Evolution of the Tic22 gene family

Data @ github

Ingroup

Time to finish analysing the Tic22 gene family. I've used "blast_and_align.py" to search the genomes in Phytozome to find an ingroup dataset.

Results Tic22.nex: The initial analysis of the identified eukaryote proteins. No real surprises there. A gene duplication happened prior to plants moving to land life. This gene duplication resulted in the two clades Tic22-III and Tic22-IV. Selaginella moellendorffii is sister to two Physcomitrella patens proteins. This of course is a bit strange, but something I've seen in analyses of other gene families also. The Selaginella + Physcomitrella clade is sister to the Tic22-IV clade, albeit with very low support (pp. 0.55) so nothing to really worry about. Following the gene duplication, the evolutionary rate seems to be roughly the same in the two clades. Hence, the pattern found in Tic20 is not present here.

Outgroup

For the outgroup, I've used "reciprocal_blast.py" on the 66 cyanobacterial genomes I found at NCBI, Cyanobase.org and JGI. Hence, the cyanobacterial sequences in the datasets are all reciprocal blastmatches to either Tic22-III or Tic22-IV. Preliminary analyses in MrBayes/ of the two datasets master/Tic22.nex and master/Tic22_outgroup.short.nex is found in the directory "mrbayes".

Results Tic22_outgroup.short.nex: In this analysis I've added the identified potential outgroup sequences from cyanobacteria. The N- and C-terminal ends of the sequences have been removed to eliminate uncertainty in the alignment. The tree has been rooted using Gloeobacter violaceus as sister-group to the rest. Support for the topology in the cyanobacterial clade is generally low. One sequence, from Synechococcus sp. WH5701, is found in the eukaryote Tic22-IV clade and has a really long branch. This result is probably due to a mistake in homology assessment (i.e. it is not homologous to Tic22) and the sequence should probably be removed from the dataset. The Selaginella + Physcomitrella clade persists as sister to the Tic22-IV clade and now has a pp. support of 0.82. Will have to look into that.

Next step will be to confirm that the cyanobacteral sequences are representative of the outgroup (and hence confirm that Tic22 originates from cyanobacteria). I have therefore done a "blast_and_align.py" analysis of a representative (?) selection of different bacterial genomes.
The result from this analysis is found in "all.Tic22-III.names.linsi.fst" and "all.Tic22-III.names.muscle.fst" and does not look very good. Will therefore try doing it in a more stringent way and only save one sequences per genome analysis and then manually analyse the result.

Result: Also this method failed to provide any good indications that Tic22 originates from any other bacterial lineage then cyanobacteria.

Reciprocal BLAST analyses of 54 bacterial WGS datasets identified three matches!

Additional genome datasets to query

Will search the following databases using "blast_and_align.py" and the two sequences from A. thaliana as query sequences.

TIc22-III (after manually inspecting the BLAST results):

  • Cyanidioschyzon_merolae - Two candidate sequences saved
  • Cyanophora_paradoxa - Three somewhat poorly aligned sequences saved
  • Marchantia_polymorpha - No matches

Tic22-IV (after manually inspecting the BLAST results):

  • Cyanidioschyzon_merolae - Two candidate sequences saved
  • Cyanophora_paradoxa - Three somewhat poorly aligned sequences saved
  • Marchantia_polymorpha - Saving all five poorly aligned sequences

Added and aligned the new sequences to the additional_plant_genomes/Tic22_outgroup.2.fst file.

Cleaning up the alignment

Will exclude sequences from "Tic22_outgroup.linsi.fst" based on what the alignment looks like and the results from the previous phylogenetic analyses I made.
Two sequences stands out in the dataset. They are from Synechococcus_sp_WH5701 (as mentioned earlier) and Cyanothece_PCC_7425. The position of the latter in the phylogeny "Tic22_outgroup.short.nex.con.tre" is "reasonable" among the other cyanobacteria. However, the way it aligns to the rest of the dataset does not look good.
The current dataset contains 106 sequences, of which 10 are new and from the three species Cyanidioschyzon merolae Cyanophora paradoxa and Marchantia polymorpha (saved in master/Tic22_outgroup.2.linsi.nex). Will analyse this data set with MrBayes like this:


begin mrbayes;

set Usebeagle = Yes;
set Beagledevice = GPU;

charset protein = 1 - 1458;
prset aamodelpr = mixed;

mcmc ngen=5000000 printfreq=1000 samplefreq=1000;

Results: Terminated analysis after nearly 5000000 generations. StdDev is ~0.08 and the analysis does not seem to converge.

Anabaena sp. PCC 7120

Apparently I have missed to download this genome from cyanobase. So I have now downloaded it to " db/cyanobacteria/all" and formated it to a BLAST database. Before that I had to fix the fasta headers that in the original file does not starts with ">". After that I ran a "blast_and_align.py" analysis on the Anabaena sp. PCC 7120 genome using atTic22-III and -IV as query sequences.

Result: atTic-III has two reasonable good matches. atTic22-IV had no good matches. Results in "outgroup/cyanobacteria/Anabaena_sp_PCC_7120".

Reciprocal BLAST analysis - Bacteria

Each of the two A. thaliana sequences had a reciprocal BLAST match to a sequence from Escherichia_coli_UTI89. Should they be included in the analysis? No match to any other sequences.

Aligned the two new sequences to the "Tic22_outgroup.2.linsi.nex" dataset, and saved as "master/Tic22_outgroup.3.linsi.fst". The new Anabaena sp. 7120 sequence was removed from the dataset along with two sequences from Marchanthia polymorpha (see "master/Tic22_outgroup.4.linsi.fst").

Analysed master/Tic22_outgroup.4.linsi.fst with zorro (cutoff value 0.4), and then with MrBayes. The latter program reports "Division 1 has 287 unique site patterns".

Also added the three bacterial sequences (form Escherichia coli UTI89, Flavobacterium johnsoniae UW101 and Ehrlichia canis str. Jake) identified in the reciprocal_BLAST analysis (saved to "master/Tic22_outgroup.5.linsi.fst"). The three bacterial sequences are very divergent from the rest of the sequences and will not be included in later analyses.

The position of Selaginella and Physcomitrella

Preliminary analyses have showed that a clade of sequences from Selaginella and Physcomitrella is sister to the clade including atTic22-IV. The Selaginella/Physcomitrella clade is hence not sister to both land plant clades. Whether the gene duplication resulting in atTic22-III and -IV happened before the transition to land life, or after, is therefore difficult to distinguish. I have added the rest of the Selaginella and Physcomitrella sequences from the "blast_and_align.py" analysis (results stored in "ingroup/tic20-III and -IV") to the "master/Tic22_outgroup.5.linsi.fst" dataset and aligned using mafft-linsi. One sequence from each species was kept in the alignment after manual inspection.

Checking the topology before the analysis has finished showed that the Physcomitrella and Selaginella sequences end up in unexpected positions in the tree. Will remove them and create the "final" dataset to analyse ("master/Tic22_outgroup.7.fst") that contains 101 sequences. Columns in the matrix with a probability of 0.4 or greater (as estimated by zorro) was analysed in MrBayes on the UoL cluster Alice.

I also created a reduced dataset ("master/Tic22_outgroup.8.fst") that includes 47 sequences from the following species:

Ingroup

  • Aquilegia coerulea
  • Arabidopsis thaliana
  • Brachypodium distachyon
  • Brassica rapa
  • Chlamydomonas reinhardtii
  • Cyanidioschyzon merolae
  • Cyanophora paradoxa
  • Glycine max
  • Malus domestica
  • Oryza sativa
  • Panucum virgatum
  • Phaseolus vulgaris
  • Physcomitrella patens
  • Prunus persica
  • Selaginella moellendorffii
  • Sorghum bicolor
  • Thelunginella halophila
  • Vitis vinifera
  • Volvox carteri
  • Zea mays

Outgroup

  • Anabaena sp. PCC 7120
  • Synechococcus elongatus PCC 7942
  • Synechococcus sp. PCC 7002
  • Synechosystis sp. PCC 6803
  • Gloeobacter violaceous

In relation to previously published results

Tripp et al. 2012 included a sequence from Anabaena sp. PCC 7120 in their dataset. I don't have this sequence in my dataset so obviously there where no reciprocal best BLAST matches between any sequences in that genome, and the query sequences I used. In order to compare my results to Tripp's I'm going to include it in the dataset for the next analysis. "Synechocystis sp. PCC 6803" is also mentioned in the paper as well as included in the current alignment. Let's keep that one in further analyses. In the paper they also say that;

"tic22 mutants show a similar phenotype as omp85 mutants and a physical interaction between Tic22 and Omp85 was observed. Tripp et al. 2012

How does phylogenetic results from the two gene families relate to that?

In this paper, they also presents the 3D structure for Tic22 in Anabaena. This could be useful for determining where to truncate the N- and C-terminal ends of the alignment.

Furthermore, this also has to be checked:

According to the tree of life proposed by Cavalier-Smith T. thermophilus branched off from the common clade before cyanobacteria (52). Thus, the observed symmetry in the structure of the cyanobacterial Tic22 and the “single domain” structure of the protein from T. thermophilus points to a gene duplication event in case of the cyanobacterial protein (Fig. 5).

Strict clock vs. non-clock model

We observe that branch lengths in the two land plant Tic22 clades are very similar. This would indicate that the evolutionary rates in the two different clades have been similar following the gene duplication. To test this, I have analysed the dataset "Tic22_outgroup.8.linsi.nex" using both a strict molecular clock model and a non-clock model. Initial analyses, comparing harmonic mean of marginal likelihoods of the whole dataset (using the Gloeobacter sequence as outgroup) showed a strong support for a non-clock like model. Hence, the whole dataset does not appear to have evolved under a strickt molecular clock. I will follow up this analysis with a test for strickt vs. non-clock model evolution in the landplant clades by excluding the cyanobacterial-, C. merolae, C. paradoxa and V. carteri sequences from the next analysis. The dataset is called "Tic22_clock.fst" and includes only landplant sequences and a C. reinhardtii sequence (as outgroup) from "Tic22_outgroup.8.linsi.fst". Analysed "master/Tic22_outgroup.8.linsi.fst" with zorro (cutoff value 0.4), and then with MrBayes.

I have reduced the dataset to 42 sequences by removing duplicated sequences (viz. different models for the same gene etc.). The new files are called Tic22.9*. Dataset was then aligned with linsi and analysed with zorro. Using a zorro cutoff value of 0.4, the matrix was then analysed with MrBayes 3.2 using a mixed amino acid model and a Independent Gamma Rate relaxed clock model (IGR; Lepage et al. 2007) for 5 000 000 generations on the alice cluster at the University of Leicester.