By contrast, 3D structures reveal this portion of the rRNA to be compact and double helical. The incorrect representation of helices leads to downstream inaccuracies in the domain architecture. The 5S rRNA is positioned and oriented as an adjunct to Domain 2, based on its location and interactions in the 3D structure. These representations are available for Escherichia coli , Thermus thermophilus , Haloarcula marismortui and Saccharomyces cerevisiae.
Base-pairing and base-stacking interactions were obtained from the library of RNA interactions FR3D 26 and confirmed by inspection and in-house code. Helices contain paired bases. Optimum helical definitions maximize intra-helical base stacking and minimize inter-helical base stacking. For base pairing, we used the geometric criteria of Leontis and Westhof We assumed that each nucleotide belongs uniquely to no more than one helix and that a helix contains contiguously stacked bases.
The majority of helices obtained by this method agree with those obtained by Gutell 8. However, some of their helical boundaries required subtle revision. For example, new helix 49b remains represented by a loop in structure 3D , as it is in structure phylo.
We incorporated new Helices 25a, 26a, 49a and 49b. Sequence conservation across all phylogeny was used to evaluate the generality of various helix and domain models. Sequences came from 19 eukaryotic species, 67 bacterial species and 36 archaeal species. The probability p i of a nucleotide type at a given position was approximated by its fractional occupancy at that position in the aligned sequences.
The Shannon Entropy H at each position was calculated from the probabilities p i according to Equation 1 30 , The Shannon Entropy ranges from 0 to 2. The minimum value indicates that the nucleotide type at that position is universally conserved. The maximum value corresponds to equivalent populations of all four nucleotide types. The 23S rRNA was partitioned into domains such that each helix is placed uniquely in one domain, using helices defined as described above.
We used statistics on molecular interactions, and calculations of 2D folding propensities, compactness and sphericity to evaluate domain models. The domain boundaries were defined by molecular interactions including base-pairing, base-stacking, base-phosphate and base-sugar interactions listed here by priority.
Domain boundaries were adjusted to minimize interactions between domains and to maximize interactions within domains. Each nucleotide belongs uniquely to one domain. Because Mfold cannot predict folds with non-canonical base pairs, Helix 26a was replaced with a canonical helix composed of Watson—Crick base pairs. The substitution is not expected to affect the energetics of the predicted folds, as the loop E motif and the substitute duplex have similar folding energies We isolated putative domains from the 3D structure of the LSU and computed various characteristics of the isolated domains using the 3V software package We estimated domain compactness with sphericity 36 , which ranges from 0 to 1.
A sphericity of 1 is most compact. We quantified molecular interactions of all of the rRNA, both helical and non-helical. Local secondary interactions were excluded. The diagonal elements of these matrices i,i describe the number of the intra-domain interactions for Domain i , whereas the off diagonal elements i,j contain the total number of interactions between Domain i and Domain j. The two halves of the extended single-stranded region are seen to associate by contiguous base-pairing interactions.
Nucleotides — are paired with nucleotides —, to form what we call Helix 26a. Inspection of the 3D structure confirms Helix 26a Figure 3 a and b , which was inferred previously by Fox and Gutell 23 , 40 and by Leontis and Westhof Helix numbers and nucleotide numbers are indicated.
The base-pair interactions within the central singled-stranded region linking nucleotides AC to UG are highlighted in orange. Inspection of Helix 26a in the 3D structure of the ribosome reveals that G is extruded. This helix falls clearly within the parameters of the loop E motif 24 , For comparison, an example of a different loop E motif is shown in Figure 3 c and d. This alignment suggests that the pairing interactions of Helix 26a are conserved over the entire phylogenetic tree.
In addition, we have inspected all available 3D structures of ribosomes 10—18 and observe that Helix 26a is conserved, as a helix, and more specifically, as a loop E motif, in each.
The alignment shows that the nucleotides of Helix 26a cluster into three groups according to the extent of conservation, as observed previously The third conservation group, at the termini of the helix, contains Watson—Crick pairs —, — and — These base pairs conserve canonical pairing but not base identity. Sequence analysis of the sarcin ricin loop E motif Helix 95, Figure 3 c and d , also an loop E motif, reaffirms this pattern, with a single difference: the trans Hoogsteen—Hoogsteen A—A base pair in position — mutates in Bacteria to the less stable U—C base pair In general, the sequence and pairing interactions of Helix 26a are more highly conserved than those of Helix RNA melting experiments previously demonstrated that a loop E helix has stability comparable to that of a double-stranded Watson—Crick helix of a similar length Using inspection of 3D structures and geometric calculations of interaction, we have partitioned the nucleotides of the 23S rRNA into helices.
Each nucleotide and each base pair belongs uniquely to no more than one helix. Contiguously stacked bases are allocated to a common helix.
The nucleotides within each of these helices are given in Table 1. Helix 25a was noted previously in Haloarcula Marismortiu 16 and E. A domain is defined here as a compact and modular structure, stabilized by a self-contained nexus of molecular interactions that suggest the ability to fold autonomously. Each rRNA helix is considered to be autonomous and to belong uniquely to a single domain.
We use computation and inspection to infer which regions of rRNA best satisfy these criteria. We use statistics of molecular interactions, 2D folding simulations and calculations of compactness and sphericity to evaluate domain models.
Fox and Gutell previously suggested that the 23S rRNA might contain a central core region forming a basic scaffold from which the domains emerge Their central core, proposed before the determination of 3D structures of ribosomes, consists of Helices 26a, 26, 47, 48, 61, 72 and 73 our numbering scheme.
The Fox and Gutell proposal was, to our knowledge, never incorporated into a secondary representation or a domain model of the 23S rRNA. The apparently single-stranded extension on Helix 61 Figure 4 a is tightly coiled onto other elements of the Domain 0 Figure 4 b and c. Similarly, the apparently single-stranded extensions of Helix 26 fold back on each other and on the terminus of Helix The location and interactions of Domain 0 with other domains are illustrated in 3D in the Supplementary Figures S1 and S2.
Domain 0 corresponds well with the ancestral core of the LSU in the Bokov and Steinberg model of ribosomal evolution The helices are numbered and distinguished by color. The coloring of helices is consistent with Figure 2. Helix 26a, with an extruded G and non-canonical base pairs, is orange. Domain 0 orange forms the central core of the 23S rRNA, to which all other domains are rooted. Domains 0—VI are colored as in Figure 1 b. The selection of Helices 25, 26, 26a, 61, 72 and 73 as elements of Domain 0 is based on several criteria.
The domain elements are tightly networked with each other and are less integrated with surrounding rRNA. The six helices of Domain 0 are linked by 16 interlocking interactions Supplementary Table S4. Helices 61 and 73 are further stabilized by interactions with each other.
Helix 25a intercalates between Helices 72 and 73 and also forms an A-minor interaction with Helix 26 Helix 72 interacts with Helix By contrast, interactions of Helices 26, 61 and 73 with their traditional domains of origin are limited to terminal regions of the rRNA elements. Domain 0 appears to play a structural role. It contains a cleft that embraces the A- and P-regions of the peptidyl transfer center PTC and holds them in proximity. Helix 25a, which was not identified in the Fox and Gutell model, is included in Domain 0.
Helix 1 is wholly contained within Domain I. Results from the program Mfold 33 suggest that Domain 0 is an autonomous folding unit. Theoretical RNA rings, sequences artificially designed according to coding constraints 53 , 54 seem homologous to tRNAs 55 , 56 , Two main approaches have been developed and used to recover accretion histories of ribosomal RNAs.
Both consider secondary structure subcomponents of rRNAs as units undergoing this process. One approach is based on homology, character polarity 58 and cladistic comparisons to infer accretion history from comparisons among numerous sequences This classical comparative biology method uses parsimony as its main conceptual tool 60 and was also used to recover evolution of molecular functions 61 , 62 and protein accretion 63 , Various empirical tests show that this method recovers actual histories better than chance 65 , 66 , 67 , 68 , 69 , 70 , 71 , 72 , 73 , 74 , A second approach is structure-based, and assumes that the ribosome grew from its spatial core towards its periphery, with the most ancient structural subcomponents located at the physical center of the ribosome, and the more recent ones at its periphery 76 , 77 , The method corresponds to that of spatial comparisons in disciplines such as plant community ecology.
Structures encompass large amounts of information: in ribosomes, contact biases between amino acids and nucleotide triplets recover the very ancient evolution of genetic code codon-amino acid assignments Though reasonable, the structural method lacks to our knowledge further empirical tests in contexts of reconstructing biomolecular histories, but one of its merits is that for each taxon for which accurate structural data are available, it produces slightly different histories, enabling to search for consensuses.
The theoretical premises of the structural approach are in observations that ontogenies of different structures recover their phylogenies: chemical prebiotic evolution 80 ; genetic code evolution 81 ; embryology 82 , 83 ; and ecological communities Spatial variation in vegetation can reconstruct the ontogeny of forests forest succession 85 , but plant colonization at forest periphery and clearings differ from de novo colonization of areas where no forest is adjacent and no humus exists: primary and secondary successions differ In addition, the structural model unrealistically assumes equal ribosomal growth in all directions from the core to the periphery 87 , 88 , Its name, the onion peeling model, is formally incorrect in onions, peripheral rings are most ancient , reflecting emphasis on structure rather than historical process Overall, one can assume that both approaches complete each other, one recovering history using phylogenetic methods, and the other using principles from ecology and embryology for historical reconstruction.
Accretion ranks of the 16S rRNA secondary structure subcomponents according to cladistic- and structure-based methods differ Fig. The highest percentage of secondary structure subelements with reasonable match between accretion ranks from the two methods is for 16S rRNA domain 3, the lowest percentage is for domain 2.
Notably, domain 4, presumed most ancient and consisting of two secondary structure subelements, has one element where both methods are highly congruent, and have very different ranks for the other subelement. Accretion rank of 16S rRNA structural subelements according to the structural onion model periphery most recent 78 ranks therein from Fig. Accretion ranks are divided by the highest rank according to that method structural, 27; phylogeny, 39 , then multiplied by The overall impression resulting from Fig.
Hence, for almost half of the secondary structure subelements, we do not know the accretion rank. A method clustering RNA secondary structures found two main RNA secondary structure groups, one characterized by small, presumably ancient tRNA-like secondary structures, and a presumed more derived group, characterized by larger, rRNA-like secondary structures, including viruses 91 , The decision to assume it is most ancient was not only based on the inclusion of tRNAs in that cluster.
The same rationale was applied to functional tRNA species, ranking as most ancient those with the highest diversity of isoacceptor tRNAs Note that this clustering is phenotypic, based on secondary structure similarities, not phylogenetic. Results show that tRNA-like RNAs have few unpaired nucleotides within stems bulges ; for rRNA-like secondary structures, the proportion of bulges among all unpaired nucleotides is greater.
Bulges are targets for regulation and enzymatic degradation, properties of advanced metabolism. In prebiotic conditions, these might be disadvantageous, increasing degradation risks. This assumption about the evolutionary direction of secondary structures was tested explicitly on tRNAs from diverse organisms organelles, Archaea, Bacteria, Eukaryota and Megavirales.
Results were overall positive weakest in Eukaryota , confirming tRNA-rRNA polarity: two independent scales of evolutionary ranks, one for amino acids, and one for RNA secondary structures, converge Note that the phylogenetic and the structural methods also make polarity assumptions. In the former, these are deduced from cladistic parsimony principles 95 , in the latter, from structure: the more peripheral a structural element in the ribosome, the more recent, including information on stacking interactions among subdomains 78 , Hence, it suffers from sampling biases, and from some level of circularity: biological data are used to infer on biological phenomena, a caveat it shares with the phylogenetic method.
A possible solution to this is to use as reference theoretical minimal RNA rings, a set of short sequences designed in silico according to few basic constraints: the shortest possible sequence coding for a start and a stop codon, and once for each of the 20 biogenic amino acids.
These constraints define at most 25 circular RNA sequences of 22 nucleotides, which code according to partially overlapping codons, along three consecutive translation rounds, for a start codon, 20 different amino acids, and a stop codon. The stop codon is physically next to the start codon, closing the RNA ring. These RNA rings, mainly defined by coding sequences, resemble ancestral tRNAs 97 , 98 , with a predicted anticodon and its corresponding cognate amino acid for each RNA ring The theoretical minimal RNA rings realistically mimic primitive RNAs and their evolution, along several coding properties 99 , , , and primary and secondary structure properties 50 , 56 , These properties coevolve with the genetic code integration order of the cognate amino acid matching the anticodon defined by homology of the RNA rings with ancestral tRNAs 50 , 56 , 57 , 99 , , , However, we do not yet understand what determines these complex evolutionary trajectories.
Notably, the tRNA-rRNA scores obtained for secondary structures of these RNA rings, correlate, as observed for real tRNAs 56 , with the evolutionary ranks of integration of the cognate amino acids matching their predicted anticodons This parallels the result described in the previous section for regular tRNAs and the genetic code integration order of their cognate amino acid Here too, the polarity results from this order, not from phylogenetic reconstruction.
As plausible proto-tRNAs, they are used here as references for ancestral RNAs, in line with results of evolutionary analyses of their different properties 50 , 56 , 57 , 99 , , , The method assumes that high similarities with RNA ring secondary structures indicate ancient structural subelements, and low similarities recent 16S rRNA structural subelements.
These similarities are then compared with accretion ranks produced by each of the phylogenetic and the structural hypotheses, expecting: 1. The quantification of similarities between secondary structures is identical to previous analyses 56 , 57 , 91 , Optimal secondary structures of spliced RNA rings were predicted by Mfold Four secondary structure properties are extracted from secondary structures, as shown as example for structural subelement h45 from the archaean Thermus thermophilus 16S rRNA Fig.
Boundaries between secondary structure subelements are from Fig. Subelement h44 ranges from nucleotides to Its only external loop is from nucleotides to Hence, a total of 41 nucleotides are considered unpaired, including the external loop. Similarities between two secondary structure pairs are estimated by Pearson correlation coefficients r between these four variables as obtained for each secondary structure Fig.
Table 1 presents the four secondary structure variables for AB for all 22 alternative splicings of that RNA ring. Such data were obtained for all 25 RNA rings. Similar secondary structure data for 22 alternative splicings of RNA ring 13, called AL, were presented previously 57 , therein Table 3. For each comparison, Fig. For each datapoint, the X-axis is defined by the value obtained for the AB secondary structure, and the Y-axis by the value obtained for the corresponding variable for the 16S secondary structure subelement shown in Fig.
These pairings are not arbitrary: the x- and y-axis values are for the same secondary structure property, but for a different secondary structure x-axis, RNA ring 25; y-axis, rRNA structural subelement, in this case h45 of Thermus thermophilus. Similarities are estimated by r, the more positive r, the more similar the secondary structures.
Each datapoint represents one of the four variables extracted from secondary structures, Y-axis values are from Fig. The secondary structure variables of all secondary structure subelements of two Archaea, Thermus thermophilus and Sulfolobus solfataricus Table 2 , two bacteria, Escherichia coli and Streptomyces coelicolor Table 3 , and the 18S rRNA of two eukaryotes, Homo sapiens and Saccharomyces cerevisiae Table 4. There are 25 RNA rings, each 22 nucleotide long.
These are considered according to the splicing matching homology with ancestral tRNAs, as shown previously Table 1 in 50 , 57 , , and Table 2 in Each RNA ring can be spliced at 22 positions, and a different optimal secondary structure predicted by Mfold exists for RNA ring sequences spliced at each potential splicing position. Four secondary structure variables are extracted from each of these secondary structures.
Table 1 presents as an example these four variables for the 22 alternative splicings of a specific RNA ring, RNA ring For each of the about 45 structural subelements of small rRNA subunits of the 6 examined organisms, the four secondary structure variables are extracted, as was done for the RNA ring structures at step 3. The secondary structures of RNA rings are compared to the secondary structures of rRNA structural subelements by analyses as presented in Fig.
These analyses plot the values obtained for each of the 4 secondary structure variables of a rRNA structural subelement as a function of the corresponding values obtained for a given RNA ring secondary structure. Figure 3 presents comparisons between 16S rRNA subelement h45 of Thermus thermophilus and two RNA ring 25 secondary structures, one obtained by splicing that ring at position 7, and one at position According to our hypothesis, the about 45 rSs comparing a given RNA ring secondary structure to all rRNA structural subelements are potential estimates of the accretion order of the rRNA secondary structures.
These rSs are compared to the accretion order of the rRNA secondary structure subelements, as these were determined by other methods and published by other authors separately for each cladistic and structural accretion ranks. This comparison is done by calculating the Pearson correlation coefficient between the rS and the accretion orders, producing rH, one for the cladistic method, rHphyl, and one for the structural method, rHstru. The z transformation linearizes the scale of r, which is not linear.
Hence, each of the RNA ring secondary structures produces one rHphyl and one rHstru per organism. For each organism, there are rHphyls and rHstrus. The minimal and maximal rHphyls and rHstrus for each organism are in Table 5. For any given RNA ring secondary structure, there are 6 rHphyls and 6 rHstrus, because analyses were done for 6 organisms.
Further analyses describe general patterns within these data, according to RNA rings, and according to splicing positions. Percentages of negative rHphyls and rHstrus for each RNA ring calculated among the rHphyls and among the rHstrus, pooling all organisms are used in the y axis of Fig. There are 25 RNA rings. Hence, for a given splicing position, there are 25 rHphyls and 25 rHstrus. Percentages of negative rHphyls and rHstrus for each splicing position, calculated from these rHphyls and rHstrus, consist the y axis in Fig.
Analyses in Table 5 , Figs. Hence, these are not biased representations of the data. There are 25 theoretical minimal RNA rings. Each has exactly 22 nucleotides, hence each RNA ring has 22 alternative splicing positions.
Different splicings produce different sequences forming different secondary structures, as shown for RNA ring 25, AB, in Table 1 , and previously for RNA rings 9 , therein Table 1 and 13 57 , therein Table 3.
The secondary structure variables shown in Table 1 and Figs. Table 5 shows the most negative and the most positive rH correlations between secondary structure similarities and accretion ranks according to the phylogenetic and the structural method for each of the six organisms rHphyl and rHstru, respectively.
Similarities rS were between the secondary structure variables described in Tables 2 — 4 and corresponding variables for the secondary structures formed by each of the 22 alternative splicings of each of the 25 theoretical minimal RNA rings. Considering that the main prediction of the working hypothesis expects negative correlations, it is notable that in each organism, the absolute values of the negative correlation is larger than the absolute value of the positive correlation, besides for one among 12 comparisons, according to the structural method, for Sulfolobus solfataricus.
In addition, percentages of negative rHs are significantly greater for rHphyl than rHstru within three among six species, Sulfolobus , Streptomyces and yeast. In Homo , percentages of negative rHstru were significantly greater than percentages of negative rHphyl.
The overall pattern is that results match the working hypothesis, and this more for rHphyl than rHstru. The opposite occurs in Homo. This could be interpreted as due to recent evolution of small rRNA structure in that species, but would require additional analyses and data from other species. A second noteworthy point is that the most positive correlations are in 7 among 12 cases with RNA ring 2, which has a predicted anticodon for a stop codon, coding sometimes for selenocysteine. This is presumably one of the latest amino acids integrated in the genetic code 21 st.
This result fits the prediction that the most positive correlations between accretion ranks and secondary structure similarities would correspond to RNA rings with recent cognates.
In other words, these secondary structures would not be references for initial RNAs starting the accretion process, but for the latest RNAs in the accretion process. Patterns confirm several points: 1. Most correlations between accretion ranks and secondary structure similarities are negative as expected by the working hypothesis, for most RNA rings; 2. In most cases, there are more negative correlations for the phylogenetic than the structural method for reconstructing accretion ranks; 3.
Percentages of negative correlations decrease with the genetic code integration order of the cognate amino acid of RNA rings see above comments for RNA ring with selenocysteine as predicted cognate.
Particularly noteworthy is that results of analyses presented here for the small rRNA subunit are in line with results obtained for the large rRNA subunit These analyses compared structural subelements of the large rRNA subunit with the same RNA ring secondary structures as those used here.
This has been suggested by several previous lines of analyses presented in the Introduction 10 , 11 , 12 , 15 , 16 , 17 , 18 , 19 , 20 , 21 , 56 , 57 , 91 , 92 , expanding upon evidences for common origins for tRNAs and rRNAs 1 , 2 , 3 , 4 , 7 , 8 , 9. Analyses presented here for the small rRNA subunit show greater congruence between accretion orders derived from the secondary structure method used here and the phylogenetic method than between the former and the structural method. Similar analyses done for the large rRNA subunit produce qualitatively similar results, independently confirming our method and evolutionary conclusions.
Overall, both phylogenetic and structural methods produce accretion orders that are congruent with the secondary structure method applied through the tRNA-rRNA axis of RNA secondary structure evolution.
It is probable that the structural methods are more prone to errors due to evolutionary convergences than the phylogenetic method, though convergences remain the main difficulty in reconstructing evolution. Bloch, D. Life Biosph. Biosystems 17 , — Tracing the evolution of RNA structure in ribosomes. Acids Res. Google Scholar. The natural history of transfer RNA and its interactions with the ribosome.
Root-Bernstein, M. The ribosome as a missing link in the evolution of life. Root-Bernstein, R. Cryptic tRNAs in chaetognath mitochondrial genomes. Alignment and tree reconstruction performance of four recent structural alignment methods was compared with exclusively sequence-based approaches.
As empirical data, we used a hexapod rRNA data set to study the influence of nucleotide interdependencies in sequence alignment and tree reconstruction.
Structural alignment methods delivered significantly better sequence alignments compared with pure sequence-based methods.
Also, structural alignment methods delivered better trees judged by topological congruence to simulation base trees. However, the advantage of structural alignments was less pronounced and even vanished in several instances. This can be interpreted as a stronger sensitivity of mixed model setups to nonphylogenetic signal. Secondary structure consideration clearly influenced sequence alignment and tree reconstruction of ribosomal genes. Although sequence alignment quality can considerably be improved by the use of secondary structure information, the application of mixed models in tree reconstructions needs further studies to understand the observed effects.
As ribosomal RNA rRNA sequences are highly structured, with large regions exhibiting conserved base pairing patterns, incorporating this information in sequence alignment and tree reconstruction might help to reduce errors associated with these problems Jow et al. In the present study, we tested this idea on simulated and empirical data using structure-informed alignments and tree reconstructions.
To discriminate between the impact of secondary structures in alignment and tree search analyses, simulated and empirical data were aligned with both structural approaches and exclusively sequence-based methods. Our goal was to systematically compare the performance between these approaches in relation to different average branch lengths of model trees. Secondary structure information has been advocated to improve the alignment of rRNA sequences e.
Structure-aided alignment uses knowledge about the secondary structure of mitochondrial and nuclear ribosomal genes. Their function in protein biosynthesis is mostly maintained by their secondary and tertiary structures. Consequently, structure features are the targets of natural selection, and primary sequences may vary, as long as the functional domains remain structurally conserved. However, computer-based alignment approaches, which consider predicted secondary structures of large RNA molecules have only been published recently RNAsalsa, Stocsits et al.
An important feature of these methods is the ability to predict reasonable secondary structures for large rRNA sequences, which can be used to guide the alignment. Previous methods for structure prediction of noncoding RNA nc RNA , relying on thermodynamic folding Zuker and Stiegler or the McCaskill algorithm McCaskill , which computes base pairing probability matrices, were mostly unable to generate biologically reasonable structures of sequences spanning more than about nucleotides Higgs Consequently, the consideration of rRNA structures in previous phylogenetic analyses relied on manual interpretation Kjer , Dohrmann et al.
However, algorithmic frameworks are preferable to manual approaches for structure-aided rRNA alignments. The observed site covariation patterns of paired sites of rRNA sequences do not display independent phylogenetic information Jow et al. But, in tree reconstructions, interdependence of corresponding sites can be taken into account Schoeniger and von Haeseler , Rzhetsky , Tillier and Collins , Tillier and Collins , Stephan , Higgs , Parsch et al. Most recent phylogenetic approaches ignore site covariation.
If variation of paired sites is strongly correlated, but treated as independent, phylogenetic information is scored twice, thus leading to biased support Jow et al. Recent phylogenetic studies corroborate the proposed superiority of the application of mixed substitution models over standard DNA models in phylogenetic analyses based on rRNA sequences Jow et al.
We simulated alignments mimicking several different evolutionary scenarios based on three structure-annotated rRNA sequences and three different starting trees fig. To examine 1 if structure consideration in the alignment process and 2 if application of RNA substitution models both can improve reliability, simulated reference alignments were subsequently degapped and realigned with seven different alignment methods: three traditional primary sequence-based methods and four new structure informed alignment approaches.
Performance of these approaches were compared according to the reference alignments. Based on the aligned data sets, we further conducted tree reconstructions with Bayesian inference. In a first setup, standard DNA models were applied and the resulting trees were compared with trees based on the reference alignment. Setup simulation analyses. Flowchart representing the setup of the simulation analyses.
A comprehensive data set of nuclear 18S and 28S ribosomal genes of Hexapoda was chosen as the empirical example fig. To discriminate between the impact of secondary structures in alignment and tree reconstruction, the 18S and 28S sequences were initially aligned with both an exclusively sequence-based approach and a structure-aided alignment approach.
Based on these alignments, we subsequently compiled four different setups for Bayesian analyses: i combined sequence and structure alignment considering base pairings in the tree search, ii combined sequence and structure alignment and tree search without consideration of base pairings, iii simple sequence alignment considering base pairings in the tree search, and iv simple sequence alignment and tree search without considering base pairings.
To check the performance of each analysis, the resulting trees were compared with previous morphological and molecular studies on the hexapod phylogeny.
Setup hexapod phylogeny. Flowchart representing the alignment and substitution model setup for the hexapod tree reconstruction analyses. However, these data sets only span about a few bp or less and are therefore quite unsuitable to test the performance of alignment methods on rRNA sequences. In the present study, we consequently rely on a different approach to generate data sets simulating rRNA genes.
RNAsim Guo et al. It generates sequences by using free energy optimization of secondary structure folding as a proxy of sequence fitness.
In a population genetics model, thermodynamic stability of sequences is assumed as their phenotype. Mutations substitution, insertion and deletion events are fixed by their contribution to the sequences's secondary structure. Thermodynamic alone does not cause rRNA-specific structure conformation in vivo e. RNAsim tries to surmount this by defining an optimal free energy E opt at the beginning of each simulation. Additionally, a gamma distribution of among site rate heterogeneity is implemented, assigning a prior mutation rate to each site.
RNAsim requires a structure-annotated sequence string and a phylogenetic tree as input. Different evolutionary scenarios were simulated using three taxon topologies with different branch lengths fig. We also applied seven different scaling factor values of branch lengths to model the impact of relative substitutions rates. The scaling factors of population size, thermodynamic folding, and substitution to indel ratio were set to default values in RNAsim Guo et al.
Each simulation was replicated ten times. Root trees. The three different root trees for the RNAsim analyses. The sequence-based methods used were 1 ClustalW Thompson et al. Four structure informed alignment methods are currently able to handle large rRNA data sets.
Potential stem fragments are first aligned and a nucleotide alignment is done by postprocessing. For the alignment of multiple sequences a so called Four-way Consistency objective function, calculated from the base-pairing probabilities of each sequence, is used in a progressive method with a subsequent iterative refinement step. RNAsalsa is a framework to align structural RNA sequences by utilizing existing knowledge about structure patterns, adapted constraint directed thermodynamic folding algorithms, and comparative evidence methods.
It automatically and simultaneously generates both individual secondary structure predictions within a set of homologous RNA genes and a consensus structure and takes sequence and secondary structure information into account as part of the alignment's scoring function.
All alignment methods were applied with default settings. As structure constraints for the RNAsalsa, we employed secondary structures of B. Mimachlamys varia 18S was downloaded from the European Ribosomal data base see for a detailed description of the RNAsalsa method Stocsits et al. Alignment accuracy was assessed with the sum-of-pairs SPS score, which is calculated as the number of correctly aligned nucleotide pairs divided by the total number of nucleotide pairs in the reference alignment.
Normal distribution of the alignment scores for each branch length were tested with a Kolmogorov—Smirnov test. Afterward, we used an one-way analysis of variance with Tukey's post hoc test to test, if the programs significantly differ in their performance.
We additionally performed an analysis of covariance with a posteriori Sidak test to test for significant differences in sensitivity of the alignment programs to increasing branch lengths i.
We used the parallel version of MrBayes 3. This doublet model assumes that base pairing is converted into another by a two-step process via an intermediate state.
Although several specific RNA models are currently available, we focused on only one RNA model as an example to scrutinize the effects of structure consideration in tree reconstruction. Including several different RNA models would have been beyond the scope of the present study. In the first setup, all alignments including the reference alignment were analyzed under a general time teversible GTR; Yang et al.
Priors were set to: Uniform priors were applied to the shape parameter of the gamma distribution and the topology, stationary nucleotide frequencies, substitution rates of the different models were covered with a Dirichlet prior, and an exponential prior ten was selected for branch lengths. All analyses were run with two different Metropolis-coupled Markov chains MCMC four chains, , generations and every 1,th generation sampled. The resulting trees were compared with the Robinson—Foulds distance measure Robinson and Foulds Data sets were divided into stem and loop partitions.
Priors and MCMC settings were applied similar to the first setup. In addition to the Robinson—Foulds score, which estimates purely topological distances, we applied the Kuhner—Felsenstein distance score, which also considers branch length differences Kuhner and Felsenstein The Kolmogorov—Smirnov test showed that tree distances were not normally distributed, therefore, nonparametric tests Friedman rank sum test and the Wilcoxon signed rank test were used.
The Friedman test was applied to calculate the ranking and subsequently the Wilcoxon test was used to compare pairwise differences between the alignment methods. We used only complete or nearly complete sequences to infer reliable individual secondary structures. Consequently, we only considered 18S sequences with a minimum length of 1, bp and 28S sequences with a minimum length of 3, bp. We only considered taxa, which were represented by both genes.
This resulted in a data set representing 88 hexapod species. As outgroups, two crustaceans, two chelicerates, one myriapod, and a tardigrade were used supplementary material S1 , Supplementary Material online.
Default settings for gap opening 1. As structure constraints, we employed the nuclear 18S and 28S structure models of A. The stringency settings for adoption of secondary structures in different alignment steps were relaxed 0. The aligned data sets were subsequently masked with the program Aliscore Misof and Misof , which identifies ambiguously aligned regions in multiple sequence alignments. The following settings of Aliscore were used: Window size was set to four positions, gaps were treated as ambiguous characters, and the maximum number of possible pairwise comparisons were applied.
Aliscore is currently not able to detect base pairings. This causes conflicts, as RNAsalsa estimates positional homologies considering sequence and structure information in corresponding base pairs. Thus, regions that are structurally conserved but variable at the sequence level might be determined as randomly similar ambiguously aligned by Aliscore.
As we considered positions, which are part of the consensus structure as structurally conserved, these positions were retained as paired positions in the data set.
Tree reconstructions were also conducted with MrBayes 3. RNAsalsa extracts a primary structure constraint of the input alignment, which was subsequently used to apply doublet models on the conserved regions of the L-INS-i data set.
Different substitution rates were covered by a gamma distribution with four categories Yang et al. For each setup, two independent runs were performed with one cold and three hot chains.
0コメント