Skip to main content
Figure 2 | Genome Biology

Figure 2

From: Exogean: a framework for annotating protein-coding genes in eukaryotic genomic DNA

Figure 2

Example of transcript modeling from a set of protein and mRNA alignments using DACMs. (a) The DACM input are mRNA (r1...r6) and protein (p1, p2) sequences that have been aligned to a genomic sequence S. The individual local alignments are each a level 1 transcript model (L1TMs)and constitute the nodes of a graph DACM1. (b) This graph has three possible directed edges: same_molecule, maximal_intron_size, and genomic_molecule_order. Each corresponds to a different relationship that connects two nodes if they respectively: are alignments produced by the same mRNA or same protein; are separated by a distance smaller than a user defined threshold (for example, 75 kilobases); and are collinear on the molecule of origin (mRNA or protein) and the genomic DNA. There are nine maximal paths along the three combined edges, which reduce DACM1 into the nine nodes (r1 to r6 and p1', p1", p2) of a graph DACM2, each representing a level 2 transcript model (L2TM). Note that the reduction of DACM1 splits nodes p1,1 to p1,5 into two DACM2 nodes (p1' and p1) because of the absenceof a genomic_molecule_order edge between p1,3 and p1,4. (c) DACM2 has three possible edges, inclusion, extension (for mRNAs) and genomic_overlap (for proteins), which respectively connect two nodes if: they overlap and their overlapping introns are identical; they overlap and their overlapping introns are identical but the second node also extends the first in 3'; and the span of the two nodes have overlapping genomic coordinates. The reduction follows either the 'extension' rule for mRNAs edges or the genomic_overlap protein edge and produces here the five nodes of graph DACM3 (mRNA nodes R1 to R3 and protein nodes P1 and P2), which represent level 3 transcript models (L3TMs). (d) DACM3 has two possible edges, genomic_overlap and compatible_splicing_structure, whichconnect (combines) protein and mRNA transcript models if they respectively have overlapping genomic coordinates and if the protein transcript model does not have any exons in introns of the mRNA transcript model. To reduce the graph, Exogean first identifies the path that contains both edges and from these, the reduction consists in grouping all nodes that are connectedto the same RNA node. This generates the three nodes of a graph DACM4 (RP1 to RP3), which represent level 4 transcript models (L4TMs). These L4TMs arethe final transcript models generated by the DACM expert annotation. (e) Graphical representation of the DACM expert annotation output: the final transcript models RP1 to RP3 are represented on the genomic sequence S. No information has been lost during the three graph reductions. Note that transcript models produced by the DACM component of Exogean are not yet final, and will be further examined and potentially extended when looking for splicing and start/stop signals.

Back to article page