Genome cartography through domain annotation
© BioMed Central Ltd 2001
Published: 3 July 2001
The evolutionary history of eukaryotic proteins involves rapid sequence divergence, addition and deletion of domains, and fusion and fission of genes. Although the protein repertoires of distantly related species differ greatly, their domain repertoires do not. To account for the great diversity of domain contexts and an unexpected paucity of ortholog conservation, we must categorize the coding regions of completely sequenced genomes into domain families, as well as protein families.
Delivery of the human genome draft sequence by publicly funded  and corporate  projects promises to precipitate significant biomedical advances this century. To rise to this challenge, biologists must become adept at navigating the vast expanses of genomic DNA data that may seem, at first glance, to be devoid of features. Yet lying beneath this facade of uniformity are rich veins of knowledge awaiting exploitation. Surveying and signposting this apparently bland genomic landscape should guide investigators towards experiments that address specific hypotheses about gene function.
But in what language are the signposts to be written? Different communities of biologists speak in dialects that are not always mutually comprehensible, particularly with respect to the umbrella term 'function' . Where one investigator might be interested in designing active-site inhibitors using high-resolution protein structural data, another might require information on gene pathways, and another might be focused on relating genotype to phenotype. If the genome is to offer up its secrets to all scientific communities, its surveyors need to adopt universal vocabularies.
Prediction or experimental finding?
A key to the databases mentioned in this article
Clusters of orthologous groups of proteins, generated from the
comparison of protein sequences encoded in 34 complete genomes,
representing 26 major phylogenetic lineages.
A distributed sequence annotation system software client and
database server for the annotation of protein sequences.
Software for the automatic annotation of eukaryotic genomes.
Annotation and searching with gene, SNP, and cross-genome
Automatic annotations of sequence databases with gene and genomic
information, including chromosome, genetic and molecular maps.
A dynamic, controlled vocabulary applicable to the annotation of
eukaryotic genomes. Includes knowledge of the role of genes and
proteins within cells.
A database of human genes that maps genes, proteins and
diseases. Provides information on gene function.
Proteome analysis database based on Pfam, SMART, Prosite, PRINTS
and ProDom protein and domain family databases and the SWISS-
PROT and TrEMBL sequence databases. Also contains software for
the annotation of protein sequences using these databases.
Interface to a database of sequence and descriptive information
correlated with genetic loci.
Mammalian homology and comparative maps. Tools and databases
from the Jackson Laboratory for the comparison of mammalian
Protein families database containing multiple sequence alignments
and hidden Markov models.
Database of protein fingerprints based on protein motifs.
Protein domain database based on an automatic compilation of
Prosite profiles are protein domain profiles constructed from multiple
sequence alignments of proteins from families of related sequences.
Protein domain families database containing multiple sequence
alignments and hidden Markov models, based on a smaller set of
domains than Pfam but designed to find domains that are more
difficult to detect.
The Stanford microarray database of raw and normalized data from
microarray experiments, including interfaces for data retrieval and
Protein sequence databases. SWISS-PROT represents a 'gold
standard' of annotation.
Database of protein families based on hidden Markov models of
multiple protein sequence alignments.
The second type of annotation relies not on empirical observations but rather on predicted evolutionary relationships. All genes that are thought to have arisen from a common ancestor are defined as homologs: where additional copies have arisen by gene duplication within a single genome they are defined as paralogs, whereas corresponding genes in different species are orthologs. Sometimes homologous gene products have strong sequence similarities, such that an inference of homology is straightforward; one such example is the Drosophila melanogaster gene branchless, which encodes a homolog of human fibroblast growth factor (FGF) . On other occasions, protein homologs have subtle or indiscernible sequence similarities that try the patience and expertise of genome wayfarers. For example, human FGF and interleukin-1α have highly similar tertiary structures despite insignificantly similar sequences but, as a result of their similar growth-factor-type functions, they are in fact very likely to be homologs .
The importance of annotating the genomic landscape on the basis of homology is that protein homologs invariably have similar tertiary structures and frequently also have similar functions. The surveying and way-marking of each new gene, therefore, needn't always be an arduous process of discovering structure and function from scratch, since clues can be inferred from what has already been experimentally gleaned from its homologs. By applying the concept of homology, the problem of broadly predicting the functions of all genes is brought within the realms of possibility.
There is a pitfall to be avoided when annotating genes by homology: homology is defined on the basis of evolution, rather than function. On one hand, homologs may be related only by evolution and not by similarities in molecular mechanism; relatives of enzymes that now lack catalytic sites are just such examples. On the other hand, examples abound of divergent homologs, or even non-homologs, whose functions overlap. Consequently, homology assignment indicates only an approximate direction for future empirical determination of function, perhaps analogous to laying down a compass bearing rather than an exact map reference.
Domains are homologous portions of sequences that are encoded in different gene contexts and have survived the evolutionary tests of time without fragmentation. In three dimensions, domains are observed to be compact units of structure, often with a hydrophobic interior and a hydrophilic exterior, and they are not divisible into smaller units. Consequently, domains represent the finite vocabulary of protein evolution: if domains are words, then multidomain proteins are complete sentences.
Just as there is a dictionary or lexicon for every language, there is one - or in this case several - for the vocabulary of domains. Pfam [11,12] is the widest-ranging lexicon of domain families, predicting at least one domain for more than two-thirds of all entries in the SWISS-PROT  protein database. SMART [14,15] is a more concise collection, focusing on those domains that are widespread and difficult to detect. Prosite [16,17] also has a dictionary of domain profiles, as does Celera  (called Panther) and The Institute for Genomic Research, TIGR [19,20].
Each of these resources detects domains using numerical representations of multiple sequence alignments, either hidden Markov models (HMMs) or generalized profiles (GPs) . Although the constructions of HMMs and GPs are very different, formally they are equivalent. Homology assignments are guided by comparisons of HMMs or GPs with protein sequence databases, and by implementation of an upper threshold value for E, the number of unrelated sequences expected purely by chance that are aligned with a particular score, or higher, in the search. This procedure has been shown on many occasions to identify subtle, yet informative, sequence similarities among distant homologs.
Taken together, Pfam [11,12], SMART [14,15] and the other domain resources carry considerable redundancy, although each has its own merits. An additional resource, InterPro [22,23], has been derived in part to avoid the onerous task of querying each of these domain lexicons separately for each protein sequence of interest. InterPro (release 3.0) combines the domain and motif sets of Prosite [16,17], Pfam [11,12], ProDom , PRINTS  and SMART [14,15] in a hierarchical manner and is thereby able to provide annotation for 74% of all SWISS-PROT (protein database) and TrEMBL (translated DNA sequence database) entries .
The pros and cons of the domain-centric view of a genome
It is important to emphasize that annotation of individual proteins or complete proteomes using domains is achieved automatically rather than by manual curation. This is relevant, because a newly sequenced genome's proteome is generally in a high state of flux, with additions and deletions resulting from sequence updates, enhanced understanding of gene structure and identification of previously overlooked genes. The animal genome sequencing projects already underway will proceed in the same manner as the publicly funded human project, through numerous draft stages and on towards completion. Providing up-to-date, and necessarily automatic, annotation of incomplete genomic data will be essential.
One might suggest that gene annotation by use of domains is a relatively short-term measure that will be made redundant by results from high-throughput studies of non-vertebrate model organisms. It can be argued that detailed predictions of the functions of most human genes can be inferred from studies of their orthologs, as these are the most likely members of a family to have similarities in molecular and cellular roles. In contrast to previous expectations, the human genome draft publications found that the great majority of human genes have no orthologs in each of three important model organisms whose genome sequences are known, namely Drosophila (fruitfly), Caenorhabditis elegans (nematode worm) and Saccharomyces cerevisiae (baker's yeast) [1,2]. Thus, the contribution of domain identification to gene-function prediction will remain until such a time as reliable results from high-throughput studies on a more closely related organism, such as the mouse, are available .
Comparing proteomes using domain families
Deconvolution of human proteins into their constituent domains has also played a major role in understanding the evolution of chordates [1,2]. Three significant differences were detected between the repertoires of human domain families and those of the nematode worm, fruitfly, the mustard cress Arabidopsis thaliana and yeast. First, only a small proportion (7%) of human domain families are absent from the other proteomes. Second, numerous domain families were greatly expanded in terms of the number of members in humans, whereas others were considerably reduced. Finally, the human proteome contained significantly more combinations of domains. This finding demonstrated that domain 'invention' in the chordate lineage has made only a minor contribution to proteome diversity, whereas expansions and contractions of domain families, and domain additions and deletions, have all added greatly to proteome innovation.
Such studies demonstrate the power of comparative proteomics in relating gene content to the evolution of eukaryotic organisms. But as with the argument that the complexity of organisms is only loosely coupled to total gene number, the question of whether the number of representatives of a domain family in a proteome is directly related to that family's contribution to cellular and organismal function remains open. It is hoped that future proteome comparisons will progress beyond the simple enumeration of homologous genes and domains towards an understanding of the biology that underlies the variability of domain family sizes.
Domain-centric annotation is but one of many methods. Although it provides information about molecular structure, function and evolution, by and large it is unable to predict functional aspects, such as cellular or organismal role, protein-binding partners or post-translational modifications. Fortunately, other views that address these aspects have been incorporated with domain predictions into web-based resources such as LocusLink , GeneCards , euGenes  and Ensembl . Each of these sites represents a confluence of diverse information sources that are mapped to specific regions of genomic sequence. Whilst navigating these sites it should be borne in mind that they often fail explicitly to distinguish between annotations that are experimentally derived and those that are predicted by homology-based methods. Nevertheless, these sites are a significant boon to biologists since they provide views of genomes from multiple vantage points, from protein tertiary structure through to SNPs and on to human disease.
An improved navigation
As viewed now, the human genome appears to be a relatively featureless landscape, punctuated by islands of annotations for well-characterized and biomedically important genes. As biological sciences progress into a more knowledge-rich, as well as data-rich, era, the cartography of this genome will become more complex with numerous different functional characteristics being assigned to a growing fraction of human genes. It will be increasingly important to restrict descriptions of function to a common and broad vocabulary that is compatible with computational approaches. Fortunately, a cross-community approach to dealing with this issue is already underway. The Gene Ontology project (GO) [27,28] has created an initial hierarchy of defined terms that encompasses many of the flavours of 'function' commonly described in biology. GO has begun to permeate throughout genomics and it will do so more rapidly as its scope and attention to detail improves.
Two further domain-centric approaches look set to guide the efficient navigation of genomes: orthology prediction and the partitioning of a domain family into subfamilies with distinct functions. Orthology is a valuable concept from which to infer functional information between species, and orthologs from the genomes of 30 bacteria, archaea and the yeast Saccharomyces cerevisiae can be predicted directly using the COGs database [29,30]. Pairs of orthologs from animal genomes are also available on the web (for example from the Jackson Laboratory ). No resource is yet available that accurately predict sets of orthologs for several multicellular eukaryotes, however, such as the fruitfly, worm, Arabidopsis, mouse and human. This situation will inevitably change on completion of the human and mouse genome sequences.
A more difficult problem is the partitioning of a homologous domain family into multiple subfamilies representing multiple functions. Homologous proteins with divergent sequences frequently have distinct functions  that are characterized by contrasting patterns of conserved amino acids. One productive approach to the analysis and prediction of functional subtypes identified key sites in multiple protein sequence alignments that specify the different functional subtypes. This method has performed well in defining functional subtypes with prediction accuracies of up to 96% .
A key element in ensuring the general utility of genomic data lies in collating predicted with experimentally derived observations. A central function will be provided by sophisticated web forums that accumulate and automatically present integrated functional data. The best example so far is the Distributed Annotation System, or DAS , which seeks to amalgamate annotations donated by experimentalists worldwide. Even these schemes, however, will face a major challenge in integrating functional information from the huge datasets that arise, for example, from microarray [35,36] and proteomics  experiments. The human genomic landscape, relatively featureless now, will soon be teeming with evidence, pointers, and clues. Ultimately, the success of the genome projects will be measured not in the completion of sequences, but in how access to integrated data opens up new avenues of research and therapy.
- International Human Genome Sequencing Consortium: Initial sequencing and analysis of the human genome. Nature. 2001, 409: 860-921. 10.1086/172716.View ArticleGoogle Scholar
- Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, Sutton GG, Smith HO, Yandell M, Evans CA, Holt RA, et al: The sequence of the human genome. Science. 2001, 291: 1304-1351.View ArticleGoogle Scholar
- Jacq B: Protein function from the perspective of molecular interactions and genetic networks. Briefings Bioinformatics. 2001, 2: 38-50.View ArticleGoogle Scholar
- LocusLink. [http://0-www.ncbi.nlm.nih.gov.brum.beds.ac.uk/LocusLink]
- GeneCards. [http://bioinformatics.weizmann.ac.il/cards]
- euGenes. [http://iubio.bio.indiana.edu:8089/]
- Ensembl. [http://www.ensembl.org/]
- Sutherland D, Samakovlis C, Krasnow MA: branchless encodes a Drosophila FGF homolog that controls tracheal cell migration and the pattern of branching. Cell. 1996, 87: 1091-1101.View ArticleGoogle Scholar
- Zhang J, Cousens LS, Barr PJ, Sprang SR: Three-dimensional structure of human basic fibroblast growth factor, a structural homolog of interleukin 1α. Proc Natl Acad Sci USA. 1991, 88:: 3446-3450.View ArticleGoogle Scholar
- Fleischmann RD, Adams MD, White O, Clayton RA, Kirkness EF, Kerlavage AR, Bult CJ, Tomb JF, Dougherty BA, Merrick JM, et al: Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science. 1996, 269: 496-512.View ArticleGoogle Scholar
- Bateman A, Birney E, Durbin R, Eddy SR, Howe KL, Sonnhammer EL: The Pfam protein families database. Nucleic Acids Res. 2000, 28: 263-266.View ArticleGoogle Scholar
- Pfam. [http://www.sanger.ac.uk/Pfam]
- SWISS-PROT. [http://www.expasy.ch/sprot/]
- Schultz J, Milpetz F, Bork P, Ponting CP: SMART: a simple modular architecture research tool: identification of signaling domains. Proc Natl Acad Sci USA. 1998, 95: 5857-5864.View ArticleGoogle Scholar
- SMART. [http://smart.embl-heidelberg.de/]
- Hofmann K, Bucher P, Falquet L, Bairoch A: The PROSITE database, its status in 1999. Nucleic Acids Res. 1999, 27: 1215-1219.Google Scholar
- Prosite. [http://www.isrec.isb-sib.ch/profile/]
- Celera, Inc. [http://www.celera.com]
- Haft DH, Loftus BJ, Richardson DL, Yang F, Eisen JA, Paulsen IT, White O: TIGRFAMs: a protein family resource for the functional identification of proteins. Nucleic Acids Res. 2001, 29: 41-43.View ArticleGoogle Scholar
- TIGRFAMs. [http://www.tigr.org/TIGRFAMS/]
- Hofmann K: Sensitive protein comparisons with profiles and hidden Markov models. Briefings Bioinformatics. 2000, 1: 167-178.View ArticleGoogle Scholar
- Apweiler R, Attwood TK, Bairoch A, Bateman A, Birney E, Biswas M, Bucher P, Cerutti L, Corpet F, Croning MD: The InterPro database, an integrated documentation resource for protein families, domains and functional sites. Nucleic Acids Res. 2001, 29: 37-40.View ArticleGoogle Scholar
- InterPro. [http://www.ebi.ac.uk/interpro]
- ProDom. [http://www.toulouse.inra.fr/prodom.html]
- PRINTS. [http://www.bioinf.man.ac.uk/dbbrowser/PRINTS/]
- Jackson IJ: Mouse genomics: making sense of the sequence. Curr Biol. 2001, 11: R311-R314.View ArticleGoogle Scholar
- The Gene Ontology Consortium: Gene ontology: tool for the unification of biology. Nature Genet. 2000, 25: 25-29.View ArticleGoogle Scholar
- Gene Ontology. [http://www.geneontology.org]
- Tatusov RL, Natale DA, Garkavtsev IV, Tatusova TA, Shankavaram UT, Rao BS, Kiryutin B, Galperin MY, Fedorova ND, Koonin EV: The COG database: new developments in phylogenetic classification of proteins from complete genomes. Nucleic Acids Res. 2001, 29: 22-28.View ArticleGoogle Scholar
- COGs. [http://www.ncbi.nim.nih.gov/COG/]
- Mammalian Homology and Comparative Maps. [http://www.informatics.jax.org/menus/homology_menu.shtml]
- Todd AE, Orengo CA, Thornton JM: Evolution of function in protein superfamilies, from a structural perspective. J Mol Biol. 2001, 307: 1113-1143. 10.1006/jmbi.2001.4513.View ArticleGoogle Scholar
- Hannenhalli SS, Russell RB: Analysis and prediction of functional sub-types from protein sequence alignments. J Mol Biol. 2000, 303: 61-76. 10.1006/jmbi.2000.4036.View ArticleGoogle Scholar
- Distributed Sequence Annotation System (DAS). [http://stein.cshl.org/das/]
- Sherlock G, Hernandez-Boussard T, Kasarskis A, Binkley G, Matese JC, Dwight SS, Kaloper M, Weng S, Jin H, Ball CA: The Stanford microarray database. Nucleic Acids Res. 2001, 29: 152-155.View ArticleGoogle Scholar
- The Stanford Microarray Database (SMD). [http://genome-www4.stanford.edu/MicroArray/SMD]
- Hoogland C, Sanchez JC, Tonella L, Binz PA, Bairoch A, Hochstrasser DF, Appel RD: The 1999 SWISS-2DPAGE Database. Nucleic Acids Res. 2000, 28: 286-288.View ArticleGoogle Scholar