Skip to main content

Table 2 Towards future-proof annotations

From: Roadblock: improved annotations do not necessarily translate into new functional insights

To ensure that improved annotations lead to meaningful insights into biological function, it is essential that they are accurate, user friendly and sympathetic to the needs of the end user. As we highlight in the main text, the status quo means that information does not readily flow between protein and transcriptomic annotations, limiting its uptake by those undertaking functional studies. The reach of future annotations is likely to maximised by engaging with the widest possible community of scientists in the process of the development of updated annotations, both to ensure that they are fit for purpose and to identify potential solutions from other fields (see point 1 for an example). Here, we highlight some initial suggestions going forwards based on our experience, with the aim of starting a dialogue with the wider scientific community.

Genomic annotations:

1. The standard reference genome will likely need to adapt to incorporate human genomic diversity as more individuals are sequenced. It may be helpful to consider moving away from the use of ‘genomic coordinates’ to ‘genomic space’: a set of ‘averaged’ coordinates onto which individual genomes can be projected, akin to the standardised 3-dimensional brain space used in the neuroimaging community [21].

2. An updated genomic reference is likely to remain the best reference space for mapping and viewing transcriptomic data, as well as information about DNA (and RNA) modifications as these emerge.

Transcriptomic annotations:

3. It would be extremely useful for future annotations to flag whether a given transcript has been derived from computational reconstruction or has been directly sequenced either using long-read nucleic acid sequencing approaches or supported by a complete peptide sequence (see points 5 and 6), producing ‘predicted’ and ‘high confidence’ sets, respectively. These complementary annotations could be considered somewhat separate (analogous to the current manual vs. automatically derived transcriptomic annotations) allowing researchers to use whichever best suits their needs and aiding harmonisation with protein annotations (see points 5 and 6).

4. Current transcriptomic annotations are assembled largely based on nucleic acid sequences. They therefore miss out on corroborating information from orthogonal sources, including proteomic sequencing. Future annotation pipelines would benefit from the inclusion of an increased diversity of input information sources.

Protein annotations:

5. Peptide sequence data need to be readily available if they are to feed into transcriptomic annotations. However, although attempts have been made to collate peptide information [22], no centralised repository of peptide sequence data currently exists. Instead, researchers undertaking protein sequencing typically deposit raw spectral data. Thus, there is a need for a centralised repository of peptide sequence information to allow harmonisation with other data sources. This database would either need researchers to submit sequence information or would need to derive sequences from spectral data. Notably, in both cases this would require curators to set quality control thresholds and, potentially, to derive ‘high confidence’ and ‘low confidence’ peptide sets (akin to the complementary transcriptomic annotations proposed in point 3).

6. The current Uniprot annotation focuses on producing a single record of full-length protein sequences relying heavily on manual curation. This allows the inclusion of an ‘annotation score’ giving a measure of confidence that a record is accurate. However, the challenge of producing ‘full length’ sequences from often partial sequence data applies equally to peptide sequences as nucleic acid sequences. Indeed, in the case of data derived from approaches employing tryptic digestion, the peptides are mostly short and overlap between peptides only occurs in cases of incomplete trypsin cleavage making de novo reconstruction more difficult than for short-read RNA-seq data. Note that even the highest Uniprot annotation score—“Experimental evidence at protein level”—does not guarantee that the individual sequences are accurate. As for transcriptomic annotations (point 3), confidence in the accuracy of individual sequences will be maximised by harmonising across orthogonal sources of information. It will likely be appropriate to move from a single annotation to the production of multiple protein annotations (e.g. ‘high confidence’ and ‘predicted’ set) to allow researchers to select the most appropriate annotation for their needs.

Scientific literature:

7. Harmonisation of annotations with the scientific literature (e.g. in the context of exon naming highlighted in the main text) is challenging. To facilitate the uptake of transcriptomic information in functional studies it would be beneficial for existing protein records to include additional data from transcriptomic annotations. For example, peptide sequences could be annotated with the locations of exon boundaries and the Ensembl IDs of these exons to increase the usage of standardised naming by those conducting functional studies. Furthermore, it may be useful to allow the direct submission of additional (non-sequence) information by the community (e.g. the ‘colloquial’ names for individual exons) to allow this information to ‘flow’ backwards into nucleotide and protein annotations thereby improving their consistency with the wider scientific literature.

8. Reporting guidelines should be developed and mandated by publishers to standardise nomenclature across the different fields and minimise ambiguity in publications.