- Open Access
Human splicing diversity and the extent of unannotated splice junctions across human RNA-seq samples on the Sequence Read Archive
© The Author(s) 2016
- Received: 13 June 2016
- Accepted: 29 November 2016
- Published: 30 December 2016
Gene annotations, such as those in GENCODE, are derived primarily from alignments of spliced cDNA sequences and protein sequences. The impact of RNA-seq data on annotation has been confined to major projects like ENCODE and Illumina Body Map 2.0.
We aligned 21,504 Illumina-sequenced human RNA-seq samples from the Sequence Read Archive (SRA) to the human genome and compared detected exon-exon junctions with junctions in several recent gene annotations. We found 56,861 junctions (18.6%) in at least 1000 samples that were not annotated, and their expression associated with tissue type. Junctions well expressed in individual samples tended to be annotated. Newer samples contributed few novel well-supported junctions, with the vast majority of detected junctions present in samples before 2013. We compiled junction data into a resource called intropolis available at http://intropolis.rail.bio. We used this resource to search for a recently validated isoform of the ALK gene and characterized the potential functional implications of unannotated junctions with publicly available TRAP-seq data.
Considering only the variation contained in annotation may suffice if an investigator is interested only in well-expressed transcript isoforms. However, genes that are not generally well expressed and nonetheless present in a small but significant number of samples in the SRA are likelier to be incompletely annotated. The rate at which evidence for novel junctions has been added to the SRA has tapered dramatically, even to the point of an asymptote. Now is perhaps an appropriate time to update incomplete annotations to include splicing present in the now-stable snapshot provided by the SRA.
Gene annotations such as those compiled by RefSeq  and GENCODE  are derived primarily from alignments of spliced complementary DNA (cDNA) sequences and protein sequences [3, 4]. So far, the impact of RNA sequencing (RNA-seq) data on annotation has been limited to a few projects including ENCODE  and Illumina Body Map 2.0 .
To measure how much splicing variation present in publicly available RNA-seq datasets is missed by annotation, we aligned 21,504 Illumina-sequenced human RNA-seq samples from the Sequence Read Archive (SRA) to the hg19 genome assembly with Rail-RNA  and compared exon-exon junction calls to exon-exon junctions from annotated transcripts. We compared exon-exon junctions rather than full transcripts because junction calls from short RNA-seq reads are considerably more reliable than assembled transcripts . Details of our alignment protocol are reviewed in Methods. All alignment was performed in the cloud using Amazon Web Services (AWS) Elastic MapReduce, costing 72 US cents per sample, as computed in Methods.
We considered only Illumina platforms because of their ubiquity and high base-calling accuracy. Specifically, the samples we aligned were obtained by querying the SRA metadata SQLite database of the R/Bioconductor package SRAdb  as of April 2015 for all Illumina-sequenced human RNA-seq samples.
Gene annotations from which exon-exon junctions were extracted and unioned to obtain a list of annotated junctions
Number of junctions
We compiled the junction calls and associated coverage levels for 21,504 SRA RNA-seq samples into a resource called intropolis available at http://intropolis.rail.bio. Using this resource, we addressed several questions that are fundamental to our understanding of the transcriptome and informative for analyses by individual investigators.
Reproducibility of junction calls across alignment protocols
Relationship between annotation and expression of splice junctions
We next asked whether annotated junctions represent the diversity of junction expression observed in the population at large. We considered an RNA-seq junction to be well supported in our data if it was observed in a large number of samples. We calculated the number of junctions that appeared in at least S samples across a range of cutoffs. For each RNA-seq junction we considered, we also evaluated whether it appeared in annotation. We considered the following levels of evidence: (1) fully annotated junctions; (2) separately annotated junctions (typically exon-skipping events), where both the donor and acceptor sites appear in one or more junctions from annotation, but never for the same junction; (3) alternative donor and acceptor sites, where only either the donor or the acceptor site appears in one or more junctions from annotation; and (4) novel junctions, where neither donor nor acceptor site is found in any annotated junction.
We also took an investigator-focused view of the relationship between annotation and expression. Most investigators collect only a small number of samples for their study. We restricted attention to samples where at least 100,000 RNA-seq junctions were found to rule out obviously small RNA-seq samples and samples that were mislabeled as RNA-seq on the SRA. In each sample, we counted the number of instances where a read maps across a junction. (A read mapping across two junctions thus contributes two instances.) The total number of such “junction overlaps” across samples is a measure of the total expression of junctions across the SRA. Most of the reads that map to junctions map to annotated junctions (Fig. 2 b). In 10,090 of a total of 10,311 samples that meet our criterion of 100,000 junctions observed, more than 95% of junction overlaps correspond to annotated junctions.
This represents only the bulk coverage of junctions. We can also consider the number of junctions observed, regardless of coverage. In 3389 out of 10,311 samples, we observe that fewer than 80% of junctions appear in annotation (Fig. 2 c). So while the most highly covered junctions are well annotated, there is a large number of junctions that are well covered across multiple samples but may not appear in any given small subset of samples.
Technical and biological variation in junction expression across samples
We next explored variation across the 21,504 samples we processed. We wanted to see the combination of technical and biological factors that contribute to variation in unannotated junction expression. In this analysis, we considered only the 56,861 unannotated junctions found in at least 1000 samples of the 21,504, and the subset of 21,057 samples of the 21,504 with at least 100,000 reads each. We performed a principal component analysis (PCA) on the data matrix where rows correspond to the 56,861 unannotated junctions and columns correspond to the 21,057 samples. (See Methods for technical details of the decomposition.)
We further examined samples belonging to specific groups that generated well-characterized datasets. Both the SEQC consortium and ABRF  studied universal human reference RNA (UHRR) and human brain RNA reference (HBRR) samples constructed by the MACQ-III consortium for quality control. UHRR comprises total RNA from ten different cancer cell lines representing various human tissues, while HBRR samples comprise total RNA from several donors across several brain regions. Both groups studied these samples in four different mixture ratios—0:1, 1:3, 3:1, and 1:0—with each sample sequenced at multiple sites. The four mixtures separate well, and each lies on a radial line passing through the singular point on the left. Data from the two groups are separated because they used different sequencing depths and read lengths.
The four SEQC UHRR:HBRR sample ratios form four clusters distinguished by PC2, and the ABRF UHRR:HBRR sample ratios form clusters distinguished by both PC1 and PC2. Observe that there is a singular point where all points appear to converge (Fig. 4). Here, the number of junctions detected in a sample approaches zero. A radial line extending from the singular point rotating clockwise across the plot passes over UHRR:HBRR sample ratios in the same order for ABRF as it does for SEQC. Though ABRF and SEQC have some overlap in managing investigators, they are two different projects that employed randomized study designs, making a strong case that PC2 is distinguishing mostly biological rather than technical factors.
Lymphoblastoid cell lines, typically made from HapMap samples, are extensively present in the SRA. Different studies cluster together and are again placed on a radial line going through the singular point; each study used very different sequencing depths and read lengths. Searching the SRA metadata, we could classify a number of samples as brain and blood. Again, these samples fall along radial lines through the singular point. The biggest separation in PC2 is between brain and blood, two tissue types that are well represented in the SRA.
Novel junction discovery over time
We proceeded to measure the accumulation of “confidently called” junctions over calendar time. A junction was “confidently called” if it was found in at least 20 reads across all samples. We measured the discovery date of a junction as the earliest submission date to the BioSample database  from among all samples in which the junction was found by Rail-RNA. The ≥20-read curve has noticeable spikes in 2009 and 2011 but appears to decelerate significantly before 2013, by which time 96.1% of junctions were discovered.
But inspection of the relationship between documentation date and discovery date suggests that only the first GENCODE release introduced new junctions with significantly earlier discovery dates than other releases (Fig. 6 b). The reason for this phenomenon is that junctions appearing first in GENCODE’s first release are present in many more samples (median = 5825) than junctions appearing first in other GENCODE releases (median = 602 samples) (Fig. 6 d).
Application to ALK isoform discovery
We have compared the variation in our database intropolis to standard gene annotations.
intropolis associates each junction with the set of samples where the junction was called and the number of reads spanning the junction in that sample, enabling biological investigators to gain new insights. Here, we give a simple example application involving the anaplastic lymphoma kinase (ALK) gene.
ALK is frequently mutated or aberrantly expressed in cancers including neuroblastoma [22–25] and non-small-cell lung adenocarcinoma, where in particular it has been found to participate in the fusion gene EML4-ALK . Cancers with ALK abnormalities are often responsive to treatment with ALK inhibitors such as crizotinib . ALK is a good therapeutic target because it is rarely expressed in normal adult tissue . A novel ALK transcript variant present in about 11% of melanomas and occasionally in other cancer subtypes was recently identified . The transcript is described as resulting from a de novo alternative transcription initiation (ATI) site in ALK intron 19 and is dubbed ALK A T I . The kinase activity of ALK A T I is found to be suppressed by various ALK inhibitors, and a patient with ALK A T I -expressing metastatic melanoma is shown to exhibit significant tumor shrinkage after treatment with crizotinib.
Top ten samples across the 21,504 analyzed in this paper in order of descending junction inclusion ratio D, as defined in the table
Sample (i.e., run)
Description of sample
coverage A for ALK
coverage B for ALK
NHEM.f_M2: normal human
Non-small cell lung
NHEM_M2: normal human
H2228, an EML4-ALK-expressing
lung adenocarcinoma cell line
NHEM.f_M2: normal human
Potential functional implications of previously unannotated junctions
We lastly searched for evidence that unannotated and partially annotated junctions were functionally relevant. In , Hupe et al. performed translating ribosome affinity purification followed by RNA sequencing (TRAP-seq) of brain and kidney samples from mouse. TRAP is a technology that isolates translating RNAs from intact tissues, potentially from targeted cell types. Thus, we examined the extent of our novel and partially annotated junctions presumably being translated. We aligned the six kidney and nine brain TRAP-seq samples from their study using Rail-RNA and lifted the resulting exon-exon junction coordinates over from mouse (mm10) to human (hg19) (see Methods). Of the 112,825 junctions found across the TRAP-seq samples whose liftovers were also found in at least one SRA sample, 86,954 (77.1%) were fully annotated, 10,771 (9.5%) were exon skips, 12,410 (11.0%) had alternative donors or acceptors, and 2690 (2.4%) were novel. These data suggest that a significant fraction of unannotated junctions are likely conserved across species: more than 3% of unannotated junctions found in more than 1000 SRA samples have analogs likely translated in mouse. Furthermore, of the 84,185 junctions found across the TRAP-seq samples whose liftovers were also found in at least 1000 SRA samples, 81,482 (96.9%) were fully annotated, 1089 (1.3%) were exon skips, 1464 (1.7%) had alternative donors or acceptors, and 150 (0.2%) were novel. So there is significant evidence that many previously unannotated or partially annotated junctions are translated into proteins and therefore have potentially novel functional relevance.
We have measured variation in junction expression across thousands of RNA sequencing samples. Our analysis demonstrates both the strengths and weaknesses of relying on current annotation for RNA-seq analysis. We have also used our population-level view of transcription to understand the potential hazards of analyzing individual samples without a clear understanding of the background variation in junction discovery levels. We have shown evidence that some unannotated and partially annotated junctions in human have translated analogs in mouse. We have introduced a resource, intropolis, for others to investigate junction variation, and we have provided an example of the utility of our resource in the case of ALK gene expression.
While we observed many unannotated junctions shared by thousands of RNA-seq samples from independent investigations, some of these are likely false positive calls due to incorrect placement of reads, sample-specific variation, and regions where the reference genome is incorrectly assembled. Rail-RNA  is designed to be parsimonious and conservative, and its junction calls agree closely with those of other aligners (Fig. 1). When an intropolis junction appears in many samples, our evidence suggests that the call is reliable; e.g., 99.8% of junctions found in at least 8000 samples from the SRA are also found in gene annotation. That said, individual novel junctions in intropolis should be used carefully and verified by other means, such as rtPCR, where appropriate.
Our study also suggests that the rate at which evidence for novel junctions has been added to the SRA has tapered dramatically, even to the point of an asymptote (Fig. 5). This has implications for projects and tools that use gene annotations; if annotations have been incomplete up to this point, now is perhaps an appropriate time to update them to include splicing present in the now-stable snapshot provided by the SRA.
As highlighted by Fig. 2 a, b, considering only the variation contained in annotation may suffice if an investigator is interested only in well-expressed transcript isoforms. However, genes that are not generally well expressed and nonetheless are present in a small but significant number of samples in the SRA are likelier to be incompletely annotated. Quantification of poorly expressed genes may thus be improved by incorporating information about annotated and unannotated splicing events. Along related lines,  develops a local splicing variation (LSV) formalism that jointly analyzes multiple junctions from the same gene using splicing graphs. The authors show a significant impact of considering novel (unannotated) junctions on their conclusions. Using this, or similar methodology, on the intropolis database to go beyond the single-junction analysis presented here may be an interesting avenue for future research.
Our approach to synthesizing large public RNA-sequencing datasets offers the opportunity to study transcription more deeply than ever before. Further, intropolis is a step toward establishing public resources that facilitate rapidly querying existing RNA-seq data.
Identifying annotated junctions
Following , we extracted junctions from transcripts across all the latest “empirical” gene annotation tracks in the UCSC Genome Browser  for hg19 and hg38 except GENCODE  and Ensembl . (While GENCODE’s tracks are also in the UCSC Genome Browser, we chose to download them from the GENCODE website http://www.gencodegenes.org/releases/ instead: as of 24 January 2016, GENCODE v22 was the latest GENCODE track listed, but GENCODE v24 had already been released.) Empirical tracks are based on alignments of, e.g., spliced cDNA and protein sequences and are listed in Table 1. Annotation tracks based on algorithmic predictions from genome sequence (Augustus, GeneID, Genscan, N-SCAN, and SGP) were excluded because they comprise transcripts that were not directly observed in experiment. Ensembl was excluded because GENCODE is already a merge of Ensembl and HAVANA transcripts. After junction coordinates were extracted, all hg38 coordinates were lifted over to hg19 where feasible, and the union of all junctions was taken. Liftover of junctions was performed using the UCSC liftOver utility  with command-line parameters -ends 2 -minMatch=1.0. Since the intropolis database was formed from alignments to only the hg19 chromosomal assembly, only those junctions corresponding to the hg19 chromosomal assembly were kept to form a final list of annotated junctions. Table 1 lists all gene annotations used to determine our set of annotated junctions. We froze these annotations on 24 January 2016 and compressed them into an archive available at http://verve.webfactional.com/misc/jan_24_2016_annotations.tar.gz. We ran the script https://github.com/nellore/runs/blob/master/sra/rip_annotated_junctions.py with PyPy v2.5.0 to extract junctions from these annotations, performing coordinate conversions from hg38 to hg19 where appropriate. The final list of junctions we defined as “annotated” is available at https://github.com/nellore/runs/blob/master/sra/annotated_junctions.tsv.gz.
Selecting human SRA samples
Samples were selected by querying the SRA metadata SQLite database of the R/Bioconductor package SRAdb . The database was downloaded from http://gbnci.abcc.ncifcrf.gov/backup/SRAmetadb.sqlite.gz, but this file is updated regularly. The version of SRAmetadb.sqlite.gz we used was updated on 1 April 2015, and it is available at ftp://ftp.ccb.jhu.edu/pub/langmead/sra_junctions/SRAmetadb.sqlite.gz. We selected all run_accessions from the sra table with platform = ’ILLUMINA’, library_strategy = ’RNA-Seq’, and taxon_id = 9606 (human) that also had URLs for FASTQs on the European Bioinformatics Institute server listed in the fastq table. Our query may be reproduced with the script https://github.com/nellore/runs/blob/master/sra/define_and_get_fields_SRA.R compatible with R v3.1.0.
Alignment of human SRA samples with Rail-RNA
Rail-RNA v0.1.7b  was used for alignment. We aligned to hg19 rather than the more recent hg38 assembly because of hg19’s continued prevalence, including use by the GEUVADIS consortium  in its study of 462 lymphoblastoid cell line (LCL) samples as well as the GTEx consortium  in its ongoing large-scale study of gene expression across human tissues. We performed a single pass of alignment; that is, reads were not realigned after junctions were discovered to improve alignments of short-anchored reads. See the “Junction detection” subsection below. Alignment was performed in the cloud using AWS Elastic MapReduce on Elastic Compute Cloud spot instances, i.e., standardized units of computing capacity. Spot instances permit bidding for computing to save money, where bids that equal or exceed a market price are fulfilled. However, if the market price drops below a bid, instances could be lost, and a computational job could fail. So saving money by bidding for spot instances comes with risk, and rather than aligning all samples in one batch, we distributed this risk by dividing alignment up into 43 batches of about 500 samples each. Analysis of each batch was itself divided into (1) a preprocessing job flow, which downloaded and preprocessed compressed FASTQs from the European Bioinformatics Institute’s mirror of the SRA, writing results to Amazon’s cloud storage service S3; and (2) an alignment job flow, which was configured to write only exon-exon junction coordinates and the number of reads in each sample mapping across each detected junction. Each preprocessing job flow was run on a cluster of 21 c3.2xlarge instances, each with 8 Intel Xeon E5-2680 v2 (Ivy Bridge) processing cores and 15 GB of RAM. Each alignment job flow was run on a cluster of 61 c3.8xlarge instances, with 32 Intel Xeon E5-2680 v2 (Ivy Bridge) processing cores and 60 GB of RAM. Summing the sizes of the 43 compressed files output by the 43 runs gives 5.3 GB, about the size of an alignment BAM for a single RNA-seq sample. Our alignment runs may be reproduced by following the instructions at https://github.com/nellore/runs/blob/master/sra/README.md.
Alignment was performed over a period of eight days. There were 21,506 samples spanning 62.2 trillion nucleotides initially selected for alignment, but two samples (run accession numbers SRR651690 and DRR023700) were not found on the European Bioinformatics Institute server and were therefore excluded. We used the Amazon Cost Explorer to compute total cost; summing across eight days of activity, it came to US$15,393.69, or 72 cents per sample. Costs divided up by Amazon service over the period of computational activity may be viewed at https://github.com/nellore/runs/blob/master/sra/hg19.costs.csv.
Rail-RNA’s junction detection method, discussed in detail in the Rail-RNA study , begins by using Bowtie 2  in local alignment mode (–local) to align each read to the genome. If a read’s highest scoring alignment is soft-clipped, the read is retained and used for junction discovery. Otherwise, it is not used for junction discovery, on the principle that the parsimonious explanation for the read is that it is exonic. Reads with soft-clipping are then divided into short, overlapping segments called readlets. Readlets are aligned to the reference genome, and the alignments are clustered into sets of mutually compatible alignments. A gap between consecutively aligning readlets in a cluster is called as an exon-exon junction if an appropriate two-base motif (e.g., GT and AG) appears on either side of the corresponding intron in the reference. If multiple clusters are tied for largest, indicating an ambiguously mapped read, Rail-RNA refrains from using that read for junction discovery.
Rail-RNA’s approach is both parsimonious, seeking to explain alignments with as few junctions as possible, and conservative, ignoring evidence from multi-mappers. Accordingly, for this study, we value precision over recall in order to make reliable statements about junctions missed by annotation. The approach could underestimate (1) the number of reads mapping across a junction in a sample, and (2) the number of samples in which a given junction is found. Since Rail-RNA excludes reads that align to the genome end-to-end from its junction discovery algorithm, it is also liable to miss junctions in a given gene for which there is a processed pseudogene. Details on Rail-RNA’s single-pass alignment algorithm may be found in Sections S.18 and S.19 of the Rail-RNA study .
Reproducing main figures
All data underlying Figs. 1, 2, 3, 5, and 6 are reproducible with the Python v2.7 script https://github.com/nellore/runs/blob/master/sra/tables.py, which was run using PyPy v2.5.0. These figures as well as Fig. 4 were generated with the Mathematica v10.3.1 notebook; see https://github.com/nellore/runs/blob/master/sra/preprint_figures.nb. SEQC/MAQC-III consortium junction data were downloaded from http://0-www.nature.com.brum.beds.ac.uk/nbt/journal/v32/n9/extref/nbt.2957-S4.zip. BioSample submission dates for 77 SRA runs (0.3% of the samples we studied) were not found on the server, so these runs were excluded from the analyses involving junction discovery dates presented in Figs. 5 and 6.
Analysis of TRAP-seq samples
All 15 mouse TRAP-seq samples were taken from the study SRP031883; individual run accession numbers are provided in the Rail-RNA manifest file https://github.com/nellore/runs/blob/master/sra/translatome.manifest. These samples were aligned to mm10 on a local computer cluster with Rail-RNA v0.2.3b, and junction output available at https://github.com/nellore/runs/blob/master/sra/mm10_translatome_junctions.tsv.gz may be recovered with the script https://github.com/nellore/runs/blob/master/sra/translatome.sh. Junctions were subsequently lifted over to hg19 with the UCSC liftOver utility  using the command-line parameters -ends 2 -minmatch=1.0; that is, we lifted over only the two-base motifs on either end of each intron and required that all four motif bases had mappings in the liftover. The script https://github.com/nellore/runs/blob/master/sra/translatome.py calls the liftOver utility and writes lifted-over junctions and their presence in human annotation. Lifted-over junctions may be downloaded at https://github.com/nellore/runs/blob/master/sra/translatome_mm10_to_hg19_junctions.tsv.gz, where the format of this file is described in translatome.py. Statistics on the presence of lifted-over junctions in human SRA samples reported in the main text were computed by https://github.com/nellore/runs/blob/master/sra/get_final_translatome_stats.sh.
Analysis of novel ALK isoform
The junction inclusion ratio D discussed in the main text is defined as follows. Suppose the number of instances where junctions are overlapped by reads (i.e., the junction overlap count) in ALK exons 1–19 is A, and the junction overlap count in ALK exons 20–29 is B. The normalized difference D=(B−A)/(A+B) is close to 1 when exons 1–19 are unexpressed compared to exons 20–29, and close to -1 when exons 20–29 are unexpressed compared to exons 1–19.
The ALK analysis may be reproduced by first filtering intropolis for junctions in ALK with the script https://github.com/nellore/runs/blob/master/sra/alk.sh, and then running the Mathematica 10.3.1 notebook https://github.com/nellore/runs/blob/master/sra/alk.nb. Samples found were checked manually for their descriptions on the SRA at http://0-www.ncbi.nlm.nih.gov.brum.beds.ac.uk/sra, and the UCSC Genome Browser screenshot of Fig. 7 was created using the Genome Browser’s PDF/PS utility.
Principal component analysis
where j indexes samples and p j is the read length in sample j.
Scripts for reproducing the PCA analysis are available in the sra subdirectory of https://github.com/nellore/runs and described in https://github.com/nellore/runs/blob/master/sra/README.md. The output of the analysis sourced the Mathematica 10.3.1 notebook https://github.com/nellore/runs/blob/master/sra/preprint_figures.nb for generating Fig. 4.
Liftover of intropolis
http://intropolis.rail.bio also provides a version of intropolis with junction coordinates lifted over from hg19 to hg38. This was accomplished with the UCSC liftOver utility  using command-line parameters -ends 2 -minmatch=1.0; that is, we lifted over only the two-base motifs on either end of each intron and required that all four motif bases had mappings in the liftover, as in the TRAP-seq analysis. The script https://github.com/nellore/runs/blob/master/sra/liftover_intropolis.py reproduces our liftover.
We thank the reviewers for their insights, which strengthened the manuscript and in particular led us to pursue the reanalysis of publicly available TRAP-seq samples.
Availability of data and materials
AN, BL, and JTL conceived and designed the study. JAH obtained the list of SRA samples to align. AN and BL performed the alignment. AEJ, JPF, and KDH performed the PCA analysis. AN, AEJ, LCT, SW, RAP, and NK performed the other analyses. AN created all figures and tables. AN, BL, and JTL wrote a first draft of the paper, and AEJ, JPF, LCT, and KDH contributed to revisions. All authors read and approved the final manuscript.
The authors declare that they have no competing interests.
Ethics approval and consent to participate
No ethics approval was required for this work.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Pruitt KD, Brown GR, Hiatt SM, Thibaud-Nissen F, Astashyn A, Ermolaeva O, Farrell CM, Hart J, Landrum MJ, McGarvey KM, et al. RefSeq: an update on mammalian reference sequences. Nucleic Acids Res. 2014; 42(D1):756–63.View ArticleGoogle Scholar
- Harrow J, Frankish A, Gonzalez JM, Tapanari E, Diekhans M, Kokocinski F, Aken BL, Barrell D, Zadissa A, Searle S, et al. GENCODE: the reference human genome annotation for the ENCODE Project. Genome Res. 2012; 22(9):1760–74.View ArticlePubMedPubMed CentralGoogle Scholar
- Thibaud-Nissen F, Souvorov A, Murphy T, DiCuccio M, Kitts P. Eukaryotic genome annotation pipeline. 2013. https://0-www-ncbi-nlm-nih-gov.brum.beds.ac.uk/books/NBK169439/.
- Curwen V, Eyras E, Andrews TD, Clarke L, Mongin E, Searle SM, Clamp M. The Ensembl automatic gene annotation system. Genome Res. 2004; 14(5):942–50.View ArticlePubMedPubMed CentralGoogle Scholar
- Consortium EP, et al. The ENCODE (ENCyclopedia Of DNA Elements) Project. Science. 2004; 306(5696):636–40.View ArticleGoogle Scholar
- Illumina Body Map 2.0 on ArrayExpress. http://www.ebi.ac.uk/arrayexpress/browse.html?keywords=E-MTAB-513&expandefo=on. Accessed 10 Dec 2016.
- Nellore A, Collado-Torres L, Jaffe AE, Morton J, Pritt J, Alquicira-Hernández J, Leek JT, Langmead B. Rail-RNA: scalable analysis of RNA-seq splicing and coverage. Bioinformatics. 2016:575. https://0-www-ncbi-nlm-nih-gov.brum.beds.ac.uk/pubmed/27592709.
- Steijger T, Abril JF, Engström PG, Kokocinski F, Hubbard TJ, Guigó R, Harrow J, Bertone P, Consortium R, et al. Assessment of transcript reconstruction methods for RNA-seq. Nat Methods. 2013; 10(12):1177–84.View ArticlePubMedGoogle Scholar
- Zhu Y, Stephens RM, Meltzer PS, Davis SR. SRAdb: query and use public next-generation sequencing data from within R. BMC Bioinforma. 2013; 14(1):19.View ArticleGoogle Scholar
- Rosenbloom KR, Armstrong J, Barber GP, Casper J, Clawson H, Diekhans M, Dreszer TR, Fujita PA, Guruvadoo L, Haeussler M, et al. The UCSC Genome Browser database: 2015 update. Nucleic Acids Res. 2015; 43(D1):670–81.View ArticleGoogle Scholar
- Consortium SI, et al. A comprehensive assessment of RNA-seq accuracy, reproducibility and information content by the Sequencing Quality Control Consortium. Nat Biotechnol. 2014; 32(9):903–14.View ArticleGoogle Scholar
- Zhang W, Yu Y, Hertwig F, Thierry-Mieg J, Zhang W, Thierry-Mieg D, Wang J, Furlanello C, Devanarayan V, Cheng J, et al. Comparison of RNA-seq and microarray-based models for clinical endpoint prediction. Genome Biol. 2015; 16(1):1.View ArticleGoogle Scholar
- Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, Batut P, Chaisson M, Gingeras TR. STAR: ultrafast universal RNA-seq aligner. Bioinformatics. 2013; 29(1):15–21.View ArticlePubMedGoogle Scholar
- Liao Y, Smyth GK, Shi W. The Subread aligner: fast, accurate and scalable read mapping by seed-and-vote. Nucleic Acids Res. 2013; 41(10):108–8.View ArticleGoogle Scholar
- Li S, Tighe SW, Nicolet CM, Grove D, Levy S, Farmerie W, Viale A, Wright C, Schweitzer PA, Gao Y, et al. Multi-platform assessment of transcriptome profiling using RNA-seq in the ABRF next-generation sequencing study. Nat Biotechnol. 2014; 32(9):915–25.View ArticlePubMedPubMed CentralGoogle Scholar
- Barrett T, Clark K, Gevorgyan R, Gorelenkov V, Gribov E, Karsch-Mizrachi I, Kimelman M, Pruitt KD, Resenchuk S, Tatusova T, et al. BioProject and BioSample databases at NCBI: facilitating capture and organization of metadata. Nucleic Acids Res. 2012; 40(D1):57–63.View ArticleGoogle Scholar
- Bernstein BE, Stamatoyannopoulos JA, Costello JF, Ren B, Milosavljevic A, Meissner A, Kellis M, Marra MA, Beaudet AL, Ecker JR, et al. The NIH Roadmap Epigenomics Mapping Consortium. Nat Biotechnol. 2010; 28(10):1045–8.View ArticlePubMedPubMed CentralGoogle Scholar
- Jaffe AE, Shin J, Collado-Torres L, Leek JT, Tao R, Li C, Gao Y, Jia Y, Maher BJ, Hyde TM, et al. Developmental regulation of human cortex transcription and its clinical relevance at single base resolution. Nat Neurosci. 2015; 18(1):154–61. doi:10.1038/nn.3898.View ArticlePubMedGoogle Scholar
- Consortium EP, et al. A user’s guide to the Encyclopedia of DNA Elements (ENCODE). PLoS Biol. 2011; 9(4):1001046.View ArticleGoogle Scholar
- Pickrell JK, Marioni JC, Pai AA, Degner JF, Engelhardt BE, Nkadori E, Veyrieras JB, Stephens M, Gilad Y, Pritchard JK. Understanding mechanisms underlying human gene expression variation with RNA sequencing. Nature. 2010; 464(7289):768–72.View ArticlePubMedPubMed CentralGoogle Scholar
- Cheung VG, Nayak RR, Wang IX, Elwyn S, Cousins SM, Morley M, Spielman RS. Polymorphic cis-and trans-regulation of human gene expression. PLoS Biol. 2010; 8(9):2213.View ArticleGoogle Scholar
- Janoueix-Lerosey I, Lequin D, Brugieres L, Ribeiro A, de Pontual L, Combaret V, Raynal V, Puisieux A, Schleiermacher G, Pierron G, et al. Somatic and germline activating mutations of the ALK kinase receptor in neuroblastoma. Nature. 2008; 455(7215):967–70.View ArticlePubMedGoogle Scholar
- Mossé YP, Laudenslager M, Longo L, Cole KA, Wood A, Attiyeh EF, Laquaglia MJ, Sennett R, Lynch JE, Perri P, et al. Identification of ALK as a major familial neuroblastoma predisposition gene. Nature. 2008; 455(7215):930–5.View ArticlePubMedPubMed CentralGoogle Scholar
- George RE, Sanda T, Hanna M, Fröhling S, Luther II W, Zhang J, Ahn Y, Zhou W, London WB, McGrady P, et al. Activating mutations in ALK provide a therapeutic target in neuroblastoma. Nature. 2008; 455(7215):975–8.View ArticlePubMedPubMed CentralGoogle Scholar
- Chen Y, Takita J, Choi YL, Kato M, Ohira M, Sanada M, Wang L, Soda M, Kikuchi A, Igarashi T, et al. Oncogenic mutations of ALK kinase in neuroblastoma. Nature. 2008; 455(7215):971–4.View ArticlePubMedGoogle Scholar
- Soda M, Choi YL, Enomoto M, Takada S, Yamashita Y, Ishikawa S, Fujiwara S-I, Watanabe H, Kurashina K, Hatanaka H, et al. Identification of the transforming EML4–ALK fusion gene in non-small-cell lung cancer. Nature. 2007; 448(7153):561–6.View ArticlePubMedGoogle Scholar
- Shaw AT, Kim DW, Nakagawa K, Seto T, Crinó L, Ahn MJ, De Pas T, Besse B, Solomon BJ, Blackhall F, et al. Crizotinib versus chemotherapy in advanced ALK-positive lung cancer. N Engl J Med. 2013; 368(25):2385–94.View ArticlePubMedGoogle Scholar
- Iwahara T, Fujimoto J, Wen D, Cupples R, Bucay N, Arakawa T, Mori S, Ratzkin B, Yamamoto T. Molecular characterization of ALK, a receptor tyrosine kinase expressed specifically in the nervous system. Oncogene. 1997; 14(4):439–49.View ArticlePubMedGoogle Scholar
- Wiesner T, Lee W, Obenauf AC, Ran L, Murali R, Zhang QF, Wong EW, Hu W, Scott SN, Shah RH, et al. Alternative transcription initiation leads to expression of a novel ALK isoform in cancer. Nature. 2015; 526(7573):453–7.View ArticlePubMedPubMed CentralGoogle Scholar
- Donlin LT, Jayatilleke A, Giannopoulou EG, Kalliolias GD, Ivashkiv LB. Modulation of TNF-induced macrophage polarization by synovial fibroblasts. J Immunol. 2014; 193(5):2373–83.View ArticlePubMedPubMed CentralGoogle Scholar
- Zhang Q, Wang HY, Bhutani G, Liu X, Paessler M, Tobias JW, Baldwin D, Swaminathan K, Milone MC, Wasik MA. Lack of TNF α expression protects anaplastic lymphoma kinase-positive T-cell lymphoma (ALK + TCL) cells from apoptosis. Proc Nat Acad Sci. 2009; 106(37):15843–8.View ArticlePubMedPubMed CentralGoogle Scholar
- Hupe M, Li MX, Gillner KG, Adams RH, Stenman JM. Evaluation of trap-sequencing technology with a versatile conditional mouse model. eLife. 2016; 5:e11752. doi:10.7554/eLife.11752.Google Scholar
- Vaquero-Garcia J, Barrera A, Gazzara MR, Gonzalez-Vallinas J, Lahens NF, Hogenesch JB, Lynch KW, Barash Y. A new view of transcriptome complexity and regulation through the lens of local splicing variations. Elife. 2016; 5:11752.View ArticleGoogle Scholar
- Farkas MH, Grant GR, White JA, Sousa ME, Consugar MB, Pierce EA. Transcriptome analyses of the human retina identify unprecedented transcript diversity and 3.5 Mb of novel transcribed sequence via significant alternative splicing and novel genes. BMC Genomics. 2013; 14(1):486.View ArticlePubMedPubMed CentralGoogle Scholar
- Cunningham F, Amode MR, Barrell D, Beal K, Billis K, Brent S, Carvalho-Silva D, Clapham P, Coates G, Fitzgerald S, et al. Ensembl 2015. Nucleic Acids Res. 2015; 43(D1):662–9.View ArticleGoogle Scholar
- Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, Haussler D. The human genome browser at UCSC. Genome Res. 2002; 12(6):996–1006.View ArticlePubMedPubMed CentralGoogle Scholar
- Lappalainen T, Sammeth M, Friedländer MR, AC’t Hoen P, Monlong J, Rivas MA, Gonzàlez-Porta M, Kurbatova N, Griebel T, Ferreira PG, et al. Transcriptome and genome sequencing uncovers functional variation in humans. Nature. 2013; 501(7468):506–11.View ArticlePubMedPubMed CentralGoogle Scholar
- Lonsdale J, Thomas J, Salvatore M, Phillips R, Lo E, Shad S, Hasz R, Walters G, Garcia F, Young N, et al. The genotype-tissue expression (GTEx) project. Nat Genet. 2013; 45(6):580–5.View ArticleGoogle Scholar
- Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat Methods. 2012; 9(4):357–9.View ArticlePubMedPubMed CentralGoogle Scholar
- Halko N, Martinsson PG, Tropp JA. Finding structure with randomness: probabilistic algorithms for constructing approximate matrix decompositions. SIAM Rev. 2011; 53(2):217–88.View ArticleGoogle Scholar
- Exon-exon junction dataset at Figshare. https://0-dx-doi-org.brum.beds.ac.uk/10.6084/m9.figshare.3811680.v1. Accessed 10 Dec 2016.
- Scripts for analysis at Figshare. https://0-dx-doi-org.brum.beds.ac.uk/10.6084/m9.figshare.3811629.v1. Accessed 10 Dec 2016.
- Pruitt KD, Harrow J, Harte RA, Wallin C, Diekhans M, Maglott DR, Searle S, Farrell CM, Loveland JE, Ruef BJ, et al. The Consensus Coding Sequence (CCDS) Project: Identifying a common protein-coding gene set for the human and mouse genomes. Genome Res. 2009; 19(7):1316–23.View ArticlePubMedPubMed CentralGoogle Scholar
- Temple G, Gerhard DS, Rasooly R, Feingold EA, Good PJ, Robinson C, Mandich A, Derge JG, Lewis J, Shoaf D, et al. The completion of the Mammalian Gene Collection (MGC). Genome Res. 2009; 19(12):2324–33.View ArticlePubMedPubMed CentralGoogle Scholar
- Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL. GenBank: update. Nucleic Acids Res. 2004; 32(Database issue):23.View ArticleGoogle Scholar