Skip to main content
Figure 2 | Genome Biology

Figure 2

From: Improving RNA-Seq expression estimates by correcting for fragment bias

Figure 2

Nucleotide distribution surrounding fragment ends and calculation of bias weights. (a) Sequence logos showing the distribution of nucleotides in a 23 bp window surrounding the ends of fragments from an experiment primed with 'not not so random' (NNSR) hexamers [11]. The 3' end sequences are complemented (but not reversed) to show the sequence of the primer during first-strand synthesis (see Figure 1). The offset is calculated so that zero is the 'first' base of the end sequence and only non-negative values are internal to the fragment. Counts were taken only from transcripts mapping to single-isoform genes. (b) Sequence logo showing normalized nucleotide frequencies after reweighting by initial (not bias corrected) FPKM in order to account for differences in abundance. (c) The background distribution for the yeast transcriptome, assuming uniform expression of all single-isoform genes. The difference in 5' and 3' distributions are due to the ends being primed from opposite strands. Comparing (c) to (a) and (b) shows that while the bias is confounded with expression in (a), the abundance normalization reveals the true bias to extend from 5 bp upstream to 5 bp downstream of the fragment end. Taking the ratio of the normalized nucleotide frequencies (b) to the background (c) for the NNSR dataset gives bias weights (d), which further reveal that the bias is partially due to selection for upstream sequences similar to the strand tags, namely TCCGATCTCT in first-strand synthesis (which selects the 5' end) and TCCGATCTGA in second-strand synthesis (which selects the 3' end). Although the weights here are based on independent frequencies, we found correlations among sites in the window and take these into account in our full model to produce more informative weights (see Supplementary methods in Additional file 3). A similar figure to this for the standard Illumina Random Hexamer protocol and plots similar to (d) for all datasets in the paper can be found in Figures S1 and S2 of Additional file 1 respectively.

Back to article page