SEESAW: detecting isoform-level allelic imbalance accounting for inferential uncertainty

Wu, Euphy Y.; Singh, Noor P.; Choi, Kwangbom; Zakeri, Mohsen; Vincent, Matthew; Churchill, Gary A.; Ackert-Bicknell, Cheryl L.; Patro, Rob; Love, Michael I.

doi:10.1186/s13059-023-03003-x

Software
Open access
Published: 12 July 2023

SEESAW: detecting isoform-level allelic imbalance accounting for inferential uncertainty

Euphy Y. Wu¹,
Noor P. Singh²,
Kwangbom Choi³,
Mohsen Zakeri²,
Matthew Vincent³,
Gary A. Churchill³,
Cheryl L. Ackert-Bicknell⁴,
Rob Patro² &
…
Michael I. Love ORCID: orcid.org/0000-0001-8401-0545^1,5

Genome Biology volume 24, Article number: 165 (2023) Cite this article

2882 Accesses
1 Citations
32 Altmetric
Metrics details

Abstract

Detecting allelic imbalance at the isoform level requires accounting for inferential uncertainty, caused by multi-mapping of RNA-seq reads. Our proposed method, SEESAW, uses Salmon and Swish to offer analysis at various levels of resolution, including gene, isoform, and aggregating isoforms to groups by transcription start site. The aggregation strategies strengthen the signal for transcripts with high uncertainty. The SEESAW suite of methods is shown to have higher power than other allelic imbalance methods when there is isoform-level allelic imbalance. We also introduce a new test for detecting imbalance that varies across a covariate, such as time.

Background

Genome-wide association studies (GWAS) have identified tens of thousands of genomic loci that are associated with complex traits or diseases, many of which are located in non-coding regulatory regions [1]. One potential mechanism by which allelic variation in these non-coding regions may affect phenotype is that the variants reside in transcription factor (TF) binding sites and influence the activities of TFs and transcription. Such a non-coding region may be referred to as a cis-regulatory element (CRE). Individuals that are heterozygous at such a variant may exhibit imbalanced allelic expression at any genes regulated by the CRE. With RNA-sequencing (RNA-seq) experiments, it is possible to observe such imbalance in allelic expression in the sequenced reads for those individuals also heterozygous for a variant in the exons of a regulated gene; other mechanisms of allelic imbalance, e.g., sequences affecting splicing or post-transcriptional regulation, are also possible to be detected. Recent advances in long-read technologies enable reconstruction of individual diploid genomes/transcriptomes, which leads to more accurate analysis at allele and isoform resolution. Analysis of allelic imbalance (AI) has the potential for higher power to detect cis-genetic regulation than analysis of total expression, as trans-regulatory and non-genetic effects on expression level are controlled for when comparing the two alleles within samples [2,3,4,5,6,7,8]. Such effects controlled for in AI analysis include both biological variability and technical artifacts that may distort total expression levels across genes. AI is also a powerful analysis to reveal cis-genetic regulation in heterozygous individuals that varies across samples representing different conditions, tissues, spatial contexts, or time periods [9,10,11,12,13].

AI can be isoform-specific, but due to low statistical power and challenges to statistical inference caused by multi-mapping, AI is often measured at the exon-level or gene-level. If different isoforms are subject to regulation from different sets of CRE, and these harbor genetic variants for which the individual is heterozygous, then such isoforms may exhibit different strength or direction of AI, as has been observed recently in an analysis of genomic imprinting at isoform level [14] and in a survey of expression in GTEx using long-read technology [15]. Being able to detect AI at the isoform level could provide insight into mechanisms of complex traits and diseases. One challenge is that AI can only be observed when the individuals under study are heterozygous at an exonic variant. Furthermore, only a subset of reads that can be aligned or probabilistically assigned to a transcript or gene will provide allelic information. As described by Raghupathy et al. [16], an RNA-seq read can fall into various categories of multi-mapping with respect to gene, isoform, and allele, providing different information for expression estimation or “quantification” at different levels. The uncertainty in measuring the expression level from multi-mapping is referred to here as inferential uncertainty. When variants lie in exons that are not constitutive, reads overlapping these exonic variants provide information for isoform-level AI. In the other case, if exonic variants lie in constitutive exons, or in exons of dominant isoforms, AI can be effectively detected by existing methods such as phASER [8] or WASP [17], which count reads aligning to gene haplotypes. WASP for example examines imbalance in the pileup of reads mapped to the genome, after correction for technical bias due to differential mapping rates of the two alleles [18]. WASP can be followed up by methods such as ASEP for statistical inference of gene-level AI across a population of individuals that may be homozygous or heterozygous for regulatory variants [19].

A subset of existing methods are able to detect cis-genetic regulation at sub-gene resolution from short-read RNA-seq. Paired Replicate Analysis of Allelic Differential Splicing Events (PAIRADISE) can extract more information about allelic exon inclusion events, by counting reads that overlap both an informative splice junction and an exonic variant for which a subject is heterozygous [20]. Within the PAIRADISE framework, reads are mapped to personalized genomes based on phased genotypes. PAIRADISE provides a statistical model for detecting allele-specific splicing events, by aggregating allelic exon inclusion within individuals, and builds upon their previous method GLiMMPS to detect splicing quantitative trait loci (QTL) across donors of all genotypes [21]. IDP-ASE combines counts of reads from short-read RNA-seq falling along regions of exons with better resolved isoforms and alleles using long reads [22]. A potential limitation for approaches that count reads overlapping specific regions of genes is that these may not be able to fully aggregate allelic information from paired-end reads overlapping multiple informative features along the length of a transcript. For PAIRADISE, some cases of isoform-level AI may be missed when focusing on reads overlapping splice junctions, such as allele-specific differences in length of 5′ or 3′ untranslated regions (UTR).

Other method publications that have demonstrated quantification of expression at allelic- and isoform-level include EMASE [16], kallisto [23], mmseq [24], and RPVG [14]. EMASE proceeds in a similar manner to PAIRADISE, by first constructing a diploid reference; however in this case, EMASE aligns reads to a diploid transcriptome, constructed via the g2gtools software. The EMASE authors found that hierarchical assignment of reads based on their information content in some cases outperformed equal apportionment as would occur using EM-based algorithms such as RSEM [25], kallisto [23], and Salmon [26] with a diploid reference transcriptome. mmseq allows for alignment of reads to a diploid reference transcriptome using Bowtie [27] and additionally can take into account gene-, isoform-, and allelic-multi-mapping when performing inference across alleles in its mmdiff step [28]. mmdiff computes posterior distributions of expression of each feature via Gibbs sampling. Features can be aggregated at various levels of resolution by summing the posterior expression estimates within each sample. Aggregation also has proved an effective strategy in non-allelic contexts, as demonstrated in tximport [29], SUPPA [30], and txrevise [31]. mmseq also provides a method mmcollapse [28] to perform data-driven aggregation of features to reduce marginal posterior variance, although this procedure cannot currently be combined with differential analysis across alleles (i.e., AI analysis).

We introduce a suite of methods, Statistical Estimation of Allelic Expression using Salmon and Swish (SEESAW), for allelic quantification and inference of AI patterns across samples. With the objective of detecting isoform-level AI, we introduce a strategy to group isoforms based on their transcription start sites (TSS). Through simulation, we show that aggregating isoform-level expression estimates to the TSS level can have higher sensitivity than either gene- or isoform-level analysis. SEESAW utilizes Salmon [26] to estimate expression with respect to an allele-specific reference transcriptome, and a non-parametric test Swish [32] to test for AI. Swish incorporates inferential uncertainty into differential testing and makes no assumption of the distributional model of the data. SEESAW follows the general framework of mmseq and mmdiff for haplotype- and isoform-specific quantification and uncertainty-aware inference. Here, the SEESAW methods were applied to simulated data to benchmark against previously developed methods for detection of AI within heterozygous individuals, making use of multiple individuals as biological replicates. We applied SEESAW to an F1 mouse time course dataset, where it detected genes containing both gene-level AI and isoform-level AI. SEESAW can detect cases of AI that are consistent across all samples, differential AI across two groups of samples, or dynamic AI over a covariate, with a new correlation-based test. The statistical testing in SEESAW is available via the Swish function in the fishpond package [33] including a software vignette for allelic analysis.

Results

SEESAW

We first briefly describe the estimation and statistical testing steps in SEESAW (Fig. 1), which combines both existing and new functionality across a number of software packages, with further details provided in the “Methods” section. SEESAW assumes that phased genotypes are available, and is primarily designed for multiple replicates or multiple conditions of organisms with the same genotype. This can occur with multiple replicates of an F1 cross, or cell lines from individual human donors across developmental time points [34,35,36], or across conditions [37,38,39,40]. First, g2gtools is used to create a diploid transcriptome, which is indexed by Salmon [26] and used for estimating allele-specific expression with bootstrap replicate datasets to assess inferential uncertainty across genes, isoforms, and alleles (detailed SEESAW pipeline shown in Additional file 1: Fig. S1). This approach for allelic quantification, mapping reads to a custom diploid transcriptome, has been demonstrated as a successful strategy in previous work [16, 23, 24, 41], similarly for mapping reads to a spliced pangenome graph [14] or to a custom diploid genome for allelic read counting [7, 42,43,44,45]. Next, SEESAW facilitates importing the estimated allelic counts at various levels of aggregation: no aggregation (labelled hereafter “isoform,” or equivalently “transcript”/“txp”), transcription start site aggregation (“TSS”), or gene-level aggregation (“gene”). Finally, we leverage the Swish [32] tool for differential expression analysis to test across the two alleles within samples via the Wilcoxon signed rank test statistic, averaging over inferential replicates, and using the qvalue package [46] applied to permuted datasets for defining false discovery rate (FDR) bounded sets of features. The steps for testing dynamic allelic imbalance are provided in the “Methods” section.

Simulation

Simulation of an F1 cross based on the Drosophila melanogaster transcriptome was used to assess method performance when the true AI status of each transcript was known. iCOBRA diagrams [47] were used to assess the sensitivity, or true-positive rate (TPR), and the FDR at nominal FDR thresholds of 1%, 5%, and 10%. Sensitivity was assessed per transcript, where detection of AI for a gene-level method was propagated to each of the gene’s expressed isoforms. We used Integrative Genomics Viewer (IGV) [48] to visualize the distribution of HISAT2 [49] aligned reads along the reference genome, after removing allelic-biased multi-mapping reads with WASP [17]. While SEESAW uses reads mapped to the diploid transcriptome with Salmon, examining genome-aligned reads with IGV allowed us to identify examples of reads that contained both allelic- and isoform-level information (Additional file 1: Fig. S2).

Notably, SEESAW with TSS aggregation had the highest overall sensitivity at 5% and 10% nominal FDR, above any of the gene- or isoform-level methods (Fig. 2). The reason behind the higher overall sensitivity can be seen when stratifying by types of AI, as in Fig. 2B; SEESAW with TSS aggregation was able to detect discordant AI on isoforms within a gene that could be masked after aggregation to the gene level. Discordant AI refers to the case where isoforms within a gene have opposite directions of AI, while concordant AI refers to the case where isoforms within a gene have the same direction of AI. Gene-level SEESAW, gene-level mmdiff, and WASP had loss of sensitivity to detect these discordant cases of AI. In addition, SEESAW with TSS aggregation or gene-level aggregation, gene-level mmdiff, and WASP had similar sensitivity at 1% nominal FDR considering both discordant and concordant AI, and these methods had observed FDR for this nominal cutoff in the range of 0–2%.

Gene-level SEESAW, gene-level mmdiff, and WASP had higher sensitivity than SEESAW using TSS aggregation when AI was concordant across all isoforms of a gene. This is expected as aggregation at the appropriate level strengthens the AI signal while reducing inferential uncertainty, so increasing power. For example, SEESAW had the strongest power to detect AI when information about the grouping of transcripts by true AI signal was used to aggregate allelic counts (“oracle” in Fig. 2A). Both SEESAW and mmdiff at the isoform level did not have as high sensitivity as methods that aggregated signal. UpSet diagrams [50] of the sets of transcripts called by each method compared to the true AI transcripts indicated the highest overlap among the gene-level methods and TSS or oracle aggregation (Additional file 1: Fig. S3).

We demonstrated that error control could be lost if we used allelic log fold change (LFC) in the aggregation step as described in the “Methods” section (Additional file 1: Fig. S4). Aggregating transcripts by the LFC is a form of “double dipping” as it makes use of the counts across samples twice: once to determine allelic LFC for aggregation then again to test for allelic imbalance. Such a procedure can lead to loss of error control as described elsewhere [51]. LFC-based p-values did not follow a uniform distribution under the null hypothesis and would lead to increased FDR with increased sample size.

As SEESAW makes use of Salmon for quantification, with inferential uncertainty measured via bootstraps, we evaluated the accuracy of uncertainty estimation using \(n =\) 10, 20, 30, 50 and 100 bootstraps, respectively to define 95% bootstrap intervals for the estimated counts (see the “Methods” section). Coverage was evaluated by comparing intervals to the true, simulated counts. Overall these results indicated that the default of \(n = 30\) inferential replicates (bootstrap samples) was sufficient for estimation of inferential uncertainty, with coverage rates close to the target 95%. The coverage rate slightly increased with number of inferential replicates, with a range of \(\sim\)2%. The coverage rate increased more across aggregation level, with a range of up to \(\sim\)10% (Additional file 1: Table S1). The increments in coverage rate were less than 0.5% comparing 30 to 50 inferential replicates within each aggregation level and expression bin. The difference in coverage rate was still small when comparing between 30 and 100 inferential replicates, across all three expression bins (Additional file 1: Fig. S5). We find that 30 inferential replicates established a good compromise between precision of uncertainty information and computation time and storage.

We assessed if the use of inferential uncertainty by Swish resulted in better error control compared to AI analysis with a beta-binomial generalized linear model applied to Salmon estimated counts, without taking into account uncertainty. Swish had better control of the FDR at all levels of aggregation compared to a beta-binomial generalized linear model applied to estimated counts with Benjamini-Hochberg correction of p-values to control the FDR (Additional file 1: Fig. S6). Salmon estimated counts can have high uncertainty particularly across alleles and isoforms, which often have high similarity in terms of sequence. Collapsing isoforms to TSS or gene level reduces uncertainty, but allelic multi-mapping remains for AI analysis [16]. This motivated our use of allelic testing methods such as mmdiff and Swish that take into account inferential uncertainty for estimated allelic counts.

We also assessed the performance of SEESAW compared to a new inference pipeline from the WASP developers, called WASP2 (Additional file 1: Fig. S7). WASP2 was equally sensitive as WASP in detecting gene-level AI, while it had less sensitivity than SEESAW to detect discordant AI signal as expected since it follows a similar approach to WASP. While we used the locfdr package [52] for multiple test correction for WASP, we found Benjamini-Hochberg [53] correction performed well for computing FDR-bounded sets for WASP2.

Osteoblast differentiation time course

We used Swish to test for AI at various levels (gene, isoform, TSS) in a time course experiment of differentiating osteoblasts from an F1 mouse, with C57BL/6J dams crossed with CAST/EiJ sires (see the “Methods” section). Following creation of the diploid reference transcripts and quantification steps of the SEESAW pipeline, we first tested for consistent AI across all nine time points (the “global AI test”). While exploring the osteoblast differentiation data, we observed that for isoforms of a gene with TSS that were near each other (within 50 bp), these isoforms often shared similar estimated allelic fold change as calculated with SEESAW. To facilitate data visualization, strengthen biological signal, and reduce inferential uncertainty, we grouped any transcripts with TSS within 50 bp of each other (referred to here as “fuzzy TSS groups”, to contrast with strict basepair-resolution TSS grouping). We tested at different levels of resolution: gene level, isoform level, and TSS level. To compare across these levels, we looked at genes in common: a gene was considered significant for global AI at isoform level or TSS group level if at least one isoform or TSS group within the gene was significant (nominal FDR \(<5\%\)). Isoform-level testing for global AI returned the most genes, with 6116 significant genes, followed by gene-level with 5701 genes, and TSS-level grouping with 5573 genes. The majority of genes (4625) were in common across all three levels of resolution (UpSet plot [50] provided in Additional file 1: Fig. S8).

Gene-level aggregation had high overlap with TSS-level indicating that, at least for global AI testing, most of the AI signal was not masked by discordant direction of AI among isoforms within a gene. Among genes displaying global AI under aggregation to the gene level, the TSS groups within those genes often had estimated imbalance in the same direction as the gene imbalance – 97.3% of significant genes had all of their TSS groups with significant AI having the same direction as the gene-level estimate. However, SEESAW was able to detect—among the 2.7% remaining genes—interesting examples of genes that had different direction of AI among its isoforms. A complete list of the 134 genes showing these significant and discordant patterns within gene is provided in Additional file 1: Table S2 and in the Zenodo deposition. For example, Fuca2 exhibited discordant AI with the CAST/EiJ (CAST) allele more highly expressed than C57BL/6J (B6) for one of the two leftmost (more 5′) TSS but less expressed than B6 for the rightmost (more 3′) TSS, with both TSS groups significant at \(< 5\%\) FDR (Fig. 3A).

Another gene of the 134 genes with discordant pattern was Sparc, the most highly expressed gene at the last time point in the osteoblast differentiation time course. Sparc is known to be critical for bone development. With fuzzy TSS aggregation and count filtering, Sparc displayed four transcripts groups, where group 5 (ENSMUSG00000018593-5) had positive allelic LFC (CAST > B6) and the other groups had negative allelic LFC (Fig. 4).

We additionally tested for “dynamic AI” using the correlation test implemented in Swish for testing changes in allelic log fold change over a continuous covariate. We again tested at gene level, isoform level, and TSS level. Gene-level dynamic AI testing returned the largest number of significant genes (nominal FDR \(< 5\%\)): 57 genes displayed dynamic AI at gene level, 49 genes at TSS level, and 23 at isoform level (Additional file 1: Fig. S9). Those significant genes shared across all levels only represented a third of those detected at gene level, where another third were shared only between gene level and TSS level. Thus TSS-level aggregation appeared to help recover signal that would be lost if only testing at the isoform level.

Interestingly, we detected genes such as Rasl11b that had isoform-level AI trending in different directions over time (Fig. 3B, Additional file 1: Figs. S10 and S11). Rasl11b exhibited dynamic AI for two TSS groups, with the CAST allele more lowly expressed than the B6 allele for TSS group “1” from day 2 to day 6, roughly balanced from day 8 to day 10, and finally with CAST/EiJ more highly expressed from day 12 to day 18. The other TSS group “3” had almost the opposite allelic ratio behavior: CAST more highly expressed earlier in time but both alleles tending toward balanced, low expression at the end of the time course. While for Rasl11b this pattern was also significant when testing at the isoform level, other genes such as Calcoco1 demonstrated the advantage of grouping features: Calcoco1 exhibited dynamic AI for two TSS groups, “5” and “6”, which differed in the direction of change in the imbalance (Additional file 1: Figs. S12 to S14). Here, the p-value and q-value for TSS group “6” was reduced when aggregating counts from the isoform to TSS-group level.

Discussion

Our new suite of methods, SEESAW, can be used to obtain allele-specific abundance with bootstrap replicates used to capture inferential uncertainty across genes, isoforms, and alleles, and to perform statistical testing of global or dynamic AI. We propose to aggregate estimates of allelic expression of isoforms by their TSS to increase statistical power in testing for AI that is a result of heterozygous variants in the promoter or in CRE that affect a particular promoter. We introduced two different AI testing procedures: global AI to test for the existence of consistent allelic fold changes across samples, and dynamic AI to test for non-zero correlation between the log allelic fold change and a continuous covariate. SEESAW can also be used to test differential AI between two groups, as introduced in Zhu et al. [32], or more complex designs using a general regression framework. Differential AI testing and differential correlation AI testing are shown as examples in the allelic analysis software vignette. The above tests utilize nonparametric testing, thus making no assumption on the distribution of the data itself. Nonparametric testing had better performance on the simulated data than a standard beta-binomial generalized linear model. In simulation, we demonstrated that SEESAW on TSS level had the highest sensitivity in the case that AI was discordant within gene, and achieved an FDR that was close to the nominal value at all levels of resolution (gene-, TSS-, or isoform-level testing), implying SEESAW can maintain error control despite high and heterogeneous levels of uncertainty. SEESAW at gene level performed comparably to existing methods such as WASP and gene-level mmdiff. For the osteoblast differentiation experiment, SEESAW was able to recover some genes with discordant isoform-level AI across all time points and was able to detect genes with isoform-level AI that changed over time in different directions.

Currently, SEESAW does not support alignment of haplotypes across individuals of different genotype. SNP-based analysis simplifies this problem, but at a loss of information, as evidence of AI may be distributed across multiple exonic variants within a transcript. A newly developed approach RPVG [14] maps RNA-seq reads to a spliced pangenome, and then provides haplotype-specific transcript abundance estimates for each individual. It would require further work for the methods presented here to group individuals by their haplotype combinations per gene and perform across-sample inference while accounting for estimation uncertainty using Swish. Another limitation of our current approach is that grouping isoforms together based on their TSS reveals shared promoter-based regulation, but may miss isoform-specific AI caused by intronic variation or variation that affects nonsense-mediated-decay. IDP-ASE and PAIRADISE provide inference on AI of splicing events, and these methods could be considered for detection of these cases. Alternatively, the framework of SEESAW can be adapted and used with other aggregation rules for different biological purposes, e.g., aggregating isoforms by various splicing events in a manner similar to SUPPA or txrevise. While SEESAW can be used at various levels of resolution, from transcript or TSS group up to gene level, if the focus of interest is gene-level AI, we found that WASP and WASP2 were equally sensitive and had good control of false discoveries, using locfdr and Benjamini-Hochberg correction, respectively. Additionally, the ASEP [19] method can be applied to allelic counts from WASP and allows for detection of gene-level AI across a population using a mixture model to account for the unobserved regulatory variants—individuals that are heterozygous for exonic variants may be either homozygous or heterozygous for regulatory variants. The analytical consequences of multiple regulatory SNPs and varying degree of linkage disequilibrium (LD) of these to the exonic SNP, with respect to detection of AI, have been described previously by Xiao and Scott [4].

While here we relied on gene annotation to group together transcripts and reduce inferential uncertainty of allelic expression estimates, another approach would be to use data-driven aggregation methods such as mmcollapse and Terminus [54]. We were not able to perform differential testing across alleles with mmdiff after aggregation with mmcollapse. A future direction that may improve performance with the inclusion of Terminus in the SEESAW pipeline would be stratification of different null distributions for test statistics in Swish based on aggregation level (transcript, transcript-group, or gene level).

After having detected AI, a natural next step is to try to understand the mechanism of cis-genetic regulation. It is possible to associate the AI seen on transcripts or genes with one or more regulatory variants, either through phasing or usage of population-level LD to establish the search space. The list of candidate regulatory SNPs can be further refined by integrating allelic signal within CRE at the epigenomic level, including allelic binding of proteins [42, 55,56,57], allelic accessibility [58, 59], or allelic methylation [60]. Alternatively, search for altered transcription factor binding motifs can be combined with RNA or protein abundance of potential regulators to winnow down the list of candidate causal regulatory variants [11, 61, 62]. It may also be of interest to detect in which cell types the allelic signal may be strongest or exclusively present, as has been investigated in recent methods for single cell allelic expression or accessibility datasets [63,64,65]. Finally, we note that a number of methods have shown that AI can be effectively integrated with total expression across individuals of all three genotypes [5, 13, 66,67,68,69]. This approach uses more information and so should produce a gain in sensitivity, as well as extending beyond genes harboring exonic variants, which is a limitation for AI-based methods.

Conclusions

Here we present a new suite of methods, called SEESAW, for quantifying and testing AI. SEESAW offers analysis at various levels of resolution (isoform, TSS, gene-level) and has significantly improved performance compared to existing methods for detecting when there is isoform-level imbalance. SEESAW provides statistical testing for global AI (across all samples), dynamic AI (differences along a continuous axis such as time), or differential AI (across groups of samples). The statistical testing in SEESAW is available in an R/Bioconductor package, fishpond [33], with an associated software vignette and visualizations designed specifically for allelic analysis.