Skip to main content
Fig. 1 | Genome Biology

Fig. 1

From: GTM-decon: guided-topic modeling of single-cell transcriptomes enables sub-cell-type and disease-subtype deconvolution of bulk transcriptomes

Fig. 1

GTM-decon overview. a Inferring cell-type-specific (CTS) topics from scRNA-seq reference data. In brief, GTM-decon infers CTS topics from scRNA-seq data by using a guided topic modeling approach utilizing cell-type labels from the reference. High prior values are assigned to the topic corresponding to the cell type, and lower prior values are assigned to the other topics, enabling it to learn a genes-by-CTS topics matrix, with each topic anchored to a specific cell type. This matrix is used to infer cell-type proportions in bulk RNA-seq data using standard topic modeling, capturing variations in cell type proportions between healthy and diseased tissue. The probabilistic graphical model (PGM) diagram depicts the data generative process assumed by the proposed guided topic model. Suppose there are K cell types in the scRNA-seq data. For each cell indexed by \(m\in \{1,\dots ,M\}\), we use K-dimensional Dirichlet-distributed cell-type topic mixture \({{\varvec{\uptheta}}}_{m}\sim \mathrm{Dir}({{\varvec{\upalpha}}}_{m})\) to represent the statistical uncertainty of the noisy cell-type label \({y}_{m}\in \{1,\dots ,K\}\). Specifically, we clamp the Dirichlet hyperparameter \({\alpha }_{m,{y}_{m}}\) of the Dirichlet variable to a relatively high value while setting the rest of the values of \({\alpha }_{m,{k}{\prime}}\) (\({y}_{m}\ne {k}{\prime}\)) to relatively low values (i.e., 0.9 and [0.01, 0.1], respectively in the cartoon illustration of M = 8 cells and K = 3 cell types). The non-zero prior values for the K – 1 unobserved cell types allow the cell-type mixture variable \({{\varvec{\uptheta}}}_{m}\) to have non-zero density over those cell types as dictated by the scRNA-seq data likelihood and therefore account for potentially mislabeled cell types. Suppose there are in total \({N}_{m}\) reads in cell \(m\). Each scRNA-seq read \(i\in \left\{1,\dots ,{N}_{m}\right\}\) is assumed to be originated from one of the K CTS topics with the categorical rates fixed to the cell-type mixture, i.e., \({z}_{i,m} \sim \mathrm{Cat}({{\varvec{\uptheta}}}_{m})\). Given cell-type topic assignment \({z}_{i,m}\in \{1,\dots ,K\}\), the ith read is then mapped to one of the G genes as indexed by \({x}_{i,m}\) with categorical rates set to be \({\varvec{\upphi}}{}_{{z}_{i,m}}\), which itself is a G-dimensional Dirichlet variable of flat hyperparameter \(\beta\), i.e., \({x}_{i,m}\sim \mathrm{Cat}({{\varvec{\upphi}}}_{{z}_{i,m}})\). To infer the latent variables, namely cell-type mixture proportion \({{\varvec{\uptheta}}}_{m}\sim \mathrm{Dir}(\mathrm{\alpha })\), CTS topic assignments for each read \({z}_{i,m}\), and CTS topic distributions \({\varvec{\Phi}}\), we employ an efficient collapsed variational Bayes algorithm as detailed in the “Methods” section. The genes-by-CTS-topic \(\widehat{{\varvec{\Phi}}}\) matrix estimated from the scRNA-seq reference then serves as a template when it comes to infer the cell-type mixing proportions \({\uptheta }_{j}\) of a bulk RNA-seq sample j using essentially the same inference algorithm as in the scRNA-seq data modeling except for having a flat hyperparameter for the prior (e.g., \({\alpha }_{k}=1\forall k\) by default) while fixing \(\widehat{{\varvec{\Phi}}}\) and only inferring the expected total reads allocated for each CTS topics (i.e., \({{E}_{q}[n}_{.,j,k}]={E}_{q}[{\sum }_{i}[{z}_{i,j}=k]\)]). b Phenotype-guided modeling of bulk RNA-seq data. GTM-decon can also use phenotype labels as a guide for topic inference to model sparsified bulk transcriptomes in a disease study. In this design, instead of having each row as a cell and each column as a cell type, each row corresponds to a bulk sample and each column to a phenotype class. For each subject j, we set the topic hyperparameter \({\alpha }_{j,{y}_{j}}\) based on the noisy phenotype label \({y}_{j}\) of the subject. The inference algorithm is the same as in modeling the scRNA-seq reference data. Given a test subject \({j}{\prime}\), the inferred topic mixture \({{\varvec{\uptheta}}}_{{j}{\prime}}\) represents the phenotypic probabilities of the subject. c Nested-guided topic model for detecting cell-type-specific differentially expressed genes between phenotypes. In this nested design, we treat the phenotype as level 1 and the cell types as level 2. The pretrained genes-by-CTS-topic distribution \(\widehat{{\varvec{\Phi}}}\) learned from panel a are used to initialize the topic distributions for each phenotype in a sparsified bulk transcriptome disease study. As illustrated in the cartoon, for example, for 2 phenotypes and 3 cell types, there are 6 topics. GTM-decon then fine-tunes the combined cell-type-specific topic distribution by running the same algorithm described in panel b. The resulting topic distributions reflect the phenotypic influences on CTS gene distributions, which are the statistics for conducting differential expression analysis in a case–control study design

Back to article page