ChromGene: gene-based modeling of epigenomic data

Jaroszewicz, Artur; Ernst, Jason

doi:10.1186/s13059-023-03041-5

Method
Open access
Published: 07 September 2023

ChromGene: gene-based modeling of epigenomic data

Artur Jaroszewicz^1,2 &
Jason Ernst ORCID: orcid.org/0000-0003-4026-7853^{1,2,3,4,5,6,7}

Genome Biology volume 24, Article number: 203 (2023) Cite this article

2466 Accesses
1 Citations
22 Altmetric
Metrics details

Abstract

Various computational approaches have been developed to annotate epigenomes on a per-position basis by modeling combinatorial and spatial patterns within epigenomic data. However, such annotations are less suitable for gene-based analyses. We present ChromGene, a method based on a mixture of learned hidden Markov models, to annotate genes based on multiple epigenomic maps across the gene body and flanks. We provide ChromGene assignments for over 100 cell and tissue types. We characterize the mixture components in terms of gene expression, constraint, and other gene annotations. The ChromGene method and annotations will provide a useful resource for gene-based epigenomic analyses.

Background

Genome-wide maps of epigenomic marks, such as histone modifications from ChIP-seq experiments and chromatin accessibility from DNase-seq or ATAC-seq experiments, provide valuable information for annotating the genome in a cell type-specific manner [1,2,3,4,5,6,7]. Notably, approaches have been developed for annotating the genome into “chromatin states” based on the combinatorial and spatial patterns of epigenomic marks inferred de novo from the data. These different chromatin states can correspond to different classes of genomic elements, including enhancers, promoters, and repressive regions [8,9,10]. Annotations from these methods have been used for a diverse range of applications, including understanding gene regulation and genetic variants associated with disease [2, 11, 12].

Typically, chromatin states annotate the genome on a per-position basis. However, for some applications with epigenomic data, it is desirable to conduct gene-based analyses, as is common with transcriptomic data [13,14,15], but taking full advantage of epigenomic data to generate gene-based annotations is less straightforward than for per-position annotations. The challenge with gene-based annotations is that the combination of epigenomic marks will vary along a gene in a position-dependent manner. Furthermore, protein-coding genes differ vastly in length, ranging from a few hundred base pairs to over 2 Mb (median 30 kb) [16]. One strategy for gene-based annotations is to focus on the chromatin state at the transcription start site (TSS) [8, 17]. While such an approach is largely independent of varying gene lengths, it ignores potentially important information throughout the gene body. A simple alternative strategy would be based on averaging each mark’s signal across the entire gene [18]. However, such an approach loses information about which marks co-occur along the gene, and it could be heavily confounded by gene length.

Another strategy has been to partition genes into regions and cluster the genes based on mark signal in those regions. For example, one study partitioned genes into five regions: the 500-bp region upstream of the TSS, the 500-bp region downstream of the TSS, and the remaining gene body into equal thirds [19]. The study then averaged the per-mark signal within each region, which were then clustered with the k-medoids algorithm. However, such an approach is dependent on the specific choices of the partition and loses information due to averaging within each partition.

Other work has solved related, but different, problems. EPIGENE used hidden Markov models (HMMs) to identify transcription units [20], but is not designed for clustering them. Hierarchical or multi-tiered HMMs have been proposed to capture broader domains, but are not designed to provide a single annotation per gene in a cell or tissue type [21,22,23]. (Henceforth, we will use the term “cell type” instead of “cell or tissue type” for ease of presentation.) EpiAlign was proposed to align chromatin states between two regions, such as genes, and identify corresponding regions [24], but is also not designed for de novo gene-based annotations.

There is thus a need for a principled, model-based method that can be used to generate de novo annotations of known genes based on the combinatorial and spatial information in data of multiple epigenomic marks. To address this, we introduce ChromGene. ChromGene uses a mixture of HMMs to model the combinatorial and spatial information of epigenomics maps throughout a gene body and flanking regions. Furthermore, ChromGene can learn a common model across multiple cell types and use it to generate per-gene annotations for each. ChromGene is distinguished from other methods in that its focus is assigning a single annotation for each gene, which can span an arbitrary length, as opposed to an annotation for each position in the genome.

Here, we apply ChromGene to ChIP-seq data of histone marks and DNase-seq data from over 100 cell types and produce per-gene annotations for each one. We describe these annotations with respect to their mark emissions and relate them to gene expression data and other external data, including genes with high probability of loss-of-function (LoF) intolerance (pLI) [25]. We show that ChromGene annotations have better agreement with gene expression and stronger Gene Ontology (GO) and cancer gene set enrichments than other methods. We expect the ChromGene annotations we have produced will be a resource for gene-based epigenomic analyses, and that the methodological approach will be useful for applications to other epigenomic data.

Results

Overview of ChromGene method

ChromGene models the set of epigenomic data across genes with a mixture of HMMs [9] (“Methods,” Fig. 1). The set of epigenomic data for each gene, along with a flanking region at each end (default 2 kb), is binarized at fixed-width bins (default 200 bp), indicating observations of each epigenomic mark, before being input to ChromGene. The data for a given gene in a cell type is modeled as being generated by one of M HMMs, each with S “hidden states” (or simply “states”) (black boxes, Fig. 1a). There are no constraints on allowed transitions between states of one HMM, but transitions between states of different HMMs are not allowed (Fig. 1b). There is also an initial state distribution, or the probability of seeing a state at the first position of the gene flanking region, over states of the HMMs. We note that otherwise, ChromGene does not directly model gene position information. The prior probability that a gene belongs to a specific mixture component, that is, an individual HMM, corresponds to the sum of initial probabilities of the states of that component. The emission distribution is modeled with a product of independent Bernoulli random variables following ChromHMM [9] (Fig. 1c). For given values of M and S, the parameters of the model are learned from the data using an expectation–maximization approach aimed to maximize the likelihood of the model parameters given the data (Fig. 1d). Once a model is trained, ChromGene computes the posterior probability of each of M mixture components generating each gene’s data in a given cell type. Finally, each gene is then assigned to the component for which it has the greatest posterior probability (Fig. 1e). We note that while ChromGene also assigns individual bins to hidden states based on the per-state posterior probability assignments (Fig. 1e), they are not our focus here, as there are effective model-based methods for position-level annotations.

To model data from multiple cell types, ChromGene can be applied analogously to the “concatenated” approach of ChromHMM [2, 28], which gives cell type-specific assignments based on a common model. To do this, ChromGene treats the same gene in different cell types as if it were different genes in the same cell type when learning the model. Using this common model, a gene is independently assigned to a mixture component for each cell type.

To develop ChromGene, we first noted that a mixture of HMMs can be equivalently expressed using a single, specially constructed HMM. In this single HMM, there is a “dummy state” that is associated with the emission of a “dummy mark,” which is only present before or after genes and their included flanking regions (Fig. 1a, “Methods”). All states within mixture components are initialized to allow transitions to and from this dummy state and between states of the same mixture component, but transitions between different mixture components are disallowed by enforcing those transition probabilities to be 0. Given the equivalence to a single HMM, we trained the ChromGene model by using ChromHMM after making enhancements to ChromHMM to better handle some technical numerical stability issues in this setting (“Methods”). We note that this strategy of reducing a mixture of HMMs to a single per-position HMM is general, and it could in principle be used to also provide other region-based annotations or to extend other software designed for per-position annotations, which may differ from ChromHMM in modeling assumptions.

ChromGene generates distinct gene-level chromatin annotations

We applied ChromGene to imputed data for ten histone modifications (H3K9me3, H3K36me3, H4K20me1, H3K79me2, H3K4me1, H3K27ac, H3K9ac, H3K4me3, H3K4me2, and H3K27me3), histone variant H2A.Z, and DNase-seq data from 127 cell types, binarized at 200 bp resolution [1, 27] across 19,919 protein-coding genes with 2 kb flanking regions [16] using the “concatenated” approach. We focused our analysis on a model with M = 12 mixture components and S = 3 states per component to balance model expressivity with having meaningful distinctions between individual components and states within a component (“Methods,” Additional File 1: Fig. S1).

Based on the emission parameters of the model (Fig. 2a), along with the relationship of the components to external data not used in model learning (discussed below), we gave each component a candidate annotation (Fig. 3, Additional File 2: Table S1), which we will use to refer to these components henceforth. Eight of the annotations (“strong_trans_enh,” “strong_trans,” “trans_enh,” “trans_cons,” “trans_K36me3,” “trans_K79me2,” “weak_trans_enh,” and “znf”) (“trans”: “transcribed”, “enh”: “enhancer”, “cons”: “constrained”, “K36me3”: “H3K36me3”, “K79me2”: “H3K79me2”) had at least one state with H3K36me3 or H3K79me2, both transcription-associated histone modifications, present at > 31% of positions. All of these annotations had at least one state associated with high frequency of promoter-associated H3K4me3 or H3K4me2 (> 73%) and limited detection of the repressive mark H3K27me3 (< 15%). For four of these annotations (“strong_trans_enh”, “strong_trans”, “trans_enh”, and “trans_cons”), all three of its corresponding states had a high frequency (> 28%) of at least one mark. In contrast, the annotations “trans_K36me3”, “trans_K79me2”, “weak_trans_enh”, and “znf” all had one state with low frequency of all marks (< 7%). Annotation “znf” is notable in that H3K36me3 co-occurs with the repressive mark H3K9me3. This annotation shows 11-fold enrichment for zinc finger named (ZNF) genes on average. This is consistent with previous findings that ZNF genes enrich for per-position chromatin states associated with this combination of marks [8, 27], but in contrast to previous work, we also provide a direct model-based annotation of all genes associated with an epigenomic pattern highly enriched in ZNF genes. Three of the annotations (“strong_trans_enh”, “trans_enh”, and “weak_trans_enh”) had a state in which the H3K4me1 mark had the greatest frequency (> 78% in all cases) and limited H3K4me3 (< 5% of positions), consistent with previously described putative enhancers [8, 27, 29], suggesting these ChromGene annotations contain intragenic enhancers. Annotations “trans_K36me3” and “trans_K79me2” both had a state with a high frequency of active marks typically found at promoters or enhancers, a state with a low frequency of all marks, and a state dominated by H3K36me3 and H3K79me2, respectively.

The four remaining annotations (“poised”, “bivalent”, “low”, “quiescent”) lacked any state with the transcription-associated marks, H3K36me3 or H3K79me2, at a high frequency (< 12% for all states). Annotations “poised” and “bivalent” had states with high frequencies of other marks. Notably, the “bivalent” annotation had a state associated with the high presence of the repressive mark H3K27me3 in combination with H3K4me1/2/3, along with another state associated with H3K27me3 alone. Annotation “low” only had one state associated with moderate levels of epigenomic marks, none of which were associated with transcription. No state of the “quiescent” annotation had any detected modifications (all emissions < 1%).

The assignments for all 19,919 genes across 127 cell types are provided (Additional File 3) [30, 31]. To visualize the assignments, we sampled 2000 genes and clustered them based on assignment to ChromGene annotations across all cell types (“Methods,” Fig. 2b).

ChromGene relationship with gene expression levels

We first investigated how a gene's assignment to a ChromGene annotation relates to its expression level. We compared ChromGene annotations to matched gene expression data for 56 cell types [1] (“Methods”). We separated cell type-gene combinations by their ChromGene annotation and analyzed the distribution of gene expression values (RPKM) for each annotation. We observed that genes assigned to different annotations had varying levels of expression (Fig. 4). The two annotations with the highest expression (“strong_trans_enh” and “strong_trans”) both had states with a high frequency of H3K36me3 and H3K79me2. The four annotations with the lowest expression (“poised,” “bivalent,” “low,” and “quiescent”) each had a median RPKM of less than 1. Notably, the “quiescent” and “bivalent” annotations had a large overlap in distribution of gene expression levels, despite the lack of the former with any state with substantial frequency of epigenomic marks, and the association of the latter with epigenomic marks in multiple states. This highlights how ChromGene provides additional information about genes not captured by gene expression.

For each ChromGene annotation, we also determined how consistent average gene expression levels were across cell types. Specifically, for each annotation and cell type, we calculated the median log₁₀(RPKM + 0.1) expression and analyzed the medians as a function of annotation (Additional File 1: Fig. S3). We then calculated the mean within-annotation variance across cell types (0.026) and the total variance of the computed medians across cell types and annotations (0.783). We found that the median expression variances within annotations were significantly smaller than across annotations (p-value < 10⁻⁴, permutation test), suggesting that ChromGene annotations are informative of expression regardless of cell type.

We also quantified how predictive ChromGene assignments are of gene expression and compared it to several baselines. To do this, we first separated genes into two groups: “expressed” genes with RPKM ≥ 1 and “unexpressed” genes with RPKM < 1. We held out genes on one chromosome at a time as a test set and used genes on all other chromosomes as a training set. We calculated the median expression for each ChromGene annotation for genes in the training set and used this as a predictor for the expression of genes in the held-out chromosome, and we found that the ChromGene assignment was a strong predictor of whether genes were expressed or unexpressed (across 56 cell types: mean AUROC = 0.893, standard deviation = 0.021). We also calculated the mean squared error (MSE) between the predicted and observed log₁₀(RPKM + 0.1) expression values across all cell types, and we found that predicted expression values were close to the true expression (across 56 cell types: mean MSE = 0.418, standard deviation = 0.050; Pearson r = 0.763). We repeated the evaluations for the following three baseline models (“Methods”).

TSS model

The TSS model clusters genes based only on epigenomic mark information at the TSS. This model only considers one position per gene, and thus does not capture additional information throughout the gene.

Gene average model

The gene average model clusters genes based on the average epigenomic mark binarized values throughout the gene, which is equivalent to using ChromGene with only one state per HMM. This model has two key disadvantages: First, it does not take advantage of spatial information; instead, it treats heterogeneous spatial data as if it were homogeneous, eliminating potentially useful information. Second, it is more likely to be biased by gene length because epigenomic marks preferentially associate with specific genic regions that scale differently with gene length (see below).

Collapsed model

The collapsed model clusters genes using a single state per mixture component, as in the gene average model, but here the single state is found by first training a normal multi-state ChromGene model, then “collapsing” each multi-state HMM into a single-state HMM by taking the weighted average of the states in that component. This is meant to show that the differences between ChromGene and a single-state model are likely due to incorporating spatial information of epigenomic marks instead of different instantiations of the models.

We implemented each baseline method to have an identical number of clusters as our ChromGene model. ChromGene assignments were significantly more predictive of gene expression than the three baseline methods (Table S2, AUROC = 0.893 vs 0.818, 0.889, and 0.888; MSE = 0.418 vs 0.685, 0.429, and 0.438; Pearson r = 0.763 vs 0.601, 0.757, and 0.753; p < 10⁻⁴ for all comparisons, binomial test, for ChromGene vs TSS, gene average, and collapsed models, respectively, “Methods”).

ChromGene reduces association of gene length with clusters

We next analyzed the relationship between the ChromGene annotations and the lengths of genes assigned to them (Fig. 5). We found that the length distribution of most annotations largely overlapped and were concentrated between 10 and 100 kb, but there were a few exceptions. Notably, “low” and “trans_cons” had median gene lengths around 100 kb, while “quiescent” was mostly composed of genes shorter than 10 kb. This “quiescent” annotation was strongly enriched (9.2 fold) for olfactory genes, which have a median length of 1034 bp and are expressed at low levels in most cell types (mean expression = 0.04 RPKM, median expression = 0).

Because gene length distributions varied across ChromGene annotations, we wanted to ensure that ChromGene provided substantial information beyond gene length. We thus evaluated the amount of information shared between ChromGene annotations and gene length, and for comparison, conducted the same evaluations for annotations from the three baseline models. Specifically, for each model and each cell type, we calculated the mutual information of the gene’s assignment and its length. We expected that the baseline TSS model would share the least information with gene length, as it only incorporates data from the TSS and not the whole gene, while among the models that incorporate data from the whole gene, ChromGene would share the least information. Indeed, we found that across cell types, the TSS model had the lowest mutual information (mean of 0.10), and ChromGene had significantly lower mutual information (mean of 0.31) than the gene average and collapsed models (mean of 0.46 and 0.48, respectively, p < 10^–30, binomial test, “Methods,” Additional File 1: Fig. S4). This indicates that of the models that use information throughout the gene, ChromGene annotations are least likely to simply reflect gene length.

Characterizing ChromGene assignments across cell types

We hypothesized that for a given gene, some ChromGene annotations were more likely to be consistently assigned across cell types, while other annotations were more likely to co-occur with other annotations. To test this, we conducted enrichment analyses for the co-occurrence of ChromGene assignments across cell types (excluding those that could be considered essentially biological replicates) (Fig. 6, “Methods”). We found that certain pairs of ChromGene annotations are strongly depleted for a given gene across pairs of cell types. For example, genes assigned in one cell type to “trans_cons,” which is highly enriched with housekeeping genes, are strongly depleted for the “quiescent” annotation in other cell types (176 fold depletion compared to expected, corresponding to a −7.46 log₂ ratio), consistent with the “quiescent” annotation’s enrichment for olfactory genes. On the other hand, we also found certain pairs of annotations enriched for the same gene across cell types. For example, genes assigned to “poised” in some cell type were enriched for both “weak_trans_enh” and “low” annotations in others, reflecting this annotation’s association with genes that can be activated or repressed in other cell types [32].

As assignments of genes to different annotations across cell types could be expected based on technical variability, we also estimated a confusion matrix for pairs of “replicate” cell types (Additional File 1: Fig. S5a, “Methods”). We found 81.1% concordance in assignments between replicate cell types, as compared to 10.2% expected by the number of genes assigned to each annotation. We next followed a similar process to calculate a contingency table for non-replicates—the probability of a given gene being assigned to one annotation for one cell type given it was assigned to another annotation in another cell type (Additional File 1: Fig. S5b). We found that between pairs of randomly chosen non-replicate cell types, ChromGene assignments were 57.1% concordant, which is substantially lower than between replicates (81.1%) and substantially higher than random assignment (10.2%).

We further defined the “cell type specificity” of each ChromGene annotation by dividing the diagonal of the contingency table (across non-replicate cell types) by the confusion matrix diagonal (the probability biological replicates would be assigned to the same annotation), then subtracting this value from 1 to obtain a “cell type specificity score” so that higher values correspond to more cell type-specific annotations. We found that the annotations that varied the least across cell types were “quiescent,” “strong_trans,” “znf,” “bivalent,” and “trans_K36me3” with cell type specificity scores from 0.16 to 0.24 (Fig. 3, Additional File 2: Table S1). The most cell type-specific annotations were “weak_trans_enh,” “low,” and “trans_enh” with scores of 0.49, 0.43, and 0.41 respectively. “Weak_trans_enh” having the highest cell type specificity score is consistent with this annotation being primarily associated with enhancers (Additional File 1: Fig. S6), which are known to be highly cell type-specific [2, 33]. The “low” annotation also had high cell type specificity despite having similarly low expression as the “quiescent” annotation, and it was often assigned to the annotation “poised” in other cell types (Fig. 6).

We next evaluated the extent to which changes in a gene’s ChromGene assignment across pairs of cell types were reflected (on average) in a change in expression of that gene (Additional File 1: Fig. S7a). We found that for most pairs of ChromGene annotations, changes in expression were largely consistent with their overall individual expression patterns (Fig. 4, Additional File 1: Fig. S7c,e). For example, if a gene was assigned in cell type i to a low-expression annotation, such as “quiescent,” and a high-expression annotation in cell type j, such as “strong_trans_enh,” then the expression of that gene in cell type j would typically be substantially higher than in i (“Methods”). Interestingly, we found a few pairs of ChromGene annotations for which expression did not vary as much as expected based on individual patterns (Additional File 1: Fig. S7c,e). Together, these results show that while changes in ChromGene assignment across cell types are typically reflected in expression changes for their associated annotations, some genes do not follow this pattern.

ChromGene annotations are differentiated by gene set enrichments

We next analyzed the enrichment of ChromGene annotations with respect to various gene sets (Additional File 2: Table S1). These gene sets included ZNF-named genes, constitutively unexpressed genes, housekeeping genes, olfactory genes, “biological processes” Gene Ontology (GO) terms, and cancer-related gene sets [1, 16, 34,35,36,37,38].

ZNF genes had the highest enrichment for the “znf” annotation (11.0 fold, median p < 10⁻²⁰⁰, hypergeometric test, “Methods”). Constitutively unexpressed genes (RPKM < 1 in all 56 cell types with matched expression available) [1] were most enriched in the “quiescent” annotation (5.8 fold, p < 10⁻³⁰⁰). In contrast, a set of previously defined housekeeping genes, based on broad and constant expression levels [39], was most enriched in the “strong_trans” annotation (3.0 fold, p < 10⁻³⁰⁰), which had a 131 fold depletion for constitutively unexpressed genes. For olfactory genes [34], we observed the strongest enrichment in the “quiescent” annotation (9.2 fold enrichment, p < 10⁻²⁰⁰), which contained 75.1% of all olfactory genes.

For the GO term enrichments [35], we calculated an adjusted enrichment p-value for each GO term gene set for each cell type and ChromGene annotation (hypergeometric test, Bonferroni corrected for the number of combinations of 12 ChromGene annotations and 6036 GO terms). For each ChromGene annotation, we identified GO terms that were enriched in the majority of cell types (adjusted p < 0.01) (“Methods,” Additional File 1: Fig. S8a, Additional File 2: Table S1). The ChromGene annotations “strong_trans_enh” and “strong_trans,” which had the highest expressed genes on average, had the greatest number of such GO terms, constituting 26 and 44% of the significant enrichments, respectively (Fig. 7c). The most significant terms for “strong_trans_enh” and “strong_trans” included core regulatory and metabolic processes terms such as “SRP-dependent cotranslational protein targeting to membrane” and “cytoplasmic translation” for “strong_trans_enh” and “mRNA splicing, via spliceosome” and “mRNA processing” for “strong_trans.” The ChromGene annotation “bivalent” had several neuron- and development-related enriched GO terms, including “neuron differentiation” and “anterior/posterior pattern specification.” In contrast, the “quies” term, also associated with low expression, showed significant enrichments for terms such as “sensory perception of smell” consistent with its enrichment for olfactory genes. The different GO enrichments for annotations with similar expression levels suggest ChromGene annotations provide information beyond expression.

To more directly investigate whether ChromGene annotations provides information beyond expression in the context of GO term enrichments, we compared GO term enrichments for gene sets defined based on only gene expression to those defined by both their gene expression and ChromGene assignment. Specifically, we first computed GO term enrichment for unexpressed or minimally expressed genes (“inactive genes”, RPKM < 1.0) in “Brain Germinal Matrix” using the same procedure as above (“Methods”), then found the most enriched term. Then, for each of the ChromGene annotations associated with lower average expression (“znf,” “poised,” “bivalent,” “low,” and “quiescent”), we took the subset of the inactive genes that also had the ChromGene annotation. For each subset, we repeated the GO term enrichment analysis and found the most enriched GO term. We took the most enriched GO term for each of these sets, then analyzed the enrichment p-values in all the other sets considered in this analysis (Fig. 7a). We found that the GO term most enriched in inactive genes, “sensory perception of smell (GO:0007608),” was substantially more enriched when subsetting to “quiescent” genes despite having fewer genes in the set (2.5 fold vs 9.1 fold, adjusted p < 10⁻²⁴ vs p < 10⁻⁵², Bonferroni corrected, “Methods”). This subset of “quiescent” genes was also enriched in this GO term when using the set of all inactive genes as the background (3.1 fold, adjusted p < 10⁻²⁴). Additionally, we found that for each of the other ChromGene annotation subsets, its most enriched GO term had a more significant p-value than for all inactive genes. For example, the “bivalent” subset was significantly enriched in “anterior/posterior pattern specification (GO:0009952)” (5.4 fold, adjusted p < 1⁻¹⁴), consistent with the role of bivalent genes in development [40], but the group of all inactive genes was only marginally enriched (1.8 fold, adjusted p = 0.016).

We repeated the analysis with highly expressed genes (RPKM > 100) using the three ChromGene annotations with the highest average expression (“strong_trans_enh,” “strong_trans,” and “trans_enh”), and again found increased significance of notable GO terms when using ChromGene annotations to subset these expressed genes. For example, in the “Fetal Brain Female” cell type, we found the ChromGene annotation “trans_enh” was significantly enriched for “neural tube development (GO:0021915)” (61.5 fold, adjusted p < 0.01), but was not significantly enriched using expression only (13.2 fold, adjusted p = 1.0) (Fig. 7b). When using a background of highly expressed genes instead of all genes, we observed a marginally significant enrichment for this GO term in “trans_enh” (4.7 fold enrichment, p < 0.002).

To determine whether ChromGene annotations provide additional information beyond gene expression regardless of cell type, we compared the number of GO terms enriched when splitting genes by ChromGene assignments to splitting genes randomly. We found that for both the unexpressed (< 1 RPKM) and highly expressed (> 100 RPKM) genes, for all 56 cell types tested, splitting genes by ChromGene assignments yielded more enriched GO terms (mean of 51.6 vs 1.8 for RPKM < 1, mean of 25.1 vs 5.2 for RPKM > 100, p < 10⁻¹⁶, binomial test) (Additional File 1: Fig. S8b,c, “Methods”). As previous studies have assumed that increased GO significance corresponds to greater biological significance [41, 42], these results would suggest increased biological relevance of using ChromGene annotations. However, we note that it is possible a method’s annotations could yield more significant GO enrichments while not necessarily being more biologically relevant, which would be difficult to determine without inherent ground truth. Together, these results suggest that using ChromGene in tandem with expression information can be used to identify subsets of genes with more specific biological roles than by expression alone. Further, they show that ChromGene can provide useful biological information even in the absence of significant expression.

Across all “biological process” GO terms and all ChromGene annotations, we saw an average of 133 enriched GO terms per cell type (adjusted p < 0.01). This was substantially more than for the three baseline methods (TSS, gene average, collapsed), where we saw an average of 53, 109, and 82 gene sets significantly enriched (adjusted p < 0.01), respectively. These results demonstrate that ChromGene annotations yield more “biological process” GO term enrichments than the baseline methods.

We next took 967 cancer-related gene sets, produced by the Cancer Cell Line Encyclopedia and implicated in various types of cancer [37], and, as before, performed a gene set enrichment analysis for each ChromGene annotation. Interestingly, we found that unlike for the “biological process” GO terms, significant enrichment for cancer gene sets was most likely to be found in lowly expressed annotations, particularly “poised,” “bivalent,” and “low,” which constituted the majority of enrichments, with 21, 60, and 16%, respectively (Fig. 7d). The enrichment of the “poised” and “bivalent” annotations is consistent with previous observations that misregulation of poised and bivalent chromatin regions are associated with cancer [43, 44]. In total, we found that ChromGene had an average of 151 cancer gene sets enriched per cell type. In comparison, the baseline methods (TSS, gene average, collapsed) had 71, 72, and 70 gene sets enriched (hypergeometric test, p < 0.01, Bonferroni corrected for the number of combinations of 127 cell types, 12 annotations, and 967 gene sets), respectively, substantially less than for ChromGene. These results further support that ChromGene annotations are more consistent with established cancer gene sets than the baseline approaches.

ChromGene separates genes by pLI scores

We next explored the relationship between ChromGene annotations and gene constraint, specifically those with high pLI scores (≥ 0.9) [25]. Genes with high pLI scores have strong evidence that the gene is intolerant to loss of function, i.e., haploinsufficient.

The three ChromGene annotations most enriched for high pLI score genes were “trans_cons,” “strong_trans_enh,” and “strong_trans” (Fig. 8), with an average of 49.6, 35.8, and 33.3% of their assigned genes having high pLI scores across cell types, respectively, compared to 18.1% expected by chance. These three annotations were also among the four annotations associated with the highest expression.

Notably, “trans_cons” had a substantially larger percentage of high-pLI genes (49.6%) than “trans_enh” (14.7%), despite a positive correlation between mean ChromGene annotation pLI score and expression (Spearman r from 0.18 to 0.39, mean r = 0.27 across cell types, all p-values < 10⁻¹⁰⁰, “Methods”), and “trans_cons” having slightly lower overall expression than “trans_enh” (median RPKM = 12.73 and 13.89, respectively). This difference is consistent with the substantial difference in gene length between the “trans_cons” and “trans_enh” annotations (Fig. 5), and a previously noted positive correlation between gene length and pLI [25]. We still saw a difference in percentage of high-pLI genes between “trans_cons” and “trans_enh” after correcting for gene length, although it did decrease, with 28.8 and 20.9%, respectively, of annotated genes having high pLI scores (“Methods,” Additional File 1: Fig. S9, Additional File 2: Table S1). Another difference between “trans_cons” and “trans_enh” is that genes annotated as “trans_cons” are more likely than “trans_enh” to be assigned to the “strong_trans” or “trans_K79me2” annotations in other cell types (Fig. 6, Additional File 1: Fig. S5b).

Among the lowly expressed ChromGene annotations, “low” and “quiescent” had the greatest difference in proportion of high-pLI genes (21.7 and 1.0%, respectively), although this difference was smaller after correcting for gene length (8.9 vs 2.9%, respectively) (“Methods”). Compared to the “quiescent” annotation, the “low” annotation contained longer genes and genes more likely to be assigned in other cell types to ChromGene annotations with a greater proportion of high-pLI genes. Similar patterns also held when considering mean pLI (Additional File 1: Fig. S10). These results further show that ChromGene captures additional information beyond expression.

Discussion

Here, we introduced ChromGene, a principled, model-based method that uses a mixture of HMMs to annotate genes based on maps of multiple epigenomic marks across genes. ChromGene’s focus on gene assignments is complementary to well-established approaches for generating per-position assignments [9, 10].

We applied ChromGene to imputed data for 12 epigenomic marks to annotate genes in over 100 cell types. We showed that ChromGene annotations frequently reflect distinct gene expression levels. In cases where ChromGene annotations had similar gene expression levels, we found they differed on other important properties such as gene set enrichments, including for high pLI score genes, reflecting that ChromGene annotations capture information beyond gene expression levels. We also showed that ChromGene yielded better agreement with gene expression data and more significant enrichments for cancer gene sets and GO terms than baseline approaches.

As ChromGene directly assigns genes to annotations, it will likely be preferable for many gene-centric analyses relative to per-position chromatin state assignments. While ChromGene annotations do correspond to patterns previously identified by per-state annotations, ChromGene annotations are more expressive in that they can associate multiple per-position patterns with the same gene-based annotation. Relating per-position annotations directly to genes has often involved using post hoc approaches mapping these annotations to individual genes. However, such approaches are subject to loss of information and can be biased by gene length. ChromGene mitigates these issues by directly providing a model-based annotation of individual genes. We note that ChromGene annotations are less likely to have advantages over per-position annotations when the object of study is not inherently gene-centric, such as studying enrichments for GWAS-identified variants.

There are multiple possible additional types of applications and extensions of ChromGene that can be investigated in future work. Although here, we applied ChromGene to protein-coding genes, ChromGene is more general and can also be applied to other pre-defined sets of genomic intervals such as long non-coding RNAs and pseudogenes. In this study, we combined our training data in a “concatenated” mode, where chromosomes across different cell types are treated as separate, and a single model is trained that applies to each epigenome. However, the input data for ChromGene can also be “stacked,” where the emissions of marks across cell types are all observed simultaneously [28, 45]. This can be used to annotate genes with patterns of variation learned across cell types. In this study, we applied ChromGene to a set of 12 imputed data sets to have a comparable set of input features across many cell types. However, ChromGene can be applied with other choices for the input data. In this study, we did not analyze ChromGene’s per-position annotation of genomic bins to individual mixture component hidden states, as we see the gene-level annotations as the main value for ChromGene relative to existing per-position annotations. However, investigating the use of ChromGene’s per-position annotations for specific applications could be a direction for future work.

Conclusions

ChromGene generates gene-level annotations, analogous to chromatin states at the position level, by modeling the combinatorial and spatial pattern of epigenomic marks using a mixture of HMMs. We used ChromGene to assign nearly 20,000 protein-coding genes for over 100 cell types to twelve gene-level chromatin annotations. We have shown that these annotations not only capture variation in gene expression, but also provide additional information. The assignment of genes to ChromGene annotations, along with the trained ChromGene model and software we used to generate them, are publicly available [30, 31]. We expect that the ChromGene assignments we have generated will be a useful resource for gene-based analysis in and across many cell types and that the approach will be useful for gene-based annotations for data from additional marks, cell types, or species.

Methods

ChromGene model

For a single cell type, ChromGene uses a mixture of multivariate HMMs to model the combinatorial and spatial patterns within multiple epigenomic maps across a set of genes to derive a single assignment per gene. Each gene is assumed to be generated by one of M fully connected HMMs, each with S states. Each of these M HMMs is defined by three sets of parameters: (1) the initial probability for each state within the HMM, (2) the probability of transitioning from one state to another state in the same HMM, and (3) the probability of observing each of E binarized epigenomic mark emissions given a state in the HMM. Although not explicitly trained, each HMM’s prior probability of being chosen for a gene is equal to the sum of the initial probabilities of its states.

More formally, we denote mixture components as m, 1 ≤ m ≤ M, where M is the total number of components; we denote states as s, 1 ≤ s ≤ S, where S is the number of states per component. The total number of states is thus \(M\times S\), and each hidden state is denoted h_m,s. The initial probability, transition probabilities, and emission probabilities are defined as follows:

Initial probabilities

The initial probability for each state, denoted \({\tau }_{m,s}\), corresponds to the probability that a randomly chosen gene starts in state h_m,s. The sum of all the initial probabilities of states in a mixture component m corresponds to the initial probability of a gene starting in component m:\({\tau }_{m}={\sum }_{s=1}^{S}{\tau }_{m,s}\) and thus the prior probability of the gene being assigned to component m.

Transition probabilities

Each mixture component is modeled as a fully connected HMM. The transition parameters \({\alpha }_{ms,ms{\prime}}\) correspond to the probability of a state s within component m, h_m,s, transitioning to a state \(s'\) within the same component \(m,h_{m,s'}\). Transitions from states within one component to states in another component are not allowed.

Emission probabilities

The probability of observing a specific combination of marks in a state s of component m, h_m,s, is calculated based on the product of independent Bernoulli random variables, with emission parameters \({\beta }_{m,s,e}\), over all marks following the approach of ChromHMM [9]. \({\beta }_{m,s,e}\) and \((1-{\beta }_{m,s,e})\) represent the probability of observing mark e as present or not present at a specific position in state \({h}_{m,s}\), respectively.

To apply ChromGene to multiple cell types, we apply the same modeling approach but treat data from multiple additional cell types as if they were additional genes in one cell type. This leads to a common model across cell types, but with cell type-specific gene assignments. This approach is analogous to the “concatenated” model learning approach of ChromHMM [2, 28].

Model learning

Parameter initialization

The initial parameters of states across all mixture components are initialized to sum to 1 using a Dirichlet distribution with \({\alpha }_{m,s}=1\). The emission parameters are initialized for each state by first randomly assigning each gene to one of the M components. Then, each position in the genic region and flanks is assigned uniformly at random to one of the S states within that component. Finally, for each of the \(M\times S\) hidden states, we set the initial emission probability of each epigenomic mark to its average frequency of being observed present across all positions assigned to the state. For each component m and state s in the component, the transition parameters from the state to each state in its component are initialized to sum to 1 using a Dirichlet distribution with \({\alpha }_{s}=1\).

Expectation–Maximization procedure

After initializing parameters, ChromGene trains the parameters iteratively with the Baum-Welch algorithm, a special case of the Expectation–Maximization (EM) algorithm, until convergence. In the E step, ChromGene takes the current parameter values (emission, transition, and initial probabilities) and calculates the log likelihood of the data being observed given those values, \(\mathrm{log}(L(\theta ;X,H))=\mathrm{log}(p(X,H|\theta ))\), where \(\theta =\{{\theta }_{\tau }, {\theta }_{\alpha }, {\theta }_{\beta }\}\), the set of initial, transition, and emission parameters, respectively, X is the observed data, and H is the set of true, but hidden state assignments. In this step, ChromGene calculates the posterior probability of each position of each gene being generated by each of the \(M\times S\) hidden states.

In the M step, these posterior probabilities are used to re-estimate the parameters \(\theta\) so that the new parameter values maximize the expected likelihood in the E step, i.e., \({\theta }^{t+1}={\text{argmax}}_{\theta }{E[}_{H|X, {\theta }^{t}}log(L(\theta ;X,H))]\). In our application, we trained for 200 iterations, as is default in ChromHMM, and parallelized the E step computation across 8 cores using the option `-p 8`. For faster computation, we also only evaluated on a randomly selected subset of the data in each iteration.

Implementation on top of ChromHMM

To implement ChromGene on top of ChromHMM, we first noted that a mixture of HMMs can be defined in terms of a single HMM with certain transitions disallowed. Specifically, a set of M fully connected HMMs, each acting as a mixture component, can be represented with a single, larger HMM, and multiple genes within the same file can be handled with the addition of a “dummy state” and “dummy mark.” In this single HMM, each non-dummy state of the HMM can transition into and out of other states within its component and a single dummy state, but not directly to states of other HMM components (Fig. 1a,b). From the dummy state, transitioning to any component is allowed, but only through the use of the dummy mark, which forces transitions into and out of the dummy state.

We structured the input data so that all genes in a chromosome and cell type (after adding 2-kb flanks on both ends and reversing genes on the negative strand) were concatenated head to tail, but with a single “dummy position” separating the genes, and a dummy position at the beginning and end of the data. In this dummy position, the emission for all the original input marks is set to 0, but we include an emission of 1 for a single, new mark designated the “dummy mark” (Fig. 1c). This dummy mark has an emission of 0 within all extended genic regions, and 1 only for the dummy position (Fig. 1d). Upon reaching the dummy position, the model forces transition to the dummy state. From this dummy state, transition to any of the mixture components is allowed.

This structure also allowed us to compactly represent all genes without having separate files for each gene and cell type. For a single cell type, we had 23 chromosome files (chromosomes 1–22 and X). For 127 cell types, this yielded 127 × 23 = 2921 total input data files. We note that in theory, we could apply an equivalent model without dummy marks and states using ChromHMM by creating a separate file for each combination of gene and cell type. However, such a process for our training scenario would produce approximately 2.5 million files, which can be over system limits for the number of files.

We generated initial parameter files for the initial parameters described above. For M mixture components, S states per component, and the dummy state, we had a total of (M × S) + 1 states. The initial probability for each state in each component is set to 0, except for the dummy state, which is set to 1. The transition probabilities from any state in a component to each other state in the same component was allowed to be nonzero and normalized so that the sum of these probabilities was 0.95. The remaining 0.05 probability was set to be the probability of transitioning from the component’s states to the dummy state. The probability of transitions from the dummy state to each of the other M × S states corresponds to the initial parameters, which, as described above, were initialized using a Dirichlet distribution with \({\alpha }_{m,s}=1\) for all components so that they summed to 1. The transition probability from a state in a component to each state in different components was set to 0.

Finally, the emission probabilities for the non-dummy marks in the non-dummy states were initialized as described above. For the dummy state, the emission probability of the epigenomic marks was set to 0, and the emission of the dummy mark was set to 1; in the non-dummy states, the emission probabilities for the dummy mark was set to 0. All input files were generated using the script `generate_chromgene_input_files.py` with default parameters [30, 31].

We passed the data and the initialized parameter files into ChromHMM’s `LearnModel` function and used several optional flags. We used the `-e 0` and `-t 0` flags to enforce that a 0 or 1 emission or transition parameter in the initial parameter file remains at that value. We also used the `-scalebeta` flag to increase numerical stability in our setting (below, “Handling numerical stability issues”). We used the flag `-n 100` to sample 100 files for each iteration, and `-d -1` to allow the log likelihood to increase between iterations. The log likelihood can increase between iterations because different subsamples of the data are evaluated on different iterations. We trained for 200 iterations, as is default in ChromHMM. We used version 1.18 of ChromHMM.

After training the model, we used ChromHMM to generate posterior probability assignment files. We used these files to calculate the posterior probability of each of the M HMMs of generating each gene in each cell type; finally, we took the HMM with the highest posterior probability to derive hard assignments. These assignments were generated using the script `chromgene_posteriors_to_components.py` with default parameters [30, 31].

Handling numerical stability issues

Implementing a mixture of HMM raised some numerical stability issues not encountered with standard, single HMMs in ChromHMM. In this setting, it is expected that for some genes, there will be specific individual HMMs that would have essentially zero probability of generating the observations for the gene. Previously, to deal with numerical stability issues, ChromHMM used a single approach to scaling, which was to rescale both the forward (alpha) variables of the HMMs and the backward (beta) variables at a specific position by the sum of the forward variables. This followed previous standard presentations of HMMs [46, 47]. However, this scaling approach led to numerical overflow in our setting. Namely, the magnitudes of the backward variables at a specific position can be vastly greater than the sum of the forward variables, and this difference can be continued to be amplified over long sequences, which in some cases can lead to numerical overflow. This likely happens specifically in this setting because the forward variables for a sequence can be dominated by one mixture component while the backward variables for a sequence can be dominated by another mixture component. To handle the overflow, we rescaled the backward variables based on the sum of the backward variables, which is valid as previously noted [48].

While rescaling the backward variables by the sum of the backward variables eliminated the numerical overflow, numerical underflow issues were then encountered. Specifically, situations were encountered in certain cases for individual genes where the product of each forward and backward variable for every state was 0, which would lead to division by 0 and no meaningful overall output. This also likely happens in certain situations in which one mixture component dominates the forward variables and another one the backward variables. To avoid the numerical underflow, if a backward variable for a state at a position fell below 10⁻³⁰⁰, we set its value to 10⁻³⁰⁰. We did the same for forward variables, except if the emission probability product for a state at the position was actually 0, in which case the forward variable was kept at 0. These changes were sufficient to lead to a numerically stable procedure.

This alternative procedure for handling numerical stability can now be accessed in ChromHMM by using the `-scalebeta` flag.

Hyperparameter selection

In ChromGene, there are two hyperparameters that must be set: M, the number of mixture components, and S, the number of states per component. For M, the number of mixture components, we considered values ranging from 8 to 20 in increments of two. For S, the number of states per component, we considered each value ranging from 2 to 5, while S = 1 corresponded to the gene average baseline.

The choice of M trades off the ability of the model to capture additional classes of genes while maintaining meaningful distinctions between each annotation. To assess this trade-off, we calculated ChromGene’s reproducibility, defined as the percentage of identical assignments in cell types that can be considered replicates (below, Calculation of confusion matrix and contingency table), as a function of M and S. We found that reproducibility is more sensitive to the choice of M than S (Additional File 1: Fig. S1a). We found that there was a large drop in reproducibility between M = 12 mixture components and 14 or more. On the other hand, the drop in reproducibility from M = 8 and M = 10 to M = 12 was relatively minimal, suggesting 12 mixture components could provide a reasonable trade-off for the expressivity of the model without having any pair of mixture components that are largely redundant (Additional File 1: Fig. S1b).

The choice of S trades off the expressivity of the state representation within a mixture component with the overall interpretability of each mixture component. A mixture component with too many states could have “redundant states,” where two or more states have very similar emission parameters. To investigate this trade-off, we calculated the mean Manhattan distance between the emission parameters of the two closest states within a mixture, for values of M = 8 to 20, and S = 2 to 5 (Additional File 1: Fig. S1c). We note that S = 1 is excluded from this analysis because two states are needed to calculate a distance. Unlike with reproducibility, this metric was more sensitive to the choice of S than M. We found that the mean Manhattan distance in models using S = 4 or S = 5 states per mixture have a similarly low value, suggesting that these models have pairs of “redundant” states that are not well-differentiated by emission parameters (Additional File 1: Fig. S1d). However, there is still an appreciable gap between S = 3 and both S = 4 and S = 5, supporting the choice of S = 3 to trade off model expressivity while maintaining a relatively interpretable model with limited redundancy between sub-states of the model.

ChromGene annotation visualization

To visualize ChromGene assignments across cell types and genes, we first randomly subsampled 2000 of the 19,919 protein-coding genes. For each pair of these genes, i and j, we calculated a pairwise distance between them as \(1-\mathrm{mean}\left(\mathrm{I}\left({m}_{i,c}={m}_{j,c}\right)\right)\), where m_g,c corresponds to the ChromGene assignment for gene g in cell type c, I(x) is the indicator function, and the mean is across the 127 cell types. Columns, which correspond to cell types, were ordered the same way as previously [1]. Rows were ordered using scipy’s hierarchical clustering function, using optimal leaf ordering and average (UPGMA) linkage.