Skip to main content
Fig. 2 | Genome Biology

Fig. 2

From: Statistical learning quantifies transposable element-mediated cis-regulation

Fig. 2

Influential TE-embedded cis-regulatory information resides up to 500kb from gene promoters. A Overview of the weighting process whereby the cis-regulatory influence of TEs decreases as a function of the distance to the closest promoter. The scheme depicts a protein-coding gene with two alternative promoters (in orange), coding for two alternative isoforms (in gray). Gaussian kernels with a maximum value of 1 and of varying bandwidth L are centered on each promoter. Before being added to the corresponding element in the matrix N, each TE is weighted as a function of its distance to the closest gene promoter. TEs overlapping exons (gray boxes) and promoters (orange boxes) of the gene are excluded. B To find the bandwidth L leading to the smallest prediction error, the root-mean-squared error (RMSE) was computed for each validation fold and averaged across the five folds over different values of L. C Overview of the experimental design of the hESC “perturbome” [50]. hESC cell lines carrying a stably integrated dox-inducible transgene overexpression construct were established from individual cells. In each of the 441 transgene overexpression experiments, dox-treated samples (dox+) are compared to the same cell line in the absence of dox (dox−). Note that the number of replicates per experiment varies. D Histogram depicting the number of times each Gaussian kernel bandwidth L — either TAD-informed or agnostic — led to the smallest mean validation RMSE in a 5-fold cross-validation scheme for the 441 transgene overexpression experiments. TAD-informed (red): the cis-regulatory weights linking integrants to genes were restricted by topologically associating domain (TAD) boundaries. TAD-agnostic (black): TAD boundaries were not considered. Individual mean RMSE estimations for GATA6, KLF4, and NEUROG1 are shown as illustrative examples. E Estimation of the cis-regulatory activity of TE subfamilies upon KLF4 overexpression [33] using the matrix N computed with \(L=250\) kb

Back to article page