Figure 1From: Modeling gene expression using chromatin features in various cellular contextsModeling pipeline. Genes longer than 4,100 bp were extended and divided into 81 bins. The chromatin feature density in each bin is logarithm-transformed and then used to determine the best bin (the bin that has the strongest correlation with the expression values). To avoid log2(0), a pseudocount is added to each bin, which is then optimized using one-third of genes in each dataset (D1) and then applied to the other two-thirds of genes in the datasets (D2) for the rest of the analysis. D2 was divided into training set (TR) and testing set (TS) in a ten-fold cross-validation manner. A two-step model was built using the training set. First, a classification model C(X) was learned to discriminate the 'on' and 'off' genes, followed by a regression model R(X) for predicting the expression levels of the 'on' genes. Finally, the correlation between the predicted expression values for testing set, C(TS_X)*R(TS_X), and the measured expression values of testing set (TS_Y) was used to measure the overall performance of the model. TSS, transcription start site; TTS, transcription termination site; RMSE, root-mean-square error.Back to article page