The identification of enhancer–promoter interactions (EPIs), especially condition-specific ones, is important for the study of gene transcriptional regulation. Existing experimental approaches for EPI identification are still expensive, and available computational methods either do not consider or have low performance in predicting condition-specific EPIs.
We developed a novel computational method called EPIP to reliably predict EPIs, especially condition-specific ones. EPIP is capable of predicting interactions in samples with limited data as well as in samples with abundant data. Tested on more than eight cell lines, EPIP reliably identifies EPIs, with an average area under the receiver operating characteristic curve of 0.95 and an average area under the precision–recall curve of 0.73. Tested on condition-specific EPIPs, EPIP correctly identified 99.26% of them. Compared with two recently developed methods, EPIP outperforms them with a better accuracy.
In this project cell line specific enhancer-promoter interactions (EPIs) in the human body were analyzed. Enhancer data from two sources, active gene promoters and Hi-C data from multiple labs in seven cell lines were used for this analysis. Using 31 features in the enhancer and the promoter regions, including both common and region specific features, we designed a machine learning model that can predict cell specific EPIs with higher performance than the existing prediction tools and can also work with varying number of features. To design the model, we used an ensemble supervised machine learning classifier named AdaBoostClassifier. Among the 31 features, 14 features were specific to the enhancer and promoter regions and 3 features were common to both regions. The 14 features include 9 histone modification features, 4 transcription factor features and 1 chromatin accessibility feature. The features were divided into overlapping partitions which train separate incremental learners in the ensemble classifier.
The model was tested on five different test data sets and compared with state-of-the-art methods TargetFinder (Whalen et al., 2016) and Ripple (Roy et al., 2015). The model performance was impressive on all the five test data sets and higher than TargetFinder and Ripple on predicting cell specific EPIs.
Talukder, A., Saadat, S., Li, X., & Hu, H. (2019). EPIP: a novel approach for condition-specific enhancer–promoter interaction prediction. Bioinformatics, 35(20), 3877-3883.
It is still challenging to predict interacting enhancer-promoter pairs (IEPs), partially because of our limited understanding of their characteristics. To understand IEPs better, here we studied the IEPs in nine cell lines and nine primary cell types.
By measuring the bipartite clustering coefficient of the graphs constructed from these experimentally supported IEPs, we observed that one enhancer is likely to interact with either none or all of the target genes of another enhancer. This observation implies that enhancers form clusters, and every enhancer in the same cluster synchronously interact with almost every member of a set of genes and only this set of genes. We perceived that an enhancer can be up to two megabase pairs away from other enhancers in the same cluster. We also noticed that although a fraction of these clusters of enhancers do overlap with super-enhancers, the majority of the enhancer clusters are different from the known super-enhancers.
Our study showed a new characteristic of IEPs, which may shed new light on distal gene regulation and the identification of IEPs.
We used the chromatin contact data from five labs, multiple enhancer sources and active promoter regions, to confirm an interesting phenomena that shows that enhancers work in clusters to form EPIs. From all the data we got, we generated a total ten EPI data sets. We calculated Bipartite Clustering Coefficient (BCC) from the enhancer-promoter interaction networks extracted from the ten data sets and used it to verify the hypothesis.
The BCC values of the enhancers were close to 1 in all the ten different EPI data sets. This means enhancers do not tend to share promoters partially. A set of enhancers only share with a set of promoters and all the enhancers in that set interact with all the promoters. This indicates that enhancers take part in EPIs forming separate clusters. We also found that these clusters are cell specific.
Talukder, A., Hu, H., & Li, X. (2021). An intriguing characteristic of enhancer-promoter interactions. BMC genomics, 22(1), 1-13.
Pairs of interacting transcription factors (TFs) have previously been shown to bind to enhancers and promoters and contribute to their physical interactions. However, to date, we have limited knowledge about such TF pairs. To fill this void, we systematically studied the co-occurrence of TF-binding motifs in interacting enhancer–promoter (EP) pairs in seven human cell lines. We discovered 423 motif pairs that significantly co-occur in enhancers and promoters of interacting EP pairs. We demonstrated that these motif pairs are biologically meaningful and significantly enriched with motif pairs of known interacting TF pairs. We also showed that the identified motif pairs facilitated the discovery of the interacting EP pairs. The developed pipeline, EPmotifPair, together with the predicted motifs and motif pairs, is available at https://doi.org/10.6084/m9.figshare.14192000. Our study provides a comprehensive list of motif pairs that may contribute to EP physical interactions, which facilitate generating meaningful hypotheses for experimental validation.
We annotated enhancer-promoter (EP) pairs in seven cell lines; GM12878, HMEC, HUVEC, IMR90, K562, KBM7, NHEK; using the FANTOM annotated enhancer regions with active markers, active promoters regions of GENCODE annotated gene transcripts and chromatin contacts from Rao et al., 2014. We filtered the 649 non-redundant TF motifs from the known motifs of JASPAR and CIS-BP databases (Khan et al., 2018; Weirauch et al., 2014). TF motif modules were predicted from the concatenated sequences of the EP pairs and mapped with the 649 non-redundant TF motifs. The mapped motif pairs with one motif in enhancer and the other in the promoter regions were kept. The predicted motif pairs were analyzed for homogeneity and compared with the known interacting TF motifs from BioGRID database (Stark et al., 2006). The information of the predicted motif pairs were used to train a lasso machine learning model to classify EPIs from non-interacting EP pairs.
We predicted hundreds of TF motif pairs with one motif located in the enhancer and the other in the promoter region. These motif pairs were highly significant and mostly shared across different cell lines. The predicted motif pairs were also found to be enriched with the known interacting TF pairs in almost every cell line. We also found the predicted motif pairs useful to train the lasso machine learning model to efficiently identify EPIs. Based on the weights of the lasso model, we could select 147 non-redundant motif pairs that were most significant for the model to classify EPIs. We identified known TF pairs associated with 72 of the 147 motif pairs. 64 of the 72 motif pairs could be associated with interacting TF pairs.
Wang, S., Hu, H., & Li, X. (2022). A systematic study of motif pairs that may facilitate enhancer–promoter interactions. Journal of Integrative Bioinformatics, 19(1).
Human cells require certain amounts of oxygen to function and have adaptive responses to being in a low-oxygen/hypoxic environment. Cancer cells will rapidly replicate at a rate that depletes their oxygen supply and forces them to rely on these responses to survive. These responses rely on the activation of the HIF1A transcription factor (TF). Consequently, HIP1A-targeted treatments have been proposed as a way to cure cancers. One approach to targeting HIP1A is through its protein-protein interactions.
In this study, we systematically investigated potential TF interactions of HIP1A using computational methods. We identified 201 potential HIF1A TF cofactors, many of which were supported by existing literature, which were conserved across multiple cell lines and crucial to regulating HIF1A regulated pathways.
ChIP-seq data from cell lines representing the most common cancer types were collected, run through Trimmomatic to trim adaptor sequences and filter low-quality reads, and mapped to the human genome. The ChIP-seq peaks of the HIF1A binding regions were defined using MACS2.
The SIOMICS tool was applied to the sequences to identify motifs and compared to known motifs in the JASPAR database. A predicted motif was considered similar to a known motif if the STAMP comparison E-value between them was less than 1.0E-05 and the TF corresponding to the known motif was considered to play a regulatory role. We obtained 201 HIF1A cofactors this way.
We also compared predicted motifs and predicted motif pairs across the eight cell lines. Predicted motifs were similar if their STAMP E-values were less than 1.0E-8 and predicted motif pairs were similar if their corresponding motifs were similar. The corresponding predicted TF pairs were compared to known TF-TF interactions in BioGRID. Finally, we stiduied the predicted HIF1A cofactors by comparing them with known HIF1A-interacting-TFs in BioGRID, HPRD, and BIND.
We used hypergeometric testing to test the enrichment of known known interacting TF pairs in the predicted ones. We also analyzed the binding specificity of the cofactors and their common target genes using annotatePeaks. We assigned every ChIP-seq peak to a gene according to the peak's distance to the nearest transcription start site.
Approximately 65.6% of predicted motifs across all cell lines were similar to known motifs when the STAMP E-value had a cutoff of 1.0E-5 and it increased to 83.8% when the cutoff was raised to 1.0E-4. Approximately 88.9% of predicted motifs in each cell line was also predicted across cell lines. These shared motifs were not due to overlapping ChIP-seq peaks.
We collected 29 known curated cofactors of HIF1A which were TFs with known motifs and found that 21 of them were in the list of predicted cofactors. The missing cofactors either were predicted but excluded because they didn't make the cutoff or were paralogous to other predicted cofactors.
Zhang, Y., Wang, S., Hu, H. et al. (2022) A systematic study of HIF1A cofactors in hypoxic cancer cells. Sci Rep 12, 18962.
Small proteins (SPs) are necessary for many cellular functions, but there are not many tools for identifying small proteins in prokaryotes. We introduced PSPI, a deep learning-based approach to identifying prokaryotic SPs. It performed better than several existing tools at identifying both prokaryotic and eukaryotic SPs. We also found evidence to suggest many SPs may contain short linear motifs.
We collected prokaryotic and eukaryotic SPs with lengths ≤ 100 amino acids long from various sources to serve as positive data. We created negative data by converting the positive sequences into a sequence of codons, permuting the sequence, and converting them back into amino acids. We also made negatives by concatenating microRNA sequences together and then randomly partitioning the sequence.
PSPI uses a LSTM-based architecture. Two versions of the model were made. The first takes a sequence of Amino acids and converts it into a series of 1-of-20 vectors. The second model also includes counts of (n,k)-mers which are gap k-mers which are at most n AA long.
PSPI showed an excellent AUROC and AUPR when classifying prokaryote SPs. When trained using eukaryotic SPs, PSPI was excellent at identifying eukaryotic SPs as well. It performed better than csORF-finder (Zhang et al., 2022), MiPepid (Zhu and Gribsov, 2019), and DeepCPP (Zhang et al., 2021). PSPI's performance improved significantly when (4,2)-mers were included as parameters.
Weston M, Hu H and Li X (2024), PSPI: A deep learning approach for prokaryotic small protein identification. Front. Genet. 15:1439423