Enhancers are distal regions of a gene that come into contact with the gene promoter region during gene transcription due to the folded chromating structure. Enhancers can be several dozens to a couple of thousand base pairs (bps) long. They can be located in the distal upstream or downstream of their target genes (Pennacchio et al., 2013). Although the longest distance between enhancers and their targets validated by low-throughput experiments is about one mega bps (Mbps) (Furlong et al., 2018; Lettice et al., 2002), recent high-throughput experiments showed that the distance can be larger than two Mbps in many cases (Javierre et al., 2016; Rao et al., 2014).
Promoters are upstream regions of genes. The enhancer region and several other factor proteins such as RNA polymerase bind in the promoter region before gene transcription. The size of the promoter region can be around 100 base pairs to several kilobase pairs (Sharan, 2007).
A promoter region can be identified relative to the location of the transcription start site (TSS) of a gene. The region typically (proximal promoter) starts 1 kilobase pairs upstream and ends with 100 base pair downstream of TSS. The proximal promoter regions contain CpG islands.
Because of the folded structure of the chromatin the accessible distal regions may come into contact. Enhancers and promoters are two such regions. Enhancer-promoter interaction (EPI) along with several transcription factors and RNA polymerase enzyme together intial gene transcription process.
The distance between the enhancer and promoter can be as close as several dozens to as far as a couple of thousand base pairs, it is still challenging to identify interacting enhancer-promoter pairs. Moreover, a study demonstrated that only 40% of enhancers regulate their nearest promoters and one enhancer may regulate multiple genes (Andersson et al., 2014) making identification of EPIs even harder.
Several computational approaches have been developed based on the correlation of epigenomic signals in enhancers and those in promoters (Andersson et al., 2014; Corradin et al., 2014; Ernst et al., 2011; Thurman et al., 2012). One challenge of using these methods is to find a proper threshold of correlations to reduce false EPI predictions (Roy et al., 2015; Whalen et al., 2016). Recently, supervised learning-based methods have been developed, such as IM-PET (He et al., 2014), PETModule (Zhao et al., 2016), Ripple (Roy et al., 2015) and TargetFinder (Whalen et al., 2016). These methods commonly use genomic and epigenomic data such as those from DNase I hypersensitive sites sequencing (DNase-seq) and histone modification-based chromatin immunoprecipitation followed by massive parallel sequencing (ChIP-seq) to extract features for EPI predictions. IM-PET, Ripple and PETModule utilize random forests as their classifier, while TargetFinder is based on boosted trees. These methods either do not consider or have low performance on condition-specific EPI predictions (Roy et al., 2015).
We developed a supervised ensemble machine learning tool that can efficiently predict cell specific enhancer-promoter interactions and can handle missing features.
After analyzing the enhancer-promoter interactions from the chromatin interactions from five different sources, we confirmed a property of cell specific enhancers that can help to detect enhancer clusters.
Pairs of interacting transcription factors (TFs) bind to enhancers and promoters and contribute to their physical interactions. We systematically studied the co-occurrence of TF-binding motifs in interacting cell-specific enhancer-promoter pairs.
HIF1A is a TF that forms highly structural and functional protein-protein interactions with other TFs to promote gene expression in hypoxic cancer cells. Here we systematically studied HIF1A cofactors in eight cancer cell lines and discovered 201 potential HIF1A cofactors, which included 21 of the 29 known HIF1A cofactors in public databases. These 201 cofactors were statistically and biologically significant, with 19 of the top 37 cofactors in our study directly validated in the literature. The remaining 18 were novel cofactors supported by literature. These discovered cofactors can be essential to HIF1A's regulatory functions and may lead to the discovery of new therapeutic targets in cancer treatment.
Small Proteins (SPs) are pivotal in various cellular functions such as immunity, defense, and communication. Despite their significance, existing computational tools still have suboptimal performance in SP identification. To fill this gap, we introduce PSPI, a deep learning-based approach designed specifically for predicting SPs. We showed that PSPI had a high accuracy in predicting general SPs. Compared with three existing tools, PSPI was faster and showed greater precision, sensitivity, and specificity. The PSPI tool, which is freely available at https://www.cs.ucf.edu/∼xiaoman/tools/PSPI/, will be useful for studying SPs as a tool for identifying SPs.
we systematically analyzed the recurrence of EPIs across 49 Hi-C and 95 HiChIP datasets. We found that the majority of EPIs identified in a given sample were also present in other samples, regardless of the assay type (Hi-C or HiChIP) or the enhancer annotations used. Interestingly, EPIs that appeared unique to individual samples were typically surrounded by fewer neighboring EPIs, suggesting they may not represent truly sample-specific interactions. Our findings indicate that most human EPIs have already been captured and that cells primarily reuse subsets of these shared EPIs across different cell types and conditions.
scATAC-seq provides an unprecendented opportunity to study EPIs in mammals. However, the sequencing depth of current scATAC-seq data is often too low to cover active interacting EP pairs. Here we are developing a deep learning based approach to consider the chromatin environments around EP pairs together with the scRNA-seq data to predict EPIs. Tested on a small dataset, we showed that the methods work well. We are still actively testing on large datasets.