Enhancers are distal regions of a gene that come into contact with the gene promoter region during gene transcription due to the folded chromating structure. Enhancers can be several dozens to a couple of thousand base pairs (bps) long. They can be located in the distal upstream or downstream of their target genes (Pennacchio et al., 2013). Although the longest distance between enhancers and their targets validated by low-throughput experiments is about one mega bps (Mbps) (Furlong et al., 2018; Lettice et al., 2002), recent high-throughput experiments showed that the distance can be larger than two Mbps in many cases (Javierre et al., 2016; Rao et al., 2014).
Promoters are upstream regions of genes. The enhancer region and several other factor proteins such as RNA polymerase bind in the promoter region before gene transcription. The size of the promoter region can be around 100 base pairs to several kilobase pairs (Sharan, 2007).
A promoter region can be identified relative to the location of the transcription start site (TSS) of a gene. The region typically (proximal promoter) starts 1 kilobase pairs upstream and ends with 100 base pair downstream of TSS. The proximal promoter regions contain CpG islands.
Because of the folded structure of the chromatin the accessible distal regions may come into contact. Enhancers and promoters are two such regions. Enhancer-promoter interaction (EPI) along with several transcription factors and RNA polymerase enzyme together intial gene transcription process.
The distance between the enhancer and promoter can be as close as several dozens to as far as a couple of thousand base pairs, it is still challenging to identify interacting enhancer-promoter pairs. Moreover, a study demonstrated that only 40% of enhancers regulate their nearest promoters and one enhancer may regulate multiple genes (Andersson et al., 2014) making identification of EPIs even harder.
Several computational approaches have been developed based on the correlation of epigenomic signals in enhancers and those in promoters (Andersson et al., 2014; Corradin et al., 2014; Ernst et al., 2011; Thurman et al., 2012). One challenge of using these methods is to find a proper threshold of correlations to reduce false EPI predictions (Roy et al., 2015; Whalen et al., 2016). Recently, supervised learning-based methods have been developed, such as IM-PET (He et al., 2014), PETModule (Zhao et al., 2016), Ripple (Roy et al., 2015) and TargetFinder (Whalen et al., 2016). These methods commonly use genomic and epigenomic data such as those from DNase I hypersensitive sites sequencing (DNase-seq) and histone modification-based chromatin immunoprecipitation followed by massive parallel sequencing (ChIP-seq) to extract features for EPI predictions. IM-PET, Ripple and PETModule utilize random forests as their classifier, while TargetFinder is based on boosted trees. These methods either do not consider or have low performance on condition-specific EPI predictions (Roy et al., 2015).
We developed a supervised ensemble machine learning tool that can efficiently predict cell specific enhancer-promoter interactions and can handle missing features.
After analyzing the enhancer-promoter interactions from the chromatin interactions from five different sources, we confirmed a property of cell specific enhancers that can help to detect enhancer clusters.
Pairs of interacting transcription factors (TFs) bind to enhancers and promoters and contribute to their physical interactions. We systematically studied the co-occurrence of TF-binding motifs in interacting cell-specific enhancer-promoter pairs.