Thrust 2: Computational Genomics and Precision Health
For example, we have used supervised learning for deciphering enhancers in the genome. As the name suggests, these are regions of the genome that enhance gene expression. We are also interested in mapping other parts of the regulatory genome, by using unsupervised methods such as clustering the genome into different functional regions. These could include enhancers, silencers, insulators, and promoters, which may often have complex signatures. Due to the three-dimensional nature of the genome, genomic regulatory regions, such as genomic enhancers may be situated thousands of basepairs away from the genes that they regulate. In addition, there may be many-to-many associations between these regulators and the genes that they regulate. With the overall flavor of mapping the regulatory genome, these are some of our current data science projects in genomics.
- Mapping the genomic enhancers and at scale using variants of deep neural networks (DNN) and speed them up using accelerators such as Graphics Processing Units (GPU), such as in this paper.
- Clustering the genome into different functional domains and then labeling and visualizing these clusters. This is an extension of the work where we have used a DNN to identify enhancers and then work to make the modeling interpretable as in this paper.
- Mapping the interactions of epigenomic regulators of the genome for transcriptional regulation. We have already mapped the non-canonical interactions of microRNA (short RNA strings), such as here and here, and we are extending our work to map out the entire interactome space of gene expression regulation. Another application of this work is to increase the specificity of genome editors. Read our article on the CRISPR-Cas genome editing technology (in collaboration with University of Washington and Johns Hopkins Bioengineering) here. View visualization here.
Overall, we believe that “Life is Computation” and with that vision in mind, we have also used natural language processing (NLP) to the language of the genome and specifically used NLP for error correction of genomics reads here.
Representative Publications
- KRATOS: Context-Aware Cell Type Classification and Interpretation Using Joint Dimensionality Reduction and Clustering
- Lerna: Transformer Architectures for Configuring Error Correction Tools for Short-and Long-Read Genome Sequencing
- Simultaneous learning of individual microRNA-gene interactions and regulatory comodules
- Combinatorial screening of biochemical and physical signals for phenotypic regulation of stem cell-based cartilage tissue engineering
- Athena: Automated tuning of k-mer based Genomic error correction Algorithms using Language Models
- Aikyatan: mapping distal regulatory elements using convolutional learning on GPU
- Scalable Genome Assembly through Parallel de Bruijn Graph Construction for Multiple k-mers
- Sarvavid: a domain specific language for developing scalable computational genomics applications
Older Publications