Deep learning for predicting catalytic and small RNA in the genome

Background: The central dogma of biology stipulates that DNA is transcribed into RNA, which are translated into proteins, which carry out functions around the cell. However, as time passes, we are discovering more and more exceptions to this dogma. One of which are small non-coding RNAs (sRNA): these short fragments of RNA don’t get translated into proteins; instead, they fold into small structures and carry out many key and catalytic functions in the bacterial cell. sRNAs are uniquely versatile, as they are capable of interacting with both protein and nucleic acid targets, are responsible for bacterial responses to environmental stimuli, and can serve as virulence mechanisms.

While the arrival of AlphaFold made great leaps in the ability to characterize proteins through the use of neural networks and natural language processing methods, sRNA discovery has yet to benefit from these advances. This is in part because short RNAs pose a similar but more challenging version of the protein folding problem: their encoding sequences can be up to 100 times shorter, they are more variable in structure, and their coding regions do not contain as many common features. In the few millions of base pairs of DNA that make up a bacterial genome, finding new sRNAs using current methods is akin to searching for a needle in a haystack. Because of that, the majority of small RNAs are hypothesized to be unknown, playing the role of biological ‘dark matter’ in the cell.

Project: In this project, we will apply recent advances in Deep Learning and Natural Language Processing (NLP) in developing a novel method for identifying and characterizing small RNAs. Our goals are 1) design and test neural network architectures and tools that had recent success in DNA representation including autoencoders, convolutional neural networks, DNA-BERT, and possibly non-Euclidean embeddings, 2) interrogate well-performing models to identify and extract features of bacterial sRNAs, and 3) design robust test cases which enable the comparison of these novel methods to existing methods for sRNA identification in bacterial genomes.

Selected candidate(s) can receive a stipend directly from the faculty advisor. This is not a guarantee of payment, and the total amount is subject to available funding.

Faculty Advisor

Professor: Tal Korem
Center/Lab: Systems Biology
Location: PH18-200
We develop data analysis methods for multi-omic microbiome data. We focus on integrating clinical, microbiome, lifestyle and environmental data in a way that advances from statistical associations to actionable insights that can be used in clinical practice.

Project Timeline

Earliest starting date: 10/15/2022
End date:
Number of hours per week of research expected during Fall 2022: ~12

Candidate requirements

Skill sets: Students should have familiarity with the Unix environment, python, and PyTorch or a similar machine learning library. In addition, students should have taken at least one advanced class in machine learning or have equivalent knowledge.
Student eligibility: ~~freshman~~, ~~sophomore~~, junior, senior, master’s
International students on F1 or J1 visa: eligible
Academic Credit Possible: Yes

Deep learning for predicting catalytic and small RNA in the genome

Faculty Advisor

Project Timeline

Candidate requirements

Columbia Data Science Institute (DSI) Scholars Program