Rapid, targeted, massive-scale genome reconstruction

Background: Genomes are inextricably tied to life as we know it, encoding all the molecular information used by organisms. Next-generation DNA sequencing has resulted in the scalable reading of genomes from organisms that inhabit complex environments - rather than being limited to organisms typically studied in the lab. Alongside this, algorithmic development is beginning to reveal the complex biology of genomes.

Project: We have developed an algorithm that uses sequencing data to reconstruct the surrounding genomic content of an input DNA sequence. Already, this algorithm has found multiple applications in our research, from inferring the global structure of genomes, to improving protein sequence reconstruction for deep learning models. We are currently employing it to study the relationship between genetic variation in the human gut microbiome and human health. Due to the algorithm’s generality and value, we intend to release it as a tool to be used by the research community. Our current implementation, however, has limited scalability in both space and time, stemming from sub-optimal data structures and algorithmic subroutines, as well as implementation in in Python. The goal of this project is to design and implement a highly scalable version of this algorithm. We anticipate a successful project requires addressing both of the above limitations: redesigning the algorithm to improve space- and time-complexity; and producing an efficient implementation in a compiled language - such as C, C++, or Rust. Developing an improved, robust implementation will accelerate many downstream applications that rely upon this algorithm, open the door to new applications, and lead to an important publication in its own right. This is an ideal project for a student interested in efficient and scalable algorithm design, genome sciences, and who is looking for a challenging but rewarding experience where you will learn valuable skills. This project is a great stepping-stone for further research in this area, particularly downstream applications in our lab.

Selected candidate(s) can receive a stipend directly from the faculty advisor. This is not a guarantee of payment, and the total amount is subject to available funding.

Faculty Advisor

Professor: Tal Korem
Center/Lab: Systems Biology
Location: PH18-200
We develop data analysis methods for multi-omic microbiome data. We focus on integrating clinical, microbiome, lifestyle and environmental data in a way that advances from statistical associations to actionable insights that can be used in clinical practice.

Project Timeline

Earliest starting date: 10/15/2022
End date:
Number of hours per week of research expected during Fall 2022: ~12

Candidate requirements

Skill sets: Students should have familiarity with the Unix environment. Experience programming efficient code in a compiled language, such as C, C++, Rust, is a desirable skill. However, individuals with limited experience but a strong motivation to learn these skills are encouraged to apply, you will be well-supported.
Student eligibility: ~~freshman~~, ~~sophomore~~, junior, senior, master’s
International students on F1 or J1 visa: eligible
Academic Credit Possible: Yes

Rapid, targeted, massive-scale genome reconstruction

Faculty Advisor

Project Timeline

Candidate requirements

Columbia Data Science Institute (DSI) Scholars Program