Unsupervised clustering of the human gut microbiome: revealing the biases, optimizing the parameters

Background: The human gut microbiome is a heterogeneous community of bacterial species. Many human diseases are associated with changes in the microbiome, and understanding the interaction between gut bacteria and human health is therefore expected to revolutionize healthcare.

Next-generation sequencing is enabling researchers to probe the human-microbiome relationship by providing a readout of the microbiome’s genetic content. This technology however, provides a mixture of DNA sequences deriving from all the genomes in the sample. A critical step in the analysis pipeline of this data, is therefore, to cluster the sequences back into the underlying genomes that generated the mixture, a process termed “binning”. The biological conclusions drawn from a study are highly dependent on the performance of this clustering step. It has recently been shown that current clustering algorithms are biased, failing to correctly bin shared genomic sequences. These sequences are of great interest and of clinical relevance, often encoding antibiotic resistance genes that are the root cause of resistance outbreaks. Consequently, current clustering algorithms are systematically biasing the interpretability of our studies.

Project: To improve binning algorithms, it is essential to understand their biases. We are developing a comprehensive binning evaluator (ComBinE), that uses simulated data for evaluation. With ComBinE, the first aim of this project is to comprehensively characterize the bias in binning algorithms as a direct function of sample properties: diversity, number of shared sequences, etc. Once characterized, we will reveal these biases to the research community through scientific publication, increasing awareness to this important issue. The second aim of this project is to develop a recommendation algorithm that, given a dataset and its unique properties, such as sample diversity, provides optimal parameter choices for binning. Because of the complexity of the problem, we anticipate a machine learning-based recommender to be very successful for this aim, and we expect this project to explore such approaches.

This project is eligible for a matching fund stipend from the Data Science Institute. This is not a guarantee of payment, and the total amount is subject to available funding.

Faculty Advisor

Professor: Tal Korem
Center/Lab: Systems Biology
Location: PH18-200
We develop data analysis methods for multi-omic microbiome data. We focus on integrating clinical, microbiome, lifestyle and environmental data in a way that advances from statistical associations to actionable insights that can be used in clinical practice.

Project Timeline

Earliest starting date: 10/15/2022
End date:
Number of hours per week of research expected during Fall 2022: ~12

Candidate requirements

Skill sets: Students should have familiarity with the Unix environment and Python. Bioinformatics experience is a desirable skill. However, individuals with limited experience but a strong motivation to learn these skills are encouraged to apply, you will be well-supported.
Student eligibility: ~~freshman~~, sophomore, junior, senior, master’s
International students on F1 or J1 visa: eligible
Academic Credit Possible: Yes

Unsupervised clustering of the human gut microbiome: revealing the biases, optimizing the parameters

Faculty Advisor

Project Timeline

Candidate requirements

Columbia Data Science Institute (DSI) Scholars Program