Understanding the structure and function of the human gut microbiome is expected to revolutionize healthcare due to its many associations with human disease. A critical step in microbiome analysis involves a clustering stage, where genomic sequences of unknown origin are assigned to latent genomes present in the sample. Current clustering methods rely on mixture-models, yet these fail to correctly model the features of genomic sequences shared across multiple genomes. These sequences are of great importance, often encoding antibiotic resistance genes that drive resistant outbreaks. This project’s goal is to develop a clustering algorithm that will effectively cluster both shared and unique genomic sequences. We have developed two probabilistic models, both based around hierarchical Poisson factorization, that have already produced promising results. The project’s goal will be to refine these models: This will involve robustly evaluating the current models, determining their limitations, and designing new models that improve upon the current. A successful project will enable for the first time, scalable, and comprehensive reconstruction of bacterial genomes. In turn, this will enable a large-scale analysis of antimicrobial resistance in the context of the human gut microbiome. We anticipate a successful project to result in an exciting publication.

Continue reading

Project focuses on using text (from company statements and job postings) to better understand inequality in labor market outcomes. It will require some data scraping, managing large amounts of text data (e.g., captures from company websites), using NLP to better understand trends in the data, and SML to code key elements in text data. It may also involve running descriptives and comparisons of text data across time periods (e.g., how the language changes).

Continue reading

The goal of this project is to evaluate algorithms that predict gene expression directly from sequencing data. With the availability of large scale sequencing data in ADSP and progress made in machine learning methods, it is possible to model long range interactions in the DNA sequence to infer intermediate phenotypes such as gene expression. First, we will test a deep learning based method called Enformer that is able to integrate long-range interactions (such as promoter-enhancer interactions) in the genome and predict gene expression from sequence. Using available RNA-sequencing on a small number of samples (e.g. ROSMAP cohort), we will optimize the algorithm to improve accuracy of prediction. Secondly, inferred expression in the ADSP cohorts will be used to test association with Alzheimer’s Disease and related endophenotypes. Finally, we will incorporate datasets that will become available in future such as cell-type specific ATAC-seq and disease-specific gene expression to re-train learning models to improve gene-expression prediction directly from sequencing data.

Continue reading

The REACH OUT study is a multi-institutional collaboration funded by the Health Effects Institute to determine if populations who have been chronically exposed to higher levels of air pollution are at greater risk of severe COVID-19 outcomes. The project will involve working with a harmonized repository of electronic health record data from multiple healthcare institutions in NYC that will be linked at the zip code level to city-wide air pollution data and neighborhood-level census variables. The project aims are:

Continue reading

Tokamak fusion reactors produce vast amounts of information rich data. Traditional approaches to data analysis struggle to cope with the scale of the data produced. In this project we aim to apply techniques from randomized numerical to beat the curse of dimensionality. Interested students should have a solid understanding of linear algebra, probability, and be happy coding in MATLAB or Python.

Continue reading

Author's picture

Columbia Data Science Institute (DSI) Scholars Program

The DSI Scholars Program is to engage and support undergraduate and master students in participating data science related research with Columbia faculty. The program’s unique enrichment activities will foster a learning and collaborative community in data science at Columbia.

Columbia University DSI

New York, NY