Multi-class probabilistic clustering of the human gut microbiome

September 1, 2021 in Open Fall 2021, Open Flexible Timeline

Understanding the structure and function of the human gut microbiome is expected to revolutionize healthcare due to its many associations with human disease. A critical step in microbiome analysis involves a clustering stage, where genomic sequences of unknown origin are assigned to latent genomes present in the sample. Current clustering methods rely on mixture-models, yet these fail to correctly model the features of genomic sequences shared across multiple genomes. These sequences are of great importance, often encoding antibiotic resistance genes that drive resistant outbreaks. This project’s goal is to develop a clustering algorithm that will effectively cluster both shared and unique genomic sequences. We have developed two probabilistic models, both based around hierarchical Poisson factorization, that have already produced promising results. The project’s goal will be to refine these models: This will involve robustly evaluating the current models, determining their limitations, and designing new models that improve upon the current. A successful project will enable for the first time, scalable, and comprehensive reconstruction of bacterial genomes. In turn, this will enable a large-scale analysis of antimicrobial resistance in the context of the human gut microbiome. We anticipate a successful project to result in an exciting publication.

Predicting gene expression from sequencing in Alzheimers Disease

September 1, 2021 in Open Fall 2021, Open Flexible Timeline

The goal of this project is to evaluate algorithms that predict gene expression directly from sequencing data. With the availability of large scale sequencing data in ADSP and progress made in machine learning methods, it is possible to model long range interactions in the DNA sequence to infer intermediate phenotypes such as gene expression. First, we will test a deep learning based method called Enformer that is able to integrate long-range interactions (such as promoter-enhancer interactions) in the genome and predict gene expression from sequence. Using available RNA-sequencing on a small number of samples (e.g. ROSMAP cohort), we will optimize the algorithm to improve accuracy of prediction. Secondly, inferred expression in the ADSP cohorts will be used to test association with Alzheimer’s Disease and related endophenotypes. Finally, we will incorporate datasets that will become available in future such as cell-type specific ATAC-seq and disease-specific gene expression to re-train learning models to improve gene-expression prediction directly from sequencing data.

Randomized algorithms for plasma fusion data analysis

September 1, 2021 in Open Fall 2021, Open Flexible Timeline

Tokamak fusion reactors produce vast amounts of information rich data. Traditional approaches to data analysis struggle to cope with the scale of the data produced. In this project we aim to apply techniques from randomized numerical to beat the curse of dimensionality. Interested students should have a solid understanding of linear algebra, probability, and be happy coding in MATLAB or Python.

Representation learning for the identification of bacterial non-coding RNAs

September 1, 2021 in Open Fall 2021, Open Flexible Timeline

Computational methods for identifying protein coding genes can leverage the conserved translational mapping of triplet codons to amino acids. However, non-coding genes, that are transcribed into RNAs but do not code for proteins, lack this structure, hindering their identification. Deep neural networks have shown tremendous promise in learning useful representations of unstructured data, including genomic data. Our lab is investigating the application of deep learning and natural language processing to the learning of representations useful for non-coding gene identification in bacterial genomes. We are seeking a student to contribute to this work. The goals of this project include 1) the identification and application of neural network architectures useful for identifying different classes of non-coding RNAs, 2) the interrogation of well-performing models in order to identify features of non-coding RNAs, and 3) the design of robust test cases which enable the comparison of these novel methods to existing methods for non-coding RNA identification in bacterial genomes.

Volatile solubility laws in melts and magmas: an adaptative and universal Bayesian approach

September 1, 2021 in Open Fall 2021, Open Flexible Timeline

All volcanoes on earth are driven by the degassing of volatile elements, mostly H2O and CO2 from their host magma. To model the degassing process, one needs to know the solubility laws of these volatile. To that end, petrologists have been performing high-pressure high-temperature experiments for sixty years to determine how much water and CO2 dissolves in magma as a function of Pressure, Temperature, Melt composition (12 oxides) and oxidation state. To model how these fifteen parameters affect solubility laws petrologist have then relied on empirical interpolation between experimental data points and some extrapolations using classical thermodynamic theory to infer the expected behavior beyond experimental calibration.

Evaluating Machine learning algorithms in Earth science

January 4, 2021 in Open Spring 2021, Open Summer 2021, Open Flexible Timeline

Since the industrial revolution the atmosphere has continued to warm due to an accumulation of carbon. Terrestrial ecosystems play a crucial role in quelling the effects of climate change by storing atmospheric carbon in biomass and in the soils. In order to inform carbon reduction policy an accurate quantification of land-air carbon fluxes is necessary. To quantify the terrestrial CO2 exchange, direct monitoring of surface carbon fluxes at few locations across the globe provide valuable observations. However, this data is sparse in both space and time, and is thus unable to provide an estimate of the global spatiotemporal changes, as well as rare extreme conditions (droughts, heatwaves). In this project we will first use synthetic data and sample CO2 fluxes from a simulation of the Earth system at observation locations and then use various machine learning algorithms (neural networks, boosting, GANs) to reconstruct the model’s CO2 flux at all locations. We will then evaluate the performance of each method using a suite of regression metrics. Finally, time permitting, we will apply these methods to real observations. This project provides a way of evaluating the performance of machine learning methods as they are used in Earth science.

Multi-class probabilistic clustering of the human gut microbiome

Predicting gene expression from sequencing in Alzheimers Disease

Randomized algorithms for plasma fusion data analysis

Representation learning for the identification of bacterial non-coding RNAs

Volatile solubility laws in melts and magmas: an adaptative and universal Bayesian approach

Evaluating Machine learning algorithms in Earth science

Columbia Data Science Institute (DSI) Scholars Program