Natural Language Processing of social media data on COVID-19 travel pattern analysis

September 1, 2021 in Open Fall 2021

COVID-19 has transformed people’s lives in every aspect. Travel patterns and work patterns are also changed. This project aims to leverage social media data (e.g., tweets, facebook posts, Reddit posts) to mine people’s travel patterns and the timeline of telecommute using Natural Language Processing (NLP). Research questions are:

Predicting gene expression from sequencing in Alzheimers Disease

September 1, 2021 in Open Fall 2021, Open Flexible Timeline

The goal of this project is to evaluate algorithms that predict gene expression directly from sequencing data. With the availability of large scale sequencing data in ADSP and progress made in machine learning methods, it is possible to model long range interactions in the DNA sequence to infer intermediate phenotypes such as gene expression. First, we will test a deep learning based method called Enformer that is able to integrate long-range interactions (such as promoter-enhancer interactions) in the genome and predict gene expression from sequence. Using available RNA-sequencing on a small number of samples (e.g. ROSMAP cohort), we will optimize the algorithm to improve accuracy of prediction. Secondly, inferred expression in the ADSP cohorts will be used to test association with Alzheimer’s Disease and related endophenotypes. Finally, we will incorporate datasets that will become available in future such as cell-type specific ATAC-seq and disease-specific gene expression to re-train learning models to improve gene-expression prediction directly from sequencing data.

Race, Ethnicity, and Air pollution in COVID-19 Hospitalization OUTcomes (REACH OUT Study)

September 1, 2021 in Open Fall 2021

The REACH OUT study is a multi-institutional collaboration funded by the Health Effects Institute to determine if populations who have been chronically exposed to higher levels of air pollution are at greater risk of severe COVID-19 outcomes. The project will involve working with a harmonized repository of electronic health record data from multiple healthcare institutions in NYC that will be linked at the zip code level to city-wide air pollution data and neighborhood-level census variables. The project aims are:

Randomized algorithms for plasma fusion data analysis

September 1, 2021 in Open Fall 2021, Open Flexible Timeline

Tokamak fusion reactors produce vast amounts of information rich data. Traditional approaches to data analysis struggle to cope with the scale of the data produced. In this project we aim to apply techniques from randomized numerical to beat the curse of dimensionality. Interested students should have a solid understanding of linear algebra, probability, and be happy coding in MATLAB or Python.

Reinforcement learning for vehicle routing

September 1, 2021 in Open Fall 2021

Vehicle routing has been extensively studied in optimization problems. With the advance of AI and big data, this project aims to solve vehicle routing problems (VRP) using reinforcement learning.

Representation learning for the identification of bacterial non-coding RNAs

September 1, 2021 in Open Fall 2021, Open Flexible Timeline

Computational methods for identifying protein coding genes can leverage the conserved translational mapping of triplet codons to amino acids. However, non-coding genes, that are transcribed into RNAs but do not code for proteins, lack this structure, hindering their identification. Deep neural networks have shown tremendous promise in learning useful representations of unstructured data, including genomic data. Our lab is investigating the application of deep learning and natural language processing to the learning of representations useful for non-coding gene identification in bacterial genomes. We are seeking a student to contribute to this work. The goals of this project include 1) the identification and application of neural network architectures useful for identifying different classes of non-coding RNAs, 2) the interrogation of well-performing models in order to identify features of non-coding RNAs, and 3) the design of robust test cases which enable the comparison of these novel methods to existing methods for non-coding RNA identification in bacterial genomes.

Spanish-Language Misinformation

September 1, 2021 in Open Fall 2021

I will use Twitter data to assess the popularity and reach of false claims affecting Latinos regarding Covid-19, the 2020 election, and the Biden presidency. I also plan to characterize these data using text analysis methods in order to recover general themes or topics, and how they varied as a function of geography, time, and user characteristics. I have access to the Twitter academic API.

Natural Language Processing of social media data on COVID-19 travel pattern analysis

Predicting gene expression from sequencing in Alzheimers Disease

Race, Ethnicity, and Air pollution in COVID-19 Hospitalization OUTcomes (REACH OUT Study)

Randomized algorithms for plasma fusion data analysis

Reinforcement learning for vehicle routing

Representation learning for the identification of bacterial non-coding RNAs

Spanish-Language Misinformation

Columbia Data Science Institute (DSI) Scholars Program