The REACH OUT study is a multi-institutional collaboration funded by the Health Effects Institute to determine if populations who have been chronically exposed to higher levels of air pollution are at greater risk of severe COVID-19 outcomes. The project will involve working with a harmonized repository of electronic health record data from multiple healthcare institutions in NYC that will be linked at the zip code level to city-wide air pollution data and neighborhood-level census variables. The project aims are:

Continue reading

Tokamak fusion reactors produce vast amounts of information rich data. Traditional approaches to data analysis struggle to cope with the scale of the data produced. In this project we aim to apply techniques from randomized numerical to beat the curse of dimensionality. Interested students should have a solid understanding of linear algebra, probability, and be happy coding in MATLAB or Python.

Continue reading

Computational methods for identifying protein coding genes can leverage the conserved translational mapping of triplet codons to amino acids. However, non-coding genes, that are transcribed into RNAs but do not code for proteins, lack this structure, hindering their identification. Deep neural networks have shown tremendous promise in learning useful representations of unstructured data, including genomic data. Our lab is investigating the application of deep learning and natural language processing to the learning of representations useful for non-coding gene identification in bacterial genomes. We are seeking a student to contribute to this work. The goals of this project include 1) the identification and application of neural network architectures useful for identifying different classes of non-coding RNAs, 2) the interrogation of well-performing models in order to identify features of non-coding RNAs, and 3) the design of robust test cases which enable the comparison of these novel methods to existing methods for non-coding RNA identification in bacterial genomes.

Continue reading

I will use Twitter data to assess the popularity and reach of false claims affecting Latinos regarding Covid-19, the 2020 election, and the Biden presidency. I also plan to characterize these data using text analysis methods in order to recover general themes or topics, and how they varied as a function of geography, time, and user characteristics. I have access to the Twitter academic API.

Continue reading

Startup Pivoting

This project aims to use data science with historical version of startup websites to identify when do they pivot to new strategies.

Firm strategies—what they choose to to do or not to do, and why—represent the main way in which firms shape the economy. In a time of widely encompassing platforms, corporate-led crypto currencies, activist CEOs, and socially-oriented corporations, characterizing how firms differ in their strategies, and in the choices they take, appears as important as ever. There is a need for tools to measure firm strategy.

As a student scholar, your role in this position would be to work in the nascent Measuring Strategy Lab, using natural language processing methods to devise new ways to understand and measure firm strategy. Specifically, using a large sample of startup websites downloaded through the Wayback Machine, develop systematic ways to understand when a startup is changing their strategy and why, and how does this predict their performance. This work builds also on the Startup Cartography Project, and is part of the ongoing efforts of bringing data science into strategy research.

Continue reading

Author's picture

Columbia Data Science Institute (DSI) Scholars Program

The DSI Scholars Program is to engage and support undergraduate and master students in participating data science related research with Columbia faculty. The program’s unique enrichment activities will foster a learning and collaborative community in data science at Columbia.

Columbia University DSI

New York, NY