Representation learning for the identification of bacterial non-coding RNAs

September 1, 2021 in Open Fall 2021, Open Flexible Timeline

Computational methods for identifying protein coding genes can leverage the conserved translational mapping of triplet codons to amino acids. However, non-coding genes, that are transcribed into RNAs but do not code for proteins, lack this structure, hindering their identification. Deep neural networks have shown tremendous promise in learning useful representations of unstructured data, including genomic data. Our lab is investigating the application of deep learning and natural language processing to the learning of representations useful for non-coding gene identification in bacterial genomes. We are seeking a student to contribute to this work. The goals of this project include 1) the identification and application of neural network architectures useful for identifying different classes of non-coding RNAs, 2) the interrogation of well-performing models in order to identify features of non-coding RNAs, and 3) the design of robust test cases which enable the comparison of these novel methods to existing methods for non-coding RNA identification in bacterial genomes.

Spanish-Language Misinformation

September 1, 2021 in Open Fall 2021

I will use Twitter data to assess the popularity and reach of false claims affecting Latinos regarding Covid-19, the 2020 election, and the Biden presidency. I also plan to characterize these data using text analysis methods in order to recover general themes or topics, and how they varied as a function of geography, time, and user characteristics. I have access to the Twitter academic API.

Startup Pivoting

September 1, 2021 in Open Fall 2021

This project aims to use data science with historical version of startup websites to identify when do they pivot to new strategies.

Firm strategies—what they choose to to do or not to do, and why—represent the main way in which firms shape the economy. In a time of widely encompassing platforms, corporate-led crypto currencies, activist CEOs, and socially-oriented corporations, characterizing how firms differ in their strategies, and in the choices they take, appears as important as ever. There is a need for tools to measure firm strategy.

As a student scholar, your role in this position would be to work in the nascent Measuring Strategy Lab, using natural language processing methods to devise new ways to understand and measure firm strategy. Specifically, using a large sample of startup websites downloaded through the Wayback Machine, develop systematic ways to understand when a startup is changing their strategy and why, and how does this predict their performance. This work builds also on the Startup Cartography Project, and is part of the ongoing efforts of bringing data science into strategy research.

The Language of Vaccine Hesitancy on Social Media

September 1, 2021 in Open Fall 2021

The project seeks to analyze a large corpus of vaccine (dis)information collected online, using NLP and other machine learning tools. Of particular interest is vaccine hesitancy among people of color, immigrant and religious communities.

The market for highly-skilled labor: Evidence from professional sports

September 1, 2021 in Open Fall 2021

In this project, we will study historical player transfer data from European professional football (n > 1,000,000 transfers). To supplement this analysis, we will also exploit data on player and team performance, team ownership and management, team finances, and player agents.

The Value of Data

September 1, 2021 in Open Fall 2021

Many scholars and policymakers view establishing functioning data markets as essential for the digital economy to bring prosperity and stability to society at large. A key challenge is to determine the value of an individual’s specific data. Is one buyer’s data more valuable than another’s for an e-commerce platform? How much should each be paid?

Towards a combined geopolitical and technical understanding of regional Internet

September 1, 2021 in Closed Fall 2021, Closed Flexible Timeline

This research project aims at exploring and developing methods to improve and diversify the visualizations of the interactions between the Internet and the topographical and geopolitical space (i.e. space and the political actors that rule over it) through the case study of a region of interest (could be virtually any region that would be of interest for the student). The main intend of the project is to produce a set of maps and visualizations (including infographics where relevant), as comprehensive and diverse as possible, combining Internet mapping with the geographical and geopolitical context of that region. We will build on top of existing techniques for visualizations of the Internet and discuss potential capacities to further model the Internet.

Representation learning for the identification of bacterial non-coding RNAs

Spanish-Language Misinformation

Startup Pivoting

The Language of Vaccine Hesitancy on Social Media

The market for highly-skilled labor: Evidence from professional sports

The Value of Data

Towards a combined geopolitical and technical understanding of regional Internet

Columbia Data Science Institute (DSI) Scholars Program