Multilanguage analysis of book titles

May 18, 2020 in Project Summer 2020-2

Data visualization, statistics, and analysis of translation entries online. More details will be furnished upon request.

Multilanguage analysis of search results

May 18, 2020 in Project Summer 2020-2

The talent students will be given search entries, topics, or terms and will be required to analyze the algorithms of search results across various engines, languages. More information and details upon request.

Phenotyping COVID-19 patients using NLP and clinical notes

May 18, 2020 in Project Summer 2020-2

Our lab is using clinical notes to phenotype COVID patient outcomes. The aim is to better understand the sequela of COVID-19 from clinical notes.

Social media echo chambers enhancing anxiety and depression: the effects of COVID-19

May 18, 2020 in Project Summer 2020-2

The question we ask is whether online echo-chambers on social media networks enhance the anxiety and depression of individuals during the COVID19 outbreak. More specifically we want to measure the intensity of the communication about COVID-19 within the echo-chamber of individuals on Twitter and investigate the impact on their subsequent tweets in terms of the level of anxiety and signs of depressive language in their Tweets. We measure echo-chambers by the number of users in the social network that tweeted about COVID-19. We build on an extensive dataset of Twitter users for whom we have identified a large number of demographic and geographic variables (such as the gender, age, ethnicity, location by state, political affiliation) as well as their social network.

Data For Good: The Consequences of Language Policing

January 15, 2020 in Project Spring 2020

Contestation over language use is an unavoidable feature of American politics. Yet, despite the rise of language policing on both sides of the aisle, we know surprisingly little about how ordinary citizens respond to norms governing language use from both in-group and out-group members. Following Munger (2017), I would like to leverage social media platforms such as Reddit and Twitter to evaluate whether injunctions to use particular words (e.g., undocumented immigrant, Latinx) are effective. I plan to use an experimental approach, where conditional on mentions of “illegal alien” or “Hispanic/Latino,” users are randomly assigned to receive a “language correction.” Outcome measures would include subsequent use of corrected terms, valence of user responses, and upvoting/liking/RTing behavior.

Measuring Liberal Arts: Creating an Index for Higher Education

September 30, 2019 in Project Fall 2019

This project works with a novel corpus of text-based school data to develop a multi-dimensional measure of the degree to which American colleges and universities offer a liberal arts education. We seek a data scientist for various tasks on a project that uses analysis of multiple text corpora to better understand the liberal arts. This is an ongoing three-year project with opportunities for future collaborations, academic publications, and developing and improving existing data science and machine learning skills. Tasks likely include: (1) Using Amazon Web Services to create and maintain cloud-based storage (SQL, S3 buckets) of the project’s expanding library of data. (2) Extracting information (named entities, times, places, books, and so on) from millions of plain-text syllabus records. (3) Merging multiple forms of data into a single dataset. (4) Scraping websites for relevant information (e.g., college course offerings, school rankings). Some pages may include dynamically created content that requires the use of a program such as Selenium.

Project: Classifying drug patents

January 16, 2019 in Project 2019

Taking out multiple patents on different aspects of a drug in order to cordon off competitors is standard practice in pharmaceuticals. In addition to primary patents, firms commonly attempt to acquire secondary patents on alternative forms of molecules, different formulations, dosages, and compositions, and new uses Policymakers in the U.S. and globally have raised concerns that these secondary patents can raise drug prices and restrict access to medicines. One challenge to assessing the impact of these patents is it is difficult and costly to know if a given patent is “primary” or “secondary.”

Multilanguage analysis of book titles

Multilanguage analysis of search results

Phenotyping COVID-19 patients using NLP and clinical notes

Social media echo chambers enhancing anxiety and depression: the effects of COVID-19

Data For Good: The Consequences of Language Policing

Measuring Liberal Arts: Creating an Index for Higher Education

Project: Classifying drug patents

Columbia Data Science Institute (DSI) Scholars Program