Large Language Models

Public health threats often trigger narratives of “othering" that cast blame upon marginalized groups and stoke further stigmatization. The record increase in hate crimes targeting Asian Americans since the onset of the pandemic is alarming but yet not surprising. As tragic as these events are, even those hate crimes that are reported to police likely represent just a subset of incidents ranging from the mundane to the extreme that Asian Americans have faced. According to AAPI Data, though Asian Americans have experienced hate incidents more than the general population during this pandemic, they are also among the least likely to say they are “very comfortable'' reporting hate crimes to authorities. Systematically documenting everyday manifestations of anti-Asian sentiments is immensely challenging.

Behavioral-level evidence that identifies and documents the extent and degree of anti-Chinese bias attributable to the pandemic — especially in everyday contexts — is hard to find. Much existing work on anti-Asian sentiment relies upon survey evidence documenting Asian Americans' self-reports of experiences and feelings, or survey evidence of attitudes towards Asians. Journalism outlets have tried to track incidents by aggregating news stories, providing crude estimates of intensity across the pandemic. While these approaches are incredibly useful, survey self-reports may certainly be influenced by demand and social desirability effects; moreover, survey researchers exercise agency in prompting respondents' reactions rather than merely observing what people elect to do when left to their own devices.

We utilize a big-data approach to examine the emergence of anti-Chinese prejudice in naturalistic behaviors — specifically, Yelp restaurant reviews. One of us has previously conducted a study relying simply on Yelp scores, avoiding the text of the reviews themselves. For this proposed work, we will leverage a large language model, GPT-3, to help detect anti-Chinese prejudice in millions of written comments on Yelp, and then compare that with the classifications done by more traditional NLP methods (i.e. sentiment analysis based on existing dictionaries) and the human-coded results. Our work will focus on techniques for writing “prompts” that can be part of a big-data pipeline.

Selected candidate(s) can receive a stipend directly from the faculty advisor. This is not a guarantee of payment, and the total amount is subject to available funding.

Faculty Advisor

Professor: Mark Hansen
Center/Lab: Brown Institute
Location: Ground Floor, Pulitzer Hall
https://journalism.columbia.edu/faculty/mark-hansen

Project Timeline

Earliest starting date: 2022-10-15
End date: 2022-12-15
Number of hours per week of research expected during Fall 2022: ~10

Candidate requirements

Skill sets: Python, NLP, basic AI
Student eligibility: ~~freshman~~, ~~sophomore~~, ~~junior~~, ~~senior~~, master’s
International students on F1 or J1 visa: eligible
Academic Credit Possible: Yes

Large Language Models

Faculty Advisor

Project Timeline

Candidate requirements

Columbia Data Science Institute (DSI) Scholars Program