Project: Measuring Liberal Arts: Creating an Index for Higher Education

This project works with a novel corpus of text-based school data to develop a multi-dimensional measure of the degree to which American colleges and universities offer a liberal arts education. We seek a data scientist for various tasks on a project that uses analysis of multiple text corpora to better understand the liberal arts. This is an ongoing three-year project with opportunities for future collaborations, academic publications, and developing and improving existing data science and machine learning skills.

Tasks likely include: (1) Using Amazon Web Services to create and maintain cloud-based storage (SQL, S3 buckets) of the project’s expanding library of data. (2) Extracting information (named entities, times, places, books, and so on) from millions of plain-text syllabus records. (3) Merging multiple forms of data into a single dataset. (4) Scraping websites for relevant information (e.g., college course offerings, school rankings). Some pages may include dynamically created content that requires the use of a program such as Selenium.

One selected candidate will receive a stipend via the DSI Scholars program. Amount is subject to available funding.

Faculty Advisor

Professor Peter Bearman
Department/School: INCITE/Arts and Sciences
Location: Morningside Campus
About INCITE: By leveraging the ideas and empirical tools of the social and human sciences, INCITE conceives and conducts collaborative research, projects, and programs that generate knowledge, promote just, equitable societies, and enrich our intellectual environment. INCITE is home to big data projects, offering a vibrant mix of cutting-edge qualitative and quantitative projects. Our current grant portfolio exceeds $3 million and we have hired several DSI students in the past few years.

Project timeline

Earliest starting date: 03/01/2019
End date: 08/31/2019
Number of hours per week of research expected during Spring 2019: ~20
Number of hours per week of research expected during Summer 2019: ~20-40

Candidate requirements

Skill sets: Candidates must possess fluency in Python, SQL, and Git/Github, as well as a working knowledge of Amazon Web Services (especially EC2 instances and S3 buckets) and the Unix/SSH command line. The ideal candidate will also have a knowledge of machine learning (such as trained classifiers and information extraction), and NLP packages such as NLTK and SpaCy. A basic knowledge of the R programming language, HTML, Docker, and distributed computing technologies (such as Apache Spark) are desirable but not required.
Student eligibility (as of Spring 2019): ~~freshman~~, sophomore, junior, senior, master’s
International students on F1 or J1 visa: eligible

Project: Measuring Liberal Arts: Creating an Index for Higher Education

Faculty Advisor

Project timeline

Candidate requirements

Columbia Data Science Institute (DSI) Scholars Program