This project works with a novel corpus of text-based school data to develop a multi-dimensional measure of the degree to which American colleges and universities offer a liberal arts education. We seek a data scientist for various tasks on a project that uses analysis of multiple text corpora to better understand the liberal arts. This is an ongoing three-year project with opportunities for future collaborations, academic publications, and developing and improving existing data science and machine learning skills. Tasks likely include: (1) Using Amazon Web Services to create and maintain cloud-based storage (SQL, S3 buckets) of the project’s expanding library of data. (2) Extracting information (named entities, times, places, books, and so on) from millions of plain-text syllabus records. (3) Merging multiple forms of data into a single dataset. (4) Scraping websites for relevant information (e.g., college course offerings, school rankings). Some pages may include dynamically created content that requires the use of a program such as Selenium.

This project is eligible for a matching fund stipend from the Data Science Institute. This not a guarantee of payment, and the total amount is subject to available funding.

Faculty Advisor

  • Professor: Peter Bearman
  • Department/School: Arts and Sciences
  • Location: 3078 Broadway - Morningside Campus
  • By leveraging the ideas and empirical tools of the social and human sciences, INCITE conceives and conducts collaborative research, projects, and programs that generate knowledge, promote just, equitable societies, and enrich our intellectual environment. INCITE is home to big data projects, offering a vibrant mix of cutting-edge qualitative and quantitative projects. Our current grant portfolio exceeds $3 million and we have hired several DSI students in the past few years.

Project Timeline

  • Earliest starting date: 10/1/2020
  • End date: 7/31/2020
  • Number of hours per week of research expected during Fall 2020: ~10

Candidate requirements

  • Skill sets: Fluency in Python, SQL, Git/Github, as well as a working knowledge the Unix command line. Knowledge of AWS services (especially EC2 and S3), NLP, machine learning classifiers, and the R programming language are also desirable but not required.
  • Student eligibility: freshman, sophomore, junior, senior, master’s
  • International students on F1 or J1 visa: eligible
  • Academic Credit Possible: No