This project works with a novel corpus of text-based school data to develop a multi-dimensional measure of the degree to which American colleges and universities offer a liberal arts education. We seek a data scientist for various tasks on a project that uses analysis of multiple text corpora to better understand the liberal arts. This is an ongoing three-year project with opportunities for future collaborations, academic publications, and developing and improving existing data science and machine learning skills.

Tasks likely include: (1) Using Amazon Web Services to create and maintain cloud-based storage (SQL, S3 buckets) of the project’s expanding library of data. (2) Extracting information (named entities, times, places, books, and so on) from millions of plain-text syllabus records. (3) Merging multiple forms of data into a single dataset. (4) Scraping websites for relevant information (e.g., college course offerings, school rankings). Some pages may include dynamically created content that requires the use of a program such as Selenium.

One selected candidate will receive a stipend via the DSI Scholars program. Amount is subject to available funding.

Faculty Advisor

  • Professor Peter Bearman
  • Department/School: INCITE/Arts and Sciences
  • Location: Morningside Campus
  • About INCITE: By leveraging the ideas and empirical tools of the social and human sciences, INCITE conceives and conducts collaborative research, projects, and programs that generate knowledge, promote just, equitable societies, and enrich our intellectual environment. INCITE is home to big data projects, offering a vibrant mix of cutting-edge qualitative and quantitative projects. Our current grant portfolio exceeds $3 million and we have hired several DSI students in the past few years.

Project timeline

  • Earliest starting date: 03/01/2019
  • End date: 08/31/2019
  • Number of hours per week of research expected during Spring 2019: ~20
  • Number of hours per week of research expected during Summer 2019: ~20-40

Candidate requirements

  • Skill sets: Candidates must possess fluency in Python, SQL, and Git/Github, as well as a working knowledge of Amazon Web Services (especially EC2 instances and S3 buckets) and the Unix/SSH command line. The ideal candidate will also have a knowledge of machine learning (such as trained classifiers and information extraction), and NLP packages such as NLTK and SpaCy. A basic knowledge of the R programming language, HTML, Docker, and distributed computing technologies (such as Apache Spark) are desirable but not required.
  • Student eligibility (as of Spring 2019): freshman, sophomore, junior, senior, master’s
  • International students on F1 or J1 visa: eligible