Taking out multiple patents on different aspects of a drug in order to cordon off competitors is standard practice in pharmaceuticals. In addition to primary patents, firms commonly attempt to acquire secondary patents on alternative forms of molecules, different formulations, dosages, and compositions, and new uses Policymakers in the U.S. and globally have raised concerns that these secondary patents can raise drug prices and restrict access to medicines. One challenge to assessing the impact of these patents is it is difficult and costly to know if a given patent is “primary” or “secondary.”

Trained legal experts typically classify a patent as after close reading, which is an obstacle to large sample empirical analyses. I am hoping to classify patents (based on their text) using natural language processing techniques, using the full text of issued patents and training sets based on hand-coding of some patents by lawyers and patent examiners.

This is an UNPAID research experience.

Faculty Advisor

  • Professor Bhaven Sampat
  • Department/School: Health Policy and Management / Public Health
  • Location: CUMC

Project timeline

  • Earliest starting date: 03/01/2019
  • End date: 05/31/2019
  • Number of hours per week of research expected during Spring 2019: ~5

Candidate requirements

  • Skill sets: Natural language processing, machine learning in Python (and/or R); Ability to work with large corpus of text data; Ability to document code so it is clear and replicable
  • Student eligibility (as of Spring 2019): freshman, sophomore, junior, senior, master’s
  • International students on F1 or J1 visa: eligible