Project: Classifying drug patents

Taking out multiple patents on different aspects of a drug in order to cordon off competitors is standard practice in pharmaceuticals. In addition to primary patents, firms commonly attempt to acquire secondary patents on alternative forms of molecules, different formulations, dosages, and compositions, and new uses Policymakers in the U.S. and globally have raised concerns that these secondary patents can raise drug prices and restrict access to medicines. One challenge to assessing the impact of these patents is it is difficult and costly to know if a given patent is “primary” or “secondary.”

Trained legal experts typically classify a patent as after close reading, which is an obstacle to large sample empirical analyses. I am hoping to classify patents (based on their text) using natural language processing techniques, using the full text of issued patents and training sets based on hand-coding of some patents by lawyers and patent examiners.

This is an UNPAID research experience.

Faculty Advisor

Professor Bhaven Sampat
Department/School: Health Policy and Management / Public Health
Location: CUMC

Project timeline

Earliest starting date: 03/01/2019
End date: 05/31/2019
Number of hours per week of research expected during Spring 2019: ~5

Candidate requirements

Skill sets: Natural language processing, machine learning in Python (and/or R); Ability to work with large corpus of text data; Ability to document code so it is clear and replicable
Student eligibility (as of Spring 2019): ~~freshman~~, ~~sophomore~~, junior, senior, master’s
International students on F1 or J1 visa: eligible

Project: Classifying drug patents

Faculty Advisor

Project timeline

Candidate requirements

Columbia Data Science Institute (DSI) Scholars Program