Electronic Health Records (EHR) provide a rich integrated source of phenotypic information that allow for automated extraction and recognition of phenotypes from EHR narratives and provide an efficient framework for conducting epidemiological and clinical studies. In addition, when EHR are linked to genetic data in electronic biorepositories such as eMERGE and All of US, phenotype information embedded in EHR can be used to efficiently construct cohorts powered for genetic discoveries. However, limitations arise from repurposing data generated from healthcare processes for research, which can include data sparseness, low quality data and diagnostic errors. Phenotyping algorithms are developed to overcome these limitations providing a robust means to assess case status.

Hidradenitis suppurativa (HS) is a painful recurrent inflammatory skin disease. While specific ICD diagnosis codes exist for HS, patients experience unusually long diagnostic delays limiting the use of ICD codes alone to identify HS patients in EHR. Two recent studies have found that 20-30% of people receiving healthcare to manage HS symptoms lack an HS diagnosis code representing tens of thousands of patients. Our goal is to develop a phenotyping algorithm that identifies HS patients on the basis of symptom treatment, allowing us to overcome the limitation of delayed diagnoses.

We have longitudinal EHR data for approximately 3000 HS patients including diagnosis and procedure codes as well as medications.

This project is eligible for a matching fund stipend from the Data Science Institute. This not a guarantee of payment, and the total amount is subject to available funding.

Faculty Advisor

  • Professor: Lynn Petukhova, PhD
  • Department/School: Dermatology/P&S; Epidemiology/MSPH
  • Location: Russ Berrie; All work will be conducted remotely
  • The overall goal of our research program is to use information in the human genome to improve the care of patients who suffer from inflammatory skin diseases, and ultimately to prevent the onset of these disorders in healthy people. Our main focus right now is on initiatating translational genetic studies of hidradenitis suppurativa (HS) by developing algorithms to build and study cohorts in electronic health records (EHR).

Project Timeline

  • Earliest starting date: 10/1/2020
  • End date:
  • Number of hours per week of research expected during Fall 2020: ~10

Candidate requirements

  • Skill sets:
    • Experience in the application of unsupervised machine learning and deep learning algorithms (both temporal and non-temporal); and with experience interpreting and visualizing results of these models. Experience leveraging these algorithms towards health care datasets, is preferable but not necessary. As with many healthcare models, interpretability is key, so interest in methods that explain why a model makes a certain choice is great.
    • This project will require investigation of various methodologies to identify the best methodologies for our specific problem area and dataset. Therefore, feeling comfortable with reading research papers and applying their contents is a must.
  • Student eligibility: freshman, sophomore, junior, senior, master’s
  • International students on F1 or J1 visa: eligible
  • Academic Credit Possible: Yes