Natural Language Processing within the CONCERN Project

January 4, 2021 in Open Spring 2021, Open Summer 2021, Open Flexible Timeline 2021

The CONCERN project aims to develop models and tools to quantify clinician concern about patient deterioration in the inpatient setting that can be used in early warning scores. We have discovered and validated several measurable ways within the Electronic Health Record (EHR) to measure clinician concern and have demonstrated that our approach identified patients at risk of deterioration earlier than other methods, which focus only on physiological data. One of our approaches is leveraging documentation of certain concepts within narrative text in nursing notes that are consistent with concern about a patient. However, this narrative free text is not easily accessible - it is often mixed together with structured or templated text and varies over note types. The steps to be performed are

Measuring Liberal Arts: Creating an Index for Higher Education

September 8, 2020 in Open Projects Fall 2020

This project works with a novel corpus of text-based school data to develop a multi-dimensional measure of the degree to which American colleges and universities offer a liberal arts education. We seek a data scientist for various tasks on a project that uses analysis of multiple text corpora to better understand the liberal arts. This is an ongoing three-year project with opportunities for future collaborations, academic publications, and developing and improving existing data science and machine learning skills. Tasks likely include: (1) Using Amazon Web Services to create and maintain cloud-based storage (SQL, S3 buckets) of the project’s expanding library of data. (2) Extracting information (named entities, times, places, books, and so on) from millions of plain-text syllabus records. (3) Merging multiple forms of data into a single dataset. (4) Scraping websites for relevant information (e.g., college course offerings, school rankings). Some pages may include dynamically created content that requires the use of a program such as Selenium.

Phenotyping COVID-19 patients using NLP and clinical notes

May 18, 2020 in Project Summer 2020-2

Our lab is using clinical notes to phenotype COVID patient outcomes. The aim is to better understand the sequela of COVID-19 from clinical notes.

Creating accurate and equitable data resources for precision public health using machine learning tools

January 15, 2020 in Project Spring 2020, Project Summer 2020

The objective of this project is to construct linkages across disparate public health data systems using machine learning tools and assess them for bias and equitable representation of subpopulations defined by demographic and socioeconomic factors.

Measuring Liberal Arts: Creating an Index for Higher Education

September 30, 2019 in Project Fall 2019

This project works with a novel corpus of text-based school data to develop a multi-dimensional measure of the degree to which American colleges and universities offer a liberal arts education. We seek a data scientist for various tasks on a project that uses analysis of multiple text corpora to better understand the liberal arts. This is an ongoing three-year project with opportunities for future collaborations, academic publications, and developing and improving existing data science and machine learning skills. Tasks likely include: (1) Using Amazon Web Services to create and maintain cloud-based storage (SQL, S3 buckets) of the project’s expanding library of data. (2) Extracting information (named entities, times, places, books, and so on) from millions of plain-text syllabus records. (3) Merging multiple forms of data into a single dataset. (4) Scraping websites for relevant information (e.g., college course offerings, school rankings). Some pages may include dynamically created content that requires the use of a program such as Selenium.

Project: A Data-driven Approach for Improving the User Experience of Internet Users

January 11, 2019 in Project 2019

Our lives are heavily reliant on Internet-connected devices and services. However, to deliver the desired user experience over the Internet, network operators need to detect and diagnose various network events (e.g., disruption, outage, misconfiguration, etc.) as well as resolve them in real-time. We have developed an Internet-wide measurement infrastructure that collects performance metrics (e.g., latency, jitter, throughput, packet loss rate, signal strength, etc.) from vantage points deployed by real users (mobile phones, WiFi access points, etc.) at regular intervals.

Project: Advancing Public Health Monitoring and Analytics in New York City through Development of a Master Person Index

January 11, 2019 in Project 2019

Data is central to the NYC Department of Health’s mission to protect and promote the health of all New Yorkers. The agency’s many programs often require large scale record linkages that integrate data from individuals across multiple public health data systems and disease registries. We are implementing a Master Person Index (MPI) system in order to centralize, optimize and standardize matching methodology for administrative data across the Department of Health.

Natural Language Processing within the CONCERN Project

Measuring Liberal Arts: Creating an Index for Higher Education

Phenotyping COVID-19 patients using NLP and clinical notes

Creating accurate and equitable data resources for precision public health using machine learning tools

Measuring Liberal Arts: Creating an Index for Higher Education

Project: A Data-driven Approach for Improving the User Experience of Internet Users

Project: Advancing Public Health Monitoring and Analytics in New York City through Development of a Master Person Index

Columbia Data Science Institute (DSI) Scholars Program