Evaluating Machine learning algorithms in Earth science

January 4, 2021 in Open Spring 2021, Open Summer 2021, Open Flexible Timeline

Since the industrial revolution the atmosphere has continued to warm due to an accumulation of carbon. Terrestrial ecosystems play a crucial role in quelling the effects of climate change by storing atmospheric carbon in biomass and in the soils. In order to inform carbon reduction policy an accurate quantification of land-air carbon fluxes is necessary. To quantify the terrestrial CO2 exchange, direct monitoring of surface carbon fluxes at few locations across the globe provide valuable observations. However, this data is sparse in both space and time, and is thus unable to provide an estimate of the global spatiotemporal changes, as well as rare extreme conditions (droughts, heatwaves). In this project we will first use synthetic data and sample CO2 fluxes from a simulation of the Earth system at observation locations and then use various machine learning algorithms (neural networks, boosting, GANs) to reconstruct the model’s CO2 flux at all locations. We will then evaluate the performance of each method using a suite of regression metrics. Finally, time permitting, we will apply these methods to real observations. This project provides a way of evaluating the performance of machine learning methods as they are used in Earth science.

Reconstructions of the ocean carbon sink: Assessment of regional uncertainties

January 4, 2021 in Open Spring 2021, Open Summer 2021

The ocean significantly mitigates climate change by absorbing fossil fuel carbon from the atmosphere. Cumulatively since the preindustrial times, the ocean has absorbed 40% of emissions. To understand past changes, diagnose ongoing changes, and to predict the future behavior of the ocean carbon sink, we must understand its spatial and temporal variability. However, the ocean is poorly sampled and so we cannot do this directly from in situ measurements.

Streaming video analysis and optimization during Work-from-Home period

January 4, 2021 in Open Spring 2021, Open Summer 2021, Open Flexible Timeline 2021

The goal of this project is to collect anonymized traces from the Columbia network in order to analyze video traffic characteristics during the work/study-from home period. This information will be used for developing various ML-based tools for Quality of Experience (QoE) measurement. We will perform the feature extraction at the collection time itself and use anonymization techniques (e.g., IP address anonymization), to preserve user privacy. Students will analyze/measure encrypted network traffic to provide ground truth for potential RL/ML algorithms for estimating video QoE and identifying device/application (e.g., the start of a video streaming session). These algorithms can serve as a basis for new video adaptation techniques (see for example - https://wimnet.ee.columbia.edu/wimnet-team-wins-3rd-place-in-the-acm-mmsys20-twitch-grand-challenge/)

Using Data Science to Improve Telephone Triage of Ophthalmology Patients

January 4, 2021 in Open Spring 2021

Health care professionals cannot examine every person calling the office with a question nor can they return every call. Therefore, medical offices seeking to improve the speed and efficiency of evaluating and triaging patients must utilize telephone personnel who are often non-clinical staff. These telephone triage personnel may be limited in their knowledge and ability to obtain the necessary details of the patient’s medical symptoms and direct medical care accordingly. Their role is not to make diagnoses by phone, but rather to collect sufficient data related to the patient’s complaints and assign them appropriately in order to get the patient to the right level of care with the right provider in the right place at the right time.

Using speech and language to identify patients at risk for hospitalizations and emergency department visits in homecare

January 4, 2021 in Closed Spring 2021, Closed Summer 2021, Closed Flexible Timeline 2021

This study is the first step in exploring an emerging and previously understudied data stream - verbal communication between healthcare providers and patients. In partnership between Columbia Engineering, School of Nursing, Amazon, and the largest home healthcare agency in the US, the study will investigate how to use audio-recorded routine communications between patients and nurses to help identify patients at risk of hospitalization or emergency department visits. The study will combine speech recognition, machine learning and natural language processing to achieve its goals.

California's Water Data Challenge: Generating User-guided Prediction of Water Supply in the Californian Rivers

September 8, 2020 in Open Projects Fall 2020

Freshwater supply is critical for managing and meeting human and ecological demands. However, while stocks of water in both natural and artificial reservoirs are helpful for increasing availability, droughts and floods, as well as whiplash events affect reliability on these systems, posing grave consequences on water users. This risk is particularly salient in the state of California, where many local communities have been plagued by extreme hydrological events. In this current research, we contribute to California’s Water Data Challenge effort where a diverse group of volunteers convened to form a multi-disciplinary team that addresses the crucial issues of extreme events in California using data science approaches. Members include researchers and professionals who come from a range of backgrounds representing academia and private sectors. We combine a range of publicly available datasets with Machine Learning (ML) techniques to explore predictability of extreme events during California’s water years. More specifically, we use a variety of water districts and showcase how ML prediction models are not only able to predict the flow of water at varying time horizons, they capture uncertainties posed by the climate and human influences.

Data For Good: Developing Predictive Model for Project Cost Estimate

September 8, 2020 in Open Projects Fall 2020, Data For Good

NYC DDC has initiated a machine learning project to develop predictive model for estimating cost of project and work items. Using the latest technique in Machine Learning and Advanced Statistics, NYC DDC to develop a model that predicts the cost of future and active projects and construction work items in different phases of the lifecycle of the project based on historical data. DDC has partnered with Microsoft who is providing the proof of concept guidance and making tools available for the proof of concept development. DDC is seeking assistance of a data scientist from the Town and Gown program to develop the model.

Evaluating Machine learning algorithms in Earth science

Reconstructions of the ocean carbon sink: Assessment of regional uncertainties

Streaming video analysis and optimization during Work-from-Home period

Using Data Science to Improve Telephone Triage of Ophthalmology Patients

Using speech and language to identify patients at risk for hospitalizations and emergency department visits in homecare

California's Water Data Challenge: Generating User-guided Prediction of Water Supply in the Californian Rivers

Data For Good: Developing Predictive Model for Project Cost Estimate

Columbia Data Science Institute (DSI) Scholars Program