Gaussian Mixture Regression to obtain useful, actionable air pollution data from consumer-grade, low-cost monitoring devices
Our primary objective for this work will be to build a GMR model that can correct for bias in low cost particulate matter (PM2.5) sensors to be used globally. We will select 5-10 diverse reference PM2.5 and low cost PM2.5 co-locations to build a Gaussian Mixture Regression model (GMR). Recently, our team showed that GMR provides a higher quality correction factor for PurpleAir PM2.5 sensors than multiple linear regression and random forest, in terms of both correlation and accuracy. We then plan to evaluate this model on at least 20 independent co-location datasets that the GMR has not seen. There has been an exciting recent rise in commercially available low-cost sensors (LCS), such as PurpleAir (www.purpleair.com) and Clarity sensors, which when paired with machine learning (ML) based correction algorithms demonstrate high accuracy compared to co-located reference grade monitors5,6. So far, these corrections have been limited to the few LCS locations which are co-located with expensive reference-grade monitors, while the potential from the thousands of un-co-located sensors remains untapped. PurpleAir and similar devices have been deployed all over the world. Ideally our global correction factor will allow for the extraction of more trustworthy data from huge open-access databases of air pollution data such as PurpleAir.
This project is eligible for a matching fund stipend from the Data Science Institute. This is not a guarantee of payment, and the total amount is subject to available funding.
Faculty Advisor
- Professor: Daniel M Westervelt
- Center/Lab: Lamont-Doherty Earth Observatory,
- Location: LDEO
- All things aerosol, air quality, atmospheric chemistry, and climate change. Using data science methods to extract useful data out of environmental sensors.
Project Timeline
- Earliest starting date: 3/1/2022
- End date: 8/1/2022
- Number of hours per week of research expected during Spring/Summer 2022: ~10
- Number of hours per week of research expected during Summer 2022: ~30
Candidate requirements
- Skill sets: R/Python, machine learning methods, probabilistic methods (e.g. Gaussian Mixture Models), work with sensor data
- Student eligibility: freshman, sophomore, junior, senior, master’s
- International students on F1 or J1 visa: eligible
- Academic Credit Possible: Yes