Mixture models are a popular technique for clustering and density estimation due to their simplicity and ease of use. However the success of these models relies crucially on specific assumptions these models make about the underlying data distribution. Gaussian mixture models, for instance, assume that the subpopulations within the data are Gaussians-like, and can thus lead to poor predictions on datasets with more complex intrinsic structures. A common approach in such situations is to resort to more complex data models. An interesting sparsely explored alternative is to find feature transformations that maintain the salient cluster information while simplifying the subpopulation structure, in effect making mixture models highly effective.
Designing high quality prediction models while maintaining social equity (in terms of ethnicity, gender, age, etc.) is critical in today’s world. Most recent research in algorithmic fairness focuses on developing fair machine learning algorithms such as fair classification, fair regression, or fair clustering. Nevertheless, it can sometimes be more useful to simply preprocess the data so as to “remove” sensitive information from the input feature space, thus minimizing potential discrimination in subsequent prediction tasks. We call this a “fair representation” of the data. A key advantage of using a fair data representation is that a practitioner can simply run any off-the-shelf algorithm and still maintain social equity without having to worry about it.
We are requesting a DSI Scholar position for an undergraduate to work with myself and my collaborator Ben Holtzman (Lamont Doherty Earth Observatory, LDEO). We have been collaborating on the development of novel machine learning applications to seismology, specifically unsupervised feature extraction in spectral properties of large numbers of small earthquakes. Our first application was published in Science Advances last year (Holtzman, Pate, Paisley, Waldhauser, Repetto, “Machine learning reveals cyclic changes in seismic source spectra in Geysers geothermal field.” Science Advances 4, eaao2929. doi:10.1126/sciadv.aao2929, 2018). Currently we are building a synthetic dataset to better understand the features that control clustering behavior, and compare different clustering methods.
Recently, there have been multiple failures of large tailings dams that store mining wastes, around the world, with devastating impacts (e.g., https://en.wikipedia.org/wiki/Brumadinho_dam_disaster). These dams are unique in that they continue to be raised as waste piles up and can get as tall as 400m. The risk and impact of failure increases as the dam gets taller. There are several thousand such dams around the world. The concept of the project is to develop a continuous status monitoring and risk analysis of these dams, automatically, using globally available satellite data from multiple bands, as well as regularly updated climate data products. Overtopping of the dam during an intense or persistent rainfall event is the leading mode of failure. Foundation failure which leads to a liquefaction or deformation of the dam is the second leading failure mode.
Our lab develops an open-source text mining software called NimbleMiner (http://github.com/mtopaz/NimbleMiner). We will work on improving the software using the latest machine learning techniques.
A major obstacle to the decarbonization of the electricity production systems is the multi scale (space and time) variability of wind, solar and hydro energy sources. Much work is being done to understand the high frequency variations in these sources from the perspective of grid integration. However, as with rainfall and other natural systems, these variables can exhibit log-log fractal scaling in space and time, such that the variance of the process increases with temporal duration and with spatial scale. Focusing on high frequency variations thus grossly understates the systemic risk that is associated with these sources. Appropriate national grid design including electricity storage allocation, needs to consider both the periodic annual cycle variations and quasi-periodic inter-annual variability which have larger variance, and the phase lags in these variations across space. The proposed project would explore the development of a multi-level, hierarchical spatio-temporal model for wind or solar using data from the continental USA and its subregions to explore stochastic simulations and multi-scale predictions of the associated risk to inform system design and financial instruments development.
The development of computational data science techniques in natural language processing (NLP) and machine learning (ML) algorithms to analyze large and complex textual information opens new avenues to study intricate processes, such as government regulation of financial markets, at a scale unimaginable even a few years ago. This project develops scalable NLP and ML algorithms (classification, clustering and ranking methods) that automatically classify laws into various codes/labels, rank feature sets based on use case, and induce best structured representation of sentences for various types of computational analysis.