Our lives are heavily reliant on Internet-connected devices and services. However, to deliver the desired user experience over the Internet, network operators need to detect and diagnose various network events (e.g., disruption, outage, misconfiguration, etc.) as well as resolve them in real-time. We have developed an Internet-wide measurement infrastructure that collects performance metrics (e.g., latency, jitter, throughput, packet loss rate, signal strength, etc.) from vantage points deployed by real users (mobile phones, WiFi access points, etc.) at regular intervals.

Continue reading

We are interested in investigating how deaths and hospitalizations resulting from opioid overdoses cluster across space and time in the US. This analysis will be conducted with the aid of two comprehensive databases: 1) detailed mortality data across the US; and 2) a stratified sample of all hospitalizations in the US, which can be subset to select for opioid overdoses. Analyses will be extended to drug type (prescription drugs, fentanyl etc.) and subject demographics (age, race, etc.). We have previously conducted similar cluster analysis for other health phenomena.

Continue reading

Through ArXivLab we aim to develop the next generation recommender systems for the scientific literature using statistical machine learning approaches. In collaboration with ArXiv we are currently developing a new scholarly literature browser which will be able to extract knowledge implicit in the mathematical and scientific literature, offer advanced mathematical search capabilities and provide personalized recommendations.

Continue reading

Effective representations and analyses of symbolic data, such as lexical data (words) and networks (graphs), have become of great interest in recent years, due both to advancements in data collection in Natural Language Processing (NLP), and the ubiquity of social networks. Such data often has no natural numerical representation, and is typically described in terms relational expressions or as pairwise similarities. It turns out that finding numerical representations of such data in “Hyperbolic” spaces—rather than into the more familiar Euclidean spaces—is a more effective way to preserve valuable relational information.

Continue reading

The Federal Communications Commission (FCC) and the Census regularly publish data on U.S. Internet availability, performance and use, at granularities from census block to county and state. The project goal is to answer questions based on the available data, such as “How reliable is Internet access?”, “Who is deploying fiber where?”, “Can we predict reliability of different technologies?”, “Can we predict the deployment of fiber?”

Continue reading

The ocean has absorbed the equivalent of 41% of industrial-age fossil carbon emissions. In the future, this rate of this ocean carbon sink will determine how much of mankind’s emissions remain in the atmosphere and drive climate change. To quantify the ocean carbon sink, surface ocean pCO2 must be known, but cannot be measured from satellite; instead it requires direct sampling across the vast and dangerous oceans. Thus, there will never be enough observations to directly estimate the carbon sink as it evolves. Data science can fill this gap by offering robust approaches to extrapolate from sparse observations to full coverage fields given auxiliary data that can be measured remotely.

Continue reading

Author's picture

Columbia Data Science Institute (DSI) Scholars Program

The DSI Scholars Program is to engage and support undergraduate and master students in participating data science related research with Columbia faculty. The program’s unique enrichment activities will foster a learning and collaborative community in data science at Columbia.

Columbia University DSI

New York, NY