One of the methods to identify novel gene-disease associations is called “trio analysis”, when an affected child (proband) and both his/her parents DNA are sequenced. The genetic information from the child and his/her parents allows for the rapid identification of compound heterozygous variants (i.e. 2 variants in the same gene, one inherited from the mother and the other one from the father).

Continue reading

Our primary objective for this work will be to build a GMR model that can correct for bias in low cost particulate matter (PM2.5) sensors to be used globally. We will select 5-10 diverse reference PM2.5 and low cost PM2.5 co-locations to build a Gaussian Mixture Regression model (GMR). Recently, our team showed that GMR provides a higher quality correction factor for PurpleAir PM2.5 sensors than multiple linear regression and random forest, in terms of both correlation and accuracy. We then plan to evaluate this model on at least 20 independent co-location datasets that the GMR has not seen. There has been an exciting recent rise in commercially available low-cost sensors (LCS), such as PurpleAir (www.purpleair.com) and Clarity sensors, which when paired with machine learning (ML) based correction algorithms demonstrate high accuracy compared to co-located reference grade monitors5,6. So far, these corrections have been limited to the few LCS locations which are co-located with expensive reference-grade monitors, while the potential from the thousands of un-co-located sensors remains untapped. PurpleAir and similar devices have been deployed all over the world. Ideally our global correction factor will allow for the extraction of more trustworthy data from huge open-access databases of air pollution data such as PurpleAir.

Continue reading

Structural variants (SVs) are large genomic alterations which can be implicated in disease. This project will focus on using novel genomic techniques to identify structural variants in genomic cold cases with neurological disorders. These “cold” cases which have previously remained unsolved with standard genomic approaches. We will use optical genome mapping and long read sequencing, together with novel bioinformatic techniques to detect and analyze structural variants.

Continue reading

Road traffic crashes involving child passengers, child pedestrians, and child bicyclists are the leading cause of death for people aged 5 to 15 years in the USA. A total of 10,344 children died on US roads in the decade from 2010-2019; a further 4.2 million were hospitalized. Urban design—meaning the overall physical form of cities—is a modifiable environmental feature that can be changed to reduce the immense burden due to child road traffic injuries. Altering the overall configuration of a city’s transportation network affects the way children and other road users routinely travel through urban space, thereby altering children’s risks for being injured or killed in a road traffic crash.

Continue reading

The WHO has identified scientific misinformation as a public health crisis, calling it an “infodemic.” Social media allows misinformation to spread quickly and out-compete scientifically grounded information. Dear Pandemic is an innovative, multidisciplinary, social media-based science communication project led by women scientists across several institutions around the US and the UK. The mission is to educate and empower individuals to successfully navigate the overwhelming amount of information. The goals are: 1) To disseminate trustworthy, comprehensive, and timely scientific content about the pandemic to lay audiences, and 2) To promote media and health literacy, equipping readers to better manage the COVID-19 infodemic within their own networks. More than one year after launch, the project has a combined monthly reach of > 5 million people across 4 social media channels (2 Facebook pages in English and Spanish; Instagram; and Twitter).

Continue reading

The impact of an outage, congestion, hijacking, and many other Internet phenomena depends on how many users or how much traffic use the route, but researchers lack visibility into how important routes are, and, distressingly, seem to have lost hope of obtaining this information without proprietary datasets or privileged viewpoints. We believe that there is hope – new measurement methods and changes in Internet structure make it possible to construct an “Internet Traffic Map” identifying the locations (logical and perhaps geographical) of Internet users and major services, the paths between users and major services, and the relative activity levels (traffic, queries, or number of users) routed along these paths. We will construct this map. The realization of an Internet Traffic Map will be an Internet-scale effort that will have Internet-scale consequences that reach far beyond the research community.

Continue reading

Clostridioides difficile infection (CDI) is highly associated with antibiotic exposure, but it is uncertain which classes of antibiotics confer the greatest risk for CDI. This project will use the MarketScan database, a large commercial insurance billing database containing 40 million patient records, outpatient antibiotic prescription data, and ICD-based disease information, to test for associations between specific antibiotic classes and risk for CDI.

Continue reading

Author's picture

Columbia Data Science Institute (DSI) Scholars Program

The DSI Scholars Program is to engage and support undergraduate and master students in participating data science related research with Columbia faculty. The program’s unique enrichment activities will foster a learning and collaborative community in data science at Columbia.

Columbia University DSI

New York, NY