The federal government spends billions of dollars a year supporting rural broadband (internet access), subsidizing build-out in low-density areas that do not have broadband (unserved areas). However, it is not clear whether the rural areas most in need are receiving a fair share of the funding. Using a very large dataset of broadband availability, census data and recent auction results, the project will analyze whether unserved areas with high racial diversity or lower median income are receiving a fair share of funding. Depending on team size, we will also attempt to create a shareable master data set building on OpenStreetMap and other sources that provides key data points for census units.

Continue reading

Water joined gold, oil and other commodities traded on Wall Street, highlighting worries that the life-sustaining natural resource may become scarce across more of the world. In the state of California, the biggest U.S. agriculture market and world’s fifth-largest economy, this challenge is particularly prevalent. Farmers, hedge funds and municipalities are now able to prepare for the risk that future water availability issues can bring in the state of California.

Continue reading

Recent advances in genomic technologies have led to the identification of many novel disease-gene associations, enabling more precise diagnoses. Along with the technologies enabling rapid DNA sequencing, multiple computational approaches have been developed to identify structural variants (i.e. relatively large deletions and duplications of genomic sequences). These workflows can lead to the identification of different structural variants, raising the risk of missing disease-causing variants when using only one of those methods.

Continue reading

Recent advances in genomic technologies have led to the identification of many novel disease-associated genes, enabling more precise diagnoses. Along with the technologies enabling rapid DNA sequencing, multiple computational approaches have been developed to extract the genetic information from raw data, including The Broad Institute’s GATK, Seven Bridge’s GenomeGraph and Google’s DeepVariant. These workflows can lead to the identification of different genetic variants, raising the risk of missing disease-causing variants when using only one of these methods.

Continue reading

With the explosive growth of medical literature, making sense of medical evidence is harder than ever. The free text form also makes it difficult to perform evidence retrieval of appraisal. There is a great need for tools and methods that can structure and reason over medical evidence. The goal of this project is to develop computational and symbolic methods to extract evidence from PubMed abstracts, integrate it with evidence derived from real world clinical data (or practice-based evidence), and perform automated knowledge discovery and evidence reasoning. We also hope this research can support evidence-based medicine during the COVID-19 pandemic and provide opportunities for students to hone his/her skills on natural language processing, data mining, deep learning, and semantic knowledge engineering. We have solid preliminary results for the students to build upon. An open-source PICO parser that extracts Population, Intervention, Comparison and Outcome information from PubMed abstracts has been developed and published. Current COVID-19 literature has been downloaded from PubMed and been pre-processed. Preliminary analyses are under way to investigate the patterns in the study populations in COVID-19 clinical studies. Our next steps include but are not limited to evidence summarization at the study level and evidence reasoning at the problem/topic level.

Continue reading

Decoding behavioral signifiers for the brain state of vigilance can have far reaching implications for understanding actions and identifying disease. We are using high resolution video recordings of mice as they navigate a maze, but have access to very few pre-determined behavioral signifiers. Several recent publications implemented computer vision to extract a variety of previously unreachable aspects of behavioral analysis, including animal pose estimation and distinguishable internal states. These descriptions allowed for the identification and characterization of dynamics, which then revealed an unprecedented richness to the behaviors that determine decision making. Applying such computational approaches in our maze in the context of behaviors that have been validated to measure choice and memory can reveal dimensions of behavior that predict or even determine psychological constructs like vigilance. DSI scholars would use pose estimation analysis to evaluate behavioral signifiers for choice and memory and relate it to our real time concurrent measures of neural activity and transmitter release. The students would also have opportunity to examine the effect of disease models known to impair performance on our maze task on any identified signifier.

Continue reading

Author's picture

Columbia Data Science Institute (DSI) Scholars Program

The DSI Scholars Program is to engage and support undergraduate and master students in participating data science related research with Columbia faculty. The program’s unique enrichment activities will foster a learning and collaborative community in data science at Columbia.

Columbia University DSI

New York, NY