Recent advances in genomic technologies have led to the identification of many novel disease-gene associations, enabling more precise diagnoses. Along with the technologies enabling rapid DNA sequencing, multiple computational approaches have been developed to identify structural variants (i.e. relatively large deletions and duplications of genomic sequences). These workflows can lead to the identification of different structural variants, raising the risk of missing disease-causing variants when using only one of those methods.

Continue reading

Recent advances in genomic technologies have led to the identification of many novel disease-associated genes, enabling more precise diagnoses. Along with the technologies enabling rapid DNA sequencing, multiple computational approaches have been developed to extract the genetic information from raw data, including The Broad Institute’s GATK, Seven Bridge’s GenomeGraph and Google’s DeepVariant. These workflows can lead to the identification of different genetic variants, raising the risk of missing disease-causing variants when using only one of these methods.

Continue reading

Since the industrial revolution the atmosphere has continued to warm due to an accumulation of carbon. Terrestrial ecosystems play a crucial role in quelling the effects of climate change by storing atmospheric carbon in biomass and in the soils. In order to inform carbon reduction policy an accurate quantification of land-air carbon fluxes is necessary. To quantify the terrestrial CO2 exchange, direct monitoring of surface carbon fluxes at few locations across the globe provide valuable observations. However, this data is sparse in both space and time, and is thus unable to provide an estimate of the global spatiotemporal changes, as well as rare extreme conditions (droughts, heatwaves). In this project we will first use synthetic data and sample CO2 fluxes from a simulation of the Earth system at observation locations and then use various machine learning algorithms (neural networks, boosting, GANs) to reconstruct the model’s CO2 flux at all locations. We will then evaluate the performance of each method using a suite of regression metrics. Finally, time permitting, we will apply these methods to real observations. This project provides a way of evaluating the performance of machine learning methods as they are used in Earth science.

Continue reading

Advances in genomic technologies have led to the identification of many novel disease-gene associations, allowing medical diagnoses to be more precise and tailored to an individual. However, the high number of variants present in each individual represents a significant challenge for the implementation of genomic medicine. The goal of this project is to enable the identification of novel genes associated with recessive disorders.

Continue reading

The goal of this project is to study the molecular background of various congenital disorders affecting the cranial nerves, which are important in senses (hearing, vision, smell), facial muscle movements and more. Abnormal cranial nerve development can cause hearing loss, eye-movement disorders, facial weakness, loss of smell, and difficulties with respiration and swallowing. Some individuals may also have other motor, sensory, intellectual, behavioral and social disabilities. These disorders cause significant disability and are caused by genetic variants, often novel variation or de novo. Unfortunately, disorders affecting the 8th cranial nerve or vestibulocochlear nerve (CN VIII), important in hearing and balance, have been largely understudied. As various cranial nerves can be affected together, such as in Moebius syndrome, and as the vestibulocochlear nerve (CN VIII) and facial nerve (CN VII) also share a path in the internal auditory canal, it is likely that these disorders share underlying genes or closely interacting genes. To investigate the genetic architecture of cranial nerve abnormalities we suggest to molecularly investigate an in-house CN VIII cohort and other cranial dysinnervation cohorts. We will study rare genomic variants (both small variant as structural variants) to identify shared molecular pathways and genes amongst individuals with cranial dysinnervation disorders.

Continue reading

COVID-19 has changed the way we use the internet, from taking classes to social interactions and entertainment. The FCC publishes a large dataset of network measurements from thousands of homes, with gigabytes of data. The project goal is to analyze the data and answer questions such as: Has the increased usage reduced internet speeds? Can we tell how much people are staying at home from data usage records? Is the increased use of video conferencing reflected in the upload metrics?

Continue reading

In 2013, the Chinese government launched its grand initiative to eradicate rural poverty by 2020. The initiative has made great progress since then, yet little rigorous empirical evidence is available due to data limitations. This project aims to use big data through both official and social media to analyze the trends, achievements, and challenges of this initiative and offer implications for the future and from a comparative perspective.

Continue reading

Author's picture

Columbia Data Science Institute (DSI) Scholars Program

The DSI Scholars Program is to engage and support undergraduate and master students in participating data science related research with Columbia faculty. The program’s unique enrichment activities will foster a learning and collaborative community in data science at Columbia.

Columbia University DSI

New York, NY