Call for Student Applications - Spring/Summer 2022

January 28, 2022 in Announcement

Columbia University Data Science Institute is pleased to announce that the Data Science Institute (DSI) and Data For Good Scholars programs for Spring 2022 are open for application.

The goal of the DSI Scholars Program is to engage Columbia University’s undergraduate and master’s students in data science research with Columbia faculty through a research internship. The program connects students with research projects across Columbia and provides student researchers with an additional learning experience and networking opportunities. Through unique enrichment activities, this program aims to foster a learning and collaborative community in data science at Columbia.

The Data For Good Scholars program connects student volunteers to organizations and individuals working for the social good whose projects have developed a need for data science expertise. As “real world” problems with real world data, these projects are excellent opportunities for students to learn how data science is practiced outside of the university setting and to learn how to work effectively with people for whom data science sits outside of their subject area.

Comparison of four workflows for structural variants identification

January 28, 2022 in Open 2022, Open Flexible Timeline

Recent advances in genomic technologies have led to the identification of many novel disease-gene associations, enabling more precise diagnoses. Along with the technologies enabling rapid DNA sequencing, multiple computational approaches have been developed to identify structural variants (i.e. relatively large deletions and duplications of genomic sequences). These workflows can lead to the identification of different structural variants, raising the risk of missing disease-causing variants when using only one of those methods. Unfortunately, many of the variants identified by those workflows are artifacts (i.e. absent in the biological sample), raising concerns that time and effort will be wasted on those artifacts instead of analyzing the causative genetic variant. The goal of this project is to develop best practices to increase the chance to identify causative structural variants, while reducing the number of artifacts. We will use the raw data from whole-exome and whole-genome sequencing of patients with renal diseases. The students will be expected to (1) Compare the output of 4 different tools for identifying structural variants and visualize the differences (using R or Python) and (2) Identify the tool specific parameters that increases the specificity and sensitivity of each tool in differentiating true variants and artifacts.

D-Hacking

January 28, 2022 in Open 2022

It has long been recognized that there could be a tradeoff between optimizing the accuracy of machine learning predictions and satisfying definitions of fairness for protected or vulnerable groups. This tradeoff has led to an increased interest in finding models that both perform well while also exhibiting fairness properties.

DAPP: Blockchain Collaborative Expert Prediction Market Decentralized Applications

January 28, 2022 in Open 2022

Data For Good: Using NLP to Reveal the Cost of Corporate Human Rights Violations

January 28, 2022 in Open 2022

Rights CoLab (https://rightscolab.org/) is working with the Sustainability Accounting Standards Board (SASB) to define a strengthened set of disclosure standards that investors can use to persuade companies to improve labor rights for both their workforce and workers in supply chains. The project has two components: 1) a data science project, and 2) an Expert Group.

Database of variants/phenotype and ClinVar submissions

January 28, 2022 in Open 2022, OPen Flexible Timeline

The IGM has performed diagnostic whole exome or whole genome sequencing on more than 5000 CUIMC patients with presentations including undiagnosed diseases of childhood, chronic kidney disease, fetal anomalies and neurological diseases (with a focus on epilepsy) among many others. These patients have been analyzed with a standardized diagnostic pipeline to identify single genotypes that are responsible for disease. Diagnostic genotypes are those that are considered to be likely contributing to the patient’s presentation through study team consensus (a multidisciplinary team that includes population geneticists, molecular geneticists, clinicians, genetic counselors, bioinformaticians and analysts). The student will be expected to 1) build an easy-to-use interface and secure database for variant records and phenotype information, and 2) implement a web interface to facilitate variant submission to ClinVar which is a freely accessible, public archive of reports of the relationships among human variations and phenotypes, with supporting evidence.

Detecting inherited compound heterozygous variants

January 28, 2022 in Open 2022, Open Flexible Timeline

One of the methods to identify novel gene-disease associations is called “trio analysis”, when an affected child (proband) and both his/her parents DNA are sequenced. The genetic information from the child and his/her parents allows for the rapid identification of compound heterozygous variants (i.e. 2 variants in the same gene, one inherited from the mother and the other one from the father).

Call for Student Applications - Spring/Summer 2022

Comparison of four workflows for structural variants identification

D-Hacking

DAPP: Blockchain Collaborative Expert Prediction Market Decentralized Applications

Data For Good: Using NLP to Reveal the Cost of Corporate Human Rights Violations

Database of variants/phenotype and ClinVar submissions

Detecting inherited compound heterozygous variants

Columbia Data Science Institute (DSI) Scholars Program