Columbia University Data Science Institute is pleased to announce that the Data Science Institute (DSI) and Data For Good Scholars programs for Fall 2019 are open for application.

The goal of the DSI Scholars Program is to engage Columbia University’s undergraduate and master’s students in data science research with Columbia faculty through a research internship. The program connects students with research projects across Columbia and provides student researchers with an additional learning experience and networking opportunities. Through unique enrichment activities, this program aims to foster a learning and collaborative community in data science at Columbia.

The Data For Good Scholars program connects student volunteers to organizations and individuals working for the social good whose projects have developed a need for data science expertise. As “real world” problems with real world data, these projects are excellent opportunities for students to learn how data science is practiced outside of the university setting and to learn how to work effectively with people for whom data science sits outside of their subject area.

Continue reading

Many of the cryptocurrency transactions have involved fraudulent activities including ponzi schemes, ransomware as well money-laundering. The objective is to use Graph Machine Learning methods to identify the miscreants on Bitcoin and Etherium Networks. There are many challenges including the amount of data in 100s of Gigabytes, creation and scalability of algorithms.

Continue reading

This project is the first comprehensive examination of African North Americans who crossed one of the U.S.-Canada borders, going either direction, after the Underground Railroad, in the generation alive roughly 1865-1930. It analyzes census and other records to match individuals and families across the decades, despite changes or ambiguities in their names, ages, “color,” birthplace, or other details. The main difficulty in making these matches is that the census data for people with a confirmed identity does not stay uniform decade after decade. Someone might be recorded not with their given name but instead a nickname (Elizabeth to Betsy); women can marry or get remarried and change their names; racial measures by a census taker may change (black to mulatto, or mulatto to white); someone might say they are from Canada, even when they were born in Kentucky, depending on how the question was asked; people who were estimating their ages might be 35 in 1870 and 40 in 1880 and 50 in 1890, for example.

Continue reading

Under United States securities laws corporations must disclose material risks to their operations. Human rights issues, especially in authoritarian countries, rarely show up in the information that data providers offer to investors, in part due to the risks to those subject to these abuses. The result is a dearth of data on human rights materiality and the tendency of investors to overlook human rights risks of the companies that they finance.

Continue reading

The function for much of the 3 billion letters in the human genome remain to be understood. Advances in DNA sequencing technology have generated enormous amount of data, yet we don’t have the tool to extract rules of how the genome works. Deep learning holds great potential in decoding the genome, in particular due to the digital nature of DNA sequences and the ability to handle large data sets. However, like many other applications, the interpretability of deep learning models hampers its ability to help understand the genome. We are developing deep learning architectures embedded with the principles of gene regulation and we will be leveraging millions of existing whole genome measurements of gene activity to learn a mechanistic model of gene regulation in human cells.

Continue reading

Mixture models are a popular technique for clustering and density estimation due to their simplicity and ease of use. However the success of these models relies crucially on specific assumptions these models make about the underlying data distribution. Gaussian mixture models, for instance, assume that the subpopulations within the data are Gaussians-like, and can thus lead to poor predictions on datasets with more complex intrinsic structures. A common approach in such situations is to resort to more complex data models. An interesting sparsely explored alternative is to find feature transformations that maintain the salient cluster information while simplifying the subpopulation structure, in effect making mixture models highly effective.

Continue reading

Designing high quality prediction models while maintaining social equity (in terms of ethnicity, gender, age, etc.) is critical in today’s world. Most recent research in algorithmic fairness focuses on developing fair machine learning algorithms such as fair classification, fair regression, or fair clustering. Nevertheless, it can sometimes be more useful to simply preprocess the data so as to “remove” sensitive information from the input feature space, thus minimizing potential discrimination in subsequent prediction tasks. We call this a “fair representation” of the data. A key advantage of using a fair data representation is that a practitioner can simply run any off-the-shelf algorithm and still maintain social equity without having to worry about it.

Continue reading

Author's picture

Columbia Data Science Institute (DSI) Scholars Program

The DSI Scholars Program is to engage and support undergraduate and master students in participating data science related research with Columbia faculty. The program’s unique enrichment activities will foster a learning and collaborative community in data science at Columbia.

Columbia University DSI

New York, NY