Mapping and disrupting business email compromise criminal networks

September 1, 2021 in Open Fall 2021

Business email compromise (BEC) is a prevalent cyber attack, where the attacker impersonates a figure of authority or legitimacy (e.g., the CEO, a business associate), and asks the target to wire money to a bank account by the attacker. Based on FBI estimates, in the past several years, attackers have been able to steal over $22B in fraudulent wire transfers. Such attacks have affected a very wide range of individuals and institutions, from the world’s largest and most sophisticated companies (e.g., Google, Facebook), to government and public entities, and even individuals whose house down payment was stolen by an attacker pretending to be their mortgage broker.

Mapping NYPD Subway Fare Evasion Enforcement

September 1, 2021 in Open Fall 2021, Open Flexible Timeline

This project is the next phase in ongoing research to document how the MTA and NYPD use public resources to criminalize poverty at the subway turnstile, especially in Black and Brown communities.

Meta-analysis of single-cell genomic data to define cellular heterogeneity and dynamics in atherosclerotic vasculature

September 1, 2021 in Open Fall 2021

Atherosclerosis, a chronic inflammatory disease of the artery wall, is the underlying cause of human coronary heart diseases. Single-cell genomics have catalyzed the revolution in understanding of cellular heterogeneity and dynamics in atherosclerotic vasculature. The goal of the project is to leverage published and our own single-cell genomic data and perform a meta-analysis. Meta-analysis allows integrated analysis of much larger cell numbers and helps resolve the full spectrum of cellular heterogeneity and dynamics in atherosclerotic vessels and facilitate therapeutic translation. The DSI scholar will: (1) use the latest bioinformatic pipeline to integrate the existing scRNA-seq, CITE-seq, and scATAC-seq datasets; (2) analyze the integrated datasets using R/Bioconductor packages (e.g. Seurat); (3) interpret the data using pathway and network analysis. Some relevant workflows are available through the “Resources” page of our lab website at https://hanruizhang.github.io/zhanglab/.

Multi-class probabilistic clustering of the human gut microbiome

September 1, 2021 in Open Fall 2021, Open Flexible Timeline

Understanding the structure and function of the human gut microbiome is expected to revolutionize healthcare due to its many associations with human disease. A critical step in microbiome analysis involves a clustering stage, where genomic sequences of unknown origin are assigned to latent genomes present in the sample. Current clustering methods rely on mixture-models, yet these fail to correctly model the features of genomic sequences shared across multiple genomes. These sequences are of great importance, often encoding antibiotic resistance genes that drive resistant outbreaks. This project’s goal is to develop a clustering algorithm that will effectively cluster both shared and unique genomic sequences. We have developed two probabilistic models, both based around hierarchical Poisson factorization, that have already produced promising results. The project’s goal will be to refine these models: This will involve robustly evaluating the current models, determining their limitations, and designing new models that improve upon the current. A successful project will enable for the first time, scalable, and comprehensive reconstruction of bacterial genomes. In turn, this will enable a large-scale analysis of antimicrobial resistance in the context of the human gut microbiome. We anticipate a successful project to result in an exciting publication.

Natural Language Processing of social media data on COVID-19 travel pattern analysis

September 1, 2021 in Open Fall 2021

COVID-19 has transformed people’s lives in every aspect. Travel patterns and work patterns are also changed. This project aims to leverage social media data (e.g., tweets, facebook posts, Reddit posts) to mine people’s travel patterns and the timeline of telecommute using Natural Language Processing (NLP). Research questions are:

Organizational Statements and Inequality

September 1, 2021 in Closed Fall 2021

Project focuses on using text (from company statements and job postings) to better understand inequality in labor market outcomes. It will require some data scraping, managing large amounts of text data (e.g., captures from company websites), using NLP to better understand trends in the data, and SML to code key elements in text data. It may also involve running descriptives and comparisons of text data across time periods (e.g., how the language changes).

Predicting gene expression from sequencing in Alzheimers Disease

September 1, 2021 in Open Fall 2021, Open Flexible Timeline

The goal of this project is to evaluate algorithms that predict gene expression directly from sequencing data. With the availability of large scale sequencing data in ADSP and progress made in machine learning methods, it is possible to model long range interactions in the DNA sequence to infer intermediate phenotypes such as gene expression. First, we will test a deep learning based method called Enformer that is able to integrate long-range interactions (such as promoter-enhancer interactions) in the genome and predict gene expression from sequence. Using available RNA-sequencing on a small number of samples (e.g. ROSMAP cohort), we will optimize the algorithm to improve accuracy of prediction. Secondly, inferred expression in the ADSP cohorts will be used to test association with Alzheimer’s Disease and related endophenotypes. Finally, we will incorporate datasets that will become available in future such as cell-type specific ATAC-seq and disease-specific gene expression to re-train learning models to improve gene-expression prediction directly from sequencing data.

Mapping and disrupting business email compromise criminal networks

Mapping NYPD Subway Fare Evasion Enforcement

Meta-analysis of single-cell genomic data to define cellular heterogeneity and dynamics in atherosclerotic vasculature

Multi-class probabilistic clustering of the human gut microbiome

Natural Language Processing of social media data on COVID-19 travel pattern analysis

Organizational Statements and Inequality

Predicting gene expression from sequencing in Alzheimers Disease

Columbia Data Science Institute (DSI) Scholars Program