Integrated Traffic-Communication Simulator for COSMOS testbed

September 30, 2019 in Project Fall 2019

Vehicle-to-Vehicle (V2V) has received increasing attention with the development of autonomous driving technology. It is believed that multi-vehicular and multi-informative algorithm is the direction of the autonomous driving technology. However, the stability and liability of the communication prevents the future from extensively embracing V2V-based transportation. Rigorous test is required before V2V can actually hit the road. Compared with the costly field test, simulation tests are more economical and feasible. To simulate the V2V communication and evaluate the robustness of current V2V-based algorithm, we are therefore developing a simulation platform integrating different commercial software like SUMO, Veins and OMNET++. These software simulate on the actual New York map, and simulate the vehicular communication in different scenarios and platoon configurations. Our next step is to use this platform to test our own V2V-based algorithms. The output of this research will eventually provide an open platform which would automatically evaluate personally designed algorithm with least manual work.

Learning the most crucial features of the galaxy merger histories

September 30, 2019 in Project Fall 2019

Galaxies in our universe form hierarchically, continuously merging and absorbing smaller galaxies over cosmic time. In this project we aim to identify the most important features of, as well as generate efficient new features from, the merger histories of galaxies. Namely, features that predict (or physically speaking, determine) the properties of galaxies, e.g. their shape or color. This will be done using the results from a large cosmological simulation, IllustrisTNG (www.tng-project.org). We will begin with identifying ways to represent the rich information in the merger history. We will then compare various ML methods oriented towards feature selection or importance analysis: random forests or gradient boosted trees, L1SVM, neural networks (through analysis of e.g. saliency maps). More advanced models can also be applied, such as neural network models designed for feature selection. Finally, we wish to apply / develop methods that can build ‘interpretable’ new features by constructing them as algebraic formulas from original input features (inspired by e.g. https://science.sciencemag.org/content/324/5923/81). The overarching goal is to understand better what in the merger history is most crucial in determining a galaxy’s present-day properties, an answer to which can be widely applicable to problems in galaxy formation.

Measuring Liberal Arts: Creating an Index for Higher Education

September 30, 2019 in Project Fall 2019

This project works with a novel corpus of text-based school data to develop a multi-dimensional measure of the degree to which American colleges and universities offer a liberal arts education. We seek a data scientist for various tasks on a project that uses analysis of multiple text corpora to better understand the liberal arts. This is an ongoing three-year project with opportunities for future collaborations, academic publications, and developing and improving existing data science and machine learning skills. Tasks likely include: (1) Using Amazon Web Services to create and maintain cloud-based storage (SQL, S3 buckets) of the project’s expanding library of data. (2) Extracting information (named entities, times, places, books, and so on) from millions of plain-text syllabus records. (3) Merging multiple forms of data into a single dataset. (4) Scraping websites for relevant information (e.g., college course offerings, school rankings). Some pages may include dynamically created content that requires the use of a program such as Selenium.

Single-Cell Transcriptome Profiling in Atherosclerosis

September 30, 2019 in Project Fall 2019

Atherosclerosis—a chronic inflammatory disease of the artery wall—is the underlying cause of human coronary heart diseases. Cells within atherosclerotic lesions are heterogeneous and dynamic. Their pathological features have been characterized by histology and flow cytometry and more recently, by bulk-tissue omics profiling. Despite this progress, our knowledge of cell types and their roles in atherogenesis remains incomplete because of masking of differences across cells when using genomic measurement at bulk level. Single-cell RNA sequencing (scRNA-seq) has catalyzed a revolution in understanding of cellular heterogeneity in organ systems and diseases. This project applies scRNA-seq to define the genetic influences on cell subpopulations and functions in atherosclerotic lesion of transgenic mice for candidate risk genes of human coronary heart diseases as inspired by human genomic discoveries. The students involved in this project are expected to work on: (1) analysis of scRNA-seq data using R/Bioconductor packages; (2) Interpretation of the data using pathway and network analysis. Some relevant workflows are available through the “Resources” page of our lab website at https://hanruizhang.github.io/zhanglab/.

Waymo & Lyft Driverless Car Data Analysis and Driving Modeling

September 30, 2019 in Project Fall 2019

Autonomous driving is developing rapidly. A lot of breakthroughs of autonomous driving have emerged in both academy and industry. However, many traffic accidents related to autonomous driving also occur and cause people’s concern on the safety issue of AV. To ensure safety and reliability, rigorous test and simulation is required before AV can really drive on road. For AV test and simulation, realistic data is an essential component. Comprehensive, multi-regime and sufficient self-driving data would definitely help the AV development.

Project: A Data-driven Approach for Improving the User Experience of Internet Users

January 11, 2019 in Project 2019

Our lives are heavily reliant on Internet-connected devices and services. However, to deliver the desired user experience over the Internet, network operators need to detect and diagnose various network events (e.g., disruption, outage, misconfiguration, etc.) as well as resolve them in real-time. We have developed an Internet-wide measurement infrastructure that collects performance metrics (e.g., latency, jitter, throughput, packet loss rate, signal strength, etc.) from vantage points deployed by real users (mobile phones, WiFi access points, etc.) at regular intervals.

Project: Analysis and Prediction of Opioid Outbreak Clusters

January 11, 2019 in Project 2019

We are interested in investigating how deaths and hospitalizations resulting from opioid overdoses cluster across space and time in the US. This analysis will be conducted with the aid of two comprehensive databases: 1) detailed mortality data across the US; and 2) a stratified sample of all hospitalizations in the US, which can be subset to select for opioid overdoses. Analyses will be extended to drug type (prescription drugs, fentanyl etc.) and subject demographics (age, race, etc.). We have previously conducted similar cluster analysis for other health phenomena.

Integrated Traffic-Communication Simulator for COSMOS testbed

Learning the most crucial features of the galaxy merger histories

Measuring Liberal Arts: Creating an Index for Higher Education

Single-Cell Transcriptome Profiling in Atherosclerosis

Waymo & Lyft Driverless Car Data Analysis and Driving Modeling

Project: A Data-driven Approach for Improving the User Experience of Internet Users

Project: Analysis and Prediction of Opioid Outbreak Clusters

Columbia Data Science Institute (DSI) Scholars Program