Data For Good Project: African North Americans Database

This project is the first comprehensive examination of African North Americans who crossed one of the U.S.-Canada borders, going either direction, after the Underground Railroad, in the generation alive roughly 1865-1930. It analyzes census and other records to match individuals and families across the decades, despite changes or ambiguities in their names, ages, “color,” birthplace, or other details. The main difficulty in making these matches is that the census data for people with a confirmed identity does not stay uniform decade after decade. Someone might be recorded not with their given name but instead a nickname (Elizabeth to Betsy); women can marry or get remarried and change their names; racial measures by a census taker may change (black to mulatto, or mulatto to white); someone might say they are from Canada, even when they were born in Kentucky, depending on how the question was asked; people who were estimating their ages might be 35 in 1870 and 40 in 1880 and 50 in 1890, for example.

To date, approximately 1,000 matches have been manually generated in a database of 50,000 records. Matches were made by looking first at the calculated birth year, then at the name given, location, place of birth, and sometimes at household members. Finding an algorithmic way to predict and identify these matches will allow these records to be paired with other sources, such as government pension data, and will factor into research on migration patterns, specific families, and nodes - whether personal or geographic — that tie these African North American groups together.

Goals include:

Finding a way to predict or confirm matches in the data, likely using confidence factors based on name, birth year, family structure, and/or location. Those that have been confirmed manually can be used as a training set for an algorithm.
Expanding the dataset by scraping census data for the rest of the households of those in the database, either for those confirmed as matches or for the entire set. This will expand the reach of the database and allow for additional matches.
Lower-Priority goals include additional visualizations and OCR conversions.

The Data For Good program is designed primarily for volunteers, however one candidate will be selected as a project coordinator and will receive a stipend via the Data For Good Scholars program. In addition to the responsibilities of a team member, the selected candidate will be responsible for keeping up-to-date notes on the project’s status, writing an end-of-period report, and attending bi-weekly meetings with a DFG program director. The project coordinator should strive to keep the group of volunteers in sync with the needs of the project owner.

Project Owner

Professor Adam Arenson, Manhattan College
Adam Arenson teaches the history and memory of North America and the global nineteenth century. His work has concentrated on the cultural and political history of slavery, Civil War, and Reconstruction, as well as the development of cities–from California to the Yukon Territory, from the province of Ontario to St. Louis to El Paso.

Project timeline

Earliest starting date: 10/01/2019
End date: 12/20/2019
Number of hours per week of research expected during Spring 2019: ~6-8, ~10 for coordinator
Project is ongoing and will be reviewed for future directions at the end of the semester

Candidate requirements

Skill sets: Familiarity with concepts of record linkage and de-duplication in both an unsupervised and supervised context would be ideal, however sufficient background knowledge in probability and maximum likelihood/machine learning to learn these topics is sufficient. Ability to program in a language like R or Python that has record linkage software available is required.
Student eligibility: freshman, sophomore, junior, senior, master’s
For coordinator position, international students on F1 or J1 visa: eligible

Data For Good Project: African North Americans Database

Project Owner

Project timeline

Candidate requirements

Columbia Data Science Institute (DSI) Scholars Program