Data Science 2016 Final Project

Getting started and learning about our dataset

We started this project by looking at ways we could work with an external collaborator and their dataset. Luckily, Olin’s professors have a breadth of data-rich research projects going on and we were able to find and work with Professor Scott Hersey and the environmental data he collected in the low-income South African township of Kwadela. Because Scott works and teaches a class on campus, we were able to meet with our collaborator to understand the context of the data and ask questions about circumstances surrounding data collection and odd points and outliers; before this project we had to make up our own context for data sets and catch probable errors based on our limited knowledge of the dataset subject field.

During our meeting with Scott, we were able to identify tangible research questions for our project, understand how the data was expected to behave based on principles of chemistry and anthropology, and select a handful of actually useful columns of data from the 50+ column long csv file Scott had originally provided us. For example, we learned that homes in Kwadela usually have very poor insulation and rely on coal burning stoves for warmth. These same coal burning stoves throw large quantities of pollutants in the air and ultimately cause lung related health concerns for the inhabitants of Kwadela. By comparing trends in outdoor temperature, indoor temperature (which indicates stove usage), and pollutant concentrations, we should be able to see how closely stove use is connected to pollution and cold weather.

Challenge that was solved - Best Ways to Represent Data Over Time

In talking with Scott, we also identified a challenge within our dataset. Scott’s data is a time course data series, meaning that it consists of a number of quantitative measurements taken over time. With time course data series, as we learned through research and our conversation with Scott, it’s not statistically accurate to take one mean temperature or pollutant concentration for the whole winter and use only that measurement. Additionally, data visualizations over the course of the entire winter are noisy and show season-wide trends, but cannot accurately represent daily trends that reflect the average life of someone living in Kwadela.

Season Wide Trends

Season-wide trends were initially a challenge, as the dates in our dataset did not originally have a feature that represented the time since data collection began. Plotting by ‘Day’ would actually show a three data points for that day (one data point per month). We needed to create a new column to represent the DayCount, which would show us the days that had past since July 1st, approximately when data collection began that winter. Once this was resolved, however, we were able to create visualizations to show trends over the course of the entire winter. The code we used to get a running DayCount is below. The plot below is a graph of season-wide trends in indoor temperature ( green), outdoor temperature (blue), and concentrations of particulate pollutant greater than 4 microns in diameter(red).

Diurnal Plots

Once we had the season-wide data visualizations, We found that diurnal plots are the best way to find the significant daily trends in the data. Right now, we’re looking at seasonal plots to understand overarching trends, then examining average days in general. Next, we want to look at averages for certain types of days (for example, days in the final, warmer third of the winter) and compare these plots. Here are some diurnal plots we have created. You can see that the spikes in pollutant concentration show up in the morning, when people in the household would turn on their stoves at the beginning of the day, as well as in the early evening, when people are cooking dinner and night starts getting colder. Comparing this to the plots of indoor temperature and outdoor temperature, we see that at the time of day that pollutants spike, and presumably the stove is being turned on, indoor temperature starts decreasing at a much slower rate than outdoor temperature, and the difference between indoor and outdoor temperatures increases. Both these visualizations and the season-wide visualizations can be used to tell a qualitative story about our dataset.

Outdoor Temmperature

Indoor Temperature

Pollutant Concentration

Statistical Analysis

We’ve begun researching statistical methods that are better suited to comparing time course models. Analysis of variance (ANOVA), for example, is used by biologists to compared observed variance in a particular variable. It’s thought of as a generalized t-test for multiple groups (variables like mean) and usually results in less false positives than using multiple two-sample t-tests. We plan on implementing ANOVA to compare different types of days (see above) and look for significant increases in stove use and pollutant creation.

Data Science 2016

Olin College of Engineering

Mackenzie Frackleton
Brenna Manning

Investigating KwaDela

Story 1: Project Update 4/5

Getting started and learning about our dataset

Challenge that was solved - Best Ways to Represent Data Over Time

Season Wide Trends

Diurnal Plots

Statistical Analysis

Contact Us