Data Science 2016 Final Project

Investigating KwaDela

Analyzing and Visualizing Thermal and Pollutant Data Over Time from the KwaDela Township in South Africa
Pre/Post Intervention

Poor air quality from aerosol pollution is the second leading cause of premature death in the world, with the burden of this pollution falling primarily on the poor. When governments seek to improve air quality, the easiest target tends to be large point-sources of pollutants (i.e. power generation facilities, refineries, and other industrial facilities). However, this strategy is ineffective in areas such as South Africa, where the vast majority (up to 80%) of particulate matter can be apportioned to domestic burning of coal and wood in low-income areas for the purpose of heating and cooking.

To this end, a major chemical company in South Africa funded a multi-year pollution study in a small township called KwaDela during 2013-2014. This study consisted of one year (winter and summer) of air quality measurements to establish a baseline for pollution. After year 1, low-cost modifications were made to township homes to improve their thermal efficiency with the goal of reducing the quantity of fuel burned. Year 2 made the same air quality measurements to determine the impact of these intervention steps on both indoor and ambient air quality.

For our Data Science final project, we have been working in collaboration with Olin Professor Scott Hersey, who has provided us with a dataset from this pollution study that describes indoor temperature, outdoor temperature, particulate matter pollutant concentration, and gas phase pollutant concentration during the winter of 2013 and the winter of 2014 in KwaDela, South Africa.

Google Earth: Bird's eye view of Kwadella
Photo: Google Earth 2016

Final Analysis and Results

A Jupyter Notebook containing visualizations and a walkthrough of our content can be found here.

Poster can be viewed here

In our data, we observed that every day has evident patterns. The later afternoon is the warmest portion of the day outdoors, and as the evening turns to night the interior of the house is warmed by the coal stove used to make a cooking fire. Pollutants spike in the morning and evening, indicating use of the stove for a combination of warmth and cooking. We know morning stove use is usually exclusively for heating the home after a cold night (people often wake just after midnight to stoke the fire) and for heathing water. In the evenings, residents of KwaDela use their stoves to cook dinner and keep the family warm for the night. The stove is used most heavily for dinnertime.

Observing differences between 2013 and 2014, before and after the intervention, we did see a decrease in particulate matter concentrations. The decrease in PM10 for cold days of the winter was statistically significant. We are continuing to investigate this furhter. This is all discussed more in depth in our JUPYTER NOTEBOOK.

System Architecture

    Initial plots showing each datapoint over time were noisy and unclear.This is because many of the trends we were examining are periodic with a period of 24 hours (or 1 day). By taking the average temperature/concentration of each day of the winter and plotting that, we were able to see the trends over the course of the winter and get rid of daily periodic fluctuations. However, we found that those daily periodic fluctuations were more important to what we were trying to understand than season-wide trends.

    Because the data we were analyzing tended to be periodic, being influenced by daily patterns of both temperature and human behavior, we found we were able to get the best results by observing and analyzing diurnal profiles of data. The pollution is coming from coal burning within homes, which is influenced by daily routines. This means we see more meaningful trends when we examine this data on a daily basis (what an average day looks like). Diurnal profiles examine the periodic trends we smoothed out in our previous graphs, and recover more meaningful data that was lost in smoothing.

    We were primarily interested in investigating the coldest days of the year, so we needed to separate the days into different temperature brackets. This was especially useful in comparing 2013 and 2014 data to analyze the impact of the intervention. Comparing an "average day" minimalizes the importance of cold days where large amounts of coal were burned, so it's not as significant. We set cutoff temperatures for these brackets so 2013 and 2014 could be broken down into like chunks. To find these cutoff temperatures, we found the IQR brackets of outdoor temperature for each year, and use the quartile cutoff temperatures of the warmer year, so that both have at least 25% of their days counted as cold days. 2014 is the warmer year, so cold days are defined as days in either 2013 or 2014 where the average temperature of that day is colder than 25% of days in 2014.

    We used a Kruskal test to determine that in comparing the diurnal profile of outdoor temperature for cold days in 2013 and the diurnal profile of outdoor temperature for cold days in 2014, they were not at all significantly different. In fact, we can accept the null hypothesis with 95% certainty. If we saw other features from diurnal profiles that were statistically significantly different, we would know these differences probably weren't a result of outdoor temperature fluctuation.

    A Kruskal Wallis-H test is one way analysis of variance on ranks. We used this test to compare 2013 and 2014 to see how significantly the data between years differed (this test is found in scipy.stats' kruskal method: http://docs.scipy.org/doc/scipy-0.17.0/reference/generated/scipy.stats.kruskal.html). We examined how the outdoor temperatures varied (as a soft control), and then compared the years's pollutant concentrations to get a read for how well the intervention reduced pollution. We did this again with diurnal plots to be sure. Kruskal is best for comparing two or more series' with one category of measurement at different time points to see if there is significant variance between those groups. It's parametric, and uses ranked means to measure how greatly the series' differ.This test can also be applied to 2013 vs 2014 particulate matter pollutant data to find statistical significance in the differences in pollution before and after intervention. The null hypothesis of the Kruskal test states that the mean ranks of the compared two or more series' are the same. We have used a cutoff p-value of 0.05, so any test that returned a value lower than 0.05 showed that the series' were significantly different.

    We did more Kruskal tests comparing particulate matter pollution data between 2013 and 2014 and found that some were statistically significantly different, and other test results suggested they were different, though we did not have enough evidence to reject the null hypothesis with 95% confidence.

    We used Spearman's rank correlation to track how well each series' trends correlated with one another. We chose to use a spearman correlation because it doesn't rely on the parameters a perason's correlation does. The pandas method dataframe.corr allowed us to choose spearman, ignores nans (a must have) and constructed a useful dataframe of correlation coefficients that were useful for making matrix visualizations of the correlations. With these Spearman rank correlations, we had quantitatvie values to represent the correlations between the particulate matter pollutants, temperature trends, and gas-phase pollutants.

    The Kruskal tests and Spearman correlations were the most useful statistical methods for our project.