Investigating KwaDela

Analyzing and Visualizing Thermal and Pollutant Data Over Time from the KwaDela Township in South Africa
Pre/Post Intervention

Story 2: Project Update 4/22

More Informative & More Intuitive Plots

Showing the difference between indoor temperature and outdoor temperature over the course of a day shows more clearly when the temperature is being influenced by coal burning.

Separating Data By Temperature - Comparing Warm & Cold Days

We looked into several different ways to divide up our data to analyze and compare/contrast the warmer and colder days of the winter. Above average vs below average, standard deviations, etc. We decided that it would be be best to divide into three brackets up by IQR. This would mean that the cold group was made up of the coldest 25% of days of the year, the middle group was made up of the middle 50% of days of the year, and the warm group was made up of the warmest 25% of days of the year. To create these groups, we separated the dataset by day, found the average outdoor temperature of each day and used those to make the divisions. For this project, we are most interested in the data about the colder days, since that is when people would be burning more coal to heat their homes. When we compared 2013 to 2014 we set temperature cutoffs to be the same, so that the data would be more comparable. We chose these cutoffs to be the IQR of 2014, since it was the slightly warmer year, so that both data frames would have enough days in the temperature range below that cutoff.

2013 vs 2014: Pre-Intervention Vs Post-Intervention

Between winter of 2013 and winter of 2014, low-cost modifications were made to township homes in Kwadela to improve thermal efficiency with the goal of reducing the quantity of fuel burned. These modifications included things like painting walls darker and replacing roofs of homes so that they would be better insulated. Comparing data from winter of 2013 to winter of 2014 can show us the impact of this intervention.

To see if the years are similar enough to compare pollution data, we compare the outdoor temperature information from 2013 and 2014.

As a whole, the temperatures of each winter were close enough that we can compare the two years. It is especially clear in the diurnal plot of the temperature of an average day. While peak temperatures in 2014 werre slightly warmer on average than 2013, they are very close and follow the same trend.

Outdoor PM10 Concentration Over Average Day 2013 vs 2014

Outdoor PM2.5 Concentration Over Average Day 2013 vs 2014

Indoor PM4 Concentration Over Average Day 2013 vs 2014


Comparing the overall diurnal plots, we see that post-intervention, the peaks of PM10 and PM2.5 concentrations decreased from before the intervention, but the peaks of indoor PM4 concentration increased. This is troubling, as it indicates that the interventions to make homes more thermally efficient may have also increased the concentration of PM4 inside of the homes of the people those interventions were intended to help.

Our next step was to break the 2013 and 2014 dataframes up into temperature brackets by the cutoffs determined above. we care most about the coldest days of the year, since that is when people tend to burn the most coal.

The plot below shows the difference between indoor and outdoor temperatures for each temperature bracket of each year.

Exploring just the cold days:
Medium temperature and warm days:

Statistics Decisions

We originally chose a one way ANOVA to look for similarity between population means (e.g. is average PM4 pollution on cold days significantly different from average PM4 on warm days). We planned to use the scipy module, stats.f_oneway, to compare quantities between years and quartiles, testing the null hypothesis that the means of each population would be the same. Theoretically, a low p value would have told us that the population means were definitely not the same. When we started implementing f_oneway, however, we started getting NaN p values. Some additional research showed that one-way ANOVAs are parametric, and assume that the each sample is from a normally distributed population and that the population standard deviations of the tested groups are all equal. Our data is definitely not normally distributed, nor does it have consistent standard deviations in any of the tested data. So we looked for a nonparametric alternative.

We found the Kruskal-Wallis H-test, or one-way ANOVA on ranks, considered to be a non-parametric version of the one way ANOVA. The test is available as the scipy module stats.kruskal, so it was perfectly accessible for our project. It’s known as a one-way ANOVA on ranks because it assigns data ranks (lowest to highest) based on value instead of the parameters required by f_oneway. At a high level, kruskal takes the given data and compiles it into one group (size n), and then rank-orders all the data from lowest (rank 1) to highest (rank n). The ranks are then returned to the original groupings and kruskal takes the mean of the ranks in the group and compares the means. The null hypothesis states that the means of the ranks of all the groups will not be substantially different, so if these groups are similar, kruskal returns a high p value. If the groups all significantly differ, we receive a low p value. So far we’ve had a decent amount of success in applying kruskal to our data.

We’re also looking for a way to compare trends in measurements with different units (temperature to PM4 concentration, for example). We initially looked to Spearman’s correlation, but realized it didn’t take into account time as a dimension of the data. We’re currently looking at using serial correlation, as discussed in Allen Downey’s Thinkstats2, to take into account time as a factor of correlation as well as other variables.

Initial Statistical Analysis Results

These results show that while the initial temperature data from the 2013 and 2014 dataframes are statistically different, the outdoor temperature data comparing 2013 cold days and 2014 cold days are very strongly correlated. Diurnal temperature wise they are basically the same.