HW 07 - Exploring poverty with PCA

Individual assignment

Due: Oct 23 at 10:05am

This data is based on real-life survey data collected by the World Bank under the umbrella of the Living Standards Measurement Survey project.This large initiative has collected survey data from developing countries since the 1980s (using comparable survey instruments) to better understand questions of health, education, poverty, employment and other indicators of well-being around the world.

This exercise will guide you through the analysis of the well-being and livelihoods of individuals in 4 countries: Bulgaria, Tajikistan, Tanzania and Panama.

Packages

In this assignment we will work with the tidyverse, GGally, and knitr packages. These packages should already be installed in your project, and you can load them with the following:

library(tidyverse)
library(GGally)
library(knitr)

Creating a Poverty Index

In developing countries, income is not always a good measure of well-being. In agricultural societies, income is tied to harvest, and thus is sensitive to the season in which the surey is administered. Furthermore, income often does not capture in-kind revenue, and is a particularly poor measure for households that practice subsistence agriculture.

For this reason, development experts will often use indices of poverty based on household assets, to better understand well-being levels. Household asset ownership does not vary seasonally, and is not tied to payment-type, making it a more stable and often preferred measure of poverty.

The descriptive statistics of household assets showed variability in asset holdings across countries. How can we turn this into an index to determine the poverty level of households in our dataset?

Principal Component Analysis

Principal Component Analysis (PCA) is one of the main ways we can turn this data into an index. It allows us to turn multidimensional data (in this case, all of our assets), into a single variable. The first component is a vector that explains the largest variation in asset holdings of households and is a linear transformation of our original assets data. Allows us to identify trends and compare accross households and countries.

  1. Compute the correlation matrix for the assets. As a stretch goal, round the correlations to three decimal points, and try to make it print pretty with the kable() function.

Here is an example of what your correlation table can look like, but you should be creating your own.

stove refrigerator tv bike motorbike computer car video stereo sew
stove 1.000 0.302 0.374 0.004 0.058 0.180 0.167 0.247 0.046 0.111
refrigerator 0.302 1.000 0.538 -0.060 0.034 0.326 0.364 0.376 -0.017 0.161
tv 0.374 0.538 1.000 -0.017 0.054 0.274 0.242 0.371 -0.007 0.098
bike 0.004 -0.060 -0.017 1.000 0.122 0.061 0.021 0.062 0.235 0.037
motorbike 0.058 0.034 0.054 0.122 1.000 0.084 0.077 0.089 0.125 0.076
computer 0.180 0.326 0.274 0.061 0.084 1.000 0.376 0.345 0.137 0.121
car 0.167 0.364 0.242 0.021 0.077 0.376 1.000 0.288 0.036 0.146
video 0.247 0.376 0.371 0.062 0.089 0.345 0.288 1.000 0.277 0.106
stereo 0.046 -0.017 -0.007 0.235 0.125 0.137 0.036 0.277 1.000 -0.014
sew 0.111 0.161 0.098 0.037 0.076 0.121 0.146 0.106 -0.014 1.000
  1. Run the Principal Component Analysis using the function prccomp() function, and make the screeplot, and interpret it.

Notice that the goal here is not minimizing the loss of relevant information in the dataset, but rather finding a concise measure of wealth/poverty.

Remember the the first principal component \(y_1\) is a linear combination of the original variables \((x_1,...,x_{10})\): \[\begin{align*} y_1 = \mathbf{ a_1}^T \mathbf{x} = \sum_{j=1}^{10} a_{1,j} x_j \end{align*}\] In order to interpret the results we need to look at the coefficients \(a_{1,j}\) with \(j = 1,2,....,10\)

##        stove refrigerator           tv         bike    motorbike 
##  -0.31889692  -0.54314008  -0.53022169  -0.01383517  -0.02382546 
##     computer          car        video       stereo          sew 
##  -0.21839976  -0.29290376  -0.40271529  -0.09493231  -0.12933171
  1. How can we interpret the first component of our PCA, looking at the values listed above? (Hint: Is it a weighted average or a weighted difference?) Which assets are the most important in the first component of our PCA, as in - which ones are explaining the most variance? (Hint: Check which coefficients are larger than 0.15 in absolute value.) As a stretch goal, think about how our PCA might help us explain wealth levels across countries. With this in mind, why might the variable motorbike seem to be less important (in terms of contribution to total variance) than, for example, refrigerator? Go back to your asset-holding descriptive statistics from class to help answer this question.

Now, the next goal is to find what are the drivers of poverty/wealth in different countries. So first of all, we want to have the index in terms of wealth - the higher, the richer. We need to multiply the principal component by \(-1\), since the coefficients are all negative.

  1. Interpret the following histogram and boxplot by country? Which are the richest countries and which are the poorest? Which country has the most inequality?

Bibliography