This data is based on real-life survey data collected by the World Bank under the umbrella of the Living Standards Measurement Survey project.This large initiative has collected survey data from developing countries since the 1980s (using comparable survey instruments) to better understand questions of health, education, poverty, employment and other indicators of well-being around the world.
This exercise will guide you through the analysis of the well-being and livelihoods of individuals in 4 countries: Bulgaria, Tajikistan, Tanzania and Panama.
In this assignment we will work with the tidyverse
, GGally
, and knitr
packages. These packages should already be installed in your project, and you can load them with the following:
In developing countries, income is not always a good measure of well-being. In agricultural societies, income is tied to harvest, and thus is sensitive to the season in which the surey is administered. Furthermore, income often does not capture in-kind revenue, and is a particularly poor measure for households that practice subsistence agriculture.
For this reason, development experts will often use indices of poverty based on household assets, to better understand well-being levels. Household asset ownership does not vary seasonally, and is not tied to payment-type, making it a more stable and often preferred measure of poverty.
The descriptive statistics of household assets showed variability in asset holdings across countries. How can we turn this into an index to determine the poverty level of households in our dataset?
Principal Component Analysis (PCA) is one of the main ways we can turn this data into an index. It allows us to turn multidimensional data (in this case, all of our assets), into a single variable. The first component is a vector that explains the largest variation in asset holdings of households and is a linear transformation of our original assets data. Allows us to identify trends and compare accross households and countries.
kable()
function.Here is an example of what your correlation table can look like, but you should be creating your own.
stove | refrigerator | tv | bike | motorbike | computer | car | video | stereo | sew | |
---|---|---|---|---|---|---|---|---|---|---|
stove | 1.000 | 0.302 | 0.374 | 0.004 | 0.058 | 0.180 | 0.167 | 0.247 | 0.046 | 0.111 |
refrigerator | 0.302 | 1.000 | 0.538 | -0.060 | 0.034 | 0.326 | 0.364 | 0.376 | -0.017 | 0.161 |
tv | 0.374 | 0.538 | 1.000 | -0.017 | 0.054 | 0.274 | 0.242 | 0.371 | -0.007 | 0.098 |
bike | 0.004 | -0.060 | -0.017 | 1.000 | 0.122 | 0.061 | 0.021 | 0.062 | 0.235 | 0.037 |
motorbike | 0.058 | 0.034 | 0.054 | 0.122 | 1.000 | 0.084 | 0.077 | 0.089 | 0.125 | 0.076 |
computer | 0.180 | 0.326 | 0.274 | 0.061 | 0.084 | 1.000 | 0.376 | 0.345 | 0.137 | 0.121 |
car | 0.167 | 0.364 | 0.242 | 0.021 | 0.077 | 0.376 | 1.000 | 0.288 | 0.036 | 0.146 |
video | 0.247 | 0.376 | 0.371 | 0.062 | 0.089 | 0.345 | 0.288 | 1.000 | 0.277 | 0.106 |
stereo | 0.046 | -0.017 | -0.007 | 0.235 | 0.125 | 0.137 | 0.036 | 0.277 | 1.000 | -0.014 |
sew | 0.111 | 0.161 | 0.098 | 0.037 | 0.076 | 0.121 | 0.146 | 0.106 | -0.014 | 1.000 |
prccomp()
function, and make the screeplot, and interpret it.Notice that the goal here is not minimizing the loss of relevant information in the dataset, but rather finding a concise measure of wealth/poverty.
Remember the the first principal component \(y_1\) is a linear combination of the original variables \((x_1,...,x_{10})\): \[\begin{align*} y_1 = \mathbf{ a_1}^T \mathbf{x} = \sum_{j=1}^{10} a_{1,j} x_j \end{align*}\] In order to interpret the results we need to look at the coefficients \(a_{1,j}\) with \(j = 1,2,....,10\)
## stove refrigerator tv bike motorbike
## -0.31889692 -0.54314008 -0.53022169 -0.01383517 -0.02382546
## computer car video stereo sew
## -0.21839976 -0.29290376 -0.40271529 -0.09493231 -0.12933171
Now, the next goal is to find what are the drivers of poverty/wealth in different countries. So first of all, we want to have the index in terms of wealth - the higher, the richer. We need to multiply the principal component by \(-1\), since the coefficients are all negative.
The World Bank, Living Standards Measurement Study LSMS (2007). Bulgaria Multitopic Household Survey 2007 [BGR_2007_MTHS_v01_M]. Retrieved from http://microdata.worldbank.org/index.php/catalog/2273/study-description
The World Bank, Living Standards Measurement Study - Integrated Surveys on Agriculture (2010-2011). Tanzania - National Panel Survey 2010-2011, Wave 2 [TZA_2010_NPS-R2_v01_M]. Retrieved from http://microdata.worldbank.org/index.php/catalog/1050
The World Bank, Living Standards Measurement Study LSMS (2008). Panama - Encuesta de Niveles de Vida 2008 [PAN_2008_ENV_v01_M]. Retrieved from http://microdata.worldbank.org/index.php/catalog/70
Tajikistan Statistical Agency, Living Standards Measurement Study LSMS (2009). Tajikistan - Living Standards Survey 2009 [TJK_2009_TLSS_v01_M]. Retrieved from http://microdata.worldbank.org/index.php/catalog/73[c1]