Independent vs. dependent

Independent variable:

Variable that is hypothesized to influence another variable
Usually graphed on the x-axis

Dependent variable:

Variable that we think changes in response to the independent variable
Usually graphed on the y-axis

What we define as independent or dependent variable may depend on the context of the study.

Difference between association and causation

association: values of two variables are linked
causation: values of one variable are directly linked with the other (there is not a confounding factor responsible for the change in both variables)
Association is only circumstantial evidence for causation, it doess not prove it.

Sampling

population: a whole set of things, people, etc. that we want to describe or discover something about

The sample is a subset of the population that we are able to observe, experiment on, etc.; we can use it to infer facts about the population. We use a sample because it would be difficult or impossible to make observations about each member of the population.

A simple random sample is a special kind of sample. Each member of the population has an equal chance to be part of the sample; whether or not a member becomes a part of the sample is determined randomly. Once I have chosen one or more members to be part of the sample, that doesn't affect the probability of choosing other members of the population (except that no member of the population can be chosen twice). See text, pg. 340.

Descriptive vs. inferential statistics

descriptive statistics: Methods to summarize the information that we know about a population or sample

inferential statistics: Methods to allow us to use what we know about the sample and try to generalize this information for the whole population. We use the rules of probability to help us make sound inferences.

How and why we summarize data

When you have a large amount of data, it's important to be able to be able to get a ``general picture'' of the population quickly and easily.

Graphical display include stem-and-leaf plots, bar charts, histograms
Summary statistics include measures of central tendency, variability, and measures of association between variables

Bar graphs

One way to represent nominal or ordinal data is to use a bar graph

A bar is drawn for each category
Height of bar represents number of members of that class
Sometimes width of the bar is set and the area of the bar represents the relative frequency for that category
Total area of the bars equals N (number of data points) or 100%
Bars should not be drawn as though they touch each other
If data is ordinal, arrange the bars to represent this
If data is nominal, may be best to arrange the bars so categories are in alphabetical order

Histograms

A histogram is often a good way to represent quantitative data.

Height of bar represents number of members of that class
Very similar to bar graph, but bars on the histogram are drawn contiguously
If the histogram is drawn such that the area of all the bars together is 1 (100%), then it can be useful to visualize a continuous curve that interpolates between the midpoints of the bars.

Numerical summaries of data

Three characteristics to summarize:

Central tendency: Where is the ``center'' of the data?
Variability: How spread out are the data points?
Shape: If we make a histogram of the data, how will it be shaped?

If we understand these 3 characteristics, we can get a good understanding how the data are distributed.

Ways to measure central tendency

Three commonly used statistics are:

Mean (average): Think of it like the balance point of your histogram
Median (50th percentile): What is the middle score?
Mode: Which value occurs most often in the data set?

Mean

Sum of data / total number of data points
All the data is taken into account
Influenced greatly by extreme values
Mathematically very useful - use it to calculate variance, standard deviation, etc.

Median

Middle value of the data set
All values are not taken into account (except insofar as the general ordering of the values goes)
Not influenced by extreme values
Particularly useful when a distribution is skewed
Not as useful mathematically as the mean

Mode

Most frequently occuring value in the data set
Can be thought of as a typical point in the data set
All values are not taken into account
In some cases, not as intuitively ``central'' as other measures
Not as useful mathematically as the mean

Measuring dispersion

How much variation is there in the data? How are the data spread out across the different possible values?

Range
Inter-quartile range
Variance
Standard deviation (SD)

Range

Distance between the highest data value and lowest data value
Easy to calculate!
Easily affected by extreme values
Doesn't use all the data

Inter-quartile range (IQR)

Quartiles are the data points that divide an ordered data set into quarters. So, the 1st quartile (Q1) is the data value that separates the bottom fourth of the data from the remainder; the 3rd quartile (Q3) separates the top fourth of the data from the remainder.

Distance between 3rd quartile and 1st quartile
Not as influenced by extreme values
Tells where the middle half of the data is located

Standard deviation (SD)

Measures how far away the numbers of list are from their average
More difficult to calculate by hand than range or IQR
Very useful mathematically

If a distribution is mound-shaped and approximately symmetric, then we can use the following approximations:

About 68% of the data points will fall within 1 SD of either side of the average
About 95% of the data points will fall within 2 SD of either side of the average

File translated from T_EX by T_TH, version 1.50.