Independent vs. dependent
Independent variable:
- Variable that is hypothesized to influence another variable
- Usually graphed on the x-axis
Dependent variable:
- Variable that we think changes in response to the independent variable
- Usually graphed on the y-axis
What we define as independent or dependent variable may depend on the
context of the study.
Difference between association and causation
- association: values of two variables are linked
- causation: values of one variable are directly linked with
the other (there is not a confounding factor responsible for the
change in both variables)
- Association is only circumstantial evidence for causation, it
doess not prove it.
Sampling
population: a whole set of things, people, etc. that we want to
describe or discover something about
The sample is a subset of the population that we are able to
observe, experiment on, etc.; we can use it to infer facts about
the population. We use a sample because it would be
difficult or impossible to make observations about each member of the
population.
A simple random sample is a special kind of sample. Each member
of the population has an equal chance to be part of the sample; whether
or not a member becomes a part of the sample is determined randomly.
Once I have chosen one or more members to be part of the sample, that
doesn't affect the probability of choosing other members of the
population (except that no member of the population can be chosen
twice). See text, pg. 340.
Descriptive vs. inferential statistics
descriptive statistics: Methods to summarize the information
that we know about a population or sample
inferential statistics: Methods to allow us to use what we know
about the sample and try to generalize this information for the whole
population. We use the rules of probability to help us make sound
inferences.
How and why we summarize data
When you have a large amount of data, it's important to be able to
be able to get a ``general picture'' of the population quickly and easily.
- Graphical display include stem-and-leaf plots, bar charts,
histograms
- Summary statistics include measures of central tendency,
variability, and measures of association between variables
Bar graphs
One way to represent nominal or ordinal data is to use a bar graph
- A bar is drawn for each category
- Height of bar represents number of members of that class
- Sometimes width of the bar is set and the area of the bar
represents the relative frequency for that category
- Total area of the bars equals N (number of data points) or
100%
- Bars should not be drawn as though they touch each other
- If data is ordinal, arrange the bars to represent this
- If data is nominal, may be best to arrange the bars so
categories are in alphabetical order
Histograms
A histogram is often a good way to represent quantitative data.
- Height of bar represents number of members of that class
- Very similar to bar graph, but bars on the histogram are
drawn contiguously
- If the histogram is drawn such that the area of all the bars
together is 1 (100%), then it can be useful to visualize a
continuous curve that interpolates between the midpoints of the bars.
Numerical summaries of data
Three characteristics to summarize:
- Central tendency: Where is the ``center'' of the data?
- Variability: How spread out are the data points?
- Shape: If we make a histogram of the data, how will it be shaped?
If we understand these 3 characteristics, we can get a good
understanding how the data are distributed.
Ways to measure central tendency
Three commonly used statistics are:
- Mean (average): Think of it like the balance point of your histogram
- Median (50th percentile): What is the middle score?
- Mode: Which value occurs most often in the data set?
Mean
- Sum of data / total number of data points
- All the data is taken into account
- Influenced greatly by extreme values
- Mathematically very useful - use it to calculate variance,
standard deviation, etc.
Median
- Middle value of the data set
- All values are not taken into account (except insofar as the
general ordering of the values goes)
- Not influenced by extreme values
- Particularly useful when a distribution is skewed
- Not as useful mathematically as the mean
Mode
- Most frequently occuring value in the data set
- Can be thought of as a typical point in the data set
- All values are not taken into account
- In some cases, not as intuitively ``central'' as other measures
- Not as useful mathematically as the mean
Measuring dispersion
How much variation is there in the data? How are the data spread out
across the different possible values?
- Range
- Inter-quartile range
- Variance
- Standard deviation (SD)
Range
- Distance between the highest data value and lowest data value
- Easy to calculate!
- Easily affected by extreme values
- Doesn't use all the data
Inter-quartile range (IQR)
Quartiles are the data points that divide an ordered data set
into quarters. So, the 1st quartile (Q1) is the data value that
separates the bottom fourth of the data from the remainder; the 3rd
quartile (Q3) separates the top fourth of the data from the remainder.
- Distance between 3rd quartile and 1st quartile
- Not as influenced by extreme values
- Tells where the middle half of the data is located
Standard deviation (SD)
- Measures how far away the numbers of list are from their average
- More difficult to calculate by hand than range or IQR
- Very useful mathematically
If a distribution is mound-shaped and approximately symmetric, then we
can use the following approximations:
- About 68% of the data points will fall within 1 SD of either
side of the average
- About 95% of the data points will fall within 2 SD of either
side of the average
File translated from TEX by TTH, version 1.50.