Course Outline

segmentGetting Started (Don't Skip This Part)

segmentIntroduction to Statistics: A Modeling Approach

segmentPART I: EXPLORING VARIATION

segmentChapter 1  Welcome to Statistics: A Modeling Approach

segmentChapter 2  Understanding Data

segmentChapter 3  Examining Distributions

3.5 The Back and Forth Between Data and the DGP (Continued)

segmentChapter 4  Explaining Variation

segmentPART II: MODELING VARIATION

segmentChapter 5  A Simple Model

segmentChapter 6  Quantifying Error

segmentChapter 7  Adding an Explanatory Variable to the Model

segmentChapter 8  Models with a Quantitative Explanatory Variable

segmentPART III: EVALUATING MODELS

segmentChapter 9  Distributions of Estimates

segmentChapter 10  Confidence Intervals and Their Uses

segmentChapter 11  Model Comparison with the F Ratio

segmentChapter 12  What You Have Learned

segmentResources
list Introduction to Statistics: A Modeling Approach
The Back and Forth Between Data and the DGP (Continued)
Examining Variation Across Samples
Let’s take a simulated sample of 12 die rolls (sampling with replacement from the six possible outcomes) and save it in a vector called sample1.
L_Ch3_TheBack_4
Then let’s save that vector as a variable in a data frame called dicerolls.
Add some code to create a density histogram with six bins so we can examine the distribution of our sample.
require(mosaic)
require(tidyverse)
require(supernova)
require(Lock5withR)
model.pop < c(1,2,3,4,5,6)
custom_seed(4)
sample1 < resample(model.pop, 12)
dicerolls < data.frame(sample1)
# Write code to create a relative frequency histogram
# Remember to put in bins as an argument
sample1 < resample(model.pop, 12)
dicerolls < data.frame(sample1)
# Write code to create a relative frequency histogram
# Remember to put in bins as an argument
gf_histogram(..density..~sample1, data=dicerolls, bins=6)
test_object("sample1")
test_object("dicerolls")
test_function("gf_histogram", args=c("data","bins"))
test_error()
success_msg("Excellent work!")
Your random sample will look different from ours (after all, random samples differ from one another) but here is one of the random samples we generated.
(Just a reminder–we might fancy up our histograms with colors and labels and things. Feel free to add those frills and whistles yourself as well.) Notice that this doesn’t look very much like the uniform distribution we would expect based on our knowledge of the DGP!
Let’s take a larger sample—24 die rolls. Modify the code below to simulate 24 die rolls, save it as a vector called sample2, and then put this vector in a data frame. Will the distribution of this sample be perfectly uniform?
require(mosaic)
require(tidyverse)
require(supernova)
require(Lock5withR)
model.pop < c(1,2,3,4,5,6)
custom_seed(5)
# Modify this code from 12 dice rolls to 24 dice rolls
sample2 < resample(model.pop, 12)
dicerolls < data.frame(sample2)
# This will create a histogram
gf_histogram(..density..~sample2, data=dicerolls, color="darkgray", fill="springgreen", bins=6)
sample2 < resample(model.pop, 24)
dicerolls < data.frame(sample2)
# This will create a histogram
gf_histogram(..density..~sample2, data=dicerolls, color="darkgray", fill="springgreen", bins=6)
test_object("sample2", incorrect_msg="Did you modify the code to specify 24 dice rolls?")
test_object("dicerolls")
test_function("gf_histogram")
test_error()
success_msg("Great thinking!")
Notice that our randomly generated sample distribution is also not perfectly uniform. In fact, this doesn’t look very uniform to our eyes at all! You might even be asking yourself, is this really a random process?
Simulate a few more samples of 24 die rolls (we will call them sample3, sample4, and sample5) and plot them on histograms. This time, add a density plot on top of your histograms (using gf_density()
). Do any of these look exactly uniform?
require(mosaic)
require(tidyverse)
require(supernova)
require(Lock5withR)
custom_seed(7)
model.pop < c(1,2,3,4,5,6)
# create samples #3, #4, #5 of 24 dice rolls
sample3 <
sample4 <
sample5 <
# add these new samples to the dicerolls data frame
dicerolls <
# this will create a histogram of your sample3
# add onto it to include a density plot
gf_histogram(..density..~ sample3, data = dicerolls, color="darkgray", fill="springgreen", bins=6)
# create histograms of sample4 and sample5 with density plots
# create samples #3, #4, #5 of 24 dice rolls
sample3 < resample(model.pop, 24)
sample4 < resample(model.pop, 24)
sample5 < resample(model.pop, 24)
# add these new samples to the dicerolls data frame
dicerolls < data.frame(sample3, sample4, sample5)
# this will create a histogram of your sample3
# add onto it to include a density plot
gf_histogram(..density..~ sample3, data = dicerolls, color="darkgray", fill="springgreen", bins=6) %>%
gf_density()
# create histograms of sample4 and sample5 with density plots
gf_histogram(..density..~sample4, data=dicerolls, color="darkgray", fill="springgreen", bins=6) %>%
gf_density()
gf_histogram(..density..~sample5, data=dicerolls, color="darkgray", fill="springgreen", bins=6) %>%
gf_density()
test_object("sample3")
test_object("sample4")
test_object("sample5")
test_data_frame("dicerolls")
test_function("gf_histogram", index = 1)
test_function("gf_density", index = 1)
test_function("gf_histogram", index = 2)
test_function("gf_density", index = 2)
test_function("gf_histogram", index = 3)
test_function("gf_density", index = 3)
test_error()
success_msg("Great work!")
Wow, these look crazy and they certainly do not look uniform. They also don’t even look similar to each other. What is going on here?
L_Ch3_TheBack_5
The fact is, these samples were, indeed, generated by a random process: simulated die rolls. And we assure you, at least here, there is no error in the programming. The important point to understand is that sample distributions can vary, even a lot, from the underlying population distribution from which they are drawn. This is what we call sampling variation. Small samples will not necessarily look like the population they are drawn from, even if they are randomly drawn.
Large Samples Versus Small Samples
Even though small samples will often look really different from the population they were drawn from, larger samples usually will not.
For example, if we ramped up the number of die rolls to 10,000, we will see a more uniform distribution. Run the code below to see.
require(mosaic)
require(tidyverse)
require(supernova)
require(Lock5withR)
model.pop < c(1,2,3,4,5,6)
custom_seed(7)
# create a sample with 10000 rolls of a die
largesample <
# add largesample to the dicerolls data frame
dicerolls <
# this will create a histogram of your largesample
gf_histogram(..density..~ largesample, data = dicerolls, color="darkgray", fill="springgreen", bins=6)
# create a sample with 10000 rolls of a die
largesample < resample(model.pop, 10000)
# add largesample to the dicerolls data frame
dicerolls < data.frame(largesample)
# this will create a histogram of your largesample
gf_histogram(..density..~ largesample, data = dicerolls, color="darkgray", fill="springgreen", bins=6)
Wow, a large sample looks a lot more like what we expect the distribution of die rolls to look like! This is also why we make a distinction between the DGP and the population. When you run a DGP (such as resampling from the numbers 1 to 6) for a long time (like 10,000 times), you end up with a distribution that we can start to call a population.
Even though small samples are unreliable and misleading sometimes, large samples usually tend to look like the parent population that they were drawn from. This is true even when you have a weird population. For example, we made up a simulated population that kind of has a “W” shape. We put it in a variable called W.pop (it’s in the weird data frame). Here’s a density histogram of the population.
gf_histogram(..density.. ~ W.pop, data = weird, color = "black", bins = 6)
Now try drawing a relatively small sample (n = 24) from W.pop (with replacement) and save it as smallsample. Put your small sample in the weird data frame. Let’s observe whether the small sample looks like this weird Wshape.
require(mosaic)
require(tidyverse)
require(supernova)
require(Lock5withR)
model.pop < c(1,2,3,4,5,6)
custom_seed(10)
W.pop < c(rep(1,5), 2, rep(3,10), rep(4,10), 5, rep(6,5))
weird < data.frame(W.pop)
# Create a sample that draws 24 times from W.pop
smallsample <
# Add smallsample to the weird data frame
weird < data.frame()
# This will create a histogram of your smallsample
gf_histogram(..density..~ smallsample, data = weird, color="darkgray", fill="mistyrose", bins=6)
# Create a sample that draws 24 times from W.pop
smallsample < resample(W.pop, 24)
# Add smallsample to the weird data frame
weird < data.frame(smallsample)
# This will create a histogram of your smallsample
gf_histogram(..density..~ smallsample, data = weird, color="darkgray", fill="mistyrose", bins=6)
test_object("W.pop")
test_object("weird")
test_object("smallsample")
test_function("gf_histogram")
test_error()
success_msg("Look at you go!")
Now try drawing a large sample (n = 10,000) and save it as largesample. Will this one look more like the weird population it came from than the small sample?
require(mosaic)
require(tidyverse)
require(supernova)
require(Lock5withR)
model.pop < c(1,2,3,4,5,6)
custom_seed(7)
W.pop < c(rep(1,5), 2,rep(3,10),rep(4,10),5, rep(6,5))
# create a sample that draws 10000 times from W.pop
largesample <
# add largesample to the weird data frame
weird < data.frame()
# this will create a histogram of your largesample
gf_histogram(..density..~ largesample, data = weird, color="darkgray", fill="mistyrose", bins=6)
# create a sample that draws 10000 times from W.pop
largesample < resample(W.pop, 10000)
# add largesample to the weird data frame
weird < data.frame(largesample)
# this will create a histogram of your largesample
gf_histogram(..density..~ largesample, data = weird, color="darkgray", fill="mistyrose", bins=6)
test_object("W.pop")
test_object("largesample")
test_data_frame("weird")
test_function("gf_histogram")
test_error()
success_msg("Your effort really shows!")
That looks very close to the Wshape of the simulated population we started off with.
L_Ch3_TheBack_6
This pattern that large samples tend to look like the populations they came from is so reliable in statistics that it is referred to as a law: the law of large numbers. This law says that, in the long run, by either collecting lots of data or doing a study many times, we will get closer to understanding the true population and DGP.
Lessons Learned
In the case of die rolls (or even in the weird Wshaped population), we know what the true DGP looks like because we made it up ourselves. Then we generated random samples. What we learned is that smaller samples will vary, very few of them looking exactly like the process that we know generated them. But a very large sample will look more like the population.
In fact, it is unusual in real research to know what the true DGP looks like. Also we rarely have the opportunity to collect truly large samples! In the typical case, we only have access to relatively small sample distributions, and usually only one sample distribution. The realities of sampling variation, which you have now seen up close, make our job very challenging. It means we cannot just look at a sample distribution and infer, with confidence, what the parent population and DGP look like.
On the other hand, if we think we have a good guess as to what the DGP looks like, we shouldn’t be too quick to give up our theory just because the sample distribution doesn’t appear to support it. In the case of die rolls, this is easy advice to take: even if something really unlikely happens in a sample—e.g., 24 die rolls in a row all come up 5—we will probably stick with our theory! After all, a 5 coming up 24 times in a row is still possible to occur by random chance, although very unlikely.
But when we are dealing with real life variables, variables for which the true DGP is fuzzy and unknown, it is more difficult to know if we should dismiss a sample as mere sampling variation just because the sample is not consistent with our theory. In these cases, it is important that we have a way to look at our sample distribution and ask: how reasonable is it to assume that the data we have could have been generated by our current theory of the DGP?
Simulations can be really helpful in this regard. By looking at what a variety of random samples look like, we can get a sense as to whether our particular sample looks like natural variation, or if, instead, it sticks out as wildly different. If the latter, we may need to revise our understanding of the DGP.