Course Outline
-
segmentGetting Started (Don't Skip This Part)
-
segmentIntroduction to Statistics: A Modeling Approach
-
segmentPART I: EXPLORING VARIATION
-
segmentChapter 1 - Welcome to Statistics: A Modeling Approach
-
segmentChapter 2 - Understanding Data
-
segmentChapter 3 - Examining Distributions
-
segmentChapter 4 - Explaining Variation
-
segmentPART II: MODELING VARIATION
-
segmentChapter 5 - A Simple Model
-
segmentChapter 6 - Quantifying Error
-
segmentChapter 7 - Adding an Explanatory Variable to the Model
-
segmentChapter 8 - Models with a Quantitative Explanatory Variable
-
segmentPART III: EVALUATING MODELS
-
segmentChapter 9 - Distributions of Estimates
-
9.6 Reasoning with Sampling Distributions
-
segmentChapter 10 - Confidence Intervals and Their Uses
-
segmentChapter 11 - Model Comparison with the F Ratio
-
segmentChapter 12 - What You Have Learned
-
segmentResources
list Introduction to Statistics: A Modeling Approach
Reasoning With Sampling Distributions
Now that you have an idea of what sampling distributions are, let’s find out how we can use them to reason about data. There are two moves we can make in this regard: the first, which we will call reasoning forward, is easier to understand. The second, reasoning backwards, is harder to understand, but much, much more useful. Fortunately, understanding the first move will help you understand the second one.
Both reasoning forward and reasoning backward rely critically on the use of “IF…”. Using the word IF is the most important part of using sampling distributions to evaluate estimates and models.
Reasoning Forward
For example, we can ask, if the mean thumb length of the population is 60.1 mm, how likely is it that we would randomly draw a sample of n=157 with a mean higher than 61 mm? (Where did we get 61 mm? We were just curious about it, that’s all. You could wonder about any mean.)
To answer this question, we need a sampling distribution. Why? Because our question is about the likelihood of getting a mean greater than 61 mm, and so the likelihood needs to be assessed in reference to the sampling distribution of means.
We can simulate a sampling distribution, as we did before, assuming a normal distribution with \(\mu=60.1\) and \(\sigma=8.73\). We’ll use this R code to generate a sampling distribution of 10,000 means of randomly selected samples of n=157, and then plot the sampling distribution in a histogram:
SDoM <- do(10000) * mean(rnorm(157, Thumb.stats$mean, Thumb.stats$sd))
gf_histogram(~ mean, data=SDoM, bins=100, fill = "blue")
L_Ch9_Reasoning_1
Look at the sampling distribution, and now go back to the question we were trying to answer: If the population has a mean of 60.1 and a standard deviation of 8.73, what is the likelihood of getting a random sample of n=157 with a mean greater than 61?
It’s easier to visualize the answer to this question if you shade with a different color all of the simulated samples with means greater than 61. R provides an easy way to do this just by adding another argument (fill = ~(mean >= 61)
) to the gf_histogram()
function, like this:
gf_histogram(~ mean, data=SDoM, fill = ~(mean >= 61), bins=100)
This addition says: check each mean to see if it is greater than or equal to 61. If it is, fill with one color; if it isn’t, fill with another color.
Like the previous one, this histogram represents 10,000 simulated samples of n=157 from a population with mean of 60.1 and standard deviation of 8.73. But now all of the simulated samples with means of greater than or equal to 61 have been shaded green. So, if you can eyeball the proportion of the histogram that is shaded green, you can get a sense of how likely it would be to draw a sample of 157 thumbs with an average length greater than 61 mm.
L_Ch9_Reasoning_2
To get the exact number of means of the 10,000 in our simulated sampling distribution that are greater than or equal to 61, we can use the tally()
function.
tally(~ mean >= 61, data = SDoM)
L_Ch9_Reasoning_3
Since we conveniently simulated 10,000 samples, we can see that .0946 (or about .10) of means are larger than or equal to 61. But we can also modify tally to format the output as proportion rather than frequency. Modify the following code to return proportions.
require(mosaic)
require(ggformula)
require(supernova)
require(tidyverse)
Fingers <- supernova::Fingers
Thumb.stats <- favstats(~ Thumb, data = Fingers)
custom_seed(2)
# this creates the distribution of means
SDoM <- do(10000) * mean(rnorm(157, Thumb.stats$mean, Thumb.stats$sd))
# modify this to return a proportion
tally(~ mean >= 61, data = SDoM)
# modify this to return a proportion
tally(~ mean >= 61, data = SDoM, format = "proportion")
test_function_result("tally")
test_error()
Using the same sampling distribution (SDoM), see if you can answer a new question: If the population distribution is normal with a mean of 60.1 and standard deviation of 8.73, what are the chances of randomly selecting a sample of n=157 with a mean of greater than or equal to 65 mm?
First, modify the code below to answer this question visually using a shaded histogram.
require(mosaic)
require(ggformula)
require(supernova)
require(tidyverse)
Fingers <- supernova::Fingers
Thumb.stats <- favstats(~ Thumb, data = Fingers)
custom_seed(2)
# this creates the distribution of means
SDoM <- do(10000) * mean(rnorm(157, Thumb.stats$mean, Thumb.stats$sd))
# modify this code to shade means >= 65
gf_histogram(~ mean, data = SDoM, fill = ~(mean >= 61), bins = 100)
# modify this code to shade means >= 65
gf_histogram(~ mean, data = SDoM, fill = ~(mean >= 65), bins = 100)
test_function("gf_histogram", args = c("data", "fill"))
test_error()
L_Ch9_Reasoning_4
Now write a tally()
command in the window below to find out exactly what proportion of the samples in your simulated sampling distribution (SDoM) had means greater than 65.
require(mosaic)
require(ggformula)
require(supernova)
require(tidyverse)
Fingers <- supernova::Fingers
Thumb.stats <- favstats(~ Thumb, data = Fingers)
custom_seed(2)
# this creates the distribution of means
SDoM <- do(10000) * mean(rnorm(157, Thumb.stats$mean, Thumb.stats$sd))
# write code to tally how many simulated means are >= 65
# write code to tally how many simulated means are >= 65
tally(~ mean>=65, data = SDoM, format = "proportion")
test_function("tally", args = "data")
test_error()
Just like we recommend you visually explore your data before you model it, it’s also a good idea to create a shaded histogram when reasoning forward from a sampling distribution. Even experienced statisticians make mistakes. Looking at the histogram gives us a way to check whether the output we get from R (e.g., the tally()
function) is reasonable.
Try asking one more question of SDoM. What is the likelihood of randomly selecting a sample of n=157 from the same DGP with a mean that is smaller than 59 mm?
Use the DataCamp window below to calculate the exact proportion of 10,000 simulated samples with means smaller than 59 mm, and then also create a shaded histogram to confirm your result.
require(mosaic)
require(ggformula)
require(supernova)
require(tidyverse)
Fingers <- supernova::Fingers
Thumb.stats <- favstats(~ Thumb, data = Fingers)
custom_seed(2)
# this creates the distribution of means
SDoM <- do(10000) * mean(rnorm(157, Thumb.stats$mean, Thumb.stats$sd))
# how many simulated means are less than 59 mm?
# make a histogram to eyeball whether your answer seems reasonable
gf_histogram()
# how many simulated means are less than 59 mm?
tally(~ mean < 59, data = SDoM)
# make a histogram to eyeball whether your answer seems reasonable
gf_histogram(~ mean, data = SDoM, fill = ~(mean < 59))
test_function("tally")
test_function("gf_histogram", args = c("data", "fill"))
test_error()
In reasoning forward, we:
Hypothesize a DGP and/or population distribution
Generate a sampling distribution from the assumed DGP/population
Use the sampling distribution to calculate the likelihood of getting certain sample means if the assumptions about the DGP/population are true.
In reasoning forward, the distribution triad is ordered like this: (1) DGP/population, (2) sampling distribution, (3) sample. You can think of a hypothetical DGP/population in a number of ways. You can just think of the process (like in die rolls), you can just think of any number you want, or you can use information from your sample to help you estimate what the population might be. The important point is that you start by assuming something about the population first. We end with possible samples.