Course Outline

list Introduction to Statistics: A Modeling Approach

Finding the 95% Confidence Interval with Simulation

Now all we need to do to construct the 95% confidence interval around our estimate of mean thumb length is find the critical distance (also called margin of error). We can do this in several ways; let’s start by using simulation.

Simulating a Sampling Distribution Centered at the Sample Mean

We just established in the previous section that the sampling distribution of means centered around our sample mean can be used to calculate the critical distance. Let’s start by simulating the sampling distribution, and then use that sampling distribution to find the critical distance.

Starting with the mean and standard deviation estimated from our sample, and the assumption that the population is normally distributed, we can simulate a sampling distribution of means for samples of n=157, centered at the estimated population mean (see picture). We’ll call the resulting data frame simSDoM to stand for “simulated Sampling Distribution of Means”.

simSDoM <- do (10000) * mean(rnorm(157, Thumb.stats$mean, Thumb.stats$sd))

This R code produces a data frame with 10,000 simulated sample means. The mean of this sampling distribution will be very close to 60.1, which is our estimate of \(\mu\) based on our sample.

To find the critical distance, let’s find the range within which 95% of sample means would lie, assuming that \(\mu\) is 60.1. To find this range, we need to find the point below which 2.5% of simulated sample means fell, and the point above which 2.5% of simulated sample means fell.

A simple way to do this is just to arrange our 10,000 simulated means in order from lowest to highest. Given that we simulated 10,000 means, we know that the lowest 250 means will be below the 2.5% cutoff for the lower bound of the confidence interval, and the highest 250 means, above the 2.5% cutoff for the upper bound.

L_Ch10_Finding_1

We’ll use arrange() to put the means in simSDoM in descending order.

arrange(simSDoM, desc(mean))

Save the arranged data frame back into the same data frame, simSDoM. Then use head() and tail() to look at the first and last six lines of newly arranged data frame.

#load packages require(ggformula) require(mosaic) require(supernova) require(Lock5Data) require(Lock5withR) require(okcupiddata) #set up exercise Thumb.stats <- favstats(~ Thumb, data = Fingers) #creates Thumb.stats set.seed(1) #sets the random number generator simSDoM <- do (10000) * mean(rnorm(157, Thumb.stats$mean, Thumb.stats$sd)) #simulates sampling distribution # arrange simSDoM in descending order by mean simSDoM <- # print first 6 lines of simSDoM # print last 6 lines of simSDoM # arrange simSDoM in descending order by mean simSDoM <- arrange(simSDoM, desc(mean)) # print first 6 lines of simSDoM head(simSDoM) # print last 6 lines of simSDoM tail(simSDoM) test_object("simSDoM") test_function_result("head") test_function_result("tail") test_error() success_msg("Great thinking!")
DataCamp: ch10-1

Notice that the means are in descending order, so as you go down the column labeled mean, the numbers get smaller. If we want to find the cutoff point for the highest 2.5% of the distribution, we simply need to identify the 250th mean from the top. We will call this the upper cutoff. To find the lower cutoff, we look at the 250th mean from the bottom.

We can use the following R code to find out the 250th mean from the top.

simSDoM$mean[250] 

L_Ch10_Finding_2

Type in the code to find the lower cutoff (250th mean) and the upper cutoff (9750th mean).

#load packages require(ggformula) require(mosaic) require(supernova) require(Lock5Data) require(Lock5withR) require(okcupiddata) #set up exercise Thumb.stats <- favstats(~ Thumb, data = Fingers) #creates Thumb.stats set.seed(1) #sets the random number generator simSDoM <- do (10000) * mean(rnorm(157, Thumb.stats$mean, Thumb.stats$sd)) #simulates sampling distribution simSDoM <- arrange(simSDoM, desc(mean)) #arranges by mean # write code to print 250th mean # write code to print 9750th mean # write code to print 250th mean simSDoM$mean[250] # write code to print 9750th mean simSDoM$mean[9750] test_output_contains("simSDoM$mean[250]") test_output_contains("simSDoM$mean[9750]") test_error() success_msg("Keep up the good work!")
DataCamp: ch10-2

L_Ch10_Finding_3

We’ve provided the code to find the critical distance from the upper cutoff to the mean. Write code to find the critical distance from the mean to the lower cutoff.

#load packages require(ggformula) require(mosaic) require(supernova) require(Lock5Data) require(Lock5withR) require(okcupiddata) #set up exercise Thumb.stats <- favstats(~ Thumb, data = Fingers) #creates Thumb.stats set.seed(1) #sets the random number generator simSDoM <- do (10000) * mean(rnorm(157, Thumb.stats$mean, Thumb.stats$sd)) #simulates sampling distribution simSDoM <- arrange(simSDoM, desc(mean)) #arranges by mean # this returns the critical distance from the upper cut off to the mean simSDoM$mean[250] - Thumb.stats$mean # write code to find the critical distance from the mean to the lower cut off # this returns the critical distance from the upper cut off to the mean simSDoM$mean[250] - Thumb.stats$mean # write code to find the critical distance from the mean to the lower cut off Thumb.stats$mean - simSDoM$mean[9750] test_output_contains("simSDoM$mean[250] - Thumb.stats$mean") test_output_contains("Thumb.stats$mean - simSDoM$mean[9750]") test_error() success_msg("Awesome!")
DataCamp: ch10-3

We find that the critical distance above the mean and below the mean is roughly 1.4 millimeters.

L_Ch10_Finding_4

We know from the Central Limit Theorem that the shape of this sampling distribution is approximately normal, and thus symmetrical. We can see this in the simulated distribution of means: there seem to be as many means left of the center as right of the center. Therefore, once we find out where the lower cutoff is, we know where the upper one will be.

Constructing the Confidence Interval from the Critical Distance

We used the simulated sampling distribution centered on the sample mean in order to calculate the critical distance. It is important to remember, however, that our real interest is in the range of possible population means from which our sample could have been drawn.

Looking again at the picture we presented earlier, we can see that the critical distance (1.4 mm) from the sample mean to each of its 2.5% tails is exactly the same distance as from the sample mean to the lower and upper bounds of the 95% confidence interval.

If we remove the sampling distribution centered around the sample mean from the picture, we can clearly see that the critical distance from the mean of the lower bound to its upper 2.5% cutoff ends precisely at the sample mean. And the same distance from the upper bound to its lower 2.5 cutoff also ends at the sample mean.

The key point is this: If the true population mean was at the lower bound, it means that there is only a 2.5% chance that the mean we observed (60.1 mm) or higher would have occurred had it been a random sample. And if the true population mean was at the upper bound, there is only 2.5% chance that the mean we observed or lower would have been drawn by chance.

Based on this simulation, we can say that the 95% confidence interval—the range within which we are 95% confident the true population mean must be—is 60.1 plus or minus 1.4—or from 58.7 to 61.5.

A Little Excursion to Test Our Logic

We constructed our 95% confidence interval by simulating a sampling distribution of 10,000 means, centered at our observed sample mean. But our real question was not about the tails of this distribution, but the tails of the sampling distributions that would be centered on different possible population means.

For example, if our lower boundary of the 95% confidence interval is correct, then we would reason that there would be a less than 2.5% chance of selecting a sample with a mean as high as 60.1 (the one we observed) if the true population mean were 58.7 mm (60.1 - 1.4). This makes sense logically. Let’s see if we can test it with a simulation.

Let’s use R to generate a simulation of 10,000 means from samples of 157 drawn randomly from a normally distributed population with a mean of 58.7. We’ll continue to use the standard deviation from our sample as an estimate of the population standard deviation. Plot this sampling distribution of means as a histogram and include a vertical line at the population mean (58.7) and at the sample mean (60.1).

#load packages require(ggformula) require(mosaic) require(supernova) require(Lock5Data) require(Lock5withR) require(okcupiddata) #set up exercise custom_seed(2) # this calculates the favstats for Thumb Thumb.stats <- favstats(~ Thumb, data = Fingers) # modify this code to generate samples from a population with a mean of 58.7 simSDoMlower <- do (10000) * mean(rnorm(157, , Thumb.stats$sd)) # make a histogram of this sampling distribution # draw a vertical line at the mean of the sampling distribution (58.7) # draw a vertical line at our sample mean (60.1) # this calculates the favstats for Thumb Thumb.stats <- favstats(~ Thumb, data = Fingers) # modify this code to generate samples from a population with a mean of 58.7 simSDoMlower <- do (10000) * mean(rnorm(157,58.7, Thumb.stats$sd)) # make a histogram of this sampling distribution # draw a vertical line at the mean of the sampling distribution (58.7) # draw a vertical line at our sample mean (60.1) gf_histogram(~mean, data = simSDoMlower) %>% gf_vline(xintercept = 58.7, color = "darkorange3") %>% gf_vline(xintercept = Thumb.stats$mean, color = "black") %>% gf_labs(title = "Lower Bound Sampling Distribution of the Mean") test_function("gf_histogram") test_function("gf_vline", args = "xintercept", index = 1) test_function("gf_vline", args = "xintercept", index = 2) test_error() success_msg("That was tough! Way to work through it.")
Use gf_vline() to add a line to your graph. Don't forget the %>% operator!
DataCamp: ch10-4

Now the question is: where does the sample mean fall, relative to the highest 2.5% of simulated sample means? We can sort the scores (in descending order) and color any means higher than the 250th simulated mean (mean > mean[250]) in a different color with R. We’ll also include the sample mean as a vertical line.

simSDoMlower <- arrange(simSDoMlower, desc(mean))
gf_histogram(~ mean, data = simSDoMlower, fill = ~mean>mean[250], bins = 100) %>%
gf_vline(xintercept = Thumb.stats$mean)

Just as we predicted, our sample mean (60.1) falls right on the upper cutoff of this sampling distribution! For reasons you may now be starting to understand, if we move our sampling distribution so that its mean is about 1.4 mm (the critical distance) to the left of our observed mean, then our sample mean becomes the upper cutoff.

We can do the same thing in the opposite direction. If we move the simulated distribution 1.4 millimeters to the right of our sample mean (\(60.1+1.4 = 61.5\)), our sample mean would be at the cutoff point for the lower 2.5% of means that could be generated by such a population.

Putting all this together, we get the picture below. What it reveals is that the 95% confidence interval for the true population mean lies somewhere between -1.4 and +1.4 millimeters from the mean of our sample.

L_Ch10_Finding_5

Interpreting the Simulated Confidence Interval

We have developed the argument that we can be 95% confident that the true population mean lies somewhere within our 95% confidence interval. Let’s think about what this means.

L_Ch10_Interpreting_1

If we are 95% confident in this interval, how should we think about the other 5%? The remaining 5% is the chance of error, the chance that the true population mean lies either below the lower bound of the confidence interval or above the upper bound. If our confidence interval for the mean of thumb length is between 58.7 and 61.5, there is a 5% chance that the true population mean is outside this interval (e.g, 57 or 62 or some other number).

Let’s flesh this out by thinking concretely about how we might be wrong. Again, we are using the “If…” move: What if the real population parameter was 58.5, a number slightly outside of our confidence interval?

simSDoMoutside <- do(10000) * mean(rnorm(157, 58.5, Thumb.stats$sd))
simSDoMoutside <- arrange(simSDoMoutside, desc(mean))
gf_histogram(~ mean, data = simSDoMoutside, fill = ~mean>mean[250], bins = 100) %>%
gf_vline(xintercept = Thumb.stats$mean)  %>%
gf_labs(title = "Sampling distribution if we are wrong (population mean = 58.5)")

If the true population mean were 58.5, our observed mean (60.1 represented by the black line) would be slightly above the 2.5% cutoff. Could our sample have been an unlikely sample? That is, could it have been an unusually high sample mean? Yes. It will only happen 2.5% of the time, but it certainly can happen. In fact, the same process that generates the likely samples—random chance—generates the unlikely ones. It is also unlikely to roll a die 10 times and get a 6 every time, but it’s still possible. Maybe our sample was an unlikely but possible event—that the population mean was 58.5 but we happened to draw an unlikely sample. IF this is what happened and we used that sample to calculate a confidence interval, we were misled.

Based on our sample, we thought the mean of the population could not be lower than the lower bound in our confidence interval, but in fact, it could be. The 95% idea keeps us aware of the fact that we might be wrong.

No Such Thing as 100% Confident in Statistics

You might wonder: why not construct a 100% confidence interval? That way there would be a 0% chance of being wrong! Unfortunately, that is not possible; there is no such thing as 100% certainty in statistics.

You might think that if you shift the hypothesized population mean far enough to the left (as in the picture below), there would be a 0% chance of getting a sample mean as high as 60.1. But remember, the normal distribution, which is a pretty good model of the sampling distribution of means, has tails that go on forever. While the chance of an extreme value may approach 0, it will never actually be 0.

When constructing confidence intervals we must first specify how confident we want to be in our estimate of the range of possible values of the true population mean—and 100% is not an option! A conventional starting point is 95%. But you can choose 90% or 99%, or any other number. Just not 100%.

Responses