Course Outline

list Introduction to Statistics: A Modeling Approach

Using Bootstrapping to Construct a Confidence Interval

By now you should have a sense of what a sampling distribution is, and some intuition as to why we need sampling distributions to construct confidence intervals. Most of our ideas about sampling distributions have been developed through simulations. Although simulations help us to understand how all this works, they may not be as practical in a data analysis situation.

Simulations require us to make some assumptions about the DGP. For example, we have assumed in previous sections that the population distribution is normal in shape, and that it has a particular standard deviation, which we have estimated from our sample. We generated sampling distributions based on these assumptions. But not all DGPs are normal so there are going to be plenty of situations where we don’t want to make this assumption.

The rise of cheap and fast computers has made popular an alternative approach to creating sampling distributions. This approach is called resampling, or bootstrapping. One cool thing about resampling techniques is that they use only the data you collect in your sample. So, unlike simulation, they don’t require you to make up anything else.

How Bootstrapping Works

We have already used the resample() function in R. We used it previously to create a sampling distribution of the means of samples of 24 die rolls. We created a vector with the numbers 1 to 6, then rigged up a DGP in which each number had an equal chance of being resampled.

Bootstrapping works the same way, but instead of making up the distribution to sample from, we resample from the actual data we have collected. Let’s start by explaining the process using our TinyFingers data set of six thumb lengths. Then we will apply the same techniques to our actual sample of 157 thumb lengths.

Here are the six thumb lengths from the TinyFingers data set again:

TinyFingers$Thumb

Let’s start by reviewing what the resample() command does.

resample(TinyFingers$Thumb, 6)

Here is the output of this command.

What resample() does here is take a new random sample of six observations from our data set of six observations. It samples with replacement, meaning that when it samples the first number, it then puts it back so it can be sampled again.

L_Ch10_UsingB_1

You can see that all of the numbers in the resampled data came from the original data set. But, some of the numbers in the original data were not selected (e.g. 68), while others (e.g. 60) were selected twice. It’s a random process, meaning each number has an equal chance of being selected each time a number is selected.

Bootstrapping a Sampling Distribution from our Fingers data set

Let’s now use resample() to bootstrap a sampling distribution of means we can use to help us interpret the mean thumb length we observed in our sample of 157 students.

L_Ch10_UsingB_2

Let’s start by creating just five bootstrapped sample means by running this code. We’ll save the bootstrapped means in a new data frame called bootSDoM. Note that all of the five means are based on our original data points.

#load packages require(ggformula) require(mosaic) require(supernova) require(Lock5Data) require(Lock5withR) require(okcupiddata) #set up exercise custom_seed(31) # This selects a bootstrapped sample and calculates the mean 5 times. # These 5 means are saved in bootSDoM. bootSDoM <- do(5) * mean(resample(Fingers$Thumb, 157)) # This prints bootSDoM. bootSDoM # This selects a bootstrapped sample and calculates the mean 5 times. # These 5 means are saved in bootSDoM. bootSDoM <- do(5) * mean(resample(Fingers$Thumb, 157)) # This prints bootSDoM. bootSDoM test_object("bootSDoM") test_output_contains("bootSDoM") test_error()
Just click Run
DataCamp: ch10-5

L_Ch10_UsingB_3

Modify the code now to create a sampling distribution of 10,000 means, then plot the means of this bootstrapped sampling distribution as a histogram.

#load packages require(ggformula) require(mosaic) require(supernova) require(Lock5Data) require(Lock5withR) require(okcupiddata) #set up exercise custom_seed(31) # Modify to bootstrap 10000 means. bootSDoM <- do(5) * mean(resample(Fingers$Thumb, 157)) # Make a histogram of these bootstrapped means. # Modify to bootstrap 10000 means. bootSDoM <- do(10000) * mean(resample(Fingers$Thumb, 157)) # Make a histogram of these bootstrapped means. gf_histogram(~ mean, data = bootSDoM, fill = "darkblue") %>% gf_labs(title = "Bootstrapped Sampling Distribution of Means (n = 157)") test_object("bootSDoM") test_function("gf_histogram", args = "data") test_error()
Don't forget to update the do() function!
DataCamp: ch10-6

L_Ch10_UsingB_4

Comparing the Bootstrapped Sampling Distribution with the Simulated Sampling Distribution

The histogram of our bootstrapped distribution of 10,000 means certainly looks like a sampling distribution. It’s normal in shape, and is centered on our observed sample mean of 60.1. This makes sense given that all the numbers sampled to go into these means came from our sample with a mean of 60.1.

Let’s compare this bootstrapped distribution to a simulated distribution of 10,000 means centered on the same mean and based on estimates derived from the same sample of 157 students: a mean of 60.1 (Thumb.stats\(mean**) and a standard deviation of 8.73 (**Thumb.stats\)sd). Save the resulting sampling distribution in a data frame called simSDoM. (You have done this before; we are just asking you to do it again.)

#load packages require(ggformula) require(mosaic) require(supernova) require(Lock5Data) require(Lock5withR) require(okcupiddata) #set up exercise custom_seed(31) Thumb.stats <- favstats(~ Thumb, data = Fingers) # This generates and plots a sampling distribution of 10000 bootstrapped means. bootSDoM <- do(10000) * mean(resample(Fingers$Thumb, 157)) gf_histogram(~ mean, data = bootSDoM, fill = "darkblue") %>% gf_labs(title = "Bootstrapped Sampling Distribution of Means (n = 157)") # Modify this code to generate and plot a sampling distribution of 10000 simulated means. simSDoM <- gf_histogram(~ mean, data = simSDoM, fill = "blue") %>% gf_labs(title = "Simulated Sampling Distribution of Means (n = 157)") # This generates and plots a sampling distribution of 10000 bootstrapped means. bootSDoM <- do(10000) * mean(resample(Fingers$Thumb, 157)) gf_histogram(~ mean, data = bootSDoM, fill = "darkblue") %>% gf_labs(title = "Bootstrapped Sampling Distribution of Means (n = 157)") # Modify this code to generate and plot a sampling distribution of 10000 simulated means. simSDoM <- do(10000) * mean(rnorm(157, Thumb.stats$mean, Thumb.stats$sd)) gf_histogram(~ mean, data = simSDoM, fill = "blue") %>% gf_labs(title = "Simulated Sampling Distribution of Means (n = 157)") test_object("bootSDoM") test_function("gf_histogram", args = "data", index = 1) test_object("simSDoM") test_function("mean", index = 2) test_function("rnorm") test_function("gf_histogram", args = "data", index = 2) test_error() success_msg("Great thinking!")
Think about the difference between a sampling difference of bootstrapped means and a sampling distribution of simulated means. What function could you use to simulate a distribution?
DataCamp: ch10-7

Now compare the two sampling distributions of means of samples of n=157. On the left is our bootstrapped sampling distribution. On the right is the simulated sampling distribution.

L_Ch10_UsingB_5

Go ahead and get the favstats() for Fingers\(Thumb** (the original variable with 157 thumb lengths), **bootSDoM\)mean, and simSDoM$mean.

#load packages require(ggformula) require(mosaic) require(supernova) require(Lock5Data) require(Lock5withR) require(okcupiddata) #set up exercise custom_seed(31) Thumb.stats <- favstats(~ Thumb, data = Fingers) bootSDoM <- do(10000) * mean(resample(Fingers$Thumb, 157)) simSDoM <- do(10000) * mean(rnorm(157, Thumb.stats$mean, Thumb.stats$sd)) # favstats for Thumb # favstats for bootstrapped means # favstats for simulated means # favstats for Thumb favstats(~Thumb, data = Fingers) # favstats for bootstrapped means favstats(~mean, data = bootSDoM) # favstats for simulated means favstats(~mean, data = simSDoM) test_function("favstats", args = "data", index = 1) test_function("favstats", args = "data", index = 2) test_function("favstats", args = "data", index = 3) test_error() success_msg("Nice job!")
You'll want to use Fingers, bootSDoM, and simSDoM
DataCamp: ch10-8

L_Ch10_UsingB_6

Using the Bootstrapped Sampling Distribution to Construct the 95% Confidence Interval

We have established that the bootstrapped sampling distribution is nearly identical to the simulated sampling distribution. Although we used different methods to create them, we end up in the same place. Just based on the fact that the two distributions have the same standard errors, we can guess that the confidence interval we would construct from the bootstrapped distribution would be roughly the same as what we got from the simulated distribution.

The DataCamp window below finds the confidence interval using the simulated sampling distribution centered at the sample mean. As before, we sort the means in descending order and then print the 250th mean. This is the same value as the upper boundary of the 95% confidence interval. We also print the 9,750th mean, the same value as the lower bound.

Add code to find the confidence interval using the bootstrapped distribution. See how the confidence intervals compare using the two different distributions.

#load packages require(ggformula) require(mosaic) require(supernova) require(Lock5Data) require(Lock5withR) require(okcupiddata) #set up exercise custom_seed(31) Thumb.stats <- favstats(~ Thumb, data = Fingers) bootSDoM <- do(10000) * mean(resample(Fingers$Thumb, 157)) simSDoM <- do(10000) * mean(rnorm(157, Thumb.stats$mean, Thumb.stats$sd)) #This calculates the critical distance from a simulated sampling distribution. simSDoM <- arrange(simSDoM, desc(mean)) simSDoM$mean[250] simSDoM$mean[9750] #Modify to find the confidence interval from the bootstrapped sampling distribution (bootSDoM). simSDoM <- arrange(simSDoM, desc(mean)) simSDoM$mean[250] simSDoM$mean[9750] #This calculates the critical distance from a simulated sampling distribution. simSDoM <- arrange(simSDoM, desc(mean)) simSDoM$mean[250] simSDoM$mean[9750] #Modify to find the confidence interval from the bootstrapped sampling distribution (bootSDoM). bootSDoM <- arrange(bootSDoM, desc(mean)) bootSDoM$mean[250] bootSDoM$mean[9750] test_object("bootSDoM") test_function("arrange") test_function("desc") test_output_contains("bootSDoM$mean[250]") test_output_contains("bootSDoM$mean[9750]") test_error() success_msg("Nice job!")
Have you changed the variable to bootSDoM?
DataCamp: ch10-9

The two sets of confidence intervals are very close! When we use simulation, we get a confidence interval of 58.75 to 61.47 and when we use bootstrapping, we get a confidence interval of 58.75 to 61.42.

L_Ch10_UsingB_7

Responses