Course Outline

list Introduction to Statistics: A Modeling Approach

Simulating Samples to Create a Sampling Distribution

Simulating Some Samples of Thumb Lengths

Now let’s extend our simulation to illustrate the concept of a sampling distribution. We will assume the same things about the DGP as in the previous simulation: normal distribution with a \(\mu\) of 60.1 and a \(\sigma\) of 8.73.

L_Ch9_More_9

Let’s start by generating a single sample of n=157 thumb lengths from the same assumed DGP. We could use this code:

rnorm(157, 60.1, 8.73)

But we will often choose to write this code:

rnorm(157, Thumb.stats$mean, Thumb.stats$sd)

Mostly we use the latter because we find it annoying to have to remember what the numbers are. Also, an added benefit is that when we just use Thumb.stats, we don’t have to worry about rounding our numbers; R will just use the super-long decimal because computers are good at remembering all the digits.

Why, you may ask, are you simulating a sample of 157 thumb lengths? The reason is that 157 is the exact size of the sample we collected for our study. Simulation of samples of n=157 lets us explore the kind of variation we might see among random samples of 157 thumbs.

So we’ll run this code rnorm(157, Thumb.stats$mean, Thumb.stats$sd) and put it into a data frame and all that in order to plot the simulated thumb lengths in a histogram. We’ll put simulated Thumb data in green (specifically “springgreen3”) so you don’t confuse it for real data.

L_Ch9_More_10

Let’s generate two more samples, and look at all three simulated samples next to each other.

L_Ch9_More_11

Now look at the histogram below. This is our actual distribution (again) of the thumb lengths produced by our sample of 157 students.

Compare the actual distribution with the three simulated sample distributions. Notice that our real sample basically doesn’t look that different from the simulated samples. The simulated samples all vary to some degree and our sample could blend in with them.

Already, just by looking at some simulated samples of the same size we studied, we can get a sense of what our data could have looked like if we had selected a different sample. It certainly seems reasonable, based on just looking at histograms, that our sample could have been generated with the DGP we assumed for our simulations: a normal distribution with a mean of 60.1 and a standard deviation of 8.73.

Simulating a Sampling Distribution of Mean Thumb Lengths

Up to this point we have generated some samples of n=157 and examined their distributions visually. To construct a sampling distribution, however, we need to compute a sample statistic such as the mean, and do it for a large number of simulated samples.

We learned before how to calculate the mean of 24 randomly generated die rolls.

mean(resample(1:6, 24))

Then we learned how to generate individual scores from a hypothetical normal distribution.

rnorm(1, Thumb.stats$mean, Thumb.stats$sd)

Let’s try to put these two ideas together. Modify this code to get the mean of 157 randomly generated scores from our assumed DGP.

require(mosaic) require(ggformula) require(supernova) require(tidyverse) Fingers <- supernova::Fingers Thumb.stats <- favstats(~ Thumb, data = Fingers) custom_seed(1) # modify this code to calculate the mean of 157 Thumb lengths from the normal distribution we have been exploring mean(resample(1:6, 24)) # this sets the random numbers set.seed(1) # modify this calculatethe mean of 157 Thumb lengths from the normal distribution we have been exploring mean(rnorm(157, Thumb.stats$mean, Thumb.stats$sd)) test_function("mean") test_error()
DataCamp: ch9-15

It is important to note that this 60.86912 is not an individual thumb length but the average of 157 randomly generated thumb lengths.

Now let’s generate some more samples of n=157. Modify the code below using the do() function to generate 20 samples of n=157.

require(mosaic) require(ggformula) require(supernova) require(tidyverse) Fingers <- supernova::Fingers Thumb.stats <- favstats(~ Thumb, data = Fingers) custom_seed(2) # add do() to generate 20 means from samples of n=157 Thumb lengths mean(rnorm(157, Thumb.stats$mean, Thumb.stats$sd)) # this sets the random numbers set.seed(2) # add do() to generate 20 means from samples of n=157 Thumb lengths do(20) * mean(rnorm(157, Thumb.stats$mean, Thumb.stats$sd)) test_function("do") test_function("mean") test_error()
DataCamp: ch9-16

L_Ch9_More_12

Notice that the individual thumb lengths have some extreme thumbs like 74 mm and 40 mm. But the means of samples of n=157 are less extreme.

L_Ch9_More_13

In the previous section, we generated 10,000 individual thumb lengths to create a simulated population. Now let’s create a simulated sampling distribution. Modify the code we provided to calculate 10,000 means from samples of n=157 from the same hypothesized DGP as above. We’ll save the resulting 10,000 sample means in a data frame called SDoM, short for Sampling Distribution of Means. Then print out the first six rows of SDoM.

require(mosaic) require(ggformula) require(supernova) require(tidyverse) Fingers <- supernova::Fingers Thumb.stats <- favstats(~ Thumb, data = Fingers) custom_seed(2) # modify this code to save 10000 sample means SDoM <- do(20) * mean(rnorm(157, Thumb.stats$mean, Thumb.stats$sd)) # print the first 6 lines of SDoM # this sets the random numbers set.seed(2) # modify this code to save 10000 sample means SDoM <- do(10000) * mean(rnorm(157, Thumb.stats$mean, Thumb.stats$sd)) # print the first 6 lines of SDoM head(SDoM) test_function("do") test_function("mean") test_function("head") test_error()
DataCamp: ch9-17

L_Ch9_More_14

Use the DataCamp window below to make a histogram of the variable mean in SDoM. Also calculate the favstats() for this distribution of means.

require(mosaic) require(ggformula) require(supernova) require(tidyverse) Fingers <- supernova::Fingers Thumb.stats <- favstats(~ Thumb, data = Fingers) custom_seed(2) SDoM <- do(10000) * mean(rnorm(157, Thumb.stats$mean, Thumb.stats$sd)) # make a histogram of mean from SDoM # calculate the favstats of SDoM # make a histogram of mean from SDoM gf_histogram(~ mean, data = SDoM) # calculate the favstats of SDoM favstats(~ mean, data = SDoM) test_function("gf_histogram") test_function("favstats") test_error()
DataCamp: ch9-18

The resulting distribution is not a distribution of individual thumb lengths—it’s a distribution of means of samples of n=157. We can see from the histogram that even though we know that the population mean is 60.1, it’s still possible that a randomly selected sample of n=157 could have a mean as large as 62 mm or as small as 58 mm. Most of the means, however, seem to cluster around 60.1, the mean of the population from which the samples were drawn.

L_Ch9_More_15

Here the two simulated distributions are shown side by side: the population (on the left) and the sampling distribution (on the right). On the left, each data point represents the thumb length of an individual person. On the right, each data point represents the mean thumb length of a randomly sampled group of 157 people.

At first glance, they look fairly similar: normal shape, centered somewhere around 60. What about spread? Do these distributions have similar variation?

L_Ch9_More_16

The sampling distribution (in blue) is actually less spread out. But it’s hard to see this from the side-by-side histograms. The reason for this is that R, which is trying to help you make a pleasing graph, adjusts the scaling of the x-axis so as to use the available space.

If we want to produce histograms that are easier to compare visually, we can tell R how to scale the x-axis, and scale it the same for the two histograms. So, for both histograms, let’s chain on this additional code after the gf_histogram() function:

gf_lims(x = c(25,95))

We have added the code below to set the x-axis to go from 25 to 95 for the histogram of the simulated population (simThumb from simpop). Modify the code below that to make a histogram of the sampling distribution (mean from SDoM) scaled in the same way. Feel free to cut and paste! Cutting and pasting is an important part of coding!

require(mosaic) require(ggformula) require(supernova) require(tidyverse) Fingers <- supernova::Fingers Thumb.stats <- favstats(~ Thumb, data = Fingers) custom_seed(1) simThumb <- rnorm(20, Thumb.stats$mean, Thumb.stats$sd) simpop <- data.frame(simThumb) custom_seed(2) SDoM <- do(10000) * mean(rnorm(157, Thumb.stats$mean, Thumb.stats$sd)) # this makes the Simulated Population histogram scaled from 25 to 95 gf_histogram(~ simThumb, data = simpop, fill = "darkorange2", bins = 100) %>% gf_labs(title = "Simulated Population Distribution") %>% gf_lims(x = c(25,95)) # add code to scale the Sampling Distribution histogram from 25 to 95 gf_histogram(~ mean, data = SDoM, fill = "blue", bins = 100) %>% gf_labs(title = "Sampling Distribution (n = 157)") # this makes the Simulated Population histogram scaled from 25 to 95 gf_histogram(~ simThumb, data = simpop, fill = "darkorange2", bins = 100) %>% gf_labs(title = "Simulated Population Distribution") %>% gf_lims(x = c(25,95)) # add code to scale the Sampling Distribution histogram from 25 to 95 gf_histogram(~ mean, data = SDoM, fill = "blue", bins = 100) %>% gf_labs(title = "Sampling Distribution (n = 157)") %>% gf_lims(x = c(25, 95)) test_function("gf_histogram", index = 1) test_function("gf_labs", index = 1) test_function("gf_lims", index = 1) test_function("gf_histogram", index = 2) test_function("gf_labs", index = 2) test_function("gf_lims", index = 2) test_error()
DataCamp: ch9-19

Now when we put the two histograms side by side on the same scale we can clearly see that the sampling distribution of means for samples of n=157 is quite a bit less spread out than the simulated population distribution.

L_Ch9_More_17

When calculating probabilities from distributions it is critical to pay attention to the exact question being asked. Questions about individual scores are answered in reference to the distribution they come from—the population distribution. Questions about means of samples are answered in reference to sampling distributions, because sampling distributions are the ones that generate sample means.

Responses