Course Outline

list Introduction to Statistics: A Modeling Approach

Exploring the Variation in an Estimate

To explore the concept of variation in estimates, let’s go back to an example we explored in Chapter 3: the throwing of a die. We start there because we know the DGP, and because we know what it is, we can investigate the variation across samples using simulation techniques.

We know, and we confirmed earlier, that the long run population of die rolls has a uniform distribution. If we flip a die one time, we don’t know if it will come out a 1, 2, 3, 4, 5, or 6. But if we flip a die 100,000 times (or flip 100,000 dice all at the same time), we should end up with a uniform distribution. Note that we just picked 100,000 as a really big number but it could have been any big number (e.g., 10,000 or 259,240 or 17,821, etc).

Using the resample() function, we simulated 100,000 die rolls.

giantsample <- resample(1:6, 100000)

Then we plotted this distribution in a histogram (below). You can see that the distribution is almost perfectly uniform, as we would expect. But it’s not perfect; there is some tiny variation across the six possible outcomes.

We learned in Chapter 3, though, that if we take a smaller sample of die rolls (let’s say n=24), we get a lot more variation among samples. And, perhaps most important, none of them look so much like what we know the population looks like.

Here’s code we used to randomly generate a sample of n=24 die rolls.

sample <- resample(1:6, 24)

Each time we ran this function, we got a different looking distribution. Here are a few samples of 24 die rolls. (We put density plots on them so you could appreciate how different the overall shapes were.)

From Sample to Estimate

Up to this point we have discussed sampling variation in a qualitative way. We can see that the samples are different, and also that none of them look very much like what we know the population looks like.

But now we want to go further. Imagine that we didn’t know the DGP of this distribution, and we were analyzing one sample in order to find out.

We know a lot more about how to do this than we did in Chapter 3! Use the DataCamp window to generate a sample of n=24 die rolls and save it as sample1. We’ve added additional code to print out the 24 numbers that result and then calculate the mean of the 24 numbers (our parameter estimate).

require(mosaic) require(ggformula) require(supernova) set.seed(4) # Write code to simulate a random sample of 24 dice rolls sample1 <- # This will print the 24 numbers in sample1 sample1 # This will return the mean of these numbers mean(sample1) # Write code to simulate a random sample of 24 dice rolls sample1 <- resample(1:6, 24) # This will print the 24 numbers in sample1 sample1 # This will return the mean of these numbers mean(sample1) test_object("sample1") test_output_contains("sample1") test_function_result("mean")
DataCamp: ch9-1

Our sample of n=24 had a mean of 3.875. If this were our only sample, we’d use 3.875 as our best estimate of the population mean. But because we know what the DGP looks like, we can know, in this case, what the mean of the population should be.

L_Ch9_Exploring_1

Let’s calculate the expected mean in two ways. First, let’s simulate 100,000 die rolls like we did above, and then calculate their mean. Modify the code in the DataCamp window to do this.

require(mosaic) require(ggformula) require(supernova) set.seed(10) # Generate a giant sample of dicerolls giantsample <- resample() # Calculate the mean of the giant sample # Generate a giant sample of diecerolls giantsample <- resample(1:6, 100000) # Calculate the mean of the giant sample mean(giantsample) test_object("giantsample") test_function_result("mean") test_error() success_msg("Great thinking!")
DataCamp: ch9-2

Here’s what we got for the mean of our giant sample of 100,000:

It’s very close to 3.5. Another way to get the expected mean is just to calculate the mean of each of the equally likely outcomes of a die roll: 1, 2, 3, 4, 5, 6. Use the DataCamp window below to do this.

require(mosaic) require(ggformula) require(supernova) Fingers <- read.csv(file="https://raw.githubusercontent.com/UCLATALL/intro-stats-modeling/master/datasets/fingers.csv", header=TRUE, sep=",") # These are the possible outcomes of a dice roll. diceoutcomes <- c(1, 2, 3, 4, 5, 6) # Modify this code to find the mean of the possible outcomes mean() # These are the possible outcomes of a dice roll. diceoutcomes <- c(1, 2, 3, 4, 5, 6) # Modify this code to find the mean of the possible outcomes mean(diceoutcomes) test_object("diceoutcomes") test_function_result("mean") test_error() success_msg("Keep up the hard work!")
DataCamp: ch9-3

Now you get exactly 3.5, which is what the population mean should be. It’s the exact middle of the numbers 1 to 6, and if each is equally likely, 3.5 would have to be the mean.

But then let’s remember our small sample (sample1, n=24)—the mean was 3.875. Because we know the true population mean is 3.5, it’s easy in this case to quantify how different our sample is from the true mean of the population: the difference is 3.875—3.5 (or .375).

L_Ch9_Exploring_2

Let’s try it. It’s easy enough to simulate another random sample of n=24. Let’s save it as sample2 and see what the mean is.

require(mosaic) require(ggformula) require(supernova) Fingers <- supernova::Fingers custom_seed(12) # Write code to simulate another random sample of 24 dice rolls sample2 <- # This will return the mean of those numbers mean(sample2) # Write code to simulate another random sample of 24 dice rolls sample2 <- resample(1:6, 24) # This will retrn the mean of those numbers mean(sample2) test_object("sample2") test_function_result("mean") test_error() success_msg("Great work!")
DataCamp: ch9-4

We know, in this case, that the mean of the population is 3.5. But if we were trying to estimate the mean based on our samples, we would have a problem. The two random samples of n=24 we have generated so far produced two different estimates of the population mean: 3.875 and 2.875 (sample1 and sample2, respectively).

Let’s look at a few more samples. But let’s not bother saving the result of each die roll in each sample. Instead, let’s just simulate a random sample of 24 die rolls and calculate the mean, all in one line of R code.

mean(resample(1:6, 24))

Try running the code to see what the mean is for a third random sample of 24 die rolls.

require(mosaic) require(ggformula) require(supernova) Fingers <- supernova::Fingers custom_seed(13) # run this code mean(resample(1:6, 24)) # run this code mean(resample(1:6, 24)) test_function("resample") test_function("mean") test_error()
DataCamp: ch9-5

So far we have generated a diverse group of sample means, and no two are alike: 3.875, 2.875, 3.167.

L_Ch9_Exploring_3

This line of reasoning starts to raise this question in our minds: Are some sample means more likely than others?

Responses