Course Outline

list Introduction to Statistics: A Modeling Approach

Sampling Distribution of PRE and F

A Sampling Distribution of PRE

As it turns out, we can construct a sampling distribution of any statistic we calculate from our sample, including our estimate of PRE. We are also going to introduce a new function called PRE(). This function gives us another way to calculate PRE from our sample.

PRE(Tip ~ Condition, data = TipExperiment)

We can also save this PRE in an R object (we’ll just call it samplePRE). Write some code to print out the samplePRE. Try running all of it in the DataCamp window below.

#load packages require(ggformula) require(mosaic) require(supernova) require(Lock5Data) require(Lock5withR) require(okcupiddata) # calculates PRE (of a model explaining variation in Tip with Condition) # based on our TipExperiment data samplePRE <- PRE(Tip ~ Condition, data = TipExperiment) # write code to print out the samplePRE # calculates PRE (of a model explaining variation in Tip with Condition) # based on our TipExperiment data samplePRE <- PRE(Tip ~ Condition, data = TipExperiment) # write code to print out the samplePRE samplePRE test_object("samplePRE") test_output_contains("samplePRE") test_error() success_msg("Nice!")
DataCamp: ch11-6

Notice that we just get the same PRE we got in the ANOVA table (from the function supernova()).

Now we will we use our “If…” thinking to imagine a DGP with no relationship between Condition and Tip. The shuffle() function will help us break whatever relationship exists between these two variables. This technique is sometimes called randomization (it is also sometimes called the permutation approach).

Below is the R code necessary to calculate the PRE of the Condition model for one shuffled (or randomized) set of data. This code will 1) shuffle around the values of Condition, 2) create a model of Tip using Condition as an explanatory variable, then 3) calculate the PRE of that complex model.

PRE(Tip ~ shuffle(Condition), data = TipExperiment)

Modify the code written in the DataCamp window below to include the idea of “shuffling the values of Condition.”

#load packages require(ggformula) require(mosaic) require(supernova) require(Lock5Data) require(Lock5withR) require(okcupiddata) custom_seed(20) # modify this to calculate PRE based on shuffled values of Condition PRE(Tip ~ Condition, data = TipExperiment) PRE(Tip ~ shuffle(Condition), data = TipExperiment) test_function("shuffle") test_function("PRE") test_error() success_msg("That was a tough one. Way to think through it!")
DataCamp: ch11-7

We got a very small PRE when we shuffled the values of Condition around.

L_Ch11_SamplingPRE_1

Now let’s extend this code with the function do() to simulate 10,000 PREs. Modify the bit of code in the DataCamp window to calculate 10,000 PREs from randomized data. Save them into a data frame called SDoPRE (Sampling Distribution of PREs).

#load packages require(ggformula) require(mosaic) require(supernova) require(Lock5Data) require(Lock5withR) require(okcupiddata) custom_seed(21) # modify this to calculate 10,000 PREs based on shuffled data SDoPRE <- PRE(Tip ~ shuffle(Condition), data = TipExperiment) # print the first 6 lines of SDoPRE SDoPRE <- do(10000) * PRE(Tip ~ shuffle(Condition), data = TipExperiment) head(SDoPRE) test_function("do") test_function("PRE") test_object("SDoPRE") test_function("head") test_error() success_msg("Great thinking!")
DataCamp: ch11-8

This code is using a randomization process to generate the sampling distribution. It starts with the data points collected by the researchers, but then re-assigns each data point, randomly, to simulated groups (Smiley Face and Control).

Up to this point, it’s very similar to the process we used above to create a sampling distribution of the differences of means. But this time, once we create the two simulated groups, instead of calculating their means and taking the difference, we fit the linear model and calculate PRE. And, we do this 10,000 times.

The resulting sampling distribution is plotted in the histogram below. We’ve also printed out the favstats.

gf_histogram(~ PRE, data = SDoPRE, fill = "orange", alpha = .0)favstats(~ PRE, data = SDoPRE)

L_Ch11_SamplingPRE_2

The shape of this sampling distribution of PRE is quite different from the sampling of means or of the differences between means. It is bunched up on the left, with a long tail on the right.

If a sampling distribution was normally distributed, we would calculate the probabilities in each of the two tails, adding them up to get the overall probability of a score or mean or mean difference as extreme as the one we observed. With the sampling distribution of PRE, we only need to look at one tail of the distribution. We know that PRE can never be less than 0. We just want to know how likely it is to get a PRE as high as the one we observed.

In the Smiley Face/Tipping study, the researchers observed a sample PRE of .0729. See if you can modify the histogram to color the PREs that are greater than the sample PRE in a different color. Also, find the proportion of the 10,000 randomized PREs in the sampling distribution that are greater than our sample PRE, .0729. Use the DataCamp window below (where we have already entered the code to save the samplePRE, and create the sampling distribution in the data frame SDoPRE).

#load packages require(ggformula) require(mosaic) require(supernova) require(Lock5Data) require(Lock5withR) require(okcupiddata) custom_seed(213) samplePRE <- PRE(Tip ~ Condition, data = TipExperiment) SDoPRE <- do(10000) * PRE(Tip ~ shuffle(Condition), data = TipExperiment) # modify this to fill the PREs greater than the samplePRE in a different color gf_histogram(~ PRE, data = SDoPRE) # what is the proportion of randomized PREs that are greater than .0729? samplePRE <- PRE(Tip ~ Condition, data = TipExperiment) SDoPRE <- do(10000) * PRE(Tip ~ shuffle(Condition), data = TipExperiment) # modify this to fill the PREs greater than the samplePRE in a different color gf_histogram(~ PRE, data = SDoPRE, fill = ~PRE > samplePRE) # what is the proportion of randomized PREs that are greater than .0729? tally(~ PRE>samplePRE, data = SDoPRE, format = "proportion") test_object("SDoPRE") test_function("gf_histogram", args = c("data", "fill")) test_function("tally", args = c("data", "format")) test_error() success_msg("Nice work!")
DataCamp: ch11-9

In our sampling distribution, only .08 of the randomized samples had PREs greater than our sample PRE.

L_Ch11_SamplingPRE_3

Randomization is a DGP that makes any difference between group means meaningless. This DGP is emulating a world where the true PRE is actually 0. This sampling distribution of PREs shows us that even if the true DGP were a PRE of 0, there would still be a .08 chance of obtaining the observed difference between the Smiley Face and Control groups. So, just like before, because there is a greater than .05 chance that the observed difference could have been obtained just by chance, we will not reject the simple (or empty) model.

A Sampling Distribution of F

Let’s try the same approach, but this time let’s simulate a sampling distribution of F ratios instead of PREs.

First, let’s save our sample F as an R object using a new function called fVal().

sampleF <- fVal(Tip ~ Condition, data = TipExperiment)

Try printing out the sampleF in the DataCamp window.

#load packages require(ggformula) require(mosaic) require(supernova) require(Lock5Data) require(Lock5withR) require(okcupiddata) sampleF <- fVal(Tip ~ Condition, data = TipExperiment) # write code to print out the sampleF sampleF <- fVal(Tip ~ Condition, data = TipExperiment) # write code to print out the sampleF sampleF test_output_contains("sampleF") test_error() success_msg("Great thinking!")
DataCamp: ch11-10

L_Ch11_SamplingF_1

Like we did with the sampling distribution of PRE, we are going to imagine the same DGP where any difference between groups is just due to randomization. Here is the code for randomizing the same data 10,000 times, but this time saving the F values in a sampling distribution, and then plotting a histogram of the sampling distribution.

SDoF  <-  do(10000) * fVal(Tip ~ shuffle(Condition), data = TipExperiment)
gf_histogram(~ fVal, data = SDoF, fill = "firebrick", alpha = .9)

We display a histogram of the simulated sampling distribution of F (right) next to the one for PRE (left) so that you can compare them.

L_Ch11_SamplingF_2

The sample F observed in the actual study was 3.305. In the DataCamp window, modify the histogram to color the fVals that are greater than the sample F in a different color. Also, find the proportion of the 10,000 randomized Fs in the sampling distribution that are greater than our sample F. What do you think this proportion will be?

Use the DataCamp window below (where we have already entered the code to save the sampleF), and create the sampling distribution in the data frame SDoF.

#load packages require(ggformula) require(mosaic) require(supernova) require(Lock5Data) require(Lock5withR) require(okcupiddata) custom_seed(20) sampleF <- fVal(Tip ~ Condition, data = TipExperiment) SDoF <- do(10000) * fVal(Tip ~ shuffle(Condition), data = TipExperiment) sampleF <- fVal(Tip ~ Condition, data = TipExperiment) SDoF <- do(10000) * fVal(Tip ~ shuffle(Condition), data = TipExperiment) # modify this to fill the Fs greater than the sampleF in a different color gf_histogram(~ fVal, data = SDoF) # what is the proportion of randomized Fs that are greater than 3.305? sampleF <- fVal(Tip ~ Condition, data = TipExperiment) SDoF <- do(10000) * fVal(Tip ~ shuffle(Condition), data = TipExperiment) # modify this to fill the Fs greater than the sampleF in a different color gf_histogram(~ fVal, data = SDoF, fill = ~fVal > sampleF) # what is the proportion of randomized Fs that are greater than 3.305? tally(~ fVal>sampleF, data = SDoF, format = "proportion") test_object("SDoF") test_function("gf_histogram", args = c("data", "fill")) test_function("tally", args = c("data", "format")) test_error() success_msg("Great thinking!")
DataCamp: ch11-11

In the table below we have pasted in the supernova(Condition.model) result again. The PRE was .0729. The F was 3.305. We can use tally() to calculate the number of randomized samples with F ratios greater than 3.305 (out of 1000) just as we did for PRE. We have put the results below, again for both our randomized sampling distributions of PRE and F.

As you can see, we get almost exactly the same results: .08 out of 10,000 randomized samples have Fs greater than the observed F, compared with .08 of 10,000 randomized PREs. This value corresponds to the “p” column at the very end of the ANOVA table (.0762). This is often called the p-value.

L_Ch11_SamplingF_3

Responses