Course Outline

list Introduction to Statistics: A Modeling Approach

Model Comparison with a Quantitative Explanatory Variable

Using the F test to compare models with a quantitative explanatory variable turns out to be a simple extension. As long as we’re talking about thumb length, let’s take the case where Height (measured in inches), a quantitative variable, is used to explain thumb length. We learned in Chapter 8 how to visualize this model, and how to fit the model using lm().

gf_point(Thumb ~ Height, size=4, data=Fingers) %>%
gf_lm(color = "orange")

We visualize the relationship between Height and Thumb using a scatter plot, and add the best-fitting regression line to represent the model, \(Y_{i}=b_{0}+b_{1}X_{i}+e_{i}\).

L_Ch11_Model_1

Remember that for Height to explain Thumb, knowing someone’s height would help you make a better guess about their thumb length than if you didn’t know their height. If Height did not explain any of the variation in Thumb (notice the “If..” thinking), we would model it with the empty model (\(Y_{i}=b_{0}+e_{i}\)). Under the empty model, every person’s thumb length would be predicted by the mean of Thumb (and random sampling). We could represent this by simply overlaying a horizontal line on the scatter plot, indicating that no matter your height, we would predict your thumb length as the same: the mean thumb length.

The two scatter plots above show the comparison we want to make. The points are the same in both plots. But the left panel shows the complex model, in which Height predicts Thumb, and the right panel shows the empty model, in which there is no effect of the Height parameter (it’s not even there because it is as if \(b_1 = 0\).)

Randomizing the Sampling Distribution of F

In order to compare these two models, we want to know if the F statistic we observe on fitting the complex model (left) could possibly have occured by chance if the empty model were true, meaning that there was a zero slope predicting Thumb with Height.

First, let’s find the F value we get by fitting the complex model to our data. We can do this in a number of ways. We can look at the supernova table for the Height.model. We can also use the fVal() function like this:

fVal(Thumb ~ Height, data = Fingers)

We can also use the fVal() function on a model like this:

fVal(Height.model)

Try using the fVal() function to get the F ratio for the Height model in the DataCamp window below.

#load packages require(ggformula) require(mosaic) require(supernova) require(Lock5withR) require(Lock5Data) require(fivethirtyeight) Height.model <- lm(Thumb ~ Height, data = Fingers) # use fVal to get the F ratio for Height model # use fVal to get the F ratio for Height model fVal(Height.model) test_function("fVal") test_error() success_msg("Nice work!")
DataCamp: ch11-20

This seems like a pretty large F value (certainly larger than 4), meaning that it’s very likely we would reject the empty model in favor of our complex model. But just to make sure, let’s see where an F of 27.98 would fall in the sampling distribution if there were, in fact, no effect of Height on Thumb.

We have a couple of ways to get this sampling distribution. One way is to use do(), fVal(), and shuffle() to get a distribution of F values that result from repeated random re-shuffling of our Thumb scores and our Height scores. In other words, by randomly pairing each thumb length in our sample with a height in our sample, and by doing this many times (like 10,000 times), we end up with a distribution of samples that could occur, by chance, if the empty model were true.

Write code to create a randomized sampling distribution (called SDoF) under the assumption that Height has no relationship with Thumb. Print the first six lines of SDoF.

#load packages require(ggformula) require(mosaic) require(supernova) require(Lock5Data) require(Lock5withR) require(okcupiddata) custom_seed(51) # create a sampling distribution of F generated from the empty model # print the first 6 rows # create a sampling distribution of F generated from the empty model SDoF <- do(10000) * fVal(Thumb ~ shuffle(Height), data = Fingers) # print the first 6 rows head(SDoF) test_object("SDoF") test_function("head") test_error() success_msg("Great work!")
DataCamp: ch11-21

L_Ch11_RandomizingF_1

Use the DataCamp window below to make a histogram of the randomized F ratios.

#load packages require(ggformula) require(mosaic) require(supernova) require(Lock5Data) require(Lock5withR) require(okcupiddata) Fingers <- read.csv(file="https://raw.githubusercontent.com/UCLATALL/intro-stats-modeling/master/datasets/fingers.csv", header=TRUE, sep=",") Fingers$Height3Group <- factor(ntile(Fingers$Height, 3), levels = c(1:3), labels = c("short","medium","tall")) custom_seed(51) # SDoF has been created for you SDoF <- do(10000) * fVal(Thumb ~ shuffle(Height3Group), data = Fingers) # plot these randomized F values in a histogram # SDoF has been created for you SDoF <- do(10000) * fVal(Thumb ~ shuffle(Height3Group), data = Fingers) # plot these randomized F values in a histogram gf_histogram(~ fVal, data = SDoF, fill = "firebrick2", alpha = .9) test_function("gf_histogram") test_error() success_msg("Keep it up!")
DataCamp: ch11-22

Above is the randomized sampling distribution of F values that we get from 10,000 re-shufflings of Thumb and Height. It’s pretty obvious that our observed F of 27.98 would be highly unlikely to have occured just by chance if there were, in fact, no relationship between Height and Thumb. We can check that with the tally() function.

#load packages require(ggformula) require(mosaic) require(supernova) require(Lock5Data) require(Lock5withR) require(okcupiddata) Fingers <- read.csv(file="https://raw.githubusercontent.com/UCLATALL/intro-stats-modeling/master/datasets/fingers.csv", header=TRUE, sep=",") Fingers$Height3Group <- factor(ntile(Fingers$Height, 3), levels = c(1:3), labels = c("short","medium","tall")) custom_seed(51) # SDoF and sampleF has been created for you SDoF <- do(10000) * fVal(Thumb ~ shuffle(Height3Group), data = Fingers) sampleF <- fVal(Thumb ~ Height, data = Fingers) # how many of the randomized F values are greater than our sampleF? # SDoF has been created for you SDoF <- do(10000) * fVal(Thumb ~ shuffle(Height3Group), data = Fingers) sampleF <- fVal(Thumb ~ Height, data = Fingers) # how many of the randomized F values are greater than our sampleF? tally(~ fVal>sampleF, data = SDoF) test_function("tally") test_error() success_msg("Great job!")
DataCamp: ch11-23

If we run tally on our 10,000 randomized Fs, we can see that none of the randomly generated samples had Fs greater than 27.98.

Just for fun, let’s make a sampling distribution with only 1,000 shuffled Fs and save it in an R data frame called SDoF1000. Plot those randomized fVals in a histogram.

#load packages require(ggformula) require(mosaic) require(supernova) require(Lock5Data) require(Lock5withR) require(okcupiddata) custom_seed(88) # create a sampling distribution with 1000 Fs from shuffled data SDoF1000 <- # depict the sampling distribution in a histogram # create a sampling distribution with 1000 Fs from shuffled data SDoF1000 <- do(1000) * fVal(Thumb ~ shuffle(Height), data = Fingers) # depict the sampling distribution in a histogram gf_histogram(~ fVal, data = SDoF1000, fill = "firebrick3", alpha = .9) test_object("SDoF1000") test_function("gf_histogram") test_error() success_msg("Keep up the good work!")
DataCamp: ch11-24

L_Ch11_RandomizingF_2

We can see from the histogram of 1,000 randomized Fs that there weren’t any in which samples were as high as the one we observed (27.98). Here is the tally illustrating the same idea.

In both sampling distributions of random re-shuffles (the one with 1,000 Fs and 10,000 Fs), we still did not get a single F larger than the one we observed. When we let the DGP run longer and got more shuffled samples (10,000 iterations), we did get a few extreme Fs (closer to 15), but these were still rare.

You can think of these extreme Fs in these sampling distributions as situations where just by random chance we happened to generate explanatory relationships between the variables. These situations are as if we kept throwing the die 24 times—generating more and more samples—all 24 rolls might come up 6! But that sample is also generated by a random process (just as all the others that “look” more random to us).

From our explorations of randomized Fs, there is very little chance that there is no relationship between Thumb and Height. We surely should reject the simple model in favor of the more complex one.

Using the F Distribution to Calculate the Probability

As you know by now, we don’t have to randomize to create the sampling distribution of F. We can run supernova() and get the probability of getting an F statistic greater than 27.98 based on the mathematical probability distribution.

supernova(Height.model)

We of course get the same F as we get using the fVal() function (it’s our sample F), and that hasn’t changed. The function supernova() only calculates p-value (p stands for probability) to four decimal places. The p-value in the ANOVA table (.0000) means that there is a less than 1 in 10,000 probability of getting an F this high if there is a 0 slope of the line predicting Thumb from Height. That’s pretty much what we found from our randomized sampling distributions of F.

L_Ch11_Using_1

In the DataCamp window below, try shuffling to produce a sampling distribution of PREs. How many are greater than our sample PRE?

#load packages require(ggformula) require(mosaic) require(supernova) require(Lock5Data) require(Lock5withR) require(okcupiddata) custom_seed(15) # create a randomized sampling distribution of PREs SDoPRE <- # how many of our randomized PREs are greater than our sample PRE samplePRE <- PRE(Thumb ~ Height, data = Fingers) tally() # create a randomized sampling distribution of PREs SDoPRE <- do(10000) * PRE(Thumb ~ shuffle(Height), data = Fingers) # how many of our randomized PREs are greater than our sample PRE samplePRE <- PRE(Thumb ~ Height, data = Fingers) tally(~PRE>samplePRE, data = SDoPRE) test_object("SDoPRE") test_object("samplePRE") test_function("tally") test_error() success_msg("Great thinking!")
DataCamp: ch11-25

L_Ch11_Using_2

When we shuffled Thumb length and Height, the resulting Fs and PREs are no where near as large as our sample F (27.98) and sample PRE (.15). This suggests that these sample statistics would be difficult to generate from the null model.

If the null model probably did not generate our data, then \(\beta_1\) is probably not 0. If we ran confint(Height.model), we would observe that the confidence interval for \(\beta_1\) would not include 0 and it would be centered around the best fitting estimate (\(b_1=.96\)). The 95% confidence interval is centered at the best fitting estimate because the interval extends from the highest \(\beta_1\) that this \(.96\) could have reasonably come from and the lowest \(\beta_1\) that our sample estimate could have come from.

It turns out that we did this just a chapter ago, so here is the R output for confint(Height.model).

Just as we predicted, the confidence interval (.60, 1.32) does not include 0 and is centered at .96!

If we know something about the results of the hypothesis test, we can predict something about the confidence interval around \(\beta_1\). For example, if we reject the null model because the p-value is greater than .05, then we know the 95% CI for \(\beta_1\) will not include 0. (Note: The confidence interval is always centered on the sample estimate.) If the null model could have generated our sample, the p-value from hypothesis testing would be greater than .05 and the 95% CI for \(\beta_1\) would include 0.

Knowing the relationship between hypothesis testing and confidence intervals, we could also look at confidence intervals and perfectly predict the result of hypothesis testing.

Responses