Course Outline

segmentGetting Started (Don't Skip This Part)

segmentStatistics and Data Science: A Modeling Approach

segmentPART I: EXPLORING VARIATION

segmentChapter 1  Welcome to Statistics: A Modeling Approach

segmentChapter 2  Understanding Data

segmentChapter 3  Examining Distributions

segmentChapter 4  Explaining Variation

segmentPART II: MODELING VARIATION

segmentChapter 5  A Simple Model

segmentChapter 6  Quantifying Error

segmentChapter 7  Adding an Explanatory Variable to the Model

segmentChapter 8  Models with a Quantitative Explanatory Variable

segmentPART III: EVALUATING MODELS

segmentChapter 9  The Logic of Inference

segmentChapter 10  Model Comparison with F

segmentChapter 11  Parameter Estimation and Confidence Intervals

segmentPART IV: MULTIVARIATE MODELS

segmentChapter 12  Introduction to Multivariate Models

segmentChapter 13  Multivariate Model Comparisons

13.6 Using `shuffle()` for Targeted Model Comparisons (Part 2)

segmentFinishing Up (Don't Skip This Part!)

segmentResources
list College / Advanced Statistics and Data Science (ABCD)
13.6 Using shuffle()
for Targeted Model Comparisons (Part 2)
Having saved the residuals from Neighborhood
model, let’s now see how we can use them, along with the shuffle()
function, to create a sampling distribution for the unique effect of HomeSizeK
.
Step Two: Create the Sampling Distribution of F for Home Size
A sampling distribution of Fs provides us a way to calculate how likely it would be for the simple model of the DGP (i.e., the one with no unique effect of HomeSizeK
) to generate an F for HomeSizeK
as large or larger than the one found in the data (11.626).
Before we create the sampling distribution of F for the HomeSizeK
effect, we will show you how to get the sample F for HomeSizeK
. Our previous method, using the f()
function, won’t work; it only gives us the overall F for the full model. To get the F for HomeSizeK
you can run this code:
f(PriceK_N_resids ~ Neighborhood + HomeSizeK, data = Smallville, predictor = ~HomeSizeK)
The first part of this code creates a supernova table for the multivariate model using PriceK_N_resids
as the outcome. The highlighted part above then reads the sample F for HomeSizeK
out of the table (without ever printing it out). We’ve put this code in the window below, so you have this F available.
In the window below, modify the code where indicated, using the shuffle()
function, to produce a single F for HomeSizeK
that assumes a DGP with 0 effect of home size. Run the code a few times just to see what it does.
require(coursekata)
# delete when coursekatar updated
Smallville < read.csv("https://docs.google.com/spreadsheets/d/e/2PACX1vTUey0jLO87REoQRRGJeG43iN1lkds_lmcnke1fuvS7BTb62jLucJ4WeIt7RW4mfRpk8n5iYvNmgf5l/pub?gid=1024959265&single=true&output=csv")
Smallville$Neighborhood < factor(Smallville$Neighborhood)
Smallville$HasFireplace < factor(Smallville$HasFireplace)
# don't delete this part
# code to fit neighborhood model and save residuals
Neighborhood_model < lm(PriceK~ Neighborhood, data = Smallville)
Smallville$PriceK_N_resids < resid(Neighborhood_model)
# this code prints sample F for HomeSizeK
f(PriceK_N_resids ~ Neighborhood + HomeSizeK, data = Smallville, predictor = ~HomeSizeK)
# modify the code below to produce the F when residuals are shuffled
f(PriceK_N_resids ~ Neighborhood + HomeSizeK, data = Smallville, predictor = ~HomeSizeK)
f(PriceK_N_resids ~ Neighborhood + HomeSizeK, data = Smallville, predictor = ~HomeSizeK)
f(shuffle(PriceK_N_resids) ~ Neighborhood + HomeSizeK, data = Smallville, predictor = ~HomeSizeK)
# temporary SCT
ex() %>% check_error()
Now let’s add some code to create a sampling distribution of 1000 Fs for HomeSizeK
assuming no effect of home size in the DGP. Save these Fs into a data frame called HomeSizeK_sdof
.
require(coursekata)
# delete when coursekatar updated
Smallville < read.csv("https://docs.google.com/spreadsheets/d/e/2PACX1vTUey0jLO87REoQRRGJeG43iN1lkds_lmcnke1fuvS7BTb62jLucJ4WeIt7RW4mfRpk8n5iYvNmgf5l/pub?gid=1024959265&single=true&output=csv")
Smallville$Neighborhood < factor(Smallville$Neighborhood)
Smallville$HasFireplace < factor(Smallville$HasFireplace)
# don't delete
# code to fit neighborhood model and save residuals
Neighborhood_model < lm(PriceK~ Neighborhood, data = Smallville)
Smallville$PriceK_N_resids < resid(Neighborhood_model)
# This code generates one shuffled HomeSizeK F
# Modify it to make a sampling distribution of 1000 shuffled Fs
# Save them in a data frame called HomeSizeK_sdof
f(shuffle(PriceK_N_resids) ~ Neighborhood + HomeSizeK, data = Smallville, predictor = ~HomeSizeK)
# This code will put these Fs into a histogram
gf_histogram(~ f, data = HomeSizeK_sdof) %>%
gf_labs(title = "shuffled HomeSizeK Fs")
HomeSizeK_sdof < do(1000) * f(shuffle(PriceK_N_resids) ~ Neighborhood + HomeSizeK, data = Smallville, predictor = ~HomeSizeK)
gf_histogram(~ f, data = HomeSizeK_sdof) %>%
gf_labs(title = "shuffled HomeSizeK Fs")
# temporary SCT
ex() %>% check_error()
Below we have graphed out the sampling distribution of 1000 shuffled Fs for the HomeSizeK
effect. We also have added to the plot, as a black dot, the sample F for the HomeSizeK
row of the ANOVA table (11.63). We’ll save this value as HomeSizeK_f
. As you can see, the sample F is far out in the tail of the sampling distribution.
To calculate the exact pvalue for the HomeSizeK
F, we can use tally.
Try copying and pasting the appropriate code into the code block below. Also generate an ANOVA table – to check out whether the pvalue obtained from tally()
is similar to the pvalue for HomeSizeK
in the ANOVA table.
require(coursekata)
# delete when coursekatar updated
Smallville < read.csv("https://docs.google.com/spreadsheets/d/e/2PACX1vTUey0jLO87REoQRRGJeG43iN1lkds_lmcnke1fuvS7BTb62jLucJ4WeIt7RW4mfRpk8n5iYvNmgf5l/pub?gid=1024959265&single=true&output=csv")
Smallville$Neighborhood < factor(Smallville$Neighborhood)
Smallville$HasFireplace < factor(Smallville$HasFireplace)
# don't delete
# code to fit neighborhood model and save residuals
Neighborhood_model < lm(PriceK~ Neighborhood, data = Smallville)
Smallville$PriceK_N_resids < resid(Neighborhood_model)
# This saves the sample HomeSizeK F
HomeSizeK_f < f(PriceK_N_resids ~ Neighborhood + HomeSizeK, data = Smallville, predictor = ~HomeSizeK)
# This code generates a sampling distribution of shuffled HomeSizeK Fs
HomeSizeK_sdof < do(1000) * f(shuffle(PriceK_N_resids) ~ Neighborhood + HomeSizeK, data = Smallville, predictor = ~HomeSizeK)
# Paste in the code for tallying the pvalue for HomeSizeK
# Modify the code below to generate an ANOVA table from the multivariate model
lm(PriceK ~ Neighborhood + HomeSizeK, data = Smallville)
HomeSizeK_f < f(PriceK_N_resids ~ Neighborhood + HomeSizeK, data = Smallville, predictor = ~HomeSizeK)
HomeSizeK_sdof < do(1000) * f(shuffle(PriceK_N_resids) ~ Neighborhood + HomeSizeK, data = Smallville, predictor = ~HomeSizeK)
tally(~ f > HomeSizeK_f, data=HomeSizeK_sdof, format="proportion")
supernova(lm(PriceK ~ Neighborhood + HomeSizeK, data = Smallville))
# temporary SCT
ex() %>% check_error()
The pvalue we got from tally()
is close to the pvalue reported on the HomeSizeK
row of the multivariate ANOVA table: 0.0019.