Course Outline

segmentGetting Started (Don't Skip This Part)

segmentStatistics and Data Science: A Modeling Approach

segmentPART I: EXPLORING VARIATION

segmentChapter 1  Welcome to Statistics: A Modeling Approach

segmentChapter 2  Understanding Data

segmentChapter 3  Examining Distributions

segmentChapter 4  Explaining Variation

segmentPART II: MODELING VARIATION

segmentChapter 5  A Simple Model

segmentChapter 6  Quantifying Error

segmentChapter 7  Adding an Explanatory Variable to the Model

segmentChapter 8  Models with a Quantitative Explanatory Variable

segmentPART III: EVALUATING MODELS

segmentChapter 9  The Logic of Inference

segmentChapter 10  Model Comparison with F

segmentChapter 11  Parameter Estimation and Confidence Intervals

segmentPART IV: MULTIVARIATE MODELS

segmentChapter 12  Introduction to Multivariate Models

segmentChapter 13  Multivariate Model Comparisons

13.5 Using `shuffle()` for Targeted Model Comparisons (Part 1)

segmentFinishing Up (Don't Skip This Part!)

segmentResources
list College / Advanced Statistics and Data Science (ABCD)
13.5 Using shuffle()
for Targeted Model Comparisons (Part 1)
Using the shuffle()
function to generate sampling distributions for targeted model comparisons is a little more complicated than for the overall model. As an example, let’s see how we could use shuffle()
to generate a sampling distribution to help us interpret HomeSizeK
F in the multivariate model. Again, these are the two models we are comparing:
Model  GLM Notation 

Complex  \(PriceK_i = \beta_0 + \beta_1NeighborhoodEastside_i + \beta_2HomeSizeK_i + \epsilon_i\) 
Simple  \(PriceK_i= \beta_0 + \beta_1NeighborhoodEastside_{i} + \colorbox{yellow}{(0)}HomeSizeK_{i} + \epsilon_i\) 
Our question is: what is the probability (the pvalue) that the additional variability in PriceK
explained by HomeSizeK
over and above that explained by Neighborhood
could simply be due to random sampling variation? In other words, could the F for HomeSizeK
have resulted from the simple model of the DGP?
Let’s first recall how we used shuffle()
to mimic the DGP for the empty model:
shuffle(PriceK) ~ Neighborhood + HomeSizeK
This code effectively broke the relationship between both of the predictor variables and PriceK
. When we shuffle PriceK
we are shuffling not only the part related to home size, but also the part related to neighborhood. Is there a way to first take out the effect of Neighborhood
and look only at the parts of PriceK
that might be predicted by HomeSizeK
?
We actually have learned a way to isolate just the part of PriceK
that is related to Neighborhood
, and that is to fit the neighborhood model and save the leftover variation (i.e., residuals).
These residuals (region A+D in the Venn diagram above) represent just the part of PriceK
that might be uniquely predicted by HomeSizeK
. If we shuffle them, we are mimicking the simple model in which HomeSizeK
has no effect beyond that accounted for by Neighborhood
.
Model  R Code  GLM Notation 

Simple 
shuffle(residuals) ~ Neighborhood + HomeSizeK

\(PriceK_i= \beta_0 + \beta_1NeighborhoodEastside_{i} + \colorbox{yellow}{(0)}HomeSizeK_{i} + \epsilon_i\) 
The twostep method experts recommend for using shuffle()
to create a sampling distribution to test just the effect of HomeSizeK
is based on this idea. Step one is to fit the Neighborhood
model of PriceK
and save the residuals. Step two is to shuffle these residuals and use them as the outcome variable in a regression with both Neighborhood
and HomeSizeK
as predictors.
If we shuffle this new outcome variable (the residuals) we will be mimicking a DGP in which HomeSizeK
has no effect on the leftover variation, but still maintains the relationship between the two predictors (HomeSizeK
and Neighborhood
).
If we fit a regression model to the the shuffled residuals many times and record the F for HomeSizeK
for each iteration we will get a sampling distribution assuming a DGP in which HomeSizeK
has no effect over and above that already accounted for by Neighborhood
. The sample F we will use for comparison is the one on the HomeSizeK
row of the supernova table for the multivariate model.
Let’s go through this process one step at a time, and spend some time thinking about the results at each step of the way.
Step One: Save Residuals from the Neighborhood Model
In the code window below we’ve put in some code to fit the neighborhood model of price (PriceK ~ Neighborhood
). Write some code to save the residuals from the Neighborhood_model
as PriceK_N_resid
).
require(coursekata)
# delete when coursekatar updated
Smallville < read.csv("https://docs.google.com/spreadsheets/d/e/2PACX1vTUey0jLO87REoQRRGJeG43iN1lkds_lmcnke1fuvS7BTb62jLucJ4WeIt7RW4mfRpk8n5iYvNmgf5l/pub?gid=1024959265&single=true&output=csv")
Smallville$Neighborhood < factor(Smallville$Neighborhood)
Smallville$HasFireplace < factor(Smallville$HasFireplace)
# this fits and saves the neighborhood model
Neighborhood_model < lm(PriceK~ Neighborhood, data = Smallville)
# write code to save the residuals of the neighborhood model
PriceK_N_resid <
# this will print a few of these residuals
head(PriceK_N_resid)
Neighborhood_model < lm(PriceK~ Neighborhood, data = Smallville)
PriceK_N_resid < resid(Neighborhood_model)
head(PriceK_N_resid)
# temporary SCT
ex() %>% check_error()
Is the Effect of Home Size Still There in the Residuals?
Now that we have saved the residuals from the singlepredictor neighborhood model, we can fit the multivariate model using residuals from the neighborhood model as the outcome variable (PriceK_N_resid = Neighborhood + HomeSizeK
) to make sure the effect of HomeSizeK
is unchanged.
require(coursekata)
# delete when coursekatar updated
Smallville < read.csv("https://docs.google.com/spreadsheets/d/e/2PACX1vTUey0jLO87REoQRRGJeG43iN1lkds_lmcnke1fuvS7BTb62jLucJ4WeIt7RW4mfRpk8n5iYvNmgf5l/pub?gid=1024959265&single=true&output=csv")
Smallville$Neighborhood < factor(Smallville$Neighborhood)
Smallville$HasFireplace < factor(Smallville$HasFireplace)
# this fits and saves the neighborhood model
Neighborhood_model < lm(PriceK~ Neighborhood, data = Smallville)
# this saves the residuals of the neighborhood model
PriceK_N_resid < resid(Neighborhood_model)
# write code to fit the multivariate model again, this time
# using residuals from the neighborhood model as the outcome variable
PriceK_N_resid_multi_model < lm()
# this will print the ANOVA table for this new model
supernova(PriceK_N_resid_multi_model)
PriceK_N_resid < resid(Neighborhood_model)
PriceK_N_resid_multi_model < lm(PriceK_N_resids ~ Neighborhood + HomeSizeK, data = Smallville)
supernova(PriceK_N_resid_multi_model)
# temporary SCT
ex() %>% check_error()
We have printed out two ANOVA tables below. The first is the one we just got by fitting the multivariate model to the residuals of the neighborhood model (PriceK_N_resids
), the second, the original multivariate model with PriceK
as the outcome variable.
It is useful to compare these two tables and try to understand their similarities and differences. Notice that the model of residuals has a much smaller SS Total than the original model of PriceK
.
Analysis of Variance Table (Type III SS)
Model: PriceK_N_resid ~ Neighborhood + HomeSizeK
SS df MS F PRE p
        
Model (error reduced)  42003.677 2 21001.838 5.813 0.2862 .0075
Neighborhood  8237.712 1 8237.712 2.280 0.0729 .1419
HomeSizeK  42003.677 1 42003.677 11.626 0.2862 .0019
Error (from model)  104774.465 29 3612.913
        
Total (empty model)  146778.142 31 4734.779
Analysis of Variance Table (Type III SS)
Model: PriceK ~ Neighborhood + HomeSizeK
SS df MS F PRE p
        
Model (error reduced)  124403.028 2 62201.514 17.216 0.5428 .0000
Neighborhood  27758.259 1 27758.259 7.683 0.2094 .0096
HomeSizeK  42003.677 1 42003.677 11.626 0.2862 .0019
Error (from model)  104774.465 29 3612.913
        
Total (empty model)  229177.493 31 7392.822
The important thing to notice about these two ANOVA tables is that the rows for HomeSizeK
and SS Error are identical. We now have a model that perfectly calculates the size of the HomeSizeK
effect while controlling for neighborhood. The effect of home size (SS, F, PRE) is identical in the two tables. Notice that the F for the unique effect of HomeSizeK
is 11.626.
Now, because the neighborhood effect has already been taken out, we can shuffle the residuals to break only the unique connection of home size to the outcome. By doing this repeatedly, we can create a sampling distribution of F for HomeSizeK
that assumes no effect of home size, yet still controls for neighborhood.