## Course Outline

• segmentGetting Started (Don't Skip This Part)
• segmentStatistics and Data Science: A Modeling Approach
• segmentPART I: EXPLORING VARIATION
• segmentChapter 1 - Welcome to Statistics: A Modeling Approach
• segmentChapter 2 - Understanding Data
• segmentChapter 3 - Examining Distributions
• segmentChapter 4 - Explaining Variation
• segmentPART II: MODELING VARIATION
• segmentChapter 5 - A Simple Model
• segmentChapter 6 - Quantifying Error
• segmentChapter 7 - Adding an Explanatory Variable to the Model
• segmentChapter 8 - Models with a Quantitative Explanatory Variable
• segmentPART III: EVALUATING MODELS
• segmentChapter 9 - The Logic of Inference
• segmentChapter 10 - Model Comparison with F
• segmentChapter 11 - Parameter Estimation and Confidence Intervals
• segmentPART IV: MULTIVARIATE MODELS
• segmentChapter 12 - Introduction to Multivariate Models
• segmentChapter 13 - Multivariate Model Comparisons
• segmentFinishing Up (Don't Skip This Part!)
• segmentResources

### list College / Advanced Statistics and Data Science (ABCD)

Book
• High School / Statistics and Data Science I (AB)
• College / Statistics and Data Science (ABC)
• High School / Advanced Statistics and Data Science I (ABC)
• College / Advanced Statistics and Data Science (ABCD)
• High School / Statistics and Data Science II (XCD)

## 13.6 Using shuffle() for Targeted Model Comparisons (Part 2)

Having saved the residuals from Neighborhood model, let’s now see how we can use them, along with the shuffle() function, to create a sampling distribution for the unique effect of HomeSizeK.

### Step Two: Create the Sampling Distribution of F for Home Size

A sampling distribution of Fs provides us a way to calculate how likely it would be for the simple model of the DGP (i.e., the one with no unique effect of HomeSizeK) to generate an F for HomeSizeK as large or larger than the one found in the data (11.626).

Before we create the sampling distribution of F for the HomeSizeK effect, we will show you how to get the sample F for HomeSizeK. Our previous method, using the f() function, won’t work; it only gives us the overall F for the full model. To get the F for HomeSizeK you can run this code:

f(PriceK_N_resids ~ Neighborhood + HomeSizeK, data = Smallville, predictor = ~HomeSizeK)


The first part of this code creates a supernova table for the multivariate model using PriceK_N_resids as the outcome. The highlighted part above then reads the sample F for HomeSizeK out of the table (without ever printing it out). We’ve put this code in the window below, so you have this F available.

In the window below, modify the code where indicated, using the shuffle() function, to produce a single F for HomeSizeK that assumes a DGP with 0 effect of home size. Run the code a few times just to see what it does.

require(coursekata) # delete when coursekata-r updated Smallville <- read.csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vTUey0jLO87REoQRRGJeG43iN1lkds_lmcnke1fuvS7BTb62jLucJ4WeIt7RW4mfRpk8n5iYvNmgf5l/pub?gid=1024959265&single=true&output=csv") Smallville$Neighborhood <- factor(Smallville$Neighborhood) Smallville$HasFireplace <- factor(Smallville$HasFireplace) # don't delete this part # code to fit neighborhood model and save residuals Neighborhood_model <- lm(PriceK~ Neighborhood, data = Smallville) Smallville$PriceK_N_resids <- resid(Neighborhood_model) # this code prints sample F for HomeSizeK f(PriceK_N_resids ~ Neighborhood + HomeSizeK, data = Smallville, predictor = ~HomeSizeK) # modify the code below to produce the F when residuals are shuffled f(PriceK_N_resids ~ Neighborhood + HomeSizeK, data = Smallville, predictor = ~HomeSizeK) f(PriceK_N_resids ~ Neighborhood + HomeSizeK, data = Smallville, predictor = ~HomeSizeK) f(shuffle(PriceK_N_resids) ~ Neighborhood + HomeSizeK, data = Smallville, predictor = ~HomeSizeK) # temporary SCT ex() %>% check_error() CK Code: D2_Code_Targeted_01 Now let’s add some code to create a sampling distribution of 1000 Fs for HomeSizeK assuming no effect of home size in the DGP. Save these Fs into a data frame called HomeSizeK_sdof. require(coursekata) # delete when coursekata-r updated Smallville <- read.csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vTUey0jLO87REoQRRGJeG43iN1lkds_lmcnke1fuvS7BTb62jLucJ4WeIt7RW4mfRpk8n5iYvNmgf5l/pub?gid=1024959265&single=true&output=csv") Smallville$Neighborhood <- factor(Smallville$Neighborhood) Smallville$HasFireplace <- factor(Smallville$HasFireplace) # don't delete # code to fit neighborhood model and save residuals Neighborhood_model <- lm(PriceK~ Neighborhood, data = Smallville) Smallville$PriceK_N_resids <- resid(Neighborhood_model) # This code generates one shuffled HomeSizeK F # Modify it to make a sampling distribution of 1000 shuffled Fs # Save them in a data frame called HomeSizeK_sdof f(shuffle(PriceK_N_resids) ~ Neighborhood + HomeSizeK, data = Smallville, predictor = ~HomeSizeK) # This code will put these Fs into a histogram gf_histogram(~ f, data = HomeSizeK_sdof) %>% gf_labs(title = "shuffled HomeSizeK Fs") HomeSizeK_sdof <- do(1000) * f(shuffle(PriceK_N_resids) ~ Neighborhood + HomeSizeK, data = Smallville, predictor = ~HomeSizeK) gf_histogram(~ f, data = HomeSizeK_sdof) %>% gf_labs(title = "shuffled HomeSizeK Fs") # temporary SCT ex() %>% check_error()
CK Code: D2_Code_Targeted_02

Below we have graphed out the sampling distribution of 1000 shuffled Fs for the HomeSizeK effect. We also have added to the plot, as a black dot, the sample F for the HomeSizeK row of the ANOVA table (11.63). We’ll save this value as HomeSizeK_f. As you can see, the sample F is far out in the tail of the sampling distribution.

To calculate the exact p-value for the HomeSizeK F, we can use tally.

Try copying and pasting the appropriate code into the code block below. Also generate an ANOVA table – to check out whether the p-value obtained from tally() is similar to the p-value for HomeSizeK in the ANOVA table.

require(coursekata) # delete when coursekata-r updated Smallville <- read.csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vTUey0jLO87REoQRRGJeG43iN1lkds_lmcnke1fuvS7BTb62jLucJ4WeIt7RW4mfRpk8n5iYvNmgf5l/pub?gid=1024959265&single=true&output=csv") Smallville$Neighborhood <- factor(Smallville$Neighborhood) Smallville$HasFireplace <- factor(Smallville$HasFireplace) # don't delete # code to fit neighborhood model and save residuals Neighborhood_model <- lm(PriceK~ Neighborhood, data = Smallville) Smallville\$PriceK_N_resids <- resid(Neighborhood_model) # This saves the sample HomeSizeK F HomeSizeK_f <- f(PriceK_N_resids ~ Neighborhood + HomeSizeK, data = Smallville, predictor = ~HomeSizeK) # This code generates a sampling distribution of shuffled HomeSizeK Fs HomeSizeK_sdof <- do(1000) * f(shuffle(PriceK_N_resids) ~ Neighborhood + HomeSizeK, data = Smallville, predictor = ~HomeSizeK) # Paste in the code for tallying the p-value for HomeSizeK # Modify the code below to generate an ANOVA table from the multivariate model lm(PriceK ~ Neighborhood + HomeSizeK, data = Smallville) HomeSizeK_f <- f(PriceK_N_resids ~ Neighborhood + HomeSizeK, data = Smallville, predictor = ~HomeSizeK) HomeSizeK_sdof <- do(1000) * f(shuffle(PriceK_N_resids) ~ Neighborhood + HomeSizeK, data = Smallville, predictor = ~HomeSizeK) tally(~ f > HomeSizeK_f, data=HomeSizeK_sdof, format="proportion") supernova(lm(PriceK ~ Neighborhood + HomeSizeK, data = Smallville)) # temporary SCT ex() %>% check_error()
CK Code: D2_Code_Targeted_03

The p-value we got from tally() is close to the p-value reported on the HomeSizeK row of the multivariate ANOVA table: 0.0019.