Course Outline

segmentGetting Started (Don't Skip This Part)

segmentStatistics and Data Science: A Modeling Approach

segmentPART I: EXPLORING VARIATION

segmentChapter 1  Welcome to Statistics: A Modeling Approach

segmentChapter 2  Understanding Data

segmentChapter 3  Examining Distributions

segmentChapter 4  Explaining Variation

segmentPART II: MODELING VARIATION

segmentChapter 5  A Simple Model

segmentChapter 6  Quantifying Error

segmentChapter 7  Adding an Explanatory Variable to the Model

segmentChapter 8  Models with a Quantitative Explanatory Variable

segmentPART III: EVALUATING MODELS

segmentChapter 9  The Logic of Inference

segmentChapter 10  Model Comparison with F

segmentChapter 11  Parameter Estimation and Confidence Intervals

segmentPART IV: MULTIVARIATE MODELS

segmentChapter 12  Introduction to Multivariate Models

12.9 Using the Sampling Distribution of F

segmentChapter 13  Multivariate Model Comparisons

segmentFinishing Up (Don't Skip This Part!)

segmentResources
list College / Advanced Statistics and Data Science (ABCD)
12.9 Using the Sampling Distribution of F
We look at the xaxis (that represents values of f
) to figure out where the sample F would fall. The sample F of 17 would be pretty far away from the bulk of the Fs in the sampling distribution (which are mostly between 0 and 5).
Let’s calculate the pvalue from the shuffled sampling distribution of F using tally()
. It should result in a similar number as the pvalue in the model row of the ANOVA table.
require(coursekata)
# delete when coursekatar updated
Smallville < read.csv("https://docs.google.com/spreadsheets/d/e/2PACX1vTUey0jLO87REoQRRGJeG43iN1lkds_lmcnke1fuvS7BTb62jLucJ4WeIt7RW4mfRpk8n5iYvNmgf5l/pub?gid=1024959265&single=true&output=csv")
Smallville$Neighborhood < factor(Smallville$Neighborhood)
Smallville$HasFireplace < factor(Smallville$HasFireplace)
# this calculates sample_f
sample_f < f(PriceK~ Neighborhood + HomeSizeK, data = Smallville)
# this generates a sampling distribution of fs
sdof < do(1000) * f(shuffle(PriceK) ~ Neighborhood + HomeSizeK, data = Smallville)
# use tally to calculate pvalue from the sdof
# remember to set the format as proportion
sample_f < f(PriceK~ Neighborhood + HomeSizeK, data = Smallville)
sdof < do(1000) * f(shuffle(PriceK) ~ Neighborhood + HomeSizeK, data = Smallville)
tally(~ f > sample_f, data = sdof, format = “proportion”)
# temporary SCT
ex() %>% check_error()
f > sample_f
TRUE FALSE
0 1
Finding the pvalue in the ANOVA Table
If we check our ANOVA table (printed below), the value we got from tally()
(0) corresponds to the first row of the p
column.
Analysis of Variance Table (Type III SS)
Model: PriceK ~ Neighborhood + HomeSizeK
SS df MS F PRE p
        
Model (error reduced)  22254.020 2 11127.010 21.364 0.5957 .0000
Neighborhood  16832.423 1 16832.423 32.319 0.5271 .0000
HomeSizeK  10471.705 1 10471.705 20.106 0.4094 .0001
Error (from model)  15103.892 29 520.824
        
Total (empty model)  37357.912 31 1205.094
You might have noticed there are a few different pvalues in this ANOVA table – how do we know to check the first row for the pvalue? The Model
row corresponds to the model comparison between the multivariate model and the empty model. The other two pvalues (in the Neighborhood
and HomeSizeK
rows) represent comparisons between the model with and without that variable. We’ll delve into those model comparisons in the next pages.
From the Model
pvalue, we see that our sample F is very unlikely to be generated by the empty model of the DGP. The pvalue is so small, we would say that p < .001.
This small pvalue suggests that the empty model is unlikely to generate an F as extreme as the one fit from our multivariate model. Thus we reject the empty model (in which both \(\beta_1\) for Neighborhood
and \(\beta_2\) for HomeSizeK
are equal to 0) in favor of the multivariate model (in which \(\beta_1\) and \(\beta_2\) are not 0).
Confidence Intervals with the Multivariate Model
With mere sample data and the power of simulations, we have been able to rule out the empty model as a model of the DGP. After concluding that \(\beta_1\) and \(\beta_2\) are not both equal to 0, we might wonder: what are they?
This is where confidence intervals, also called parameter estimation, comes in. Just as before, we can use confint()
to estimate the lowest and highest \(\beta\)s that could still reasonably produce our sample. Try it in the code block below.
require(coursekata)
# delete when coursekatar updated
Smallville < read.csv("https://docs.google.com/spreadsheets/d/e/2PACX1vTUey0jLO87REoQRRGJeG43iN1lkds_lmcnke1fuvS7BTb62jLucJ4WeIt7RW4mfRpk8n5iYvNmgf5l/pub?gid=1024959265&single=true&output=csv")
Smallville$Neighborhood < factor(Smallville$Neighborhood)
Smallville$HasFireplace < factor(Smallville$HasFireplace)
# this saves the multivariate model
multi_model < lm(PriceK ~ Neighborhood + HomeSizeK, data = Smallville)
# write one line of code that will calculate
# confidence intervals for all parameters
multi_model < lm(PriceK ~ Neighborhood + HomeSizeK, data = Smallville)
confint(multi_model)
# temporary SCT
ex() %>% check_error()
2.5 % 97.5 %
(Intercept) 87.95358 266.54808
NeighborhoodEastside 115.07739 17.35826
HomeSizeK 27.15159 108.54819
We can interpret these confidence intervals in the same way we did before. We imagined many different plausible alternative values of \(\beta_1\) that could, with 95% likelihood, have produced the sample \(b_1\). We can also do the same for \(\beta_2\) and any other \(\beta\)s that might be in our complex model.
We have previously specified our multivariate model as \(Y_i = \beta_0 + \beta_1X_{1i} + \beta_2X_{2i} + \epsilon_i\). The confidence intervals tell us a range of plausible values for \(\beta_0\), \(\beta_1\), and \(\beta_2\).
Because these confidence intervals do not include 0, we are 95% confident that there is some effect of Neighborhood
and HomeSizeK
on PriceK
in the DGP. The fact that 0 is not included in the intervals tells us that there is some contribution, and the confidence intervals themselves suggest a range for how big that contribution might be.