Course Outline

segmentGetting Started (Don't Skip This Part)

segmentStatistics and Data Science: A Modeling Approach

segmentPART I: EXPLORING VARIATION

segmentChapter 1  Welcome to Statistics: A Modeling Approach

segmentChapter 2  Understanding Data

segmentChapter 3  Examining Distributions

segmentChapter 4  Explaining Variation

segmentPART II: MODELING VARIATION

segmentChapter 5  A Simple Model

segmentChapter 6  Quantifying Error

segmentChapter 7  Adding an Explanatory Variable to the Model

segmentChapter 8  Models with a Quantitative Explanatory Variable

segmentPART III: EVALUATING MODELS

segmentChapter 9  The Logic of Inference

segmentChapter 10  Model Comparison with F

segmentChapter 11  Parameter Estimation and Confidence Intervals

segmentPART IV: MULTIVARIATE MODELS

segmentChapter 12  Introduction to Multivariate Models

segmentChapter 13  Multivariate Model Comparisons

13.11 Error and Inference from Models with Multiple Quantitative Predictors

segmentFinishing Up (Don't Skip This Part!)

segmentResources
list College / Advanced Statistics and Data Science (ABCD)
13.11 Error and Inference from Models with Multiple Quantitative Predictors
Unpacking the ANOVA Table for FEV ~ HEIGHT + AGE
As with all statistical models, this one produces a predicted value on the outcome variable for every data point. By subtracting each predicted value from the actual value in the data we get residuals, and from there we get sums of squares, PRE, and F. Everything works the same way here as with previous models.
Add some code to the window below to generate the ANOVA table for the FEV ~ HEIGHT + AGE
model.
require(coursekata)
# delete when coursekatar updated
fevdata < read.table('http://jse.amstat.org/datasets/fev.dat.txt')
colnames(fevdata) < c("AGE", "FEV", "HEIGHT", "SEX", "SMOKE")
fevdata < data.frame(fevdata)
# saves the multivariate model
multi_model < lm(FEV ~ HEIGHT + AGE, data = fevdata)
# write code to produce the ANOVA table
supernova(multi_model)
multi_model < lm(FEV ~ HEIGHT + AGE, data = fevdata)
supernova(multi_model)
# temporary SCT
ex() %>% check_error()
Analysis of Variance Table (Type III SS)
Model: FEV ~ HEIGHT + AGE
SS df MS F PRE p
        
Model (error reduced)  376.245 2 188.122 1067.956 0.7664 .0000
HEIGHT  95.326 1 95.326 541.157 0.4539 .0000
AGE  6.259 1 6.259 35.532 0.0518 .0000
Error (from model)  114.675 651 0.176
        
Total (empty model)  490.920 653 0.752
There are many things you could have observed. We notice that the PRE for the whole model is .77 (rounded) so this model explains a lot of error. We also noticed that height uniquely reduces error more than age. We also noticed huge Fs for every row (Fs larger than 4 are worth talking about and these are way bigger than that) – for the degrees of freedom we spent, we have reduced a lot of error.
Comparing Models of the DGP
We’ve been able to explain a lot of the variation in the data with this model. But is this a good model of the DGP? We need to engage in some model comparison to decide which model we will select as our best model of the DGP.
Just because the pvalues are below our .05 cutoff for rejecting the simpler models, however, doesn’t necessarily mean we should adopt the multivariate model as our preferred model of the DGP. In this case, it’s also smart to look at the singlepredictor models for HEIGHT
and AGE
, especially since there is apparently a lot of overlap between these predictors.
Below we have put the ANOVA tables for three models: the multivariate model, the height model, and the age model.
Model: FEV ~ HEIGHT + AGE
SS df MS F PRE p
        
Model (error reduced)  376.245 2 188.122 1067.956 0.7664 .0000
HEIGHT  95.326 1 95.326 541.157 0.4539 .0000
AGE  6.259 1 6.259 35.532 0.0518 .0000
Error (from model)  114.675 651 0.176
        
Total (empty model)  490.920 653 0.752
Model: FEV ~ HEIGHT
SS df MS F PRE p
        
Model (error reduced)  369.986 1 369.986 1994.731 0.7537 .0000
Error (from model)  120.934 652 0.185
        
Total (empty model)  490.920 653 0.752
Model: FEV ~ AGE
SS df MS F PRE p
        
Model (error reduced)  280.919 1 280.919 872.184 0.5722 .0000
Error (from model)  210.001 652 0.322
        
Total (empty model)  490.920 653 0.752
Same Model, Different Names
We have now learned how to fit models with quantitative outcome variables and various types and numbers of predictor variables (categorical, quantitative, or both). As we have seen, all of these models can be understood through the common framework of the General Linear Model.
Out in the world, however, people will often use specialized terms to refer to models with different numbers and types of variables. Here is a table with some of the examples we have looked at and the special names people give to those models.
Example  Description  Common Name 

PriceK ~ Neighborhood (with 2 possible neighborhoods) 
a model with a single twogroup predictor variable  ttest 
PriceK ~ Neighborhood (3+ possible neighborhoods) 
a model with a single morethantwogroup predictor variable  oneway ANOVA (Analysis of Variance) 
PriceK ~ HomeSizeK 
a model with a single quantitative predictor  simple regression 
PriceK ~ Neighborhood + HomeSizeK 
a model with at least one categorical and one quantitative variable  ANCOVA (Analysis of Covariance) 
tip_percent ~ condition + gender 
a model with two categorical variables  twoway ANOVA 
FEV ~ HEIGHT + AGE 
a model with multiple quantitative variable  multiple regression 
It’s good for you to become familiar with some of these names. However, the understanding that you have is much more powerful: you see that all of these are variations of one super useful idea – the General Linear Model. The reason these different names arose in the first place was because each technique was historically developed to solve a specific problem in statistics and data analysis. Later, people discovered how they were connected.
Although some people prefer the specialized names, even experts have a hard time keeping all these names straight. There are well known “cheatsheets” (such as this one called Common Statistical Tests Are Linear Models) that help people remember what all these different models can be called. But you know the truth: they are all just variants of the general linear model.