Course Outline
-
segmentGetting Started (Don't Skip This Part)
-
segmentStatistics and Data Science: A Modeling Approach
-
segmentPART I: EXPLORING VARIATION
-
segmentChapter 1 - Welcome to Statistics: A Modeling Approach
-
segmentChapter 2 - Understanding Data
-
segmentChapter 3 - Examining Distributions
-
segmentChapter 4 - Explaining Variation
-
segmentPART II: MODELING VARIATION
-
segmentChapter 5 - A Simple Model
-
segmentChapter 6 - Quantifying Error
-
segmentChapter 7 - Adding an Explanatory Variable to the Model
-
segmentChapter 8 - Models with a Quantitative Explanatory Variable
-
segmentPART III: EVALUATING MODELS
-
segmentChapter 9 - The Logic of Inference
-
segmentChapter 10 - Model Comparison with F
-
segmentChapter 11 - Parameter Estimation and Confidence Intervals
-
segmentPART IV: MULTIVARIATE MODELS
-
segmentChapter 12 - Introduction to Multivariate Models
-
segmentChapter 13 - Multivariate Model Comparisons
-
13.11 Error and Inference from Models with Multiple Quantitative Predictors
-
segmentFinishing Up (Don't Skip This Part!)
-
segmentResources
list College / Advanced Statistics and Data Science (ABCD)
13.11 Error and Inference from Models with Multiple Quantitative Predictors
Unpacking the ANOVA Table for FEV ~ HEIGHT + AGE
As with all statistical models, this one produces a predicted value on the outcome variable for every data point. By subtracting each predicted value from the actual value in the data we get residuals, and from there we get sums of squares, PRE, and F. Everything works the same way here as with previous models.
Add some code to the window below to generate the ANOVA table for the FEV ~ HEIGHT + AGE
model.
require(coursekata)
# delete when coursekata-r updated
fevdata <- read.table('http://jse.amstat.org/datasets/fev.dat.txt')
colnames(fevdata) <- c("AGE", "FEV", "HEIGHT", "SEX", "SMOKE")
fevdata <- data.frame(fevdata)
# saves the multivariate model
multi_model <- lm(FEV ~ HEIGHT + AGE, data = fevdata)
# write code to produce the ANOVA table
supernova(multi_model)
multi_model <- lm(FEV ~ HEIGHT + AGE, data = fevdata)
supernova(multi_model)
# temporary SCT
ex() %>% check_error()
Analysis of Variance Table (Type III SS)
Model: FEV ~ HEIGHT + AGE
SS df MS F PRE p
------ --------------- | ------- --- ------- -------- ------ -----
Model (error reduced) | 376.245 2 188.122 1067.956 0.7664 .0000
HEIGHT | 95.326 1 95.326 541.157 0.4539 .0000
AGE | 6.259 1 6.259 35.532 0.0518 .0000
Error (from model) | 114.675 651 0.176
------ --------------- | ------- --- ------- -------- ------ -----
Total (empty model) | 490.920 653 0.752
There are many things you could have observed. We notice that the PRE for the whole model is .77 (rounded) so this model explains a lot of error. We also noticed that height uniquely reduces error more than age. We also noticed huge Fs for every row (Fs larger than 4 are worth talking about and these are way bigger than that) – for the degrees of freedom we spent, we have reduced a lot of error.
Comparing Models of the DGP
We’ve been able to explain a lot of the variation in the data with this model. But is this a good model of the DGP? We need to engage in some model comparison to decide which model we will select as our best model of the DGP.
Just because the p-values are below our .05 cutoff for rejecting the simpler models, however, doesn’t necessarily mean we should adopt the multivariate model as our preferred model of the DGP. In this case, it’s also smart to look at the single-predictor models for HEIGHT
and AGE
, especially since there is apparently a lot of overlap between these predictors.
Below we have put the ANOVA tables for three models: the multivariate model, the height model, and the age model.
Model: FEV ~ HEIGHT + AGE
SS df MS F PRE p
------ --------------- | ------- --- ------- -------- ------ -----
Model (error reduced) | 376.245 2 188.122 1067.956 0.7664 .0000
HEIGHT | 95.326 1 95.326 541.157 0.4539 .0000
AGE | 6.259 1 6.259 35.532 0.0518 .0000
Error (from model) | 114.675 651 0.176
------ --------------- | ------- --- ------- -------- ------ -----
Total (empty model) | 490.920 653 0.752
Model: FEV ~ HEIGHT
SS df MS F PRE p
----- --------------- | ------- --- ------- -------- ------ -----
Model (error reduced) | 369.986 1 369.986 1994.731 0.7537 .0000
Error (from model) | 120.934 652 0.185
----- --------------- | ------- --- ------- -------- ------ -----
Total (empty model) | 490.920 653 0.752
Model: FEV ~ AGE
SS df MS F PRE p
----- --------------- | ------- --- ------- ------- ------ -----
Model (error reduced) | 280.919 1 280.919 872.184 0.5722 .0000
Error (from model) | 210.001 652 0.322
----- --------------- | ------- --- ------- ------- ------ -----
Total (empty model) | 490.920 653 0.752
Same Model, Different Names
We have now learned how to fit models with quantitative outcome variables and various types and numbers of predictor variables (categorical, quantitative, or both). As we have seen, all of these models can be understood through the common framework of the General Linear Model.
Out in the world, however, people will often use specialized terms to refer to models with different numbers and types of variables. Here is a table with some of the examples we have looked at and the special names people give to those models.
Example | Description | Common Name |
---|---|---|
PriceK ~ Neighborhood (with 2 possible neighborhoods) |
a model with a single two-group predictor variable | t-test |
PriceK ~ Neighborhood (3+ possible neighborhoods) |
a model with a single more-than-two-group predictor variable | one-way ANOVA (Analysis of Variance) |
PriceK ~ HomeSizeK |
a model with a single quantitative predictor | simple regression |
PriceK ~ Neighborhood + HomeSizeK |
a model with at least one categorical and one quantitative variable | ANCOVA (Analysis of Covariance) |
tip_percent ~ condition + gender |
a model with two categorical variables | two-way ANOVA |
FEV ~ HEIGHT + AGE |
a model with multiple quantitative variable | multiple regression |
It’s good for you to become familiar with some of these names. However, the understanding that you have is much more powerful: you see that all of these are variations of one super useful idea – the General Linear Model. The reason these different names arose in the first place was because each technique was historically developed to solve a specific problem in statistics and data analysis. Later, people discovered how they were connected.
Although some people prefer the specialized names, even experts have a hard time keeping all these names straight. There are well known “cheatsheets” (such as this one called Common Statistical Tests Are Linear Models) that help people remember what all these different models can be called. But you know the truth: they are all just variants of the general linear model.