Course Outline
-
segmentGetting Started (Don't Skip This Part)
-
segmentIntroduction to Statistics: A Modeling Approach
-
segmentPART I: EXPLORING VARIATION
-
segmentChapter 1 - Welcome to Statistics: A Modeling Approach
-
segmentChapter 2 - Understanding Data
-
segmentChapter 3 - Examining Distributions
-
segmentChapter 4 - Explaining Variation
-
segmentPART II: MODELING VARIATION
-
segmentChapter 5 - A Simple Model
-
segmentChapter 6 - Quantifying Error
-
segmentChapter 7 - Adding an Explanatory Variable to the Model
-
7.5 Quantifying Model Fit with Sums of Squares
-
segmentChapter 8 - Models with a Quantitative Explanatory Variable
-
segmentPART III: EVALUATING MODELS
-
segmentChapter 9 - Distributions of Estimates
-
segmentChapter 10 - Confidence Intervals and Their Uses
-
segmentChapter 11 - Model Comparison with the F Ratio
-
segmentChapter 12 - What You Have Learned
-
segmentResources
list Introduction to Statistics: A Modeling Approach
Quantifying Model Fit With Sums of Squares
In the empty model, you will recall, we used the mean as the model, i.e., as the predicted score for every observation. We developed the intuition that mean was a better fitting model (that there was less error around the model) if the spread of the distribution was small than if it was large.
Calculating Sums of Squares: Empty Model (Review)
In the previous chapter, we quantified error using the sum of the squared deviations (SS, or Sum of Squares) around the mean, a measure that is minimized precisely at the mean. Under the empty model, all of the variation is unexplained—that’s why it is called “empty.” But it does show us clearly how much variation there is left to explain, measured in sum of squares.
Remind yourself how to use the anova()
function to get the SS left over after fitting the empty model for our TinyFingers thumb length data.
require(mosaic)
require(ggformula)
#set up tiny data set
Thumb <- c(56, 60, 61, 63, 64, 68)
Sex <- c("female","female","female","male","male","male")
TinyFingers <- data.frame(Sex, Thumb)
TinyFingers$Sex <- as.factor(TinyFingers$Sex)
TinySex.model <- lm(Thumb ~ Sex, data = TinyFingers)
TinyFingers$Sex.predicted <- predict(TinySex.model)
TinyFingers$Sex.resid <- TinyFingers$Thumb - TinyFingers$Sex.predicted
TinyFingers$Sex.resid2 <- resid(TinySex.model)
TinyEmpty.model <- lm(Thumb ~ NULL, data = TinyFingers)
TinyFingers$Empty.pred <- predict(TinyEmpty.model)
# here is the code you wrote before
TinyEmpty.model <- lm(Thumb ~ NULL, data = TinyFingers)
# write code to get the SS leftover from TinyEmpty.model
# here is the code you wrote before
TinyEmpty.model <- lm(Thumb ~ NULL, data = TinyFingers)
# write code to get the SS leftover from TinyEmpty.model
anova(TinyEmpty.model)
test_function_result("lm")
test_object("TinyEmpty.model")
test_function_result("anova")
test_error()
L_Ch7_Quantifying_1
Calculating Sums of Squares: Sex Model
How do we quantify the error around our new—more complex—model, where sex is used to predict thumb length?
We quantify error around the more complex model in the same way we did for the empty model. We simply generate the residuals, square them, and then sum them to get the sum of squares left after fitting our model.
Go ahead and modify this code to get the SS left over for the TinySex.model.
require(mosaic)
require(ggformula)
#set up tiny data set
Thumb <- c(56, 60, 61, 63, 64, 68)
Sex <- c("female","female","female","male","male","male")
TinyFingers <- data.frame(Sex, Thumb)
TinyFingers$Sex <- as.factor(TinyFingers$Sex)
TinySex.model <- lm(Thumb ~ Sex, data = TinyFingers)
TinyFingers$Sex.predicted <- predict(TinySex.model)
TinyFingers$Sex.resid <- TinyFingers$Thumb - TinyFingers$Sex.predicted
TinyFingers$Sex.resid2 <- resid(TinySex.model)
TinyEmpty.model <- lm(Thumb ~ NULL, data = TinyFingers)
TinyFingers$Empty.pred <- predict(TinyEmpty.model)
# modify this code to find the SS of TinySex.model
anova(Empty.model)
# modify this code to find the SS of TinySex.model
anova(TinySex.model)
test_function_result("anova", incorrect_msg = "Did you change `Empty.model` to `TinySex.model`?")
test_error()
success_msg("Wow! Great work.")
L_Ch7_Quantifying_2
We now have calculated two leftover (or residual) sums of squares. The first, 82, is for the empty model. The second, 28, is for the Sex model.
L_Ch7_Quantifying_3
The sum of squares has been minimized as much as we could with the empty model. We can now take that SS as our starting point—this is how much total error we have to explain. As soon as we add an explanatory variable (in this case Sex) into the model, it can only decrease the sum of squares for error, not increase them. If the new variable has no predictive value, then the sum of squares could stay the same. But it’s rare for a variable to have no predictive value at all.
Visualizing Sums of Squares
Let’s watch another video that explains where we are at this point. In her previous video in Chapter 6, Dr. Ji demonstrated the concept of sum of squares using our TinyFingers data set. We literally drew squares when we “squared the residuals.” She showed that the sum of squared deviations is minimized at the mean.
In this video, Dr. Ji shows us how we can visualize sum of squares from the Sex model, and also how we can compare the sum of squares from the Sex model against the empty model.
If you want to try it yourself, here we have provided data to copy/paste in to the little “sample data” box and the link to applet:
Sex | Thumb |
0 | 56 |
0 | 60 |
0 | 61 |
1 | 63 |
1 | 64 |
1 | 68 |
L_Ch7_Quantifying_4