Course Outline
-
segmentGetting Started (Don't Skip This Part)
-
segmentStatistics and Data Science: A Modeling Approach
-
segmentPART I: EXPLORING VARIATION
-
segmentChapter 1 - Welcome to Statistics: A Modeling Approach
-
segmentChapter 2 - Understanding Data
-
segmentChapter 3 - Examining Distributions
-
segmentChapter 4 - Explaining Variation
-
segmentPART II: MODELING VARIATION
-
segmentChapter 5 - A Simple Model
-
segmentChapter 6 - Quantifying Error
-
segmentChapter 7 - Adding an Explanatory Variable to the Model
-
7.6 Quantifying Model Fit with Sums of Squares
-
segmentChapter 8 - Models with a Quantitative Explanatory Variable
-
segmentFinishing Up (Don't Skip This Part!)
-
segmentResources
list High School / Statistics and Data Science I (AB)
7.6 Quantifying Model Fit With Sums of Squares
In the empty model, you will recall, we used the mean as the model, i.e., as the predicted score for every observation. We developed the intuition that mean was a better-fitting model (that there was less error around the model) if the spread of the distribution was small than if it was large.
Calculating Sums of Squares: Empty Model (Review)
In the previous chapter, we quantified error using the sum of the squared deviations (SS, or sum of squares) around the mean, a measure that is minimized precisely at the mean. Under the empty model, all of the variation is unexplained—that’s why it is called “empty.” But it does show us clearly how much variation there is left to explain, measured in sum of squares.
Remind yourself how to use the supernova()
function to get the SS leftover after fitting the empty model (SS Total) for our TinyFingers
thumb length data.
require(coursekata)
TinyFingers <- data.frame(
Sex = as.factor(rep(c("female", "male"), each = 3)),
Thumb = c(56, 60, 61, 63, 64, 68)
)
Tiny_empty_model <- lm(Thumb ~ NULL, data = TinyFingers)
Tiny_Sex_model <- lm(Thumb ~ Sex, data = TinyFingers)
TinyFingers <- TinyFingers %>% mutate(
Sex_predicted = predict(Tiny_Sex_model),
Sex_resid = Thumb - Sex_predicted,
Sex_resid2 = resid(Tiny_Sex_model),
empty_pred = predict(Tiny_empty_model)
)
# here is the code you wrote before
Tiny_empty_model <- lm(Thumb ~ NULL, data = TinyFingers)
# write code to get the SS leftover from Tiny_empty_model
Tiny_empty_model <- lm(Thumb ~ NULL, data = TinyFingers)
supernova(Tiny_empty_model)
ex() %>% {
check_function(., "lm") %>% check_result() %>% check_equal()
check_object(., "Tiny_empty_model") %>% check_equal()
check_function(., "supernova") %>% check_result() %>% check_equal()
}
Analysis of Variance Table (Type III SS)
Model: Thumb ~ NULL
SS df MS F PRE p
----- ----------------- ------ --- ------ --- --- ---
Model (error reduced) | --- --- --- --- --- ---
Error (from model) | --- --- --- --- --- ---
----- ----------------- ------ --- ------ --- --- ---
Total (empty model) | 82.000 5 16.400
Calculating Sums of Squares: Sex Model
How do we quantify the error around our new—more complex—model, where sex is used to predict thumb length?
We quantify error around the more complex model in the same way we did for the empty model. We simply generate the residuals based on predictions of the Sex model, square them, and then sum them to get the sum of squares error from the model.
Go ahead and modify this code to get the SS Error from the Tiny_Sex_model
.
require(coursekata)
TinyFingers <- data.frame(
Sex = as.factor(rep(c("female", "male"), each = 3)),
Thumb = c(56, 60, 61, 63, 64, 68)
)
Tiny_empty_model <- lm(Thumb ~ NULL, data = TinyFingers)
Tiny_Sex_model <- lm(Thumb ~ Sex, data = TinyFingers)
TinyFingers <- TinyFingers %>% mutate(
Sex_predicted = predict(Tiny_Sex_model),
Sex_resid = Thumb - Sex_predicted,
Sex_resid2 = resid(Tiny_Sex_model),
empty_pred = predict(Tiny_empty_model)
)
# modify this code to find the SS of Tiny_Sex_model
supernova(empty_model)
supernova(Tiny_Sex_model)
ex() %>% check_function("supernova") %>% check_result() %>% check_equal(incorrect_msg = "Did you change `empty_model` to `Tiny_Sex_model`?")
Analysis of Variance Table (Type III SS)
Model: Thumb ~ Sex
SS df MS F PRE p
----- ----------------- ------ - ------ ----- ------ -----
Model (error reduced) | 54.000 1 54.000 7.714 0.6585 .0499
Error (from model) | 28.000 4 7.000
----- ----------------- ------ - ------ ----- ------ -----
Total (empty model) | 82.000 5 16.400
We now have calculated two leftover (or residual) sums of squares: SS Total and SS Error. SS Total is the total error from the empty model (82); SS Error is the error leftover from the Sex
model (28).
SS Total is the smallest SS we could have without adding an explanatory variable to the model. It represents the total variation in the outcome variable that we would want to explain. Taking that as our starting point, we can reduce the error by adding an explanatory variable into the model (in this case Sex
).
Adding an explanatory variable to our model can only decrease the sum of squares for error, not increase it. If the new model does not make better predictions than the empty model then the sum of squares would stay the same. But it’s rare for an explanatory variable to have no predictive value at all.
Visualizing Sums of Squares
Let’s watch another video that explains where we are at this point. In her previous video in Chapter 6, Dr. Ji demonstrated the concept of sum of squares using our TinyFingers
data set. We literally drew squares when we “squared the residuals.” She showed that the sum of squared deviations is minimized at the mean.
In this video, Dr. Ji shows us how we can visualize sum of squares from the Sex
model, and also how we can compare the sum of squares from the Sex
model against the empty model.
If you want to try out the app Dr. Ji uses in this video you can click this link to the sum of squares applet. Copy/paste the data below into the little “sample data” box to reproduce Ji’s examples. (Here’s the full link in case that one doesn’t work: http://www.rossmanchance.com/applets/RegShuffle.htm)
Sex | Thumb |
0 | 56 |
0 | 60 |
0 | 61 |
1 | 63 |
1 | 64 |
1 | 68 |