Course Outline

segmentGetting Started (Don't Skip This Part)

segmentIntroduction to Statistics: A Modeling Approach

segmentPART I: EXPLORING VARIATION

segmentChapter 1  Welcome to Statistics: A Modeling Approach

segmentChapter 2  Understanding Data

segmentChapter 3  Examining Distributions

segmentChapter 4  Explaining Variation

segmentPART II: MODELING VARIATION

segmentChapter 5  A Simple Model

segmentChapter 6  Quantifying Error

6.1 The Beauty of Sum of Squares

segmentChapter 7  Adding an Explanatory Variable to the Model

segmentChapter 8  Models with a Quantitative Explanatory Variable

segmentPART III: EVALUATING MODELS

segmentChapter 9  Distributions of Estimates

segmentChapter 10  Confidence Intervals and Their Uses

segmentChapter 11  Model Comparison with the F Ratio

segmentChapter 12  What You Have Learned

segmentResources
list Introduction to Statistics: A Modeling Approach
The Beauty of Sum of Squares
As it turns out, Sum of Squares has a special relationship to the mean. In the previous chapter we extolled the virtues of the mean. Now it’s time to start appreciating the beauty of Sum of Squares!
The most obvious advantage of SS as a measure of total error is that it is minimized exactly at the mean. And because our goal in statistical modeling is to reduce error, this is a good thing. In any distribution of a quantitative variable, the mean is the point in the distribution at which SS is lower than at any other point. (Be sure to watch the video above for more explanation on this point.)
(It is worth pointing out that this advantage of SS is only there if our model is the mean. If we were to choose another number, such as the median, as our model of a distribution, we would probably choose a different measure of error. But our focus in this course is primarily on the mean.)
There are other things about SS that have attracted statisticians over the years. Most of these things will be hard for you to understand until you get farther into the course. But trust us when we say that the Sum of Squares will prove its utility, not just because it is minimized at the mean, but because of the way it fits mathematically into the statistics landscape.
At first glance, many of the topics in statistics seem like part of some endless list of unrelated formulas… the mean, the sum of squares, linear models. But hopefully you are starting to see that these fit together. The relationship between the mean and the SS is actually just a peek at the interlocking relationships between all these concepts. Using the squared deviations will actually link up with other ideas in statistics later.
It is somewhat like the Pythagorean Theorem. You learned in school that the square of the hypotenuse of a right triangle is equal to the sum of the squares of the two sides. Thus, \(a^2+b^2=c^2\). Squaring the sides makes everything add up and fit together. But if you don’t square them, the theorem no longer holds: \(a+b\neq{c}\). By using Sum of Squares as a quantification of total error, lots of things will fit together that otherwise would not.
Finding Sum of Squares (SS)
Hopefully we have convinced you that SS goes hand in hand with the mean. Even more generally, it goes with the General Linear Model (GLM). So far, we have only explored one model—the empty model—in which \(b_0\) represents the sample mean (which is also our estimate of the parameter, the population mean).
\[Y_{i}=b_{0}+e_{i}\]
R has a handy way of helping us find the sum of squared errors (SS) from a particular model. Remember we used lm()
to create a model based on our TinyFingers data. We called that the TinyEmpty.model.
TinyEmpty.model < lm(Thumb ~ NULL, data = TinyFingers)
Once we have this model, we can use a function called anova()
to look at the error from this model. ANOVA stands for ANalysis Of VAriance. Analysis means “to break down”, and later we will use this function to break down the variation into parts. But for now, we will use anova()
just to figure out how much error there is around the model, measured in Sums of Squares.
anova(TinyEmpty.model)
There are a bunch of other things in this output that we will talk about soon. But for now, focus your attention on the column labeled “Sum Sq”. We see the same value (82) that we previously calculated with the longer sequence of R commands in which we calculated the residuals, squared them, and then summed the squared residuals.
Try creating a NULL or empty model of thumb length using the larger Fingers data frame, and then look at the SS by using anova()
.
require(mosaic)
require(ggformula)
require(supernova)
# create an empty model of Thumb length from Fingers
Empty.model <
# modify this code to see the SS
anova()
# create an empty model of Thumb length from Fingers
Empty.model < lm(Thumb ~ NULL, data = Fingers)
# modify this code to see the SS
anova(Empty.model)
test_object("Empty.model")
test_function_result("anova")
test_error()
success_msg("Great job!")
L_Ch6_Sum_3
Let’s try calculating the Sum of Squares a different way, and see if we get the same result.
L_Ch6_Sum_4
Try running this code, and see what the result is.
require(mosaic)
require(ggformula)
Fingers < read.csv(file="https://raw.githubusercontent.com/UCLATALL/introstatsmodeling/master/datasets/fingers.csv", header=TRUE, sep=",")
Empty.model < lm(Thumb ~ NULL, data=Fingers)
# try running this code; will this result in the same SS?
sum( resid(Empty.model)^2 )
# try running this code; will this result in the same SS?
sum( resid(Empty.model)^2 )
test_function_result("sum")
test_error()
success_msg("Well done!")
This lines up with the output we got from anova()
. Notice, however, that the anova()
output rounded off to the nearest whole number, whereas this alternative calculation included two places after the decimal point.