Course Outline

list Introduction to Statistics: A Modeling Approach

Comparing Two Models: Proportional Reduction in Error

We have seen intuitively how we might compare two models (the empty model and the Sex model) using sums of squares. Now we are going to develop a table for summarizing this comparison, and a new statistic—Proportional Reduction in Error (PRE)—that tells us how much better the complex model fits compared with the simple, or in this case empty, model.

A New R Function: supernova()

Using the anova() function, we had to generate two separate tables to get all our sums of squares, one for the empty model and one for the Sex model. But what we really want to do is compare these two models. We want to see which one fits better, and how much better. We’ve created a new R function to help us do that, which we call supernova().

Let’s run supernova() on the TinyEmpty.model and see what the results look like. Press Run in the DataCamp window to run this code.

require(mosaic) require(ggformula) require(supernova) #set up tiny data set Thumb <- c(56, 60, 61, 63, 64, 68) Sex <- c("female","female","female","male","male","male") TinyFingers <- data.frame(Sex, Thumb) TinyFingers$Sex <- as.factor(TinyFingers$Sex) TinySex.model <- lm(Thumb ~ Sex, data = TinyFingers) TinyFingers$Sex.predicted <- predict(TinySex.model) TinyFingers$Sex.resid <- TinyFingers$Thumb - TinyFingers$Sex.predicted TinyFingers$Sex.resid2 <- resid(TinySex.model) TinyEmpty.model <- lm(Thumb ~ NULL, data = TinyFingers) TinyFingers$Empty.pred <- predict(TinyEmpty.model) # run this code supernova(TinyEmpty.model) # run this code supernova(TinyEmpty.model) test_function("supernova") test_function_result("supernova") test_error()
Just press Run
DataCamp: ch7-12

This output table has a lot of columns, most of which we haven’t discussed. So, don’t worry at this point that you don’t really know what they are. As we proceed through the course, we will be filling in the rest of this table, which will provide you a really useful way of comparing models.

For now, notice that the only row of the table with any numbers is the last row. You actually know what these numbers mean. First, under SS we see 82, which you probably recognize as the sum of squares for the empty model. They call it Total Sum of Squares.

The next column, labeled df, shows us degrees of freedom. So, for our tiny sample of 6, degrees of freedom is 5, or n-1.

The third column is labeled MS, which stands for Mean Square error. MS is calculated by dividing SS (82) by degrees of freedom (5), which results in 16.4.

L_Ch7_Comparing_1

SS divided by df is the formula for variance, which you learned in the previous chapter. Mean Square error is like saying the average squared error from the mean. That idea is called variance. It’s the same thing.

Using supernova() to Examine the Sex Model

Now modify the code in the DataCamp window below to run supernova() on the TinySex.model, and see what you get.

require(mosaic) require(ggformula) require(supernova) #set up tiny data set Thumb <- c(56, 60, 61, 63, 64, 68) Sex <- c("female","female","female","male","male","male") TinyFingers <- data.frame(Sex, Thumb) TinySex.model <- lm(Thumb ~ Sex, data = TinyFingers) # modify this code for TinySex.model supernova() test_error() success_msg("Wow! Great work.")
Make sure to specify the correct model
DataCamp: ch7-13

Looking at this table, we can see a combined view of all the information that we previously got from two anova() tables, one for the empty model and one for the Sex model. The bottom row, labelled Total, still shows us the variation around the empty model—we see the SS of 82.

The first two rows show us the result of fitting the Sex model. The row labelled Error shows us the SS (28) left after taking out the effect of Sex. We saw before that the Sex model fit better than the empty model, but now we can clearly see what that difference is in a single table. The SS left in the empty model was 82, but now that we have added Sex as an explanatory variable, the SS is only 28.

Partitioning Sums of Squares

Something new comes out in this table (reproduced again, below) that wasn’t so obvious before. Look just at the column labelled SS (see green box). If it looks like an addition problem to you, you are right. The two rows associated with the Sex model (Model and Error) add up to the Total SS (the empty model). So, 54 + 28 = 82.

It turns out this is true about sums of squares. The total sum of squares in the outcome variable—which is the sum of squared deviations around the empty model, or mean—can be equally partitioned into the part that is due to the model itself (we will come back to that in just a moment), and the part that is still left after the model is subtracted out.

We want to take a moment to appreciate why the ANOVA tables (printed by the functions anova() and supernova()) are called ANalysis Of VAriance tables. “Analyze” often means to break up into parts. Analysis of variance is when we break apart, or partition, variation. SS is one measure of variation that we are analyzing here.

So how do we interpret the SS Model (the 54, in this case)? We are going to think of it as the reduction in error (measured in sums of squares) due to the Sex model when compared to the empty model. Let’s use this diagram to help explain this idea.

We can think about this diagram from the outside in. The outer circle represents the total amount of error in the outcome variable, based on the distances between the scores and the empty model. (We saved these distances earlier in a variable called Empty.resid.)

The orange circle represents the part of this total error that is explained by the Sex model.

The area not covered by the orange circle represents the amount of error left unexplained from the Sex model. This leftover SS, called SS Error, is based on the distance of each score from the mean of its group (male or female). (We saved these distances earlier in a variable called Sex.resid.)

The orange circle, therefore, also represents the reduction in error** that has been achieved by the Sex model in comparison to the empty model.**

In our table, we can simply subtract the SS Error (error around the Sex model) from SS Total (error around the mean, or the empty model) to get the SS Model, i.e., the sum of squares that is explained by the Sex model. Just as DATA = MODEL + ERROR, SS Total = SS Model + SS Error.

L_Ch7_Comparing_2

If SS Total is based on Empty.resid, and SS Error is based on Sex.resid, what distances are used to calculate the SS Model? This is the sum of squared distances between the predictions of the Sex model and the predictions of the empty model.

L_Ch7_Comparing_3

You can think of SS Total = SS Model + SS Error in this more detailed way:

*

**SS(Thumb to Empty.model) = SS(Sex.model to Empty.model) + SS(Thumb to Sex.model)

To directly calculate SS Model, we need to figure out how much error has been reduced by the Sex model in comparison to the empty model. This reduced error is represented by the distance of each person’s predicted score under the Sex.model to their predicted score under the empty model.

To make this concrete, let’s use R to calculate this distance for each person in the TinyFingers data set. We can add a new variable called ErrorReduced in TinyFingers.

TinyFingers$ErrorReduced <- TinyFingers$Sex.predicted - TinyFingers$Empty.pred

To calculate the sum of squares for error reduced, we square each person’s value for ErrorReduced, and sum these squares up across people.

sum(TinyFingers$ErrorReduced^2)

Let’s try this to see if we get 54 (the SS Model, error reduced, from our supernova table). Try putting all this together in the DataCamp window below.

require(mosaic) require(ggformula) require(supernova) Thumb <- c(56, 60, 61, 63, 64, 68) Sex <- c(1,1,1,2,2,2) TinyFingers <- data.frame(Sex, Thumb) TinyFingers$Sex <- factor(TinyFingers$Sex, levels = c(1,2), labels = c("female", "male")) TinySex.model <- lm(Thumb ~ Sex, data = TinyFingers) TinySex.fun <- makeFun(TinySex.model) TinyFingers$Sex.predicted <- predict(TinySex.model) TinyFingers$Sex.resid <- resid(TinySex.model) TinyEmpty.model <- lm(Thumb ~ NULL, data = TinyFingers) TinyFingers$Empty.pred <- predict(TinyEmpty.model) TinyFingers$Empty.resid <- resid(TinyEmpty.model) # This creates a column for error reduced by the Sex.model TinyFingers$ErrorReduced <- TinyFingers$Sex.predicted - TinyFingers$Empty.pred # Modify this to sum up squared ErrorReduced # This creates a column for error reduced by the Sex.model TinyFingers$ErrorReduced <- TinyFingers$Sex.predicted - TinyFingers$Empty.pred # Modify this to sum up squared ErrorReduced sum(TinyFingers$ErrorReduced^2) test_data_frame("TinyFingers") test_function_result("sum") test_error()
Look at the example code above if you need help
DataCamp: ch7-13a

Proportional Reduction in Error (PRE)

We have now quantified how much variation has been explained by our model: 54 square millimeters. Is that good? Is that a lot of explained variation? It would be easier to understand if we knew the proportion of error that has been reduced rather than the raw amount of error measured in \(mm^2\).

If you take another look at the supernova() table (reproduced below) for the TinySex.model, you will see a column labelled PRE. PRE stands for Proportional Reduction in Error.

PRE is calculated using the sums of squares. It is simply the sum of squares for Model (the sum of squares reduced by the model) divided by the SS Total (or, the total sum of squares in the outcome variable under the empty model). We can represent this in a formula:

\[PRE=\frac{SS_{Model}}{SS_{Total}}\]

PRE can be interpreted as the proportion of variation in the outcome variable that is explained by the explanatory variable. In that sense, it tells us something about the strength of our statistical model. For our tiny data set (TinyFingers), the effect of Sex on Thumb is fairly strong. Variation in sex accounts for .66 of the variation in thumb length.

Hold up. Why do we call this value PRE, Proportion Reduction in Error, when it seems like we should really call it Proportion of Error Explained? We didn’t have anything to do with the naming so we feel the same way. You can think about error being reduced if your reference point is the empty model’s total error. Then PRE is the proportion of error reduced from the empty model compared to the Sex model.

Just as a note—PRE is a statistic that goes by other names as well. In the ANOVA tradition (Analysis of Variance) it has been called \(\eta^2\), or eta squared. In Chapter 8 we will introduce the same concept in the context of regression, where it is called \(R^2\). For now all you need to know is: it’s all the same thing, in case anyone asks you.

L_Ch7_PRE_1

L_Ch7_PRE_3

Fitting the Sex Model to the Full Fingers Data Set

Before moving on, use the DataCamp window below to fit the Sex model to the full Fingers data set. Save the model in an R object called Sex.model. Print out the model (this will show you the values of \(b_{0}\) and \(b_{1}\)). Then run supernova() to produce the ANOVA table we have been discussing.

require(mosaic) require(ggformula) require(supernova) Fingers <- read.csv(file="https://raw.githubusercontent.com/UCLATALL/intro-stats-modeling/master/datasets/fingers.csv", header=TRUE, sep=",") # fit Sex.model to show Thumb ~ Sex in the Fingers data frame Sex.model <- lm(formula = , data = ) # print out the model # print the supernova table for Sex.model # fit Sex.model to the Fingers data frame Sex.model <- lm(formula = Thumb ~ Sex, data = Fingers) # print out the model Sex.model # print the supernova table for Sex.model supernova(Sex.model) test_function("lm", args=c("formula", "data")) test_object("Sex.model") test_output_contains("Sex.model", incorrect_msg="Don't forget to print `Sex.model`") test_function_result("supernova")
Look at previous exercises to see how to fit Sex.model to the fingers dataframe
DataCamp: ch7-14

L_Ch7_PRE_2

Responses