Course Outline

list Introduction to Statistics: A Modeling Approach

Examining Residuals From the Model

As we have said before, when we use a model to make predictions, our predictions are usually wrong. The residuals, or the distance of each observed score from the predicted score, give us an indication of how wrong our predictions are for each person.

Calculating Residuals From the Model

We calculated residuals from the empty model by subtracting the mean from each score. We use the same approach for the more complex model.

Using the model, we assign a predicted score to each observation. This time, however, the predicted score is not the Grand Mean for everyone, but one mean for females and another for males.

L_Ch7_Examining_6

Notice that we use the same strategy to quantify the leftover error from any model. We can do the subtractions in R, just as we did for the empty model. We start with the TinyFingers data frame looking like this.

We then calculate the residual for each person by this subtraction: their observed score minus their score predicted by the model. We will save the result in a new variable in TinyFingers that we will call Sex.resid.

TinyFingers$Sex.resid <- TinyFingers$Thumb - TinyFingers$Sex.predicted
TinyFingers

L_Ch7_Examining_1

Another, slightly easier, way to do the same thing is to use the `resid() function, using the TinySex.model as the argument. Let’s try that, and save the results in a new variable, Sex.resid2. Then let’s print the updated version of TinyFingers.

TinyFingers$Sex.resid2 <- resid(TinySex.model)
TinyFingers

Compare the two variables—Sex.resid and Sex.resid2—in the output. Notice that we get the same values for Sex.resid2 as we did for Sex.resid.

L_Ch7_Examining_2

Finally, let’s compare the residuals from the TinySex.model to those from the TinyEmpty.model. Let’s first add a variable called Empty.pred (we could have made it anything, the point is to name it something meaningful—so in this case we will call it Empty.pred, short for predicted from the empty model), using the predict() function, and print out TinyFingers again.

TinyFingers$Empty.pred <- predict(TinyEmpty.model)
TinyFingers

L_Ch7_Examining_3

Now use the resid() function to create a new variable in the TinyFingers data frame, Empty.resid. And print out the updated version of TinyFingers.

require(mosaic) require(ggformula) #set up tiny data set Thumb <- c(56, 60, 61, 63, 64, 68) Sex <- c("female","female","female","male","male","male") TinyFingers <- data.frame(Sex, Thumb) TinyFingers$Sex <- as.factor(TinyFingers$Sex) TinySex.model <- lm(Thumb ~ Sex, data = TinyFingers) TinyFingers$Sex.predicted <- predict(TinySex.model) TinyFingers$Sex.resid <- TinyFingers$Thumb - TinyFingers$Sex.predicted TinyFingers$Sex.resid2 <- resid(TinySex.model) TinyEmpty.model <- lm(Thumb ~ NULL, data = TinyFingers) TinyFingers$Empty.pred <- predict(TinyEmpty.model) # generate residuals from TinyEmpty.model TinyFingers$Empty.resid <- # write code to print TinyFingers # generate residuals from Empty.model TinyFingers$Empty.resid <- resid(TinyEmpty.model) # write code to print TinyFingers print(TinyFingers) test_data_frame("TinyFingers") test_output_contains("TinyFingers") test_error()
DataCamp: ch7-6

L_Ch7_Examining_4

Graphing Residuals From the Model

You might wonder, why are we bothering to generate and save residuals? We will have a lot more to say about this later. But the short answer is: it helps us to understand the error around our model, and often suggests ways of improving the model.

Just as the first thing we do when looking at a data set is to examine the distributions of the variables, it is good to get in the habit of examining the distributions of residuals after we fit a new model.

Let’s go back to the full Fingers data frame. We fit the model lm(Thumb ~ Sex, data = Fingers) and saved the model in Sex.model. Using the resid() function, write some code to generate a new column in Fingers called Sex.resid (the residuals from the Sex.model).

require(mosaic) require(ggformula) Fingers <- read.csv(file="https://raw.githubusercontent.com/UCLATALL/intro-stats-modeling/master/datasets/fingers.csv", header=TRUE, sep=",") Sex.model <- lm(Fingers$Thumb ~ Fingers$Sex) Fingers$Sex.resid <- # This prints the first few rows of Fingers head(select(Fingers, Thumb, Sex.resid), 10) Fingers$Sex.resid <- resid(Sex.model) test_function("resid", incorrect_msg="Did you call the `resid()` function?") test_data_frame("Fingers") test_output_contains("head(select(Fingers, Thumb, Sex.resid), 10)") test_error() success_msg("You're doing a terrific job!")
Use the resid() function with Sex.model
DataCamp: ch7-7

In the following window, we have provided the code to create density histograms of Thumb in a facet grid by Sex. Try modifying it to generate density histograms of Sex.resid in a facet grid by Sex. Compare the histograms of residuals from the Sex.model with histograms of thumb length.

require(mosaic) require(ggformula) Fingers <- read.csv(file="https://raw.githubusercontent.com/UCLATALL/intro-stats-modeling/master/datasets/fingers.csv", header=TRUE, sep=",") Sex.model <- lm(Fingers$Thumb ~ Fingers$Sex) Fingers$Sex.resid <- resid(Sex.model) # this is the code to create histograms of Thumb in a grid by Sex # modify the code to create histograms of Sex.resid in a grid by Sex gf_histogram(~ Thumb, data=Fingers) %>% gf_facet_grid(Sex ~ .) # this is the code to create histograms of Thumb in a grid by Sex # modify the code to create histograms of Sex.resid in a grid by Sex gf_histogram(~ Sex.resid, data=Fingers) %>% gf_facet_grid(Sex ~ .) ex() %>% check_function("gf_histogram") %>% check_arg("object") %>% check_equal() ex() %>% check_function("gf_histogram") %>% check_arg("data") %>% check_equal() ex() %>% check_function("gf_facet_grid") %>% check_arg("...") %>% check_equal(incorrect_msg = "Make sure you keep the code to create a facet grid by Sex") ex() %>% check_error() success_msg("Great thinking!")
Change the first argument to ~Sex.resid
DataCamp: ch7-8

In the activity below, we’ve depicted the density histograms of Thumb by Sex (in black) next to the histograms of Sex.resid by Sex (in skyblue).

L_Ch7_Examining_7

We can add the means of Thumb for females and males to the Thumb histograms with some R code. First, we calculate the mean Thumb length for each Sex group and save it in an R object called Thumb.stats:

Thumb.stats <- favstats(Thumb ~ Sex, data = Fingers)

Then we chain on (%>%) a vertical line on our histogram with this code.

gf_vline(xintercept=~mean, data=Thumb.stats)

Here we have provided the code to add mean lines for each Sex group to the Thumb histograms. Modify the next chunk of code to add mean lines for each Sex group to the Sex.resid histograms.

require(mosaic) require(ggformula) Fingers <- read.csv(file="https://raw.githubusercontent.com/UCLATALL/intro-stats-modeling/master/datasets/fingers.csv", header=TRUE, sep=",") Sex.model <- lm(Fingers$Thumb ~ Fingers$Sex) Fingers$Sex.resid <- resid(Sex.model) # this code generates histograms with lines to represent the mean Thumb length of each Sex group Thumb.stats <- favstats(Thumb ~ Sex, data = Fingers) gf_histogram(..density.. ~ Thumb, data = Fingers) %>% gf_facet_grid(Sex ~ .) %>% gf_vline(xintercept = ~mean, color = "blue", data = Thumb.stats) # modify this code to add lines to represent the mean Sex.resid of each Sex group Sex.resid.stats <- favstats(Sex.resid ~ Sex, data = Fingers) gf_histogram(..density.. ~ Sex.resid, data = Fingers, fill = "skyblue") %>% gf_facet_grid(Sex ~ . ) # this code generates histograms with lines to represent the mean Thumb length of each Sex group Thumb.stats <- favstats(Thumb ~ Sex, data = Fingers) gf_histogram(..density.. ~ Thumb, data = Fingers) %>% gf_facet_grid(Sex ~ .) %>% gf_vline(xintercept = ~mean, color = "blue", data = Thumb.stats) # modify this code to add lines to represent the mean Sex.resid of each Sex group Sex.resid.stats <- favstats(Sex.resid ~ Sex, data = Fingers) gf_histogram(..density.. ~ Sex.resid, data = Fingers, fill = "skyblue") %>% gf_facet_grid(Sex ~ . ) %>% gf_vline(xintercept = ~mean, color = "blue", data=Sex.resid.stats) test_function("gf_vline", index = 1) test_function("gf_facet_grid", index = 1) test_function("gf_histogram", index = 1) test_function("gf_vline", index = 2) test_function("gf_facet_grid", index = 2) test_function("gf_histogram", index = 2) test_error()
DataCamp: ch7-9

L_Ch7_Examining_5

Responses