Course Outline
-
segmentGetting Started (Don't Skip This Part)
-
segmentIntroduction to Statistics: A Modeling Approach
-
segmentPART I: EXPLORING VARIATION
-
segmentChapter 1 - Welcome to Statistics: A Modeling Approach
-
segmentChapter 2 - Understanding Data
-
segmentChapter 3 - Examining Distributions
-
segmentChapter 4 - Explaining Variation
-
segmentPART II: MODELING VARIATION
-
segmentChapter 5 - A Simple Model
-
5.4 Generating Predictions from the Empty Model
-
segmentChapter 6 - Quantifying Error
-
segmentChapter 7 - Adding an Explanatory Variable to the Model
-
segmentChapter 8 - Models with a Quantitative Explanatory Variable
-
segmentPART III: EVALUATING MODELS
-
segmentChapter 9 - Distributions of Estimates
-
segmentChapter 10 - Confidence Intervals and Their Uses
-
segmentChapter 11 - Model Comparison with the F Ratio
-
segmentChapter 12 - What You Have Learned
-
segmentResources
list Introduction to Statistics: A Modeling Approach
Generating Predictions From the Empty Model
We have just made the point that the mean of the sample distribution is our best, unbiased estimate of the mean of the population. For this reason, we use the mean as our model for the population. And, if we want to predict what the next randomly sampled observation might be, without any other information, we would use the mean.
In data analysis, we also use the word “predict” in another sense and for another purpose. Using our tiny data set, we found the mean thumb length to be 62 mm. So, if we were predicting what the seventh observation might be we’d go with 62 mm. But if we take the mean and look backwards at the data we have already collected, we could also generate a “predicted” thumb length for each of the data points we already have. This prediction answers the question: what would our model have predicted this thumb length to be if we hadn’t collected the data?
There’s an R function that will actually do this for you. Here’s how we can use it to generate the predicted thumb lengths for each of the six students in the tiny data set. Remember, we already fit the model and saved the results in TinyEmpty.model:
predict(TinyEmpty.model)
L_Ch5_Generating_1
Now, why would we want to create predicted thumb lengths for these six students when we already know their actual thumb lengths? We will go into this a lot more in the next chapter, but for now, the reason is so we an get a sense of how far each of our data points are from the prediction that our model would have made. In other words, it gives us a rough idea of how well our model fits our current data.
In order to use these predicted scores as a way of seeing how much error there is, we first need to save the prediction for each student in the data set. When there is only one prediction for everyone, as with the empty model, and when there are only six data points, as in our tiny data set, it seems like overkill to save the predictions.
But later you will see how useful it is to save the individual predicted scores. For example, if we save the predicted score for each student in a new variable called Predicted, we can then subtract each student’s actual thumb length from their predicted thumb length, resulting in a deviation from the prediction.
Use the function predict()
and save the predicted thumb lengths for each of the six students as a new variable in the TinyFingers data set. Then, print the new contents of TinyFingers.
require(mosaic)
require(ggformula)
StudentID <- c(1,2,3,4,5,6)
Thumb <- c(56, 60, 61, 63, 64, 68)
TinyFingers <- data.frame(StudentID, Thumb)
TinyEmpty.model <- lm(Thumb ~ NULL, data = TinyFingers)
# modify this to save the predictions from the TinyEmpty.model
TinyFingers$Prediction <-
# this prints TinyFingers
TinyFingers
# modify this to save the predictions from the TinyEmpty.model
TinyFingers$Prediction <- predict(TinyEmpty.model)
TinyFingers
test_data_frame("TinyFingers")
test_output_contains("TinyFingers")
test_error()
L_Ch5_Generating_2
Thinking About Error
We have developed the idea of the mean being the simplest (or empty) model of the distribution of a quantitative variable. If we connect this idea to our general formulation DATA = MODEL + ERROR, we can rewrite the statement as:
*DATA = MEAN + ERROR
If this is true, then we can calculate error in our data set by just shuffling this equation around to get the formula:
*ERROR = DATA - MEAN
Using this formula, if someone has a thumb length larger than the mean (e.g., 68), then their error is a positive number (in this case, +6). If they have a thumb length lower than the mean (e.g., 61) then we can calculate their error as a negative number (e.g. -1).
We generally call this calculated error the residuals, which indicates that they are “leftovers” from our data once we take out the model.
To find these errors (or residuals) you can just subtract the mean from each data point. In R we could just run this code to get the residuals:
TinyFingers$Thumb - TinyFingers$Predicted
The numbers in the output indicate, for each student in the data frame, what their residual is after subtracting out the model (which is the mean in this case).
Modify the following code to save these residuals in a new variable in TinyFingers called Residual.
Modify the following to save these residuals as part of the TinyFingers data frame.
require(mosaic)
require(ggformula)
StudentID <- c(1,2,3,4,5,6)
Thumb <- c(56, 60, 61, 63, 64, 68)
TinyFingers <- data.frame(StudentID, Thumb)
TinyEmpty.model <- lm(Thumb ~ NULL, data = TinyFingers)
TinyFingers$Prediction <- predict(TinyEmpty.model)
# modify this to save the residuals
TinyFingers$Residual <-
# this print TinyFingers
TinyFingers
# modify this to save the residuals
TinyFingers$Residual <- TinyFingers$Thumb - TinyFingers$Prediction
# this print TinyFingers
TinyFingers
test_data_frame("TinyFingers")
test_error()
success_msg("Great thinking!")
These residuals (or “leftovers”) are so important in modeling that there is an even easier way to get them in R. Again, we will use the results of our model fit, which we saved in the R object TinyEmpty.model:
resid(TinyEmpty.model)
Notice that we get the same numbers. But instead of specifying the data and the model’s predictions, we just tell R which model to get the residuals from.
Modify the following code to save the residuals that we get using the resid()
function in the TinyFingers data frame. Give the resulting variable a new name: easyResidual.
require(mosaic)
require(ggformula)
StudentID <- c(1,2,3,4,5,6)
Thumb <- c(56, 60, 61, 63, 64, 68)
TinyFingers <- data.frame(StudentID, Thumb)
TinyEmpty.model <- lm(Thumb ~ NULL, data = TinyFingers)
TinyFingers$Prediction <- predict(TinyEmpty.model)
TinyFingers$Residual <- TinyFingers$Thumb - TinyFingers$Prediction
# modify this to save the residuals (calculated the easy way)
TinyFingers$easyResidual <-
# this print TinyFingers
TinyFingers
# modify this to save the residuals (calculated the easy way)
TinyFingers$easyResidual <- resid(TinyEmpty.model)
# this print TinyFingers
TinyFingers
test_data_frame("TinyFingers")
test_output_contains("TinyFingers")
test_error()
success_msg("Fantastic work!")
Note that the variables Residual and easyResidual are identical. This makes sense; you just used different methods to get the residuals.
Here we’ve plotted histograms of the three variables: Thumb, Predicted, and Residual.
L_Ch5_Generating_3
The distributions of the data and the residuals have the same shape. But the numbers on the x-axis differ across the two distributions. The distribution of Thumb is centered at the mean (62), whereas the distribution of Residual is centered at 0. Data that are smaller than the mean (such as a thumb length of 58) have negative residuals (-4) but data that are larger than the mean (such as 73) have positive residuals (11).
Print out the favstats()
for Thumb and the Residual for the Fingers data frame.
require(mosaic)
require(ggformula)
StudentID <- c(1,2,3,4,5,6)
Thumb <- c(56, 60, 61, 63, 64, 68)
TinyFingers <- data.frame(StudentID, Thumb)
TinyEmpty.model <- lm(Thumb ~ NULL, data = TinyFingers)
TinyFingers$Prediction <- predict(TinyEmpty.model)
TinyFingers$Residual <- TinyFingers$Thumb - TinyFingers$Prediction
# get the favstats for Thumb
# get the favstats for Residuals
# if you decide to save them, make sure to print them out
favstats(~Thumb, data=TinyFingers)
favstats(~Residual, data=TinyFingers)
test_function_result("favstats", index=1)
test_function_result("favstats", index=2)
test_error()
success_msg("You're getting the hang of it!")
R is very precise so sometimes it will give you outputs like the value for the mean of the residuals: -1.421085e-14. The e-14 part indicates that this is a number pretty close to zero—the -14 meaning that that decimal point is shifted to the left 14 places! So, the actual mean of the Residual is actually -.00000000000001421085. So, whenever you see this scientific notation with a negative number after the “e”, you can just read it as “zero,” or pretty close to zero.
L_Ch5_Generating_4
The residuals (or error) around the mean always sum to 0. Therefore, the mean of the errors will also always be 0, because 0 divided by n equals 0.
Now that you have looked in detail at the tiny set of data, explore the predictions and residuals from the Empty.model fit earlier from the full set of Fingers data. Add them as new variables (Predœicted and Residual) to the Fingers data frame.
require(mosaic)
require(ggformula)
require(supernova)
# here is the code you wrote before
Empty.model <- lm(Thumb ~ NULL, data=Fingers)
# generate predictions from the Empty.model
Fingers$Prediction <-
# generate residuals from the Empty.model
Fingers$Residual <-
# this will print out a few lines of Fingers
head(select(Fingers, Thumb, Prediction, Residual), 10)
# here is the code you wrote before
Empty.model <- lm(Thumb ~ NULL, data=Fingers)
# generate predictions from the Empty.model
Fingers$Prediction <- predict(Empty.model)
# generate residuals from the Empty.model
Fingers$Residual <- resid(Empty.model)
# this will print out a few lines of Fingers
head(select(Fingers, Thumb, Prediction, Residual), 10)
test_object("Empty.model")
test_data_frame("Fingers")
test_error()
success_msg("Keep up the great work!")
Then make histograms of the variables Thumb, Predicted, and Residual. Which two histograms will have a similar shape?
require(mosaic)
require(ggformula)
require(supernova)
Empty.model <- lm(Thumb ~ NULL, data=Fingers)
Fingers$Prediction <- predict(Empty.model)
Fingers$Residual <- resid(Empty.model)
# make histograms for Thumb, Prediction, and Residual
# make histograms for Thumb, Prediction, and Residual
gf_histogram(~Thumb, data = Fingers)
gf_histogram(~Prediction, data = Fingers)
gf_histogram(~Residual, data = Fingers)
ex() %>% check_function("gf_histogram", index = 1) %>% check_arg("object") %>% check_equal()
ex() %>% check_function("gf_histogram", index = 1) %>% check_arg("data") %>% check_equal()
ex() %>% check_function("gf_histogram", index = 2) %>% check_arg("object") %>% check_equal()
ex() %>% check_function("gf_histogram", index = 2) %>% check_arg("data") %>% check_equal()
ex() %>% check_function("gf_histogram", index = 3) %>% check_arg("object") %>% check_equal()
ex() %>% check_function("gf_histogram", index = 3) %>% check_arg("data") %>% check_equal()
ex() %>% check_error()