Course Outline

list Introduction to Statistics: A Modeling Approach

Generating Predictions From the Empty Model

We have just made the point that the mean of the sample distribution is our best, unbiased estimate of the mean of the population. For this reason, we use the mean as our model for the population. And, if we want to predict what the next randomly sampled observation might be, without any other information, we would use the mean.

In data analysis, we also use the word “predict” in another sense and for another purpose. Using our tiny data set, we found the mean thumb length to be 62 mm. So, if we were predicting what the seventh observation might be we’d go with 62 mm. But if we take the mean and look backwards at the data we have already collected, we could also generate a “predicted” thumb length for each of the data points we already have. This prediction answers the question: what would our model have predicted this thumb length to be if we hadn’t collected the data?

There’s an R function that will actually do this for you. Here’s how we can use it to generate the predicted thumb lengths for each of the six students in the tiny data set. Remember, we already fit the model and saved the results in TinyEmpty.model:

predict(TinyEmpty.model)

L_Ch5_Generating_1

Now, why would we want to create predicted thumb lengths for these six students when we already know their actual thumb lengths? We will go into this a lot more in the next chapter, but for now, the reason is so we an get a sense of how far each of our data points are from the prediction that our model would have made. In other words, it gives us a rough idea of how well our model fits our current data.

In order to use these predicted scores as a way of seeing how much error there is, we first need to save the prediction for each student in the data set. When there is only one prediction for everyone, as with the empty model, and when there are only six data points, as in our tiny data set, it seems like overkill to save the predictions.

But later you will see how useful it is to save the individual predicted scores. For example, if we save the predicted score for each student in a new variable called Predicted, we can then subtract each student’s actual thumb length from their predicted thumb length, resulting in a deviation from the prediction.

Use the function predict() and save the predicted thumb lengths for each of the six students as a new variable in the TinyFingers data set. Then, print the new contents of TinyFingers.

require(mosaic) require(ggformula) StudentID <- c(1,2,3,4,5,6) Thumb <- c(56, 60, 61, 63, 64, 68) TinyFingers <- data.frame(StudentID, Thumb) TinyEmpty.model <- lm(Thumb ~ NULL, data = TinyFingers) # modify this to save the predictions from the TinyEmpty.model TinyFingers$Prediction <- # this prints TinyFingers TinyFingers # modify this to save the predictions from the TinyEmpty.model TinyFingers$Prediction <- predict(TinyEmpty.model) TinyFingers test_data_frame("TinyFingers") test_output_contains("TinyFingers") test_error()
Use the predict() function on TinyEmpty.model
DataCamp: ch5-8

L_Ch5_Generating_2

Thinking About Error

We have developed the idea of the mean being the simplest (or empty) model of the distribution of a quantitative variable. If we connect this idea to our general formulation DATA = MODEL + ERROR, we can rewrite the statement as:

*

DATA = MEAN + ERROR

If this is true, then we can calculate error in our data set by just shuffling this equation around to get the formula:

*

ERROR = DATA - MEAN

Using this formula, if someone has a thumb length larger than the mean (e.g., 68), then their error is a positive number (in this case, +6). If they have a thumb length lower than the mean (e.g., 61) then we can calculate their error as a negative number (e.g. -1).

We generally call this calculated error the residuals, which indicates that they are “leftovers” from our data once we take out the model.

To find these errors (or residuals) you can just subtract the mean from each data point. In R we could just run this code to get the residuals:

TinyFingers$Thumb - TinyFingers$Predicted

The numbers in the output indicate, for each student in the data frame, what their residual is after subtracting out the model (which is the mean in this case).

Modify the following code to save these residuals in a new variable in TinyFingers called Residual.

Modify the following to save these residuals as part of the TinyFingers data frame.

require(mosaic) require(ggformula) StudentID <- c(1,2,3,4,5,6) Thumb <- c(56, 60, 61, 63, 64, 68) TinyFingers <- data.frame(StudentID, Thumb) TinyEmpty.model <- lm(Thumb ~ NULL, data = TinyFingers) TinyFingers$Prediction <- predict(TinyEmpty.model) # modify this to save the residuals TinyFingers$Residual <- # this print TinyFingers TinyFingers # modify this to save the residuals TinyFingers$Residual <- TinyFingers$Thumb - TinyFingers$Prediction # this print TinyFingers TinyFingers test_data_frame("TinyFingers") test_error() success_msg("Great thinking!")
Subtract Thumb from Prediction
DataCamp: ch5-9

These residuals (or “leftovers”) are so important in modeling that there is an even easier way to get them in R. Again, we will use the results of our model fit, which we saved in the R object TinyEmpty.model:

resid(TinyEmpty.model)

Notice that we get the same numbers. But instead of specifying the data and the model’s predictions, we just tell R which model to get the residuals from.

Modify the following code to save the residuals that we get using the resid() function in the TinyFingers data frame. Give the resulting variable a new name: easyResidual.

require(mosaic) require(ggformula) StudentID <- c(1,2,3,4,5,6) Thumb <- c(56, 60, 61, 63, 64, 68) TinyFingers <- data.frame(StudentID, Thumb) TinyEmpty.model <- lm(Thumb ~ NULL, data = TinyFingers) TinyFingers$Prediction <- predict(TinyEmpty.model) TinyFingers$Residual <- TinyFingers$Thumb - TinyFingers$Prediction # modify this to save the residuals (calculated the easy way) TinyFingers$easyResidual <- # this print TinyFingers TinyFingers # modify this to save the residuals (calculated the easy way) TinyFingers$easyResidual <- resid(TinyEmpty.model) # this print TinyFingers TinyFingers test_data_frame("TinyFingers") test_output_contains("TinyFingers") test_error() success_msg("Fantastic work!")
Have you used the resid() function?
DataCamp: ch5-10

Note that the variables Residual and easyResidual are identical. This makes sense; you just used different methods to get the residuals.

Here we’ve plotted histograms of the three variables: Thumb, Predicted, and Residual.

L_Ch5_Generating_3

The distributions of the data and the residuals have the same shape. But the numbers on the x-axis differ across the two distributions. The distribution of Thumb is centered at the mean (62), whereas the distribution of Residual is centered at 0. Data that are smaller than the mean (such as a thumb length of 58) have negative residuals (-4) but data that are larger than the mean (such as 73) have positive residuals (11).

Print out the favstats() for Thumb and the Residual for the Fingers data frame.

require(mosaic) require(ggformula) StudentID <- c(1,2,3,4,5,6) Thumb <- c(56, 60, 61, 63, 64, 68) TinyFingers <- data.frame(StudentID, Thumb) TinyEmpty.model <- lm(Thumb ~ NULL, data = TinyFingers) TinyFingers$Prediction <- predict(TinyEmpty.model) TinyFingers$Residual <- TinyFingers$Thumb - TinyFingers$Prediction # get the favstats for Thumb # get the favstats for Residuals # if you decide to save them, make sure to print them out favstats(~Thumb, data=TinyFingers) favstats(~Residual, data=TinyFingers) test_function_result("favstats", index=1) test_function_result("favstats", index=2) test_error() success_msg("You're getting the hang of it!")
Don't forget to use ~Thumb and ~Residual
DataCamp: ch5-11

R is very precise so sometimes it will give you outputs like the value for the mean of the residuals: -1.421085e-14. The e-14 part indicates that this is a number pretty close to zero—the -14 meaning that that decimal point is shifted to the left 14 places! So, the actual mean of the Residual is actually -.00000000000001421085. So, whenever you see this scientific notation with a negative number after the “e”, you can just read it as “zero,” or pretty close to zero.

L_Ch5_Generating_4

The residuals (or error) around the mean always sum to 0. Therefore, the mean of the errors will also always be 0, because 0 divided by n equals 0.

Now that you have looked in detail at the tiny set of data, explore the predictions and residuals from the Empty.model fit earlier from the full set of Fingers data. Add them as new variables (Predœicted and Residual) to the Fingers data frame.

require(mosaic) require(ggformula) require(supernova) # here is the code you wrote before Empty.model <- lm(Thumb ~ NULL, data=Fingers) # generate predictions from the Empty.model Fingers$Prediction <- # generate residuals from the Empty.model Fingers$Residual <- # this will print out a few lines of Fingers head(select(Fingers, Thumb, Prediction, Residual), 10) # here is the code you wrote before Empty.model <- lm(Thumb ~ NULL, data=Fingers) # generate predictions from the Empty.model Fingers$Prediction <- predict(Empty.model) # generate residuals from the Empty.model Fingers$Residual <- resid(Empty.model) # this will print out a few lines of Fingers head(select(Fingers, Thumb, Prediction, Residual), 10) test_object("Empty.model") test_data_frame("Fingers") test_error() success_msg("Keep up the great work!")
Use the predict() and resid() functions
DataCamp: ch5-12

Then make histograms of the variables Thumb, Predicted, and Residual. Which two histograms will have a similar shape?

require(mosaic) require(ggformula) require(supernova) Empty.model <- lm(Thumb ~ NULL, data=Fingers) Fingers$Prediction <- predict(Empty.model) Fingers$Residual <- resid(Empty.model) # make histograms for Thumb, Prediction, and Residual # make histograms for Thumb, Prediction, and Residual gf_histogram(~Thumb, data = Fingers) gf_histogram(~Prediction, data = Fingers) gf_histogram(~Residual, data = Fingers) ex() %>% check_function("gf_histogram", index = 1) %>% check_arg("object") %>% check_equal() ex() %>% check_function("gf_histogram", index = 1) %>% check_arg("data") %>% check_equal() ex() %>% check_function("gf_histogram", index = 2) %>% check_arg("object") %>% check_equal() ex() %>% check_function("gf_histogram", index = 2) %>% check_arg("data") %>% check_equal() ex() %>% check_function("gf_histogram", index = 3) %>% check_arg("object") %>% check_equal() ex() %>% check_function("gf_histogram", index = 3) %>% check_arg("data") %>% check_equal() ex() %>% check_error()
Don't forget to make all three histograms!
DataCamp: ch5-13

Responses