Course Outline

list Introduction to Statistics: A Modeling Approach

Using the Regression Model to Make Predictions

The specific regression line we fit to our data, defined by its slope and intercept, is the one that fits our data best. By this we mean that the sum of squared deviations around this line are the lowest of any possible line we could have used instead. We have reduced leftover error to the smallest level possible given our variables.

This regression model also is our best estimate of the relationship between height and thumb length in the population. As with other models of the population, we can use the regression model to predict future observations. To do so we must turn it into a function, one that will predict thumb length based on height.

Here is the fitted model for the complete Fingers data set:

\[Y_{i}= -3.33+.96 X_{i}+e_{i}\]


Remember, a function takes in some input and spits out a prediction based on the model. Here is the function we can use to predict future scores:


With the group model it was easy to make predictions from the model: no calculation was required to see that if the person was short, the prediction would be the mean for short people, and if the person was tall, the mean for tall people. But with the regression model, a calculator would be required to predict thumb length based on height in inches.

We could always use R like a regular calculator but you would have to type in the numbers. For example, if we wanted to predict someone’s thumb length who is 70 inches tall, we would type this into R:

-3.3295 + .9619 * 70

But we can make a lot of mistakes in typing these numbers that we got from lm(). And because these estimates are rounded (to keep the output from looking too crazy), the prediction is less precise than it could be. If we let R make the calculation directly from the model, the prediction would be more precise.

To do this, we first need to make the fitted model into a function in R. We do this using the makeFun() function. <- makeFun(Height.model)

Second, we call the function to make the prediction. Here, we put in the value for Height (70) and the function will return the prediction of thumb length for a person with this particular height.

The function is fine for making individual predictions but to check our model against the data, we would want to generate predictions for each student in the Fingers data frame. As we’ve said before, we really don’t need predictions when we already know their actual thumb lengths. But this is a way to see how well (or how poorly) the model would have predicted the thumb lengths for the students in our data set.

We will use the predict() function, which you have used before, to make a new variable with the predictions based on Height.model. We’ll save those predictions as Height.pred.

Fingers$Height.pred <- predict(Height.model)

Then we’ll print out the first 10 rows of the data frame—but only the variables Thumb length, Height, and the predicted thumb length from the Height model.

head(select(Fingers, Thumb, Height, Height.pred), 10)


We’ve added the code to calculate Height.pred in the DataCamp window below. Add the code to create the scatter plot of Height.pred (y-axis) by Height (x-axis) using gf_point in the DataCamp window below.

require(mosaic) require(ggformula) Fingers <- read.csv(file="", header=TRUE, sep=",") Fingers <- data.frame(Fingers) Height.model <- lm(Thumb ~ Height, data = Fingers) # this creates predicted thumb lengths from Height.model Fingers$Height.pred <- predict(Height.model) # write code to create a scatter plot of Height.pred by Height gf_point() # this creates predicted thumb lengths from Height.model Fingers$Height.pred <- predict(Height.model) # write code to create a scatter plot of Height.pred by Height gf_point(Height.pred ~ Height, data = Fingers) test_function_result("gf_point") test_error()
DataCamp: ch8-5