Course Outline

segmentGetting Started (Don't Skip This Part)

segmentStatistics and Data Science: A Modeling Approach

segmentPART I: EXPLORING VARIATION

segmentChapter 1  Welcome to Statistics: A Modeling Approach

segmentChapter 2  Understanding Data

segmentChapter 3  Examining Distributions

segmentChapter 4  Explaining Variation

segmentPART II: MODELING VARIATION

segmentChapter 5  A Simple Model

segmentChapter 6  Quantifying Error

segmentChapter 7  Adding an Explanatory Variable to the Model

7.2 Fitting a Model with an Explanatory Variable

segmentChapter 8  Models with a Quantitative Explanatory Variable

segmentPART III: EVALUATING MODELS

segmentChapter 9  Distributions of Estimates

segmentChapter 10  Confidence Intervals and Their Uses

segmentChapter 11  Model Comparison with the F Ratio

segmentChapter 12  What You Have Learned

segmentFinishing Up (Don't Skip This Part!)

segmentResources
list Introduction to Statistics: A Modeling Approach
7.2 Fitting a Model With an Explanatory Variable
Now that you have learned how to specify a model with an explanatory variable, let’s learn how to fit the model using R.
Fitting a model, as a reminder, simply means calculating the parameter estimates. We use the word “fitting” because we want to calculate the best estimate, the one that will result in the least amount of error. For the tiny data set, we could calculate the parameter estimates in our head—it’s just a matter of calculating the mean for males and the mean for females. But when the data set is larger, it is much easier to use R.
Using R, we will first fit the Sex
model to the tiny data set, just so you can see that R gives you the same parameter estimates you got before. After that we will fit it to the complete data set.
Note that the parts that are going to be different for each person (\(X_{i}\) and \(Y_{i}\)) are called variables (because they vary)! \(e_i\) also varies, but we typically reserve the label “variable” for outcome and explanatory variables. The parts that are going to be the same for each person (\(b_{0}\) and \(b_{1}\)) are called parameter estimates.
We do not need to estimate the variables. Each student in the data set already has a score for the outcome variable (\(Y_{i}\)) and the explanatory variable (\(X_{i}\)), and these scores vary across students. Notice that the subscript \(i\) is attached to the parts that are different for each person.
We do need to estimate the parameters because, as discussed previously, they are features of the population, and thus are unknown. The parameter estimates we calculate are those that best fit our particular sample of data. But we would have gotten different estimates if we had a different sample. Thus, it is important to keep in mind that these estimates are only that, and they are undoubtedly wrong. Calling them estimates keeps us humble!
Parameter estimates don’t vary from person to person, so they don’t carry the subscript \(i\).
Fitting the Sex Model to the Tiny Data Set
We will refer to this more complex model (more complex than the empty model, that is) as the Sex
model. It has one explanatory variable, Sex
. We will fit the model using R’s lm()
(linear model) function.
To fit the model we run this R code, and get the results below:
lm(Thumb ~ Sex, data=TinyFingers)
Call:
lm(formula = Thumb ~ Sex, data = TinyFingers)
Coefficients:
(Intercept) Sexmale
59 6
Note that the estimates are exactly what you should have expected: the first estimate, for \(b_{0},\) is 59 (the mean for females); the second, \(b_{1}\) is 6, which is the number of millimeters you need to add to the female average thumb length to get average male thumb length.
Notice that the estimate for \(b_{0}\) is labeled “intercept” in the output. You have encountered the concept of intercept before, when you studied the concept of a line in algebra. Remember the equation for a line? \(y=mx+b\). \(m\) represents the slope of the line, and \(b\), the yintercept. The General Linear Model notation is similar to this, though it includes error, whereas the equation for a line does not.
The reason the estimate for \(b_{0}\) is called Intercept is because it is the estimate for thumb length when \(X_{i}\) is equal to 0—in other words, when sex is female. The estimate that R called “Sexmale,” by this line of reasoning, is kind of like the slope of a line. It is the increment in thumb length for a unit increase in \(X_{i}\).
If you want—and it’s a good idea—you can save the results of this model fit in an R object. Here’s the code to save the model fit in an object called TinySex.model
:
TinySex.model < lm(Thumb ~ Sex, data=TinyFingers)
Once you’ve saved the model, If you want to see what the model estimates are, you can just type the name of the model and you will get the same output as above:
TinySex.model
Call:
lm(formula = Thumb ~ Sex, data = TinyFingers)
Coefficients:
(Intercept) Sexmale
59 6
Now that we have estimates for the two parameters, we can put them in our model statement to yield: \(Y_{i} = 59 + 6 X_{i}+e_{i}\).
How Does R Know to Make Female = 0?
The answer to this question is: R doesn’t know; it’s just taking whatever group comes first and making it 0, then taking the next group and making it 1. The female
group comes first in the variable Sex
because it is alphabetically before male
. It doesn’t really matter, though it will affect the interpretation of the parameter estimate.
Let’s say, just for fun, that you coded male as 0 and female as 1 in your data set. In this case, the best fitting value for \(b_0\) would be the mean for males, which is 65 mm. Because males are coded as 0, \(X_i\) would be equal to 0, so you wouldn’t add the second parameter estimate (\(b_1 * 0=0\)).
For females, \(X_i\) would be equal to 1, which means \(b_1\) would be added on to 65 to get the model prediction for females. Notice, however, that now the parameter estimate for \(b_1\) would be negative. 65 plus (6) would yield the mean for females, which is 59.
As long as R knows that a variable is categorical (e.g., a Factor), it doesn’t really care how you code it. You can code Sex
with any numbers you choose (e.g., 1 and 2 or 0 and 100), or with any words you choose (e.g, male and female). However you code it, internally R will recode the variable as 0 and 1.
Fitting the Sex Model to the Complete Data Set
Now that you have looked in detail at the tiny set of data, find the best estimates for our bigger set of data (found in the data frame called Fingers
) by modifying the code below.
require(tidyverse)
require(mosaic)
require(Lock5Data)
require(supernova)
# store the model where Sex predicts Thumb
Sex.model <
# this prints out the model estimates
Sex.model
Sex.model < lm(Thumb ~ Sex, data = Fingers)
Sex.model
ex() %>% check_object("Sex.model") %>% check_equal()
lm()
to create a model of Sex
from the Fingers
data frame
Call:
lm(formula = Thumb ~ Sex, data = Fingers)
Coefficients:
(Intercept) Sexmale
58.256 6.447