Course Outline

list Introduction to Statistics: A Modeling Approach

Specifying the Model

Reviewing the Empty Model

In the previous modules we introduced the idea of a statistical model as a number. We developed what we called the empty model, which we wrote like this (in GLM notation):



In the case of thumb length, this model states that the DATA (each data point, represented as \(Y_{i},\) which is each person’s thumb length), can be thought of as having two components: the mean thumb length for everyone, usually called the Grand Mean (MODEL, represented as \(b_{0}\)), plus each person’s residuals from the model, or ERROR, represented by \(e_{i}\).

(Note: now that we are going to have different means for males and females, for example, we will use the term Grand Mean to make clear when we are referring to the mean for everyone in the sample.)

When we use the notation of the General Linear Model, we must define the meaning of each symbol in context. \(Y_{i}\), for example, could mean lots of different things, depending on what our outcome variable is. But we will always use it to represent the outcome variable.

It’s useful to illustrate the null model (or empty model) with our tiny data set called TinyFingers. Remember, it is a data set with six people’s thumb lengths, three males and three females. It was randomly selected from our complete Fingers data set.


Let’s look at just these six thumb lengths the same way we did above, with two histograms (left panel, below). It’s actually a little clearer, with such a small data set, if we represent the same data as a jitter plot (right panel).


In the jitter plot, the Grand Mean of the distribution ignoring sex is represented by the blue line. Under the null model, each person’s score would be modeled with the Grand Mean (which is 62). So each person’s error from the model is represented by how far their thumb length is above or below 62.

Adding an Explanatory Variable to the Model

Now let’s add an explanatory variable, Sex, into the model. In the Sex model, which includes sex as an explanatory variable, we model the variation with two numbers: the mean for males (65), and the mean for females (59). So, everyone still gets one number as a model of his or her thumb length, but now males get a different number than females.

Error is still measured the same way, as the deviation of each person’s measured thumb length from their predicted thumb length. But this time, the error is calculated from each person’s group mean (male or female) instead of from the Grand Mean (see figure above).


Whereas the empty model was a one-parameter model (we only had to estimate one parameter, the mean), the Sex model is a two-parameter model. One of the parameters is the mean for males, the other is the mean for females.

There are actually a few ways you could write this model; we will write it like this:


In this equation, \(Y_{i}\) is still the thumb length for the person i (DATA), and \(e_{i}\) is still the error for person i (ERROR), i.e., the deviation of the thumb length of the person i from the thumb length predicted by the model. The part of the equation in between DATA and ERROR (\(b_{0}+b_{1}X_{i}\)) is the MODEL part of DATA = MODEL + ERROR, and it requires a bit of unpacking.


Two Interpretations of the GLM Notation for a Two-Group Model

The model statement \(b_{0}+b_{1}X_{i}\) actually has two interpretations, each representing a different way of thinking about the two-parameter model.


Under both interpretations, \(b_{0}+b_{1}X_{i}\) needs to result in each person’s predicted thumb length—the number that when added to their error (\(e_{i}\)) will equal their actual score. In this two-group model, there should be two possible predicted thumb lengths: one for males and one for females.

Interpretation #1: Grand Mean Plus Deviation

Under the first interpretation, \(b_{0}\) represents the Grand Mean (or 62 mm)—in other words, it has the same meaning as in the empty model, which includes only the mean. In order to go from the Grand Mean to the prediction based on Sex, \(b_{1}X_{i}\) will need to represent the deviation of the group mean from the Grand Mean.

To make the model work under this interpretation, \(X_{i}\), which represents the variable Sex, will have to be coded +1 for males and -1 for females. So, if someone is male, \(b_{1}\) represents how far the mean for males is above the Grand Mean; and if someone is female, \(b_{1}\) represents how far the mean for females is below the Grand Mean. Because there are only two groups, the means of the two groups (males and females) will be symmetrical around the Grand Mean.

Note that this does not mean that you have to code the variable in the data frame in any particular way. The software takes care of that for you. You can code Sex with any numbers you choose (e.g., 1 and 2 or 0 and 100). But if the data analysis software fits the model in this way, it will treat sex as coded +1 and -1.

Under this model, as expected, males would all end up with one predicted thumb length (65, the mean for males, or \(b_{0}+b_{1}\)), and females would all end up with a different predicted thumb length (59, the mean for females, or \(b_{0}-b_{1}\)).


We can rewrite the model statement substituting the parameter estimates for the parameters \(b_{0}\) and \(b_{1}\) like this:


Interpretation #2: Group One Plus Increment

Argh, we can hear you say! Why do we need to know a second interpretation of the model statement \(b_{0}+b_{1}X_{i}\)? The answer is that it is good for you! One of the important advantages of mathematical notation is that it can be interpreted in different ways depending on what is most useful at the time. So, one of the things you need to learn is to use notation flexibly.

Under the second interpretation of this model statement, \(b_{0}\) does NOT represent the Grand Mean (62), but instead represents the mean for females, which is 59.


It works like this: \(X_{i}\) still represents sex, but this time it is coded 0 for females and 1 for males (instead of -1 for females and +1 for males). So, if someone is female, she would be coded 0 for \(X_{i}\). With \(X_{i}\) coded as 0, \(b_{1}*sex_{i}\) would become 0, which means the model would simply assign them a predicted thumb length of \(b_{0}\), which is 59, the mean for females.

If someone is male, he would be coded 1 for \(X_{i}\).


Under this interpretation of the model, \(b_{1}\) would represent the difference, or increment, in means between the females and males. If \(b_{0}\) is the mean for females, \(b_{1}\) (which is multiplied by 1 if someone is male), would be the number that, when added to \(b_{0},\) would result in the mean for males.

If we use this second interpretation, here is how we would rewrite the model to include the estimated parameters:


So, if someone is male, he would be coded 1 for \(X_{i}\). With \(X_{i}\) coded as 1, \(b_{1}*X_{i}\) would be 6*1, which means the model would simply assign them a predicted thumb length of \(b_{0}\) (which is 59), plus 6. This adds up to 65, the mean for males.


Summary of the Two Interpretations

We have summarized these two interpretations of the model statement \(b_{0}+b_{1}X_{i}\) in the table below. Note that the points where the interpretations differ are highlighted yellow.

In general, for purposes of this course, we will go with the second interpretation, with \(b_{0}\) representing the mean of one of the groups. We do this because R’s lm() function is designed to work this way.

Although it’s important for you to understand why \(X_{i}\) would have to be coded differently under the two interpretations, you won’t really have to do that coding in this course. R will figure it out for you. You just need to understand which approach R is taking or else you will misinterpret the parameter estimates.

Note that we are broadening our definition of error from the way we thought of it for the empty model. For the empty model, error was the residual from the mean (i.e., the Grand Mean). Now we need to expand our thinking a bit, seeing error as the residual from the predicted score (\(\hat{Y}\)) under the model, not just as the residual from the mean (\(\bar{Y}\)).

Of course, under both models (empty and two-group) the error is the residual from the predicted score under the model. It so happens that in the empty model, that predicted score was the mean. In the two-group model, the error is the residual from subtracting out the male mean if you are male, or the female mean if you are female.

No matter how complex our models become, error is always defined as the residual from the predicted score for each data point.