Course Outline

list Introduction to Statistics: A Modeling Approach

Extending to a Three-Group Model

You have now learned how to specify a model with a single categorical explanatory variable consisting of two groups. It’s actually pretty simple to extend this idea to a categorical variable with three groups.

First, a New Two-Group Model

Let’s use a new explanatory variable to explain variation in thumb length: height. Of course height, in our data set, is a quantitative variable measured in inches. But let’s make a new variable that turns height into a categorical variable with two categories: short and tall. We can do this easily in R. Name the variable Height2Group.

We will use the ntile() function to cut a quantitative variable up into groups, and then use head() and select() to look at the first 10 rows of the relevant variables:

Fingers$Height2Group <- ntile(Fingers$Height, 2)
head(select(Fingers, Thumb, Height, Height2Group), 10)

Use the factor() function to modify this variable so that the 1’s are labeled as “short” and the 2’s are labeled as “tall”. Just a reminder, use the larger data frame Fingers.

require(mosaic) require(ggformula) require(supernova) require(lsr) Fingers <- read.csv(file="https://raw.githubusercontent.com/UCLATALL/intro-stats-modeling/master/datasets/fingers.csv", header=TRUE, sep=",") Fingers$Height2Group <- ntile(Fingers$Height, 2) # a reminder of how we used factor() before: Fingers$Sex <- factor(Fingers$Sex, levels = c(1,2), labels = c("female", "male")) # modify this line so that 1s are labeled as "short" and 2s are labeled as "tall" Fingers$Height2Group <- factor() # this prints out 10 rows of Fingers for the selected columns head(select(Fingers, Thumb, Height, Height2Group), 10) # a reminder of how we used factor() before: Fingers$Sex <- factor(Fingers$Sex, levels = c(1,2), labels = c("female", "male")) # modify this line so that 1s are labeled as "short" and 2s are labeled as "tall" Fingers$Height2Group <- factor(Fingers$Height2Group, levels = c(1,2), labels = c("short", "tall")) # this prints out 10 rows of Fingers for the selected columns head(select(Fingers, Thumb, Height, Height2Group), 10) test_data_frame("Fingers") test_output_contains("head(select(Fingers, Thumb, Height, Height2Group), 10)") test_error() success_msg("Great work!")
Don't forget to select the Height2Group column from the Fingers data frame when labelling the levels
DataCamp: ch7-16

Using the same approach we used for sex, we can write the model for Height2Group like this:

\[Y_{i}=b_{0}+b_{1}X_{i}+e_{i}\]

L_Ch7_Extending_1

Go ahead and fit the Height2Group model and take a look at the parameter estimates.

require(mosaic) require(ggformula) require(supernova) require(lsr) Fingers <- read.csv(file="https://raw.githubusercontent.com/UCLATALL/intro-stats-modeling/master/datasets/fingers.csv", header=TRUE, sep=",") Fingers$Height2Group <- ntile(Fingers$Height, 2) # fit a model for Thumb ~ Height2Group Height2Group.model <- # this prints out the estimates Height2Group.model # fit a model for Thumb ~ Height2Group Height2Group.model <- lm(formula = Thumb ~ Height2Group, data = Fingers) # this prints out the estimates Height2Group.model test_function("lm", args = c("formula", "data")) test_object("Height2Group.model") test_output_contains("Height2Group.model") test_error() success_msg("Keep up the great work!")
Don't forget to set formula = Thumb ~ Height2Group and data = Fingers
DataCamp: ch7-17

L_Ch7_Extending_2

Now go ahead and run supernova() to print the complete ANOVA table for the Height2Group.model.

require(mosaic) require(ggformula) require(supernova) require(lsr) Fingers <- read.csv(file="https://raw.githubusercontent.com/UCLATALL/intro-stats-modeling/master/datasets/fingers.csv", header=TRUE, sep=",") Fingers$Height2Group <- ntile(Fingers$Height, 2) Height2Group.model <- lm(formula = Thumb ~ Height2Group, data = Fingers) # run supernova to print the ANOVA table for Height2Group.model # run supernova to print the ANOVA table for Height2Group.model supernova(Height2Group.model) test_function_result("supernova") test_error() success_msg("Nice work!")
Use the supernova() function from previous activities
DataCamp: ch7-18

L_Ch7_Extending_3

Let’s now compare the Sex model with the Height2Group model. Both of these are two-group models, and both have the same outcome variable (Thumb). What differs is the explanatory variable (Sex vs. Height2Group). We’ve pasted in the supernova table for both models below:

L_Ch7_Extending_4

A Three-Group Model

Now let’s try this same approach with three height groups: short, medium, and tall.

Revise the code below to make a new variable called Height3Group that divides height into three categories, each with an equal number of students. Label the levels (1,2,3) as short, medium, and tall.

require(mosaic) require(ggformula) require(supernova) require(lsr) Fingers <- read.csv(file="https://raw.githubusercontent.com/UCLATALL/intro-stats-modeling/master/datasets/fingers.csv", header=TRUE, sep=",") # modify these two lines of code to create 3 Height groups with the labels "short", "medium", and "tall" Fingers$Height2Group <- ntile(Fingers$Height, 2) Fingers$Height2Group <- factor(Fingers$Height2Group, levels = c(1,2), labels = c("short", "tall")) # this code prints out 10 rows of Fingers for selected columns head(select(Fingers, Thumb, Height, Height3Group), 10) # modify these two lines of code to create 3 Height groups with the labels "short", "medium", and "tall" Fingers$Height3Group <- ntile(Fingers$Height, 3) Fingers$Height3Group <- factor(Fingers$Height3Group, levels = c(1,2,3), labels = c("short", "medium", "tall")) # this code prints out 10 rows of Fingers for selected columns head(select(Fingers, Thumb, Height, Height3Group), 10) test_function_result("ntile") test_function_result("factor") test_function("factor", args = c("levels", "labels")) test_data_frame("Fingers") test_output_contains("head(select(Fingers, Thumb, Height, Height3Group),10)") test_error() success_msg("Rock on!")
Don't forget to change the column name from Height2Group to Height3Group
DataCamp: ch7-19

Calculate and print out the group means of Thumb length for the three height groups.

require(mosaic) require(ggformula) require(supernova) require(lsr) Fingers <- read.csv(file="https://raw.githubusercontent.com/UCLATALL/intro-stats-modeling/master/datasets/fingers.csv", header=TRUE, sep=",") Fingers$Height3Group <- ntile(Fingers$Height, 3) Fingers$Height3Group <- factor(Fingers$Height3Group, levels = c(1,2,3), labels = c("short", "medium", "tall")) # use favstats() to print the group means of Thumb length for the three height groups favstats() # use favstats() to print the group means of Thumb length for the three height groups favstats(Thumb ~ Height3Group, data = Fingers) test_function_result("favstats", incorrect_msg = "Did you use `Thumb ~ Height3Group`?") test_error() success_msg("Great work! Let's try something more challening.")
Type ?favstats to check which arguments it takes
DataCamp: ch7-20

L_Ch7_Extending_5

Fitting the Height3Group Model

Now use the DataCamp window below to fit the Height3Group model to the data, and print out the model estimates.

require(mosaic) require(ggformula) require(supernova) require(lsr) Fingers <- read.csv(file="https://raw.githubusercontent.com/UCLATALL/intro-stats-modeling/master/datasets/fingers.csv", header=TRUE, sep=",") Fingers$Height3Group <- ntile(Fingers$Height, 3) Fingers$Height3Group <- factor(Fingers$Height3Group, levels = c(1,2,3), labels = c("short", "medium", "tall")) # modify this code to fit the model Height3Group.model <- lm(Thumb ~ ) # this prints out the estimates Height3Group.model # modify this code to fit the model Height3Group.model <- lm(Thumb ~ Height3Group, data = Fingers) # this prints out the estimates Height3Group.model test_function_result("lm") test_object("Height3Group.model") test_output_contains("Height3Group.model") test_error() success_msg("Super job!")
Have you created a model that shows Thumb as a function of Height3Group?
DataCamp: ch7-21

The three-group model is written like this using General Linear Model notation:

\[Y_{i}=b_{0}+b_{1}X_{1i}+b_{2}X_{2i}+e_{i}\]

Whereas fitting the two-group model involved estimating two parameters (\(b_{0}\) and \(b_{1}\)), the three-group model adds a third parameter (\(b_{2}\)).

L_Ch7_Extending_6

Interpreting the Height3Group Model

\(b_{0}\) is the mean of the short group. \(b_{1}\) is the increment you have to add to the short group to get the mean of the medium group. And \(b_{2}\) is the increment you have to add to the short group to get the mean of the tall group.

We can substitute in the parameter estimates into the model, like this:

\[\bar{Y}_{i}=56.07+4.15X_{1i}+8.02X_{2i}+e_{i}\]

Just as before, it is useful to think through exactly how the X variables are coded. Notice, first, that we now have two of these in the model: \(X_{1i}\) and \(X_{2i}\). The sub-1 and sub-2 just distinguish between the two variables; instead of giving them different names, we call them X-sub-1 and X-sub-2.

The sub-i indicates these are not parameters, but variables, which means that each individual in the data set will have their own scores on the two variables. As before, it’s a little tricky to figure out what all the possible scores are on these two variables, and also how scores are assigned for each individual.

L_Ch7_Extending_7

R doesn’t necessarily use the same numbers you do to code a variable. For the Height3Group model we put in a single categorical explanatory variable (Height3Group, with levels 1 representing short, 2 representing medium, and 3 representing tall). But R turns this one variable into two new variables, \(X_{1}\) and \(X_{2}\), both of which are “dummy coded,” which means they can either have a value of 0 or 1 for each person in the data set.

So here is how it works: For someone in the short group, the model needs to assign them a score of 56.07, the mean for the short group. You can think of \(X_{1}\) as a variable asking, “Is this person medium?” and 0 means no and 1 means yes. By the same reasoning, \(X_{2}\) represents whether someone is tall or not. For short people, \(X_{1}\) and \(X_{2}\) are both 0 because they are not medium and not tall.

L_Ch7_Extending_8

For the people in the medium group, \(X_{1}\) should be 1 (because they are in the medium group), and \(X_{2}\) should be 0 (because they are not in the tall group). So the model will give them a predicted thumb length of 56.07 + 4.15 which is equal to 60.22 mm.

And notice from favstats that the average thumb length of the medium group is 60.22!

Dummy coding takes categorical variables and turns them into a series of binary codes. As you can see from the table below, just giving each person a 0 or 1 on \(X_{1}\) and \(X_{2}\) can uniquely categorize them as short, medium, or tall.

You may wonder why you need to go through all the details of how R assigns dummy codes for the categorical explanatory variable. It’s important because it gives you a very concrete understanding of how to interpret the model parameters. In this course, we don’t often ask you to calculate numbers on your own. Instead, we want you to focus on thinking about what, exactly, a number means. This will help you do that.

Examining the Three-Group Model Fit

You have already done the following: created the Height3Group categorical explanatory variable; examined the mean thumb lengths of students in each of the three groups; fit the Height3Group model using lm() and interpreted the model parameter estimates; and learned how to represent the three-group model using notation of the GLM.

The final step is to take a look at the ANOVA table so you can compare the fit of the Height3Group model to the empty model. Of course, you know how to do this using supernova(). Go ahead and get the ANOVA table for the Height3Group model.

require(mosaic) require(ggformula) require(supernova) require(lsr) Fingers <- read.csv(file="https://raw.githubusercontent.com/UCLATALL/intro-stats-modeling/master/datasets/fingers.csv", header=TRUE, sep=",") Fingers$Height3Group <- ntile(Fingers$Height, 3) Fingers$Height3Group <- factor(Fingers$Height3Group, levels = c(1,2,3), labels = c("short", "medium", "tall")) Height3Group.model <- lm(Thumb ~ Height3Group, data = Fingers) # use supernova() to print the ANOVA table for Height3Group.model # use supernova() to print the ANOVA table for Height3Group.model supernova(Height3Group.model) test_function_result("supernova") test_error() success_msg("Well done!")
Have you used the supernova() function on Height3Group.model?
DataCamp: ch7-22

Here’s the ANOVA table for the Height3Group model. Just for comparison, we pasted in the table for the Height2Group model right above it.

L_Ch7_Extending_9

In more advanced classes you will learn how to compare these two models directly. But for our class, we will only compare each model to the empty model.

Responses