Course Outline

segmentGetting Started (Don't Skip This Part)

segmentIntroduction to Statistics: A Modeling Approach

segmentPART I: EXPLORING VARIATION

segmentChapter 1  Welcome to Statistics: A Modeling Approach

segmentChapter 2  Understanding Data

segmentChapter 3  Examining Distributions

segmentChapter 4  Explaining Variation

segmentPART II: MODELING VARIATION

segmentChapter 5  A Simple Model

segmentChapter 6  Quantifying Error

segmentChapter 7  Adding an Explanatory Variable to the Model

7.8 Extending to a ThreeGroup Model

segmentChapter 8  Models with a Quantitative Explanatory Variable

segmentPART III: EVALUATING MODELS

segmentChapter 9  Distributions of Estimates

segmentChapter 10  Confidence Intervals and Their Uses

segmentChapter 11  Model Comparison with the F Ratio

segmentChapter 12  What You Have Learned

segmentResources
list Introduction to Statistics: A Modeling Approach
Extending to a ThreeGroup Model
You have now learned how to specify a model with a single categorical explanatory variable consisting of two groups. It’s actually pretty simple to extend this idea to a categorical variable with three groups.
First, a New TwoGroup Model
Let’s use a new explanatory variable to explain variation in thumb length: height. Of course height, in our data set, is a quantitative variable measured in inches. But let’s make a new variable that turns height into a categorical variable with two categories: short and tall. We can do this easily in R. Name the variable Height2Group.
We will use the ntile()
function to cut a quantitative variable up into groups, and then use head()
and select()
to look at the first 10 rows of the relevant variables:
Fingers$Height2Group < ntile(Fingers$Height, 2)
head(select(Fingers, Thumb, Height, Height2Group), 10)
Use the factor()
function to modify this variable so that the 1’s are labeled as “short” and the 2’s are labeled as “tall”. Just a reminder, use the larger data frame Fingers.
require(mosaic)
require(ggformula)
require(supernova)
require(lsr)
Fingers < read.csv(file="https://raw.githubusercontent.com/UCLATALL/introstatsmodeling/master/datasets/fingers.csv", header=TRUE, sep=",")
Fingers$Height2Group < ntile(Fingers$Height, 2)
# a reminder of how we used factor() before:
Fingers$Sex < factor(Fingers$Sex, levels = c(1,2), labels = c("female", "male"))
# modify this line so that 1s are labeled as "short" and 2s are labeled as "tall"
Fingers$Height2Group < factor()
# this prints out 10 rows of Fingers for the selected columns
head(select(Fingers, Thumb, Height, Height2Group), 10)
# a reminder of how we used factor() before:
Fingers$Sex < factor(Fingers$Sex, levels = c(1,2), labels = c("female", "male"))
# modify this line so that 1s are labeled as "short" and 2s are labeled as "tall"
Fingers$Height2Group < factor(Fingers$Height2Group, levels = c(1,2), labels = c("short", "tall"))
# this prints out 10 rows of Fingers for the selected columns
head(select(Fingers, Thumb, Height, Height2Group), 10)
test_data_frame("Fingers")
test_output_contains("head(select(Fingers, Thumb, Height, Height2Group), 10)")
test_error()
success_msg("Great work!")
Using the same approach we used for sex, we can write the model for Height2Group like this:
\[Y_{i}=b_{0}+b_{1}X_{i}+e_{i}\]
L_Ch7_Extending_1
Go ahead and fit the Height2Group model and take a look at the parameter estimates.
require(mosaic)
require(ggformula)
require(supernova)
require(lsr)
Fingers < read.csv(file="https://raw.githubusercontent.com/UCLATALL/introstatsmodeling/master/datasets/fingers.csv", header=TRUE, sep=",")
Fingers$Height2Group < ntile(Fingers$Height, 2)
# fit a model for Thumb ~ Height2Group
Height2Group.model <
# this prints out the estimates
Height2Group.model
# fit a model for Thumb ~ Height2Group
Height2Group.model < lm(formula = Thumb ~ Height2Group, data = Fingers)
# this prints out the estimates
Height2Group.model
test_function("lm", args = c("formula", "data"))
test_object("Height2Group.model")
test_output_contains("Height2Group.model")
test_error()
success_msg("Keep up the great work!")
L_Ch7_Extending_2
Now go ahead and run supernova()
to print the complete ANOVA table for the Height2Group.model.
require(mosaic)
require(ggformula)
require(supernova)
require(lsr)
Fingers < read.csv(file="https://raw.githubusercontent.com/UCLATALL/introstatsmodeling/master/datasets/fingers.csv", header=TRUE, sep=",")
Fingers$Height2Group < ntile(Fingers$Height, 2)
Height2Group.model < lm(formula = Thumb ~ Height2Group, data = Fingers)
# run supernova to print the ANOVA table for Height2Group.model
# run supernova to print the ANOVA table for Height2Group.model
supernova(Height2Group.model)
test_function_result("supernova")
test_error()
success_msg("Nice work!")
L_Ch7_Extending_3
Let’s now compare the Sex model with the Height2Group model. Both of these are twogroup models, and both have the same outcome variable (Thumb). What differs is the explanatory variable (Sex vs. Height2Group). We’ve pasted in the supernova table for both models below:
L_Ch7_Extending_4
A ThreeGroup Model
Now let’s try this same approach with three height groups: short, medium, and tall.
Revise the code below to make a new variable called Height3Group that divides height into three categories, each with an equal number of students. Label the levels (1,2,3) as short, medium, and tall.
require(mosaic)
require(ggformula)
require(supernova)
require(lsr)
Fingers < read.csv(file="https://raw.githubusercontent.com/UCLATALL/introstatsmodeling/master/datasets/fingers.csv", header=TRUE, sep=",")
# modify these two lines of code to create 3 Height groups with the labels "short", "medium", and "tall"
Fingers$Height2Group < ntile(Fingers$Height, 2)
Fingers$Height2Group < factor(Fingers$Height2Group, levels = c(1,2), labels = c("short", "tall"))
# this code prints out 10 rows of Fingers for selected columns
head(select(Fingers, Thumb, Height, Height3Group), 10)
# modify these two lines of code to create 3 Height groups with the labels "short", "medium", and "tall"
Fingers$Height3Group < ntile(Fingers$Height, 3)
Fingers$Height3Group < factor(Fingers$Height3Group, levels = c(1,2,3), labels = c("short", "medium", "tall"))
# this code prints out 10 rows of Fingers for selected columns
head(select(Fingers, Thumb, Height, Height3Group), 10)
test_function_result("ntile")
test_function_result("factor")
test_function("factor", args = c("levels", "labels"))
test_data_frame("Fingers")
test_output_contains("head(select(Fingers, Thumb, Height, Height3Group),10)")
test_error()
success_msg("Rock on!")
Calculate and print out the group means of Thumb length for the three height groups.
require(mosaic)
require(ggformula)
require(supernova)
require(lsr)
Fingers < read.csv(file="https://raw.githubusercontent.com/UCLATALL/introstatsmodeling/master/datasets/fingers.csv", header=TRUE, sep=",")
Fingers$Height3Group < ntile(Fingers$Height, 3)
Fingers$Height3Group < factor(Fingers$Height3Group, levels = c(1,2,3), labels = c("short", "medium", "tall"))
# use favstats() to print the group means of Thumb length for the three height groups
favstats()
# use favstats() to print the group means of Thumb length for the three height groups
favstats(Thumb ~ Height3Group, data = Fingers)
test_function_result("favstats", incorrect_msg = "Did you use `Thumb ~ Height3Group`?")
test_error()
success_msg("Great work! Let's try something more challening.")
L_Ch7_Extending_5
Fitting the Height3Group Model
Now use the DataCamp window below to fit the Height3Group model to the data, and print out the model estimates.
require(mosaic)
require(ggformula)
require(supernova)
require(lsr)
Fingers < read.csv(file="https://raw.githubusercontent.com/UCLATALL/introstatsmodeling/master/datasets/fingers.csv", header=TRUE, sep=",")
Fingers$Height3Group < ntile(Fingers$Height, 3)
Fingers$Height3Group < factor(Fingers$Height3Group, levels = c(1,2,3), labels = c("short", "medium", "tall"))
# modify this code to fit the model
Height3Group.model < lm(Thumb ~ )
# this prints out the estimates
Height3Group.model
# modify this code to fit the model
Height3Group.model < lm(Thumb ~ Height3Group, data = Fingers)
# this prints out the estimates
Height3Group.model
test_function_result("lm")
test_object("Height3Group.model")
test_output_contains("Height3Group.model")
test_error()
success_msg("Super job!")
The threegroup model is written like this using General Linear Model notation:
\[Y_{i}=b_{0}+b_{1}X_{1i}+b_{2}X_{2i}+e_{i}\]
Whereas fitting the twogroup model involved estimating two parameters (\(b_{0}\) and \(b_{1}\)), the threegroup model adds a third parameter (\(b_{2}\)).
L_Ch7_Extending_6
Interpreting the Height3Group Model
\(b_{0}\) is the mean of the short group. \(b_{1}\) is the increment you have to add to the short group to get the mean of the medium group. And \(b_{2}\) is the increment you have to add to the short group to get the mean of the tall group.
We can substitute in the parameter estimates into the model, like this:
\[\bar{Y}_{i}=56.07+4.15X_{1i}+8.02X_{2i}+e_{i}\]
Just as before, it is useful to think through exactly how the X variables are coded. Notice, first, that we now have two of these in the model: \(X_{1i}\) and \(X_{2i}\). The sub1 and sub2 just distinguish between the two variables; instead of giving them different names, we call them Xsub1 and Xsub2.
The subi indicates these are not parameters, but variables, which means that each individual in the data set will have their own scores on the two variables. As before, it’s a little tricky to figure out what all the possible scores are on these two variables, and also how scores are assigned for each individual.
L_Ch7_Extending_7
R doesn’t necessarily use the same numbers you do to code a variable. For the Height3Group model we put in a single categorical explanatory variable (Height3Group, with levels 1 representing short, 2 representing medium, and 3 representing tall). But R turns this one variable into two new variables, \(X_{1}\) and \(X_{2}\), both of which are “dummy coded,” which means they can either have a value of 0 or 1 for each person in the data set.
So here is how it works: For someone in the short group, the model needs to assign them a score of 56.07, the mean for the short group. You can think of \(X_{1}\) as a variable asking, “Is this person medium?” and 0 means no and 1 means yes. By the same reasoning, \(X_{2}\) represents whether someone is tall or not. For short people, \(X_{1}\) and \(X_{2}\) are both 0 because they are not medium and not tall.
L_Ch7_Extending_8
For the people in the medium group, \(X_{1}\) should be 1 (because they are in the medium group), and \(X_{2}\) should be 0 (because they are not in the tall group). So the model will give them a predicted thumb length of 56.07 + 4.15 which is equal to 60.22 mm.
And notice from favstats that the average thumb length of the medium group is 60.22!
Dummy coding takes categorical variables and turns them into a series of binary codes. As you can see from the table below, just giving each person a 0 or 1 on \(X_{1}\) and \(X_{2}\) can uniquely categorize them as short, medium, or tall.
You may wonder why you need to go through all the details of how R assigns dummy codes for the categorical explanatory variable. It’s important because it gives you a very concrete understanding of how to interpret the model parameters. In this course, we don’t often ask you to calculate numbers on your own. Instead, we want you to focus on thinking about what, exactly, a number means. This will help you do that.
Examining the ThreeGroup Model Fit
You have already done the following: created the Height3Group categorical explanatory variable; examined the mean thumb lengths of students in each of the three groups; fit the Height3Group model using lm()
and interpreted the model parameter estimates; and learned how to represent the threegroup model using notation of the GLM.
The final step is to take a look at the ANOVA table so you can compare the fit of the Height3Group model to the empty model. Of course, you know how to do this using supernova()
. Go ahead and get the ANOVA table for the Height3Group model.
require(mosaic)
require(ggformula)
require(supernova)
require(lsr)
Fingers < read.csv(file="https://raw.githubusercontent.com/UCLATALL/introstatsmodeling/master/datasets/fingers.csv", header=TRUE, sep=",")
Fingers$Height3Group < ntile(Fingers$Height, 3)
Fingers$Height3Group < factor(Fingers$Height3Group, levels = c(1,2,3), labels = c("short", "medium", "tall"))
Height3Group.model < lm(Thumb ~ Height3Group, data = Fingers)
# use supernova() to print the ANOVA table for Height3Group.model
# use supernova() to print the ANOVA table for Height3Group.model
supernova(Height3Group.model)
test_function_result("supernova")
test_error()
success_msg("Well done!")
Here’s the ANOVA table for the Height3Group model. Just for comparison, we pasted in the table for the Height2Group model right above it.
L_Ch7_Extending_9
In more advanced classes you will learn how to compare these two models directly. But for our class, we will only compare each model to the empty model.