Course Outline

segmentGetting Started (Don't Skip This Part)

segmentStatistics and Data Science: A Modeling Approach

segmentPART I: EXPLORING VARIATION

segmentChapter 1  Welcome to Statistics: A Modeling Approach

segmentChapter 2  Understanding Data

segmentChapter 3  Examining Distributions

segmentChapter 4  Explaining Variation

segmentPART II: MODELING VARIATION

segmentChapter 5  A Simple Model

segmentChapter 6  Quantifying Error

segmentChapter 7  Adding an Explanatory Variable to the Model

segmentChapter 8  Digging Deeper into Group Models

segmentChapter 9  Models with a Quantitative Explanatory Variable

segmentPART III: EVALUATING MODELS

segmentChapter 10  The Logic of Inference

segmentChapter 11  Model Comparison with F

segmentChapter 12  Parameter Estimation and Confidence Intervals

segmentFinishing Up (Don't Skip This Part!)

segmentResources
list High School / Advanced Statistics and Data Science I (ABC)
8 – Digging Deeper into Group Models
8.1 Extending to a ThreeGroup Model
You have now learned how to specify a model with a single categorical explanatory variable consisting of two groups. It’s actually pretty simple to extend this idea to a categorical variable with three groups.
First, a New TwoGroup Model
Let’s use a new explanatory variable to explain variation in thumb length: Height
. Height
, in our data set, is a quantitative variable measured in inches. But we can make a new variable that turns Height
into a categorical variable with two categories: short
and tall
.
We can do this using the ntile()
function in R. The code below will cut the sample up into two equalsized groups based on Height
and save the result into a new variable called Height2Group
.
Fingers$Height2Group < ntile(Fingers$Height, 2)
head(select(Fingers, Thumb, Height, Height2Group), 10)
We used head()
and select()
to look at the first 10 rows of the relevant variables – Thumb
, Height
, and Height2Group
:
Thumb Height Height2Group
1 66.00 70.5 2
2 64.00 64.8 1
3 56.00 64.0 1
4 58.42 70.0 2
5 74.00 68.0 2
6 60.00 68.0 2
7 70.00 69.0 2
8 55.00 65.7 2
9 60.00 62.5 1
10 52.00 63.4 1
In the code window below, use the factor()
function to add labels to Height2Group
so that the 1s are labeled as short
and the 2s are labeled as tall
.
require(coursekata)
Fingers < Fingers %>% mutate(
Height2Group = ntile(Height, 2)
)
# this creates Height2Group, a numeric variable
Fingers$Height2Group < ntile(Fingers$Height, 2)
# this is how we used factor() before:
Fingers$Sex < factor(Fingers$Sex, levels = c(1,2), labels = c("female", "male"))
# modify this line so that 1s are labeled as "short" and 2s are labeled as "tall"
Fingers$Height2Group < factor()
# this prints out 10 rows of Fingers for the selected columns
head(select(Fingers, Thumb, Height, Height2Group), 10)
Fingers$Height2Group < factor(Fingers$Height2Group, levels = 1:2, labels = c("short", "tall"))
head(select(Fingers, Thumb, Height, Height2Group), 10)
ex() %>% {
check_object(., "Fingers") %>% check_column("Height2Group") %>% check_equal()
check_output_expr(., "head(select(Fingers, Thumb, Height, Height2Group), 10)")
}
Thumb Height Height2Group
1 66.00 70.5 tall
2 64.00 64.8 short
3 56.00 64.0 short
4 58.42 70.0 tall
5 74.00 68.0 tall
6 60.00 68.0 tall
7 70.00 69.0 tall
8 55.00 65.7 tall
9 60.00 62.5 short
10 52.00 63.4 short
Using the same approach we used for sex, we can write the model for Height2Group
like this:
\[\text{Thumb}_i=b_0+b_1\text{Height2Group}_i+e_i\]
Go ahead and fit the Height2Group
model, and print out the parameter estimates and ANOVA table for the model.
require(coursekata)
Fingers < Fingers %>% mutate(
Height2Group = factor(ntile(Height, 2), 1:2, c("short", "tall"))
)
# fit a model for Thumb ~ Height2Group
Height2Group_model <
# this prints out the estimates
Height2Group_model
Height2Group_model < lm(formula = Thumb ~ Height2Group, data = Fingers)
Height2Group_model
ex() %>% {
check_function(., "lm") %>% check_arg("formula") %>% check_equal()
check_object(., "Height2Group_model") %>% check_equal()
check_output_expr(., "Height2Group_model")
}
Call:
lm(formula = Thumb ~ Height2Group, data = Fingers)
Coefficients:
(Intercept) Height2Grouptall
57.818 4.601
Analysis of Variance Table (Type III SS)
Model: Thumb ~ Height2Group
SS df MS F PRE p
        
Model (error reduced)  830.880 1 830.880 11.656 0.0699 .0008
Error (from model)  11049.331 155 71.286
        
Total (empty model)  11880.211 156 76.155
A ThreeGroup Model
Now let’s try this same approach with three height groups: short, medium, and tall.
Revise the code below to make a new variable called Height3Group
that divides the sample into three categories based on Height
, each with an equal number of students. Label the levels (1,2,3) as short
, medium
, and tall
.
require(coursekata)
Fingers < Fingers %>% mutate(
Height2Group = factor(ntile(Height, 2), 1:2, c("short", "tall"))
)
Height2Group.model < lm(Thumb ~ Height2Group, data = Fingers)
# modify these two lines of code to create 3 Height groups with the labels "short", "medium", and "tall"
# make sure you save to a new variable in Fingers called Height3Group
Fingers$Height2Group < ntile(Fingers$Height, 2)
Fingers$Height2Group < factor(Fingers$Height2Group, levels = c(1,2), labels = c("short", "tall"))
# this prints out 10 rows of Fingers for selected columns
head(select(Fingers, Thumb, Height, Height3Group), 10)
Fingers$Height3Group < ntile(Fingers$Height, 3)
Fingers$Height3Group < factor(Fingers$Height3Group, levels = c(1,2,3), labels = c("short", "medium", "tall"))
head(select(Fingers, Thumb, Height, Height3Group), 10)
ex() %>% {
check_object(., "Fingers") %>% check_column("Height3Group") %>% check_equal()
check_output_expr(., "head(select(Fingers, Thumb, Height, Height3Group),10)")
}
Thumb Height Height3Group
1 66.00 70.5 tall
2 64.00 64.8 medium
3 56.00 64.0 short
4 58.42 70.0 tall
5 74.00 68.0 tall
6 60.00 68.0 tall
7 70.00 69.0 tall
8 55.00 65.7 medium
9 60.00 62.5 short
10 52.00 63.4 short
Calculate and print out the group means of Thumb
for the three height groups.
require(coursekata)
Fingers < Fingers %>% mutate(
Height2Group = factor(ntile(Height, 2), 1:2, c("short", "tall")),
Height3Group = factor(ntile(Height, 3), 1:3, c("short", "medium", "tall"))
)
# use favstats() to print the group means of Thumb length for the three height groups you created earlier
favstats()
favstats(Thumb ~ Height3Group, data = Fingers)
ex() %>% check_function("favstats") %>% check_result() %>% check_equal()
Height3Group min Q1 median Q3 max mean sd n missing
1 short 39.00 51.00 55 58.42 79.00 56.07113 7.499937 53 0
2 medium 45.00 55.00 60 64.00 86.36 60.22375 8.490406 52 0
3 tall 44.45 59.75 64 68.25 90.00 64.09365 8.388113 52 0
Here is a jitter plot that shows the distribution of thumb lengths for each of the three height groups and the mean of each group. On the next page, we’ll learn how to create a model of thumb length based on the three height groups.