Course Outline

segmentGetting Started (Don't Skip This Part)

segmentIntroduction to Statistics: A Modeling Approach

segmentPART I: EXPLORING VARIATION

segmentChapter 1  Welcome to Statistics: A Modeling Approach

segmentChapter 2  Understanding Data

segmentChapter 3  Examining Distributions

segmentChapter 4  Explaining Variation

4.8 Quantitative Explanatory Variables

segmentPART II: MODELING VARIATION

segmentChapter 5  A Simple Model

segmentChapter 6  Quantifying Error

segmentChapter 7  Adding an Explanatory Variable to the Model

segmentChapter 8  Models with a Quantitative Explanatory Variable

segmentPART III: EVALUATING MODELS

segmentChapter 9  Distributions of Estimates

segmentChapter 10  Confidence Intervals and Their Uses

segmentChapter 11  Model Comparison with the F Ratio

segmentChapter 12  What You Have Learned

segmentResources
list Introduction to Statistics: A Modeling Approach
Quantitative Explanatory Variables
Okay, let’s go back to where we were, explaining the variation in thumb length using the variable Sex.
THUMB LENGTH = SEX + OTHER STUFF
Let’s look at the histograms and scatterplots of this word equation, which showed that the overall variation in thumb length could be partially explained by taking sex into account.
gf_histogram(..density.. ~ Thumb, data = Fingers, fill = "orange") %>%
gf_facet_grid(Sex ~ .)
gf_point(Thumb ~ Sex, data = Fingers, color = "orange", size = 5, alpha = .5)
Let’s now see if we can take the same approach for a different explanatory variable: Height. First, let’s write a word equation to represent the relationship we are wanting to explore:
THUMB LENGTH = HEIGHT + OTHER STUFF
L_Ch4_Quantitative_1
We actually could use the same approach with Height as we did with Sex. But, we would first need to recode Height as a categorical variable. Let’s try constructing a new variable by cutting up Height into two categories—short and tall.
L_Ch4_Quantitative_2
Write some code to cut Height from the data frame Fingers into 2 equalsized categories: people below the median height, and people above the median height. Save these categories into a new variable called Height2Group, and label the two categories “short” and “tall.”
# load packages
library(ggformula)
library(mosaic)
# import the fingers data frame
Fingers < read.csv(file="https://raw.githubusercontent.com/UCLATALL/introstatsmodeling/master/datasets/fingers.csv", header=TRUE, sep=",")
Fingers < data.frame(Fingers)
# clean up str
Fingers$RaceEthnic < as.factor(Fingers$RaceEthnic)
Fingers$SSLast < as.numeric(Fingers$SSLast)
Fingers$Year < as.factor(Fingers$Year)
Fingers$Job < as.factor(Fingers$Job)
Fingers$MathAnxious < as.numeric(Fingers$MathAnxious)
Fingers$Interest < as.numeric(Fingers$Interest)
Fingers$GradePredict < as.numeric(Fingers$GradePredict)
Fingers$Thumb < as.numeric(Fingers$Thumb)
Fingers$Index < as.numeric(Fingers$Index)
Fingers$Middle < as.numeric(Fingers$Middle)
Fingers$Ring < as.numeric(Fingers$Ring)
Fingers$Pinkie < as.numeric(Fingers$Pinkie)
Fingers$Height < as.numeric(Fingers$Height)
Fingers$Weight < as.numeric(Fingers$Weight)
# label a few factors
Fingers$RaceEthnic < factor(Fingers$RaceEthnic, levels = c(1,2,3,4,5), labels = c("White","African American","Asian","Latino","Other"))
Fingers$Job < factor(Fingers$Job, levels = c(0,1,2), labels = c("not working", "parttime job", "fulltime job"))
Fingers$Year < factor(Fingers$Year, levels = c(1,2,3,4), labels = c("freshman", "sophomore", "junior", "senior"))
# write code to cut Height from Fingers into 2 categories
# save this as a new variable: Height2Group
Fingers$Height2Group <
# modify this to label Fingers$Height2Group appropriately
Fingers$Sex < factor(Fingers$Sex, levels = c(1,2), labels = c("female", "male"))
# this prints select variables in Fingers
head(select(Fingers, Thumb, Height, Height2Group))
# write code to cut Height from Fingers into 2 categories
# save this as a new variable: Height2Group
Fingers$Height2Group < ntile(Fingers$Height, 2)
# modify this to label Fingers$Height2Group appropriately
Fingers$Height2Group < factor(Fingers$Height2Group, levels = c(1,2), labels = c("short", "tall"))
# this prints select variables in Fingers
head(select(Fingers, Thumb, Height, Height2Group))
test_data_frame("Fingers")
test_function_result("head")
test_error()
success_msg("Wow! You're a rock staR. Keep up the good work!")
Now we can try looking at the data the same way as we did for Sex, which also had two levels.
L_Ch4_Quantitative_3
Create histograms in a grid and a scatterplot to look at variability in Thumb based on Height2Group.
require(mosaic)
require(ggformula)
Fingers < read.csv(file="https://raw.githubusercontent.com/UCLATALL/introstatsmodeling/master/datasets/fingers.csv", header=TRUE, sep=",")
Fingers < data.frame(Fingers)
Fingers$Height2Group < ntile(Fingers$Height, 2)
Fingers$Height2Group < factor(Fingers$Height2Group, levels = c(1,2), labels = c("short", "tall"))
# create histograms of Thumb in a grid by Height2Group
# create a scatterplot of Thumb by Height2Group
# create histograms of Thumb in a grid by Height2Group
gf_histogram(~Thumb, data=Fingers) %>%
gf_facet_grid(Height2Group ~ .)
# create a scatterplot of Thumb by Height2Group
gf_point(Thumb ~ Height2Group, data=Fingers)
# use this code if you want your plot to look like ours
gf_histogram(~ Thumb, data = Fingers, fill = "darkolivegreen4") %>%
gf_facet_grid(Height2Group ~ .)
gf_point(Thumb ~ Height2Group, data = Fingers, alpha = .5, size = 3, color = "darkolivegreen4")
test_function("gf_histogram")
test_function("gf_point")
test_error()
success_msg("Keep up the great work!")
Similar to what we found for Sex, where there was a lot of variability within the female group and male group, there is a lot of variability within the short and tall groups. But there is less variability within each group than there is overall if we must combine the groups together. Again, it is useful to think about this withingroup variation as the leftover variation after explaining some of the variation with Height2Group.
Let’s also try looking at these distribution with a boxplot.
require(mosaic)
require(ggformula)
Fingers < read.csv(file="https://raw.githubusercontent.com/UCLATALL/introstatsmodeling/master/datasets/fingers.csv", header=TRUE, sep=",")
Fingers < data.frame(Fingers)
Fingers$Height2Group < ntile(Fingers$Height, 2)
Fingers$Height2Group < factor(Fingers$Height2Group, levels = c(1,2), labels = c("short", "tall"))
# create boxplots of Thumb by Height2Group
# create boxplots of Thumb by Height2Group
gf_boxplot(Thumb ~ Height2Group, data = Fingers)
# if you want your code to produce the output below
gf_boxplot(Thumb ~ Height2Group, data = Fingers, color = "darkolivegreen4")
test_function_result("gf_boxplot")
test_error()
success_msg("Nice work!")
L_Ch4_Quantitative_4
See if you can break height into three categories (let’s call it Height3Group) and then compare the distribution of height across all three categories with a scatterplot. Create boxplots as well.
require(mosaic)
require(ggformula)
Fingers < read.csv(file="https://raw.githubusercontent.com/UCLATALL/introstatsmodeling/master/datasets/fingers.csv", header=TRUE, sep=",")
Fingers < data.frame(Fingers)
Fingers$Height2Group < ntile(Fingers$Height, 2)
Fingers$Height2Group < factor(Fingers$Height2Group, levels = c(1,2), labels = c("short", "tall"))
# modify this code to break Height into 3 categories: "short", "medium", and "tall"
Fingers$Height3Group < ntile(Fingers$Height, 2)
Fingers$Height3Group < factor(Fingers$Height2Group, levels = c(1,2), labels = c("short", "tall"))
# create a scatterplot of Thumb by Height3Group
# create boxplots of Thumb by Height3Group
# modify this code to break Height into 3 categories
Fingers$Height3Group < ntile(Fingers$Height, 3)
Fingers$Height3Group < factor(Fingers$Height3Group, levels = c(1,2,3), labels = c("short", "medium", "tall"))
# create a scatterplot of Thumb by Height3Group
gf_point(Thumb ~ Height3Group, data = Fingers)
# create boxplots of Thumb by Height3Group
gf_boxplot(Thumb ~ Height3Group, data = Fingers)
test_data_frame("Fingers", incorrect_msg="Did you remember to use `ntile()`?")
test_function("gf_point")
test_function("gf_boxplot")
test_error()
success_msg("Keep it up!")
L_Ch4_Quantitative_5
Looking at these two boxplots, we have an intuition that the threegroup version of Height explains more variation in thumb length than does the twogroup version. Although there is still a lot of variation within each group in the threegroup version, the withingroup variation appears smaller in the threegroup than in the twogroup model. Or, to put it another way, there is less variation left over after taking out the variation due to height.
L_Ch4_Quantitative_6
From Categorical to Quantitative Explanatory Variables
Up to this point we have been using Height as though it were a categorical variable. First we divided it into two categories, then three.
When we do this we are throwing away some of the information we have in our data. We know exactly how many inches tall each person is. Why not use that information instead of just categorizing people as either tall or short?
Let’s try another approach, a scatterplot of Thumb length by Height. Try using gf_point()
with Height rather than Height2Group or Height3Group. Note: when making scatterplots, the convention is to put the outcome variable on the yaxis, the explanatory variable on the xaxis.
require(mosaic)
require(ggformula)
Fingers < read.csv(file="https://raw.githubusercontent.com/UCLATALL/introstatsmodeling/master/datasets/fingers.csv", header=TRUE, sep=",")
Fingers < data.frame(Fingers)
Fingers$Height.2group < ntile(Fingers$Height, 2)
Fingers$Height.2group < factor(Fingers$Height.2group, levels = c(1,2), labels = c("short", "tall"))
# create a scatterplot of Thumb by Height
# create a scatterplot of Thumb by Height
gf_point(Thumb ~ Height, data = Fingers)
test_function("gf_point", incorrect_msg="Have you used `gf_point()`?")
test_error()
success_msg("Fantastic!")
L_Ch4_Quantitative_7
The same relationship we spotted in the boxplots when we divided Height into three categories can be seen in the scatterplot. In the image below, we have overlaid boxes at three different intervals along the distribution of Height.
Each box corresponds to one of the three groups of our Height3Group variable. On the xaxis you can see the range in height, measured in inches, for each of the three groups.
L_Ch4_Quantitative_8
Remember that we used ntile()
to divide our sample into three groups of equal sizes. Because most people in the sample are clustered around the average height, it makes sense that the box in the middle is the narrowest. There aren’t that many people taller than 70 inches, so to get a tall group that is exactly onethird of the sample means we have to include a wider range of heights.
The heights of the boxes represent the middle of the Thumb distribution for that third of the sample, just like in a boxplot. So, the bottom of the box is Q1 and the top is Q3. You can see that the thumb lengths of people who are taller tend to be longer. You can also see that height explains only some of the variation in thumb length. Within each band of Height, there is variation in thumb length (look up and down within each box).
So, just as when we measured height as a categorical variable, although there appears to be some variation in Thumb that is explained by Height, there is also variation left over after we have taken out the variation due to Height.
L_Ch4_Quantitative_9
We can try to explain variation with categorical explanatory variables (such as Sex and Height3Group) but we can also try to explain variation with quantitative explanatory variable (such as Height).
Let’s stretch our thinking further. What if you wanted to have two explanatory variables for thumb length? For example, if we wanted to think about how variation in Thumb might be explained by variation in both Sex and Height, we could represent this idea as a word equation like this.
THUMB LENGTH = SEX + HEIGHT + OTHER STUFF
The variation in thumb length is the same whether we try to explain it with Sex, Height, or both! The total variation in Thumb doesn’t change. But how about that unexplained variation? The better the job done by the explanatory variables, the less leftover variation.