Course Outline

segmentGetting Started (Don't Skip This Part)

segmentIntroduction to Statistics: A Modeling Approach

segmentPART I: EXPLORING VARIATION

segmentChapter 1  Welcome to Statistics: A Modeling Approach

segmentChapter 2  Understanding Data

segmentChapter 3  Examining Distributions

segmentChapter 4  Explaining Variation

4.1 Explaining One Variable with Another

segmentPART II: MODELING VARIATION

segmentChapter 5  A Simple Model

segmentChapter 6  Quantifying Error

segmentChapter 7  Adding an Explanatory Variable to the Model

segmentChapter 8  Models with a Quantitative Explanatory Variable

segmentPART III: EVALUATING MODELS

segmentChapter 9  Distributions of Estimates

segmentChapter 10  Confidence Intervals and Their Uses

segmentChapter 11  Model Comparison with the F Ratio

segmentChapter 12  What You Have Learned

segmentResources
list Introduction to Statistics: A Modeling Approach
Explaining One Variable With Another
Let’s start by looking at the distribution of Thumb.
Write code to draw a histogram of Thumb from the Fingers data frame. Feel free to play around with features like labels, gf_lab()
, or with arguments like color
, fill
, bins
, or binwidth
.
# Load packages
require(mosaic)
require(tidyverse)
# Import Fingers data frame
Fingers < read.csv(file="https://raw.githubusercontent.com/UCLATALL/introstatsmodeling/master/datasets/fingers.csv", header=TRUE, sep=",")
Fingers < data.frame(Fingers)
# clean up str
Fingers$Sex < as.factor(Fingers$Sex)
Fingers$RaceEthnic < as.factor(Fingers$RaceEthnic)
Fingers$SSLast < as.numeric(Fingers$SSLast)
Fingers$Year < as.factor(Fingers$Year)
Fingers$Job < as.factor(Fingers$Job)
Fingers$MathAnxious < as.numeric(Fingers$MathAnxious)
Fingers$Interest < as.numeric(Fingers$Interest)
Fingers$GradePredict < as.numeric(Fingers$GradePredict)
Fingers$Thumb < as.numeric(Fingers$Thumb)
Fingers$Index < as.numeric(Fingers$Index)
Fingers$Middle < as.numeric(Fingers$Middle)
Fingers$Ring < as.numeric(Fingers$Ring)
Fingers$Pinkie < as.numeric(Fingers$Pinkie)
Fingers$Height < as.numeric(Fingers$Height)
Fingers$Weight < as.numeric(Fingers$Weight)
# label a few factors
Fingers$Sex < factor(Fingers$Sex, levels = c(1,2), labels = c("female", "male"))
Fingers$RaceEthnic < factor(Fingers$RaceEthnic, levels = c(1,2,3,4,5), labels = c("White","African American","Asian","Latino","Other"))
Fingers$Job < factor(Fingers$Job, levels = c(0,1,2), labels = c("not working", "parttime job", "fulltime job"))
Fingers$Year < factor(Fingers$Year, levels = c(1,2,3,4), labels = c("freshman", "sophomore", "junior", "senior"))
# Write code to draw a histogram of Thumb from the Fingers dataset
# Write code to draw a histogram of Thumb from the Fingers dataset
gf_histogram(~Thumb, data=Fingers)
# Another solution
hist(Fingers$Thumb)
test_or(
test_function("gf_histogram", args="data"),
test_function("hist", args="x"))
We’ve seen this distribution a few times now. It looks like most of the thumbs run between 40 and 80 mm; the center of the distribution is somewhere around 60 mm; and the distribution is kind of bellshaped, with most of the observations clustered around the middle, then just a few observations in the outer tails.
When we want to explain variation in one variable, a starting place is to think about other variables that might be meaningfully related to it.
L_Ch4_Explaining_1
One variable that might be meaningfully related to thumb length is Sex. You might intuitively sense that male and female thumb lengths might differ, or vary. But then again, even among a bunch of females, their thumb lengths vary too.
L_Ch4_Explaining_2
Unfortunately, the variable Sex is not included in our previous histogram. But we can visualize the relationship between Thumb and Sex in a few ways. One way is by coloring or filling in the data in the histogram by Sex, assigning females one color and males another.
To do this we use the fill =
argument, but instead of putting in a color we put a tilda (~) and then the name of a variable: fill = ~Sex
.
gf_histogram(~ Thumb, data = Fingers, fill = ~Sex)
Whenever you color these data by an explanatory variable, it’s a bit of a pain to change the default colors. Thankfully, this default color scheme seems nice for this particular situation. But it is really nice to be able to change the colors. You have to chain on one additional (slightly complicated) line of code (using %>%) and substitute the color names you want for the different values of your explanatory variable. For example, here’s the R code to change the colors of this histogram.
gf_histogram(~ Thumb, data = Fingers, fill = ~Sex) %>%
gf_refine(scale_fill_manual(values = c("purple", "orange")))
If you wanted to change the default outline colors, you would specify scale_color_manual()
and chain on this code instead.
gf_refine(scale_color_manual(values = c("purple", "orange")))
Try changing the colors used for the different values of Sex (female and male) in the histogram we made before.
# Load packages
require(mosaic)
require(tidyverse)
require(ggformula)
# Import Fingers data frame
Fingers < read.csv(file="https://raw.githubusercontent.com/UCLATALL/introstatsmodeling/master/datasets/fingers.csv", header=TRUE, sep=",")
Fingers < data.frame(Fingers)
# clean up str
Fingers$RaceEthnic < as.factor(Fingers$RaceEthnic)
Fingers$SSLast < as.numeric(Fingers$SSLast)
Fingers$Year < as.factor(Fingers$Year)
Fingers$Job < as.factor(Fingers$Job)
Fingers$MathAnxious < as.numeric(Fingers$MathAnxious)
Fingers$Interest < as.numeric(Fingers$Interest)
Fingers$GradePredict < as.numeric(Fingers$GradePredict)
Fingers$Thumb < as.numeric(Fingers$Thumb)
Fingers$Index < as.numeric(Fingers$Index)
Fingers$Middle < as.numeric(Fingers$Middle)
Fingers$Ring < as.numeric(Fingers$Ring)
Fingers$Pinkie < as.numeric(Fingers$Pinkie)
Fingers$Height < as.numeric(Fingers$Height)
Fingers$Weight < as.numeric(Fingers$Weight)
# label a few factors
Fingers$RaceEthnic < factor(Fingers$RaceEthnic, levels = c(1,2,3,4,5), labels = c("White","African American","Asian","Latino","Other"))
Fingers$Job < factor(Fingers$Job, levels = c(0,1,2), labels = c("not working", "parttime job", "fulltime job"))
Fingers$Year < factor(Fingers$Year, levels = c(1,2,3,4), labels = c("freshman", "sophomore", "junior", "senior"))
# Change the default colors for the different values of the explanatory variable
gf_histogram(..density.. ~ Thumb, data = Fingers, fill = ~Sex)
gf_histogram(..density.. ~ Thumb, data = Fingers, fill = ~Sex) %>%
gf_refine(scale_fill_manual(values = c("red","blue")))
test_function("gf_histogram", incorrect_msg="Keep the sample code the same")
test_function("gf_refine", incorrect_msg="Have you called `gf_refine()`?")
test_function("scale_fill_manual", incorrect_msg="Have you used the `scale_fill_values()` and set `values =` colors?")
Another way is to split up the histogram we made into two—one for females and another for males. We can chain on (using %>%
) the command gf_facet_grid()
after gf_histogram()
. This will put the histogram of Thumb for females and the one for males in a grid.
gf_histogram(~ Thumb, data = Fingers) %>%
gf_facet_grid(. ~ Sex)
L_Ch4_Explaining_3
Remember that putting something after the ~ means something gets changed on the xaxis. gf_facet_grid()
works the same way. Putting the variable Sex after the ~
puts these two graphs in a row along the xaxis. Putting Sex before the ~
puts these two graphs in a column along the yaxis.
gf_histogram(~ Thumb, data = Fingers) %>%
gf_facet_grid(Sex ~ .)
This is more helpful because it’s easier to compare where the distributions are along the same Thumb axis. It seems that the distribution of Thumb lengths for males is shifted higher relative to the female distribution.
Also, it immediately is apparent that there are fewer males than females. This is when a measure like density (rather than count) comes in handy.
Adjust the following code to recreate these histograms as density histograms.
# Load packages
require(mosaic)
require(tidyverse)
# Import Fingers data frame
Fingers < read.csv(file="https://raw.githubusercontent.com/UCLATALL/introstatsmodeling/master/datasets/fingers.csv", header=TRUE, sep=",")
Fingers < data.frame(Fingers)
# clean up str
Fingers$RaceEthnic < as.factor(Fingers$RaceEthnic)
Fingers$SSLast < as.numeric(Fingers$SSLast)
Fingers$Year < as.factor(Fingers$Year)
Fingers$Job < as.factor(Fingers$Job)
Fingers$MathAnxious < as.numeric(Fingers$MathAnxious)
Fingers$Interest < as.numeric(Fingers$Interest)
Fingers$GradePredict < as.numeric(Fingers$GradePredict)
Fingers$Thumb < as.numeric(Fingers$Thumb)
Fingers$Index < as.numeric(Fingers$Index)
Fingers$Middle < as.numeric(Fingers$Middle)
Fingers$Ring < as.numeric(Fingers$Ring)
Fingers$Pinkie < as.numeric(Fingers$Pinkie)
Fingers$Height < as.numeric(Fingers$Height)
Fingers$Weight < as.numeric(Fingers$Weight)
# label a few factors
Fingers$RaceEthnic < factor(Fingers$RaceEthnic, levels = c(1,2,3,4,5), labels = c("White","African American","Asian","Latino","Other"))
Fingers$Job < factor(Fingers$Job, levels = c(0,1,2), labels = c("not working", "parttime job", "fulltime job"))
Fingers$Year < factor(Fingers$Year, levels = c(1,2,3,4), labels = c("freshman", "sophomore", "junior", "senior"))
# Modify this code to create density histograms
gf_histogram(~ Thumb, data = Fingers) %>%
gf_facet_grid(Sex ~.)
# Modify this code to create density histograms
gf_histogram(..density..~ Thumb, data = Fingers) %>%
gf_facet_grid(Sex ~.)
ex() %>% check_function("gf_histogram") %>% check_arg("object") %>% check_equal()
ex() %>% check_function("gf_histogram") %>% check_arg("data") %>% check_equal()
ex() %>% check_function("gf_facet_grid") %>% check_arg("object") %>% check_equal()
ex() %>% check_function("gf_facet_grid") %>% check_result() %>% check_equal()
ex() %>% check_error()
success_msg("You're doing a fantastic job!")
L_Ch4_Explaining_4
Another way of thinking about Sex explaining variation in Thumb is to say that Thumb is really made up of two different distributions, one for males and one for females. Although the shape of these two histograms are roughly normal, the average male thumb is bigger than the average female thumb. It almost seems like the whole male distribution is shifted higher along the xaxis. The center of the distribution is different across the two groups, but also the variation (or spread) within the groups is now smaller within each of the two histograms than it is in the combined distribution.
This isn’t to say that just because we know someone’s sex we definitely know their thumb length. After all, there are both males and females with longer thumbs and both males and females with shorter thumbs. This variation among members of the same group is called withingroup variation.
L_Ch4_Explaining_5
When we combine all the Thumbs together in a single histogram, we are able to see how spread out the overall distribution is. This gives us an idea of the total variation. When we divide the distribution up and look separately at the two histograms, we can see the withingroup variation.
Notice that these groupspecific histograms tend to have less variation than the single histogram. It’s as if some of the variation in Thumb has been accounted for by Sex. Because we can only see the withingroup variation after we divide the distribution up by Sex, another name for withingroup variation is leftover variation.
Even though there is still a lot of variation in thumb length left over after taking out Sex, it is still true that if we know someone’s sex we can be a little better at predicting their thumb length. A little better may not be great, but it is better than nothing.
There are some cool things you can do with this grid of histograms. A lot of what you already know about histograms can be added here. You can adjust bins, you can add labels, you can chain on density plots.
gf_histogram(..density.. ~ Thumb, data = Fingers, bins = 10) %>%
gf_facet_grid(Sex ~ .) %>%
gf_density()
You can adjust color and fill as usual.
gf_histogram(..density.. ~ Thumb, data = Fingers, fill = "orange”, color = "gray”) %>%
gf_facet_grid(Sex ~ .)
L_Ch4_Explaining_6
Make a grid of histograms to investigate the association of the variable you chose with thumb length. (Make sure to use a categorical variable. )
# Load packages
require(mosaic)
require(tidyverse)
# Import Fingers data frame
Fingers < read.csv(file="https://raw.githubusercontent.com/UCLATALL/introstatsmodeling/master/datasets/fingers.csv", header=TRUE, sep=",")
Fingers < data.frame(Fingers)
# clean up str
Fingers$RaceEthnic < as.factor(Fingers$RaceEthnic)
Fingers$SSLast < as.numeric(Fingers$SSLast)
Fingers$Year < as.factor(Fingers$Year)
Fingers$Job < as.factor(Fingers$Job)
Fingers$MathAnxious < as.numeric(Fingers$MathAnxious)
Fingers$Interest < as.numeric(Fingers$Interest)
Fingers$GradePredict < as.numeric(Fingers$GradePredict)
Fingers$Thumb < as.numeric(Fingers$Thumb)
Fingers$Index < as.numeric(Fingers$Index)
Fingers$Middle < as.numeric(Fingers$Middle)
Fingers$Ring < as.numeric(Fingers$Ring)
Fingers$Pinkie < as.numeric(Fingers$Pinkie)
Fingers$Height < as.numeric(Fingers$Height)
Fingers$Weight < as.numeric(Fingers$Weight)
# label a few factors
Fingers$RaceEthnic < factor(Fingers$RaceEthnic, levels = c(1,2,3,4,5), labels = c("White","African American","Asian","Latino","Other"))
Fingers$Job < factor(Fingers$Job, levels = c(0,1,2), labels = c("not working", "parttime job", "fulltime job"))
Fingers$Year < factor(Fingers$Year, levels = c(1,2,3,4), labels = c("freshman", "sophomore", "junior", "senior"))
# Write code to create paneled histograms of Thumb to explore variables that would do a poor job of explaining variation (categorical variables)
gf_histogram(..density.. ~ Thumb, data = Fingers, fill=~RaceEthnic) %>%
gf_facet_grid(RaceEthnic ~ .)
Here are a few example histograms you could have made. The gray histograms on the left make a grid of thumb length based on year in college. The colorful histograms are based on Race/Ethnicity.
gf_histogram( ~ Thumb, data = Fingers) %>%
gf_facet_grid(Year ~ .)
gf_histogram(..density.. ~ Thumb, data = Fingers, fill=~RaceEthnic) %>%
gf_facet_grid(RaceEthnic ~ .)
L_Ch4_Explaining_7