Course Outline
-
segmentGetting Started (Don't Skip This Part)
-
segmentIntroduction to Statistics: A Modeling Approach
-
segmentPART I: EXPLORING VARIATION
-
segmentChapter 1 - Welcome to Statistics: A Modeling Approach
-
segmentChapter 2 - Understanding Data
-
segmentChapter 3 - Examining Distributions
-
segmentChapter 4 - Explaining Variation
-
4.4 Even More Ways: Putting these Plots Together
-
segmentPART II: MODELING VARIATION
-
segmentChapter 5 - A Simple Model
-
segmentChapter 6 - Quantifying Error
-
segmentChapter 7 - Adding an Explanatory Variable to the Model
-
segmentChapter 8 - Models with a Quantitative Explanatory Variable
-
segmentPART III: EVALUATING MODELS
-
segmentChapter 9 - Distributions of Estimates
-
segmentChapter 10 - Confidence Intervals and Their Uses
-
segmentChapter 11 - Model Comparison with the F Ratio
-
segmentChapter 12 - What You Have Learned
-
segmentResources
list Introduction to Statistics: A Modeling Approach
Even More Ways: Putting These Plots Together
gf_point()
and gf_jitter()
are useful. They emphasize that data are made up of individual numbers, and yet they help us to notice clusters of those individual points. There are times, however, when we want to transcend the individual data points and focus only on where the clusters are.
Boxplots, which we have seen before, are helpful in this regard, and are especially useful for comparing the distribution of an outcome variable across different levels of a categorical explanatory variable.
Here’s how we would create a boxplot of thumb length broken down by sex.
gf_boxplot(Thumb ~ Sex, data = Fingers)
In making boxplots we can play with the arguments color
and fill
much like we did before.
gf_boxplot(Thumb ~ Sex, data = Fingers, color = "orange")
Recall that the rectangle at the center of the boxplot shows us where, on the scale of the outcome variable, the middle .50 of data fall. The thick line inside the box is the median.
Think back to the five-number summary. We can get the five-number summary for Thumb broken down by Sex by modifying how we previously used favstats()
.
favstats(Thumb ~ Sex, data = Fingers)
L_Ch4_MoreWays_6
The big box with the thick line would contain half of the data points, the half that are closest to the middle of the distribution.
In ggformula, when we chain on multiple functions, the later functions assume the same variables and data frames so we don’t need to type those in again. Handy!
gf_boxplot(Thumb ~ Sex, data = Fingers, color = "orange") %>%
gf_jitter()
We can also add in any arguments to modify gf_jitter
.
gf_boxplot(Thumb ~ Sex, data = Fingers, color = "orange") %>%
gf_jitter(height = 0, color = "gray", alpha = .5, size = 3)
L_Ch4_MoreWays_7
In this situation where we are looking at the variation in Thumb length by Sex, the boxes are in different vertical positions. The male box is higher than the female box.
L_Ch4_MoreWays_8
Instead of an explanatory variable like Sex, let’s try one that is unlikely to help us explain the variation in thumb length.
L_Ch4_MoreWays_9
Modify this code to depict a boxplot for Thumb length by Job (instead of by Sex).
require(mosaic)
require(ggformula)
require(tidyverse)
Fingers <- read.csv(file="https://raw.githubusercontent.com/UCLATALL/intro-stats-modeling/master/datasets/fingers.csv", header=TRUE, sep=",")
Fingers <- data.frame(Fingers)
# clean up str
Fingers$RaceEthnic <- as.factor(Fingers$RaceEthnic)
Fingers$SSLast <- as.numeric(Fingers$SSLast)
Fingers$Year <- as.factor(Fingers$Year)
Fingers$MathAnxious <- as.numeric(Fingers$MathAnxious)
Fingers$Interest <- as.numeric(Fingers$Interest)
Fingers$GradePredict <- as.numeric(Fingers$GradePredict)
Fingers$Thumb <- as.numeric(Fingers$Thumb)
Fingers$Index <- as.numeric(Fingers$Index)
Fingers$Middle <- as.numeric(Fingers$Middle)
Fingers$Ring <- as.numeric(Fingers$Ring)
Fingers$Pinkie <- as.numeric(Fingers$Pinkie)
Fingers$Height <- as.numeric(Fingers$Height)
Fingers$Weight <- as.numeric(Fingers$Weight)
# label a few factors
Fingers$RaceEthnic <- factor(Fingers$RaceEthnic, levels = c(1,2,3,4,5), labels = c("White","African American","Asian","Latino","Other"))
Fingers$Year <- factor(Fingers$Year, levels = c(1,2,3,4), labels = c("freshman", "sophomore", "junior", "senior"))
# prerun
Fingers$Job <- as.factor(Fingers$Job)
Fingers$Job <- factor(Fingers$Job, levels=c(0,1,2), labels=c("no job", "part-time", "full-time"))
# Modify this boxplot to look at Thumb length by Job
gf_boxplot(Thumb ~ Sex, data = Fingers, color = ~Sex) %>%
gf_jitter(height = 0, color = "gray", alpha = .5, size = 3)
gf_boxplot(Thumb ~ Job, data = Fingers) %>%
gf_jitter(height=0, color="gray", alpha=.5, size=3)
test_function("gf_boxplot")
Notice that in this boxplot, the boxes are at approximately the same vertical position and are about the same size. The one exception is the box for the full-time level of Job.
L_Ch4_MoreWays_10
The full-time box only includes one student, so, we wouldn’t want to draw any conclusions about the relationship between working full time and thumb length. Most of the students in the Fingers data frame either work part-time or not at all. The thumbs of students with no job are not much longer or shorter than thumbs of students with part-time jobs. But within each group, their thumb lengths vary a lot. There are long-thumbed and short-thumbed students with part-time jobs and with no jobs.
Now let’s return our attention to the whisker part (the lines) that go out from the box. The whiskers are drawn in relation to IQR, the interquartile range.
L_Ch4_MoreWays_11
In gf_boxplot()
, outliers, defined as observations more than 1.5 IQRs above or below the box, are represented with dots. The ends of the whiskers (the lines that extend above and below the box) represent the maximum and minimum observations that are not defined as outliers.
L_Ch4_MoreWays_12
Any data that is greater or less than the whiskers are depicted in a boxplot as individual points. By convention, these can be considered outliers.