## Course Outline

• segmentGetting Started (Don't Skip This Part)
• segmentIntroduction to Statistics: A Modeling Approach
• segmentPART I: EXPLORING VARIATION
• segmentChapter 1 - Welcome to Statistics: A Modeling Approach
• segmentChapter 2 - Understanding Data
• segmentChapter 3 - Examining Distributions
• segmentChapter 4 - Explaining Variation
• segmentPART II: MODELING VARIATION
• segmentChapter 5 - A Simple Model
• segmentChapter 6 - Quantifying Error
• segmentChapter 7 - Adding an Explanatory Variable to the Model
• segmentChapter 8 - Models with a Quantitative Explanatory Variable
• segmentPART III: EVALUATING MODELS
• segmentChapter 9 - Distributions of Estimates
• segmentChapter 10 - Confidence Intervals and Their Uses
• segmentChapter 11 - Model Comparison with the F Ratio
• segmentChapter 12 - What You Have Learned
• segmentResources

## Explaining One Variable With Another

Let’s start by looking at the distribution of Thumb.

Write code to draw a histogram of Thumb from the Fingers data frame. Feel free to play around with features like labels, gf_lab(), or with arguments like color, fill, bins, or binwidth.

 # Load packages require(mosaic) require(tidyverse) # Import Fingers data frame Fingers <- read.csv(file="https://raw.githubusercontent.com/UCLATALL/intro-stats-modeling/master/datasets/fingers.csv", header=TRUE, sep=",") Fingers <- data.frame(Fingers) # clean up str Fingers$Sex <- as.factor(Fingers$Sex) Fingers$RaceEthnic <- as.factor(Fingers$RaceEthnic) Fingers$SSLast <- as.numeric(Fingers$SSLast) Fingers$Year <- as.factor(Fingers$Year) Fingers$Job <- as.factor(Fingers$Job) Fingers$MathAnxious <- as.numeric(Fingers$MathAnxious) Fingers$Interest <- as.numeric(Fingers$Interest) Fingers$GradePredict <- as.numeric(Fingers$GradePredict) Fingers$Thumb <- as.numeric(Fingers$Thumb) Fingers$Index <- as.numeric(Fingers$Index) Fingers$Middle <- as.numeric(Fingers$Middle) Fingers$Ring <- as.numeric(Fingers$Ring) Fingers$Pinkie <- as.numeric(Fingers$Pinkie) Fingers$Height <- as.numeric(Fingers$Height) Fingers$Weight <- as.numeric(Fingers$Weight) # label a few factors Fingers$Sex <- factor(Fingers$Sex, levels = c(1,2), labels = c("female", "male")) Fingers$RaceEthnic <- factor(Fingers$RaceEthnic, levels = c(1,2,3,4,5), labels = c("White","African American","Asian","Latino","Other")) Fingers$Job <- factor(Fingers$Job, levels = c(0,1,2), labels = c("not working", "part-time job", "full-time job")) Fingers$Year <- factor(Fingers$Year, levels = c(1,2,3,4), labels = c("freshman", "sophomore", "junior", "senior"))   # Write code to draw a histogram of Thumb from the Fingers dataset   # Write code to draw a histogram of Thumb from the Fingers dataset gf_histogram(~Thumb, data=Fingers) # Another solution hist(Fingers$Thumb)   test_or( test_function("gf_histogram", args="data"), test_function("hist", args="x"))  Use gf_histogram() DataCamp: ch4-1 We’ve seen this distribution a few times now. It looks like most of the thumbs run between 40 and 80 mm; the center of the distribution is somewhere around 60 mm; and the distribution is kind of bell-shaped, with most of the observations clustered around the middle, then just a few observations in the outer tails. When we want to explain variation in one variable, a starting place is to think about other variables that might be meaningfully related to it. L_Ch4_Explaining_1 One variable that might be meaningfully related to thumb length is Sex. You might intuitively sense that male and female thumb lengths might differ, or vary. But then again, even among a bunch of females, their thumb lengths vary too. L_Ch4_Explaining_2 Unfortunately, the variable Sex is not included in our previous histogram. But we can visualize the relationship between Thumb and Sex in a few ways. One way is by coloring or filling in the data in the histogram by Sex, assigning females one color and males another. To do this we use the fill = argument, but instead of putting in a color we put a tilda (~) and then the name of a variable: fill = ~Sex. gf_histogram(~ Thumb, data = Fingers, fill = ~Sex) Whenever you color these data by an explanatory variable, it’s a bit of a pain to change the default colors. Thankfully, this default color scheme seems nice for this particular situation. But it is really nice to be able to change the colors. You have to chain on one additional (slightly complicated) line of code (using %>%) and substitute the color names you want for the different values of your explanatory variable. For example, here’s the R code to change the colors of this histogram. gf_histogram(~ Thumb, data = Fingers, fill = ~Sex) %>% gf_refine(scale_fill_manual(values = c("purple", "orange"))) If you wanted to change the default outline colors, you would specify scale_color_manual() and chain on this code instead. gf_refine(scale_color_manual(values = c("purple", "orange"))) Try changing the colors used for the different values of Sex (female and male) in the histogram we made before.  # Load packages require(mosaic) require(tidyverse) require(ggformula) # Import Fingers data frame Fingers <- read.csv(file="https://raw.githubusercontent.com/UCLATALL/intro-stats-modeling/master/datasets/fingers.csv", header=TRUE, sep=",") Fingers <- data.frame(Fingers) # clean up str Fingers$RaceEthnic <- as.factor(Fingers$RaceEthnic) Fingers$SSLast <- as.numeric(Fingers$SSLast) Fingers$Year <- as.factor(Fingers$Year) Fingers$Job <- as.factor(Fingers$Job) Fingers$MathAnxious <- as.numeric(Fingers$MathAnxious) Fingers$Interest <- as.numeric(Fingers$Interest) Fingers$GradePredict <- as.numeric(Fingers$GradePredict) Fingers$Thumb <- as.numeric(Fingers$Thumb) Fingers$Index <- as.numeric(Fingers$Index) Fingers$Middle <- as.numeric(Fingers$Middle) Fingers$Ring <- as.numeric(Fingers$Ring) Fingers$Pinkie <- as.numeric(Fingers$Pinkie) Fingers$Height <- as.numeric(Fingers$Height) Fingers$Weight <- as.numeric(Fingers$Weight) # label a few factors Fingers$RaceEthnic <- factor(Fingers$RaceEthnic, levels = c(1,2,3,4,5), labels = c("White","African American","Asian","Latino","Other")) Fingers$Job <- factor(Fingers$Job, levels = c(0,1,2), labels = c("not working", "part-time job", "full-time job")) Fingers$Year <- factor(Fingers$Year, levels = c(1,2,3,4), labels = c("freshman", "sophomore", "junior", "senior"))   # Change the default colors for the different values of the explanatory variable gf_histogram(..density.. ~ Thumb, data = Fingers, fill = ~Sex)   gf_histogram(..density.. ~ Thumb, data = Fingers, fill = ~Sex) %>% gf_refine(scale_fill_manual(values = c("red","blue")))   test_function("gf_histogram", incorrect_msg="Keep the sample code the same") test_function("gf_refine", incorrect_msg="Have you called gf_refine()?") test_function("scale_fill_manual", incorrect_msg="Have you used the scale_fill_values() and set values = colors?")  The explanatory variable is sex DataCamp: ch4-2 Another way is to split up the histogram we made into two—one for females and another for males. We can chain on (using %>%) the command gf_facet_grid() after gf_histogram(). This will put the histogram of Thumb for females and the one for males in a grid. gf_histogram(~ Thumb, data = Fingers) %>% gf_facet_grid(. ~ Sex) L_Ch4_Explaining_3 Remember that putting something after the ~ means something gets changed on the x-axis. gf_facet_grid() works the same way. Putting the variable Sex after the ~ puts these two graphs in a row along the x-axis. Putting Sex before the ~ puts these two graphs in a column along the y-axis. gf_histogram(~ Thumb, data = Fingers) %>% gf_facet_grid(Sex ~ .) This is more helpful because it’s easier to compare where the distributions are along the same Thumb axis. It seems that the distribution of Thumb lengths for males is shifted higher relative to the female distribution. Also, it immediately is apparent that there are fewer males than females. This is when a measure like density (rather than count) comes in handy. Adjust the following code to re-create these histograms as density histograms.  # Load packages require(mosaic) require(tidyverse) # Import Fingers data frame Fingers <- read.csv(file="https://raw.githubusercontent.com/UCLATALL/intro-stats-modeling/master/datasets/fingers.csv", header=TRUE, sep=",") Fingers <- data.frame(Fingers) # clean up str Fingers$RaceEthnic <- as.factor(Fingers$RaceEthnic) Fingers$SSLast <- as.numeric(Fingers$SSLast) Fingers$Year <- as.factor(Fingers$Year) Fingers$Job <- as.factor(Fingers$Job) Fingers$MathAnxious <- as.numeric(Fingers$MathAnxious) Fingers$Interest <- as.numeric(Fingers$Interest) Fingers$GradePredict <- as.numeric(Fingers$GradePredict) Fingers$Thumb <- as.numeric(Fingers$Thumb) Fingers$Index <- as.numeric(Fingers$Index) Fingers$Middle <- as.numeric(Fingers$Middle) Fingers$Ring <- as.numeric(Fingers$Ring) Fingers$Pinkie <- as.numeric(Fingers$Pinkie) Fingers$Height <- as.numeric(Fingers$Height) Fingers$Weight <- as.numeric(Fingers$Weight) # label a few factors Fingers$RaceEthnic <- factor(Fingers$RaceEthnic, levels = c(1,2,3,4,5), labels = c("White","African American","Asian","Latino","Other")) Fingers$Job <- factor(Fingers$Job, levels = c(0,1,2), labels = c("not working", "part-time job", "full-time job")) Fingers$Year <- factor(Fingers$Year, levels = c(1,2,3,4), labels = c("freshman", "sophomore", "junior", "senior"))   # Modify this code to create density histograms gf_histogram(~ Thumb, data = Fingers) %>% gf_facet_grid(Sex ~.)   # Modify this code to create density histograms gf_histogram(..density..~ Thumb, data = Fingers) %>% gf_facet_grid(Sex ~.)   ex() %>% check_function("gf_histogram") %>% check_arg("object") %>% check_equal() ex() %>% check_function("gf_histogram") %>% check_arg("data") %>% check_equal() ex() %>% check_function("gf_facet_grid") %>% check_arg("object") %>% check_equal() ex() %>% check_function("gf_facet_grid") %>% check_result() %>% check_equal() ex() %>% check_error() success_msg("You're doing a fantastic job!")  DataCamp: ch4-3 L_Ch4_Explaining_4 Another way of thinking about Sex explaining variation in Thumb is to say that Thumb is really made up of two different distributions, one for males and one for females. Although the shape of these two histograms are roughly normal, the average male thumb is bigger than the average female thumb. It almost seems like the whole male distribution is shifted higher along the x-axis. The center of the distribution is different across the two groups, but also the variation (or spread) within the groups is now smaller within each of the two histograms than it is in the combined distribution. This isn’t to say that just because we know someone’s sex we definitely know their thumb length. After all, there are both males and females with longer thumbs and both males and females with shorter thumbs. This variation among members of the same group is called within-group variation. L_Ch4_Explaining_5 When we combine all the Thumbs together in a single histogram, we are able to see how spread out the overall distribution is. This gives us an idea of the total variation. When we divide the distribution up and look separately at the two histograms, we can see the within-group variation. Notice that these group-specific histograms tend to have less variation than the single histogram. It’s as if some of the variation in Thumb has been accounted for by Sex. Because we can only see the within-group variation after we divide the distribution up by Sex, another name for within-group variation is leftover variation. Even though there is still a lot of variation in thumb length left over after taking out Sex, it is still true that if we know someone’s sex we can be a little better at predicting their thumb length. A little better may not be great, but it is better than nothing. There are some cool things you can do with this grid of histograms. A lot of what you already know about histograms can be added here. You can adjust bins, you can add labels, you can chain on density plots. gf_histogram(..density.. ~ Thumb, data = Fingers, bins = 10) %>% gf_facet_grid(Sex ~ .) %>% gf_density() You can adjust color and fill as usual. gf_histogram(..density.. ~ Thumb, data = Fingers, fill = "orange”, color = "gray”) %>% gf_facet_grid(Sex ~ .) L_Ch4_Explaining_6 Make a grid of histograms to investigate the association of the variable you chose with thumb length. (Make sure to use a categorical variable. )  # Load packages require(mosaic) require(tidyverse) # Import Fingers data frame Fingers <- read.csv(file="https://raw.githubusercontent.com/UCLATALL/intro-stats-modeling/master/datasets/fingers.csv", header=TRUE, sep=",") Fingers <- data.frame(Fingers) # clean up str Fingers$RaceEthnic <- as.factor(Fingers$RaceEthnic) Fingers$SSLast <- as.numeric(Fingers$SSLast) Fingers$Year <- as.factor(Fingers$Year) Fingers$Job <- as.factor(Fingers$Job) Fingers$MathAnxious <- as.numeric(Fingers$MathAnxious) Fingers$Interest <- as.numeric(Fingers$Interest) Fingers$GradePredict <- as.numeric(Fingers$GradePredict) Fingers$Thumb <- as.numeric(Fingers$Thumb) Fingers$Index <- as.numeric(Fingers$Index) Fingers$Middle <- as.numeric(Fingers$Middle) Fingers$Ring <- as.numeric(Fingers$Ring) Fingers$Pinkie <- as.numeric(Fingers$Pinkie) Fingers$Height <- as.numeric(Fingers$Height) Fingers$Weight <- as.numeric(Fingers$Weight) # label a few factors Fingers$RaceEthnic <- factor(Fingers$RaceEthnic, levels = c(1,2,3,4,5), labels = c("White","African American","Asian","Latino","Other")) Fingers$Job <- factor(Fingers$Job, levels = c(0,1,2), labels = c("not working", "part-time job", "full-time job")) Fingers$Year <- factor(Fingers\$Year, levels = c(1,2,3,4), labels = c("freshman", "sophomore", "junior", "senior"))   # Write code to create paneled histograms of Thumb to explore variables that would do a poor job of explaining variation (categorical variables)   gf_histogram(..density.. ~ Thumb, data = Fingers, fill=~RaceEthnic) %>% gf_facet_grid(RaceEthnic ~ .)   
Use str() to look up the variables in the data frame to figure out whether they are categorical or quantitative
DataCamp: ch4-4

Here are a few example histograms you could have made. The gray histograms on the left make a grid of thumb length based on year in college. The colorful histograms are based on Race/Ethnicity.

gf_histogram( ~ Thumb, data = Fingers) %>%
gf_facet_grid(Year ~ .)
gf_histogram(..density.. ~ Thumb, data = Fingers, fill=~RaceEthnic) %>%
gf_facet_grid(RaceEthnic ~ .) L_Ch4_Explaining_7