Course Outline

list Introduction to Statistics: A Modeling Approach

Quantitative Explanatory Variables

Okay, let’s go back to where we were, explaining the variation in thumb length using the variable Sex.

THUMB LENGTH = SEX + OTHER STUFF

Let’s look at the histograms and scatterplots of this word equation, which showed that the overall variation in thumb length could be partially explained by taking sex into account.

gf_histogram(..density.. ~ Thumb, data = Fingers, fill = "orange") %>%
gf_facet_grid(Sex ~ .)
gf_point(Thumb ~ Sex, data = Fingers, color = "orange", size = 5, alpha = .5)

Let’s now see if we can take the same approach for a different explanatory variable: Height. First, let’s write a word equation to represent the relationship we are wanting to explore:

THUMB LENGTH = HEIGHT + OTHER STUFF

L_Ch4_Quantitative_1

We actually could use the same approach with Height as we did with Sex. But, we would first need to recode Height as a categorical variable. Let’s try constructing a new variable by cutting up Height into two categories—short and tall.

L_Ch4_Quantitative_2

Write some code to cut Height from the data frame Fingers into 2 equal-sized categories: people below the median height, and people above the median height. Save these categories into a new variable called Height2Group, and label the two categories “short” and “tall.”

# load packages library(ggformula) library(mosaic) # import the fingers data frame Fingers <- read.csv(file="https://raw.githubusercontent.com/UCLATALL/intro-stats-modeling/master/datasets/fingers.csv", header=TRUE, sep=",") Fingers <- data.frame(Fingers) # clean up str Fingers$RaceEthnic <- as.factor(Fingers$RaceEthnic) Fingers$SSLast <- as.numeric(Fingers$SSLast) Fingers$Year <- as.factor(Fingers$Year) Fingers$Job <- as.factor(Fingers$Job) Fingers$MathAnxious <- as.numeric(Fingers$MathAnxious) Fingers$Interest <- as.numeric(Fingers$Interest) Fingers$GradePredict <- as.numeric(Fingers$GradePredict) Fingers$Thumb <- as.numeric(Fingers$Thumb) Fingers$Index <- as.numeric(Fingers$Index) Fingers$Middle <- as.numeric(Fingers$Middle) Fingers$Ring <- as.numeric(Fingers$Ring) Fingers$Pinkie <- as.numeric(Fingers$Pinkie) Fingers$Height <- as.numeric(Fingers$Height) Fingers$Weight <- as.numeric(Fingers$Weight) # label a few factors Fingers$RaceEthnic <- factor(Fingers$RaceEthnic, levels = c(1,2,3,4,5), labels = c("White","African American","Asian","Latino","Other")) Fingers$Job <- factor(Fingers$Job, levels = c(0,1,2), labels = c("not working", "part-time job", "full-time job")) Fingers$Year <- factor(Fingers$Year, levels = c(1,2,3,4), labels = c("freshman", "sophomore", "junior", "senior")) # write code to cut Height from Fingers into 2 categories # save this as a new variable: Height2Group Fingers$Height2Group <- # modify this to label Fingers$Height2Group appropriately Fingers$Sex <- factor(Fingers$Sex, levels = c(1,2), labels = c("female", "male")) # this prints select variables in Fingers head(select(Fingers, Thumb, Height, Height2Group)) # write code to cut Height from Fingers into 2 categories # save this as a new variable: Height2Group Fingers$Height2Group <- ntile(Fingers$Height, 2) # modify this to label Fingers$Height2Group appropriately Fingers$Height2Group <- factor(Fingers$Height2Group, levels = c(1,2), labels = c("short", "tall")) # this prints select variables in Fingers head(select(Fingers, Thumb, Height, Height2Group)) test_data_frame("Fingers") test_function_result("head") test_error() success_msg("Wow! You're a rock staR. Keep up the good work!")
Have you used ntile with Fingers$Height and 2 as the arguments?
DataCamp: ch4-12

Now we can try looking at the data the same way as we did for Sex, which also had two levels.

L_Ch4_Quantitative_3

Create histograms in a grid and a scatterplot to look at variability in Thumb based on Height2Group.

require(mosaic) require(ggformula) Fingers <- read.csv(file="https://raw.githubusercontent.com/UCLATALL/intro-stats-modeling/master/datasets/fingers.csv", header=TRUE, sep=",") Fingers <- data.frame(Fingers) Fingers$Height2Group <- ntile(Fingers$Height, 2) Fingers$Height2Group <- factor(Fingers$Height2Group, levels = c(1,2), labels = c("short", "tall")) # create histograms of Thumb in a grid by Height2Group # create a scatterplot of Thumb by Height2Group # create histograms of Thumb in a grid by Height2Group gf_histogram(~Thumb, data=Fingers) %>% gf_facet_grid(Height2Group ~ .) # create a scatterplot of Thumb by Height2Group gf_point(Thumb ~ Height2Group, data=Fingers) # use this code if you want your plot to look like ours gf_histogram(~ Thumb, data = Fingers, fill = "darkolivegreen4") %>% gf_facet_grid(Height2Group ~ .) gf_point(Thumb ~ Height2Group, data = Fingers, alpha = .5, size = 3, color = "darkolivegreen4") test_function("gf_histogram") test_function("gf_point") test_error() success_msg("Keep up the great work!")
Don't forget to use gf_facet_grid to put your histograms of Height2Group in a grid
DataCamp: ch4-13

Similar to what we found for Sex, where there was a lot of variability within the female group and male group, there is a lot of variability within the short and tall groups. But there is less variability within each group than there is overall if we must combine the groups together. Again, it is useful to think about this within-group variation as the leftover variation after explaining some of the variation with Height2Group.

Let’s also try looking at these distribution with a boxplot.

require(mosaic) require(ggformula) Fingers <- read.csv(file="https://raw.githubusercontent.com/UCLATALL/intro-stats-modeling/master/datasets/fingers.csv", header=TRUE, sep=",") Fingers <- data.frame(Fingers) Fingers$Height2Group <- ntile(Fingers$Height, 2) Fingers$Height2Group <- factor(Fingers$Height2Group, levels = c(1,2), labels = c("short", "tall")) # create boxplots of Thumb by Height2Group # create boxplots of Thumb by Height2Group gf_boxplot(Thumb ~ Height2Group, data = Fingers) # if you want your code to produce the output below gf_boxplot(Thumb ~ Height2Group, data = Fingers, color = "darkolivegreen4") test_function_result("gf_boxplot") test_error() success_msg("Nice work!")
Have you used gf_boxplot()?
DataCamp: ch4-14

L_Ch4_Quantitative_4

See if you can break height into three categories (let’s call it Height3Group) and then compare the distribution of height across all three categories with a scatterplot. Create boxplots as well.

require(mosaic) require(ggformula) Fingers <- read.csv(file="https://raw.githubusercontent.com/UCLATALL/intro-stats-modeling/master/datasets/fingers.csv", header=TRUE, sep=",") Fingers <- data.frame(Fingers) Fingers$Height2Group <- ntile(Fingers$Height, 2) Fingers$Height2Group <- factor(Fingers$Height2Group, levels = c(1,2), labels = c("short", "tall")) # modify this code to break Height into 3 categories: "short", "medium", and "tall" Fingers$Height3Group <- ntile(Fingers$Height, 2) Fingers$Height3Group <- factor(Fingers$Height2Group, levels = c(1,2), labels = c("short", "tall")) # create a scatterplot of Thumb by Height3Group # create boxplots of Thumb by Height3Group # modify this code to break Height into 3 categories Fingers$Height3Group <- ntile(Fingers$Height, 3) Fingers$Height3Group <- factor(Fingers$Height3Group, levels = c(1,2,3), labels = c("short", "medium", "tall")) # create a scatterplot of Thumb by Height3Group gf_point(Thumb ~ Height3Group, data = Fingers) # create boxplots of Thumb by Height3Group gf_boxplot(Thumb ~ Height3Group, data = Fingers) test_data_frame("Fingers", incorrect_msg="Did you remember to use `ntile()`?") test_function("gf_point") test_function("gf_boxplot") test_error() success_msg("Keep it up!")
Don't forget to set the number of categories to 3
DataCamp: ch4-15

L_Ch4_Quantitative_5

Looking at these two boxplots, we have an intuition that the three-group version of Height explains more variation in thumb length than does the two-group version. Although there is still a lot of variation within each group in the three-group version, the within-group variation appears smaller in the three-group than in the two-group model. Or, to put it another way, there is less variation left over after taking out the variation due to height.

L_Ch4_Quantitative_6

From Categorical to Quantitative Explanatory Variables

Up to this point we have been using Height as though it were a categorical variable. First we divided it into two categories, then three.

When we do this we are throwing away some of the information we have in our data. We know exactly how many inches tall each person is. Why not use that information instead of just categorizing people as either tall or short?

Let’s try another approach, a scatterplot of Thumb length by Height. Try using gf_point() with Height rather than Height2Group or Height3Group. Note: when making scatterplots, the convention is to put the outcome variable on the y-axis, the explanatory variable on the x-axis.

require(mosaic) require(ggformula) Fingers <- read.csv(file="https://raw.githubusercontent.com/UCLATALL/intro-stats-modeling/master/datasets/fingers.csv", header=TRUE, sep=",") Fingers <- data.frame(Fingers) Fingers$Height.2group <- ntile(Fingers$Height, 2) Fingers$Height.2group <- factor(Fingers$Height.2group, levels = c(1,2), labels = c("short", "tall")) # create a scatterplot of Thumb by Height # create a scatterplot of Thumb by Height gf_point(Thumb ~ Height, data = Fingers) test_function("gf_point", incorrect_msg="Have you used `gf_point()`?") test_error() success_msg("Fantastic!")
Use gf_point() with Thumb ~ Height in the Fingers data frame
DataCamp: ch4-16

L_Ch4_Quantitative_7

The same relationship we spotted in the boxplots when we divided Height into three categories can be seen in the scatterplot. In the image below, we have overlaid boxes at three different intervals along the distribution of Height.

Each box corresponds to one of the three groups of our Height3Group variable. On the x-axis you can see the range in height, measured in inches, for each of the three groups.

L_Ch4_Quantitative_8

Remember that we used ntile() to divide our sample into three groups of equal sizes. Because most people in the sample are clustered around the average height, it makes sense that the box in the middle is the narrowest. There aren’t that many people taller than 70 inches, so to get a tall group that is exactly one-third of the sample means we have to include a wider range of heights.

The heights of the boxes represent the middle of the Thumb distribution for that third of the sample, just like in a boxplot. So, the bottom of the box is Q1 and the top is Q3. You can see that the thumb lengths of people who are taller tend to be longer. You can also see that height explains only some of the variation in thumb length. Within each band of Height, there is variation in thumb length (look up and down within each box).

So, just as when we measured height as a categorical variable, although there appears to be some variation in Thumb that is explained by Height, there is also variation left over after we have taken out the variation due to Height.

L_Ch4_Quantitative_9

We can try to explain variation with categorical explanatory variables (such as Sex and Height3Group) but we can also try to explain variation with quantitative explanatory variable (such as Height).

Let’s stretch our thinking further. What if you wanted to have two explanatory variables for thumb length? For example, if we wanted to think about how variation in Thumb might be explained by variation in both Sex and Height, we could represent this idea as a word equation like this.

THUMB LENGTH = SEX + HEIGHT + OTHER STUFF

The variation in thumb length is the same whether we try to explain it with Sex, Height, or both! The total variation in Thumb doesn’t change. But how about that unexplained variation? The better the job done by the explanatory variables, the less leftover variation.

Responses