Course Outline

segmentGetting Started (Don't Skip This Part)

segmentIntroduction to Statistics: A Modeling Approach

segmentPART I: EXPLORING VARIATION

segmentChapter 1  Welcome to Statistics: A Modeling Approach

segmentChapter 2  Understanding Data

segmentChapter 3  Examining Distributions

segmentChapter 4  Explaining Variation

segmentPART II: MODELING VARIATION

segmentChapter 5  A Simple Model

segmentChapter 6  Quantifying Error

segmentChapter 7  Adding an Explanatory Variable to the Model

segmentChapter 8  Models with a Quantitative Explanatory Variable

8.7 Correlation

segmentPART III: EVALUATING MODELS

segmentChapter 9  Distributions of Estimates

segmentChapter 10  Confidence Intervals and Their Uses

segmentChapter 11  Model Comparison with the F Ratio

segmentChapter 12  What You Have Learned

segmentResources
list Introduction to Statistics: A Modeling Approach
Correlation
You might have heard of Pearson’s r, often referred to as a “correlation coefficient.” Correlation is just a special case of regression in which both the outcome and explanatory variables are transformed into z scores prior to analysis.
L_Ch8_Correlation_1
Let’s see what happens when we transform the two variables we have been working with: Thumb length and Height. Because both variables are transformed into z scores, the mean of each distribution will be 0, and the standard deviation will be 1. The function zscore()
will convert all the values in a variable to z scores.
require(mosaic)
require(ggformula)
require(supernova)
Fingers < read.csv(file="https://raw.githubusercontent.com/UCLATALL/introstatsmodeling/master/datasets/fingers.csv", header=TRUE, sep=",")
# this is for measurement section
#Fingers < arrange(Fingers, desc(Sex))
#Fingers$FamilyMembers[1] < 2
#Fingers$Height[1] < 62
#Fingers$Sex < recode(Fingers$Sex, '1' = "female", '2' = "male")
Fingers < data.frame(Fingers)
# clean up str
Fingers$Sex < as.factor(Fingers$Sex)
Fingers$RaceEthnic < as.numeric(Fingers$RaceEthnic)
Fingers$SSLast < as.numeric(Fingers$SSLast)
Fingers$Year < as.numeric(Fingers$Year)
Fingers$Job < as.numeric(Fingers$Job)
Fingers$MathAnxious < as.numeric(Fingers$MathAnxious)
Fingers$Interest < as.numeric(Fingers$Interest)
Fingers$GradePredict < as.numeric(Fingers$GradePredict)
Fingers$Thumb < as.numeric(Fingers$Thumb)
Fingers$Index < as.numeric(Fingers$Index)
Fingers$Middle < as.numeric(Fingers$Middle)
Fingers$Ring < as.numeric(Fingers$Ring)
Fingers$Pinkie < as.numeric(Fingers$Pinkie)
Fingers$Height < as.numeric(Fingers$Height)
Fingers$Weight < as.numeric(Fingers$Weight)
Fingers < filter(Fingers, Thumb >= 33 & Thumb <= 100)
set.seed(2)
Height.model < lm(Thumb ~ Height, data = Fingers)
Fingers$Height.resid < resid(Height.model)
Fingers$zThumb < zscore(Fingers$Thumb)
Fingers$zHeight < zscore(Fingers$Height)
# this transforms all Thumb lengths into zscores
Fingers$zThumb < zscore(Fingers$Thumb)
# modify this to do the same for Height
Fingers$zHeight <
# this transforms all Thumb lengths into zscores
Fingers$zThumb < zscore(Fingers$Thumb)
# modify this to do the same for Height
Fingers$zHeight < zscore(Fingers$Height)
test_data_frame("Fingers")
Let’s make a scatter plot of zThumb and zHeight and look at the distribution. Then also make (again) a scatter plot of Thumb and Height, and compare the two scatter plots.
L_Ch8_Correlation_2
Make two scatter plots by modifying the code below.
require(mosaic)
require(ggformula)
require(supernova)
Fingers < read.csv(file="https://raw.githubusercontent.com/UCLATALL/introstatsmodeling/master/datasets/fingers.csv", header=TRUE, sep=",")
# this is for measurement section
#Fingers < arrange(Fingers, desc(Sex))
#Fingers$FamilyMembers[1] < 2
#Fingers$Height[1] < 62
#Fingers$Sex < recode(Fingers$Sex, '1' = "female", '2' = "male")
Fingers < data.frame(Fingers)
# clean up str
Fingers$Sex < as.factor(Fingers$Sex)
Fingers$RaceEthnic < as.numeric(Fingers$RaceEthnic)
Fingers$SSLast < as.numeric(Fingers$SSLast)
Fingers$Year < as.numeric(Fingers$Year)
Fingers$Job < as.numeric(Fingers$Job)
Fingers$MathAnxious < as.numeric(Fingers$MathAnxious)
Fingers$Interest < as.numeric(Fingers$Interest)
Fingers$GradePredict < as.numeric(Fingers$GradePredict)
Fingers$Thumb < as.numeric(Fingers$Thumb)
Fingers$Index < as.numeric(Fingers$Index)
Fingers$Middle < as.numeric(Fingers$Middle)
Fingers$Ring < as.numeric(Fingers$Ring)
Fingers$Pinkie < as.numeric(Fingers$Pinkie)
Fingers$Height < as.numeric(Fingers$Height)
Fingers$Weight < as.numeric(Fingers$Weight)
Height.model < lm(Thumb ~ Height, data = Fingers)
Fingers$Height.resid < resid(Height.model)
Fingers$zThumb < zscore(Fingers$Thumb)
Fingers$zHeight < zscore(Fingers$Height)
# this makes a scatterplot of the raw scores
# size makes the points bigger or smaller
gf_point(Thumb ~ Height, data = Fingers, size = 4, color = "black")
# modify this to make a scatterplot of the zscores
# feel free to change the colors
gf_point( , data = Fingers, size = 4, color = "firebrick")
# this makes a scatterplot of the raw scores
# size makes the points bigger or smaller
gf_point(Thumb ~ Height, data = Fingers, size = 4, color = "black")
# modify this to make a scatterplot of the zscores
# feel free to change the colors
gf_point(zThumb ~ zHeight, data = Fingers, size = 4, color = "firebrick")
ex() %>% check_function("gf_point", index = 1) %>% check_arg("object") %>% check_equal()
ex() %>% check_function("gf_point", index = 1) %>% check_arg("data") %>% check_equal()
ex() %>% check_function("gf_point", index = 2) %>% check_arg("object") %>% check_equal()
ex() %>% check_function("gf_point", index = 2) %>% check_arg("data") %>% check_equal()
ex() %>% check_error()
L_Ch8_Correlation_3
Fitting the Regression Model to the Two Distributions
In the DataCamp window below we’ve provided the code to fit a regression line for Thumb based on Height. Below that code, fit a regression model to the two transformed variables, using zThumb as the outcome variable (instead of Thumb) and zHeight as the explanatory variable (instead of Height). Save the model in an R object called zHeight.model.
Then, print the model estimates for both the zHeight.model and the Height.model.
require(mosaic)
require(supernova)
Fingers < supernova::Fingers
Fingers$zThumb < zscore(Fingers$Thumb)
Fingers$zHeight < zscore(Fingers$Height)
# this fits a regression model of Thumb by Height
Height.model < lm(Thumb ~ Height, data = Fingers)
# modify this to fit a regression model predicting zThumb with zHeight
zHeight.model < lm()
# this prints the estimate
Height.model
zHeight.model
zHeight.model < lm(zThumb ~ zHeight, data = Fingers)
test_object("zHeight.model")
Next, redo the two scatter plots, this time overlaying the bestfitting regression line for each one.
require(mosaic)
require(supernova)
Fingers < supernova::Fingers
Fingers$zThumb < zscore(Fingers$Thumb)
Fingers$zHeight < zscore(Fingers$Height)
# this overlays the best fitting regression model on this scatter plot
gf_point(Thumb ~ Height, data = Fingers, size = 4, color = "black") %>%
gf_lm()
# modify this to overlay the best fitting regression model on this scatter plot
gf_point(zThumb ~ zHeight, data = Fingers, size = 4, color = "firebrick")
# this overlays the best fitting regression model on this scatter plot
gf_point(Thumb ~ Height, data = Fingers, size = 4, color = "black") %>%
gf_lm()
# modify this to overlay the best fitting regression model on this scatter plot
gf_point(zThumb ~ zHeight, data = Fingers, size = 4, color = "firebrick") %>%
gf_lm()
test_function("gf_point", index = 1)
test_function("gf_point", index = 2)
test_function("gf_lm", index = 1)
test_function("gf_lm", index = 2)
test_error()
Below we’ve organized the results of all this in a table: the two scatter plots, bestfitting regression lines, and estimated model parameters.
L_Ch8_Correlation_4
Note that R will sometimes express parameter estimates in scientific notation. Thus, 1.801e16 means that the decimal point is shifted 16 digits to the left. So, the actual yintercept of the bestfitting regression line is .00000000000000018. Which is, for all practical purposes, 0.
We know from earlier that the bestfitting regression line passes through the mean of both the outcome and explanatory variables. Note that in the case of zThumb and zHeight, the middle of the scatter plot is at 0 on both the x and yaxes. The yintercept is 0 in this model because when x is 0, y is also 0.
L_Ch8_Correlation_5
Comparing the Fit of the Two Models
Let’s now run supernova()
on the two models, and compare their fit to the data.
require(mosaic)
require(supernova)
Fingers < supernova::Fingers
Fingers$zThumb < zscore(Fingers$Thumb)
Fingers$zHeight < zscore(Fingers$Height)
Height.model < lm(Thumb ~ Height, data = Fingers)
zHeight.model < lm(zThumb ~ zHeight, data = Fingers)
# this quantifies error from Height.model
supernova(Height.model)
# modify this to quantify error from zHeight.model
supernova()
L_Ch8_Correlation_6
The fit of the models is identical because all we have changed is the unit in which we measure the outcome and explanatory variables. We saw when we first introduced z scores that transforming an entire distribution into z scores did not change the shape of the distribution, but only the mean and standard deviation (to 0 and 1).
The same thing is true when we transform both the outcome and explanatory variables. The z transformation does not change the shape of the bivariate distribution, as represented in the scatter plot, at all. It simply changes the scale on both axes to standard deviations instead of inches.
Unlike PRE, which is a proportion of the total, SS are expressed in the units of the measurement. So if we converted the mm (for Thumb length) and inches (for Height) into cm, feet, etc, the SS would change to reflect those new units.