## Course Outline

• segmentGetting Started (Don't Skip This Part)
• segmentIntroduction to Statistics: A Modeling Approach
• segmentPART I: EXPLORING VARIATION
• segmentChapter 1 - Welcome to Statistics: A Modeling Approach
• segmentChapter 2 - Understanding Data
• segmentChapter 3 - Examining Distributions
• segmentChapter 4 - Explaining Variation
• segmentPART II: MODELING VARIATION
• segmentChapter 5 - A Simple Model
• segmentChapter 6 - Quantifying Error
• segmentChapter 7 - Adding an Explanatory Variable to the Model
• segmentChapter 8 - Models with a Quantitative Explanatory Variable
• segmentPART III: EVALUATING MODELS
• segmentChapter 9 - Distributions of Estimates
• segmentChapter 10 - Confidence Intervals and Their Uses
• segmentChapter 11 - Model Comparison with the F Ratio
• segmentChapter 12 - What You Have Learned
• segmentResources

## Fitting a Regression Model

### Using lm() to Fit the Height Model to TinyFingers

Now you can begin to see the power you’ve been granted by the General Linear Model! Fitting—or estimating the parameters—of the regression model is accomplished the same way as estimating the parameters of the grouping model. It’s all done using the lm() function in R.

The lm() function is smart enough to know that if the explanatory variable is quantitative, the model to estimate is the regression model. If the explanatory variable is categorical (e.g., defined as a factor in R), lm() will fit a group model.

Modify the code below to fit the regression model using Height as the explanatory variable to predict Thumb length in the TinyFingers data.

 require(ggformula) require(mosaic) require(Lock5Data) require(Lock5withR) require(okcupiddata) Fingers <- read.csv(file="https://raw.githubusercontent.com/UCLATALL/intro-stats-modeling/master/datasets/fingers.csv", header=TRUE, sep=",") Fingers <- data.frame(Fingers) #set up tiny data set Thumb <- c(56, 60, 61, 63, 64, 68) Sex <- c("female","female","female","male","male","male") TinyFingers <- data.frame(Sex, Thumb) TinyFingers$Sex <- as.factor(TinyFingers$Sex) TinyFingers$Height = c(62, 66, 67, 63, 68, 71) TinyFingers$Height2Group = ntile(TinyFingers$Height, 2) TinyFingers$Height3Group = ntile(TinyFingers$Height, 3) TinyFingers$Height2Group = recode(TinyFingers$Height2Group, '1' = "1-Short", '2' = "2-Tall") TinyFingers$Height3Group = recode(TinyFingers$Height3Group, '1' = "1-Short", '2' = "2-Medium", '3' = "3-Tall")   # modify this to fit the model TinyHeight.model <- lm() # this prints the best fitting estimates TinyHeight.model   # modify this to fit the model TinyHeight.model <- lm(Thumb ~ Height, data = TinyFingers) # this prints the best fitting estimates TinyHeight.model   test_object("TinyHeight.model") test_output_contains("TinyHeight.model") test_error() success_msg("Keep up the great work!")  Model Thumb length as a function of Height DataCamp: ch8-1 L_Ch8_Fitting_1 ### Fitting a Regression Model By Accident When You Don’t Want One Although R is pretty smart about knowing which model to fit, it won’t always do the right thing. If you code the grouping variable with the character strings “short” and “tall,” R will make the right decision because it knows the variable must be categorical. But if you code a grouping variable as 1 and 2, and you forget to make it a factor, R may get confused and fit the model as though the explanatory variable is quantitative. For example, we’ve added a new variable to our TinyFingers data called GroupNum. Here is what the data look like. If you take a look at the variables Height2Group and GroupNum, they have the same information. Students 1, 2, and 4 are in one group and students 3, 5, and 6 are in another group. If we fit a model with Height2Group (and called it the Height2Group.model) or GroupNum (and called it the GroupNum.model), we would expect the same estimates. Let’s try it.  require(mosaic) require(ggformula) #set up tiny data set Thumb <- c(56, 60, 61, 63, 64, 68) Sex <- c("female","female","female","male","male","male") TinyFingers <- data.frame(Sex, Thumb) TinyFingers$Sex <- as.factor(TinyFingers$Sex) TinyFingers$Height = c(62, 66, 67, 63, 68, 71) TinyFingers$Height2Group = ntile(TinyFingers$Height, 2) TinyFingers$Height3Group = ntile(TinyFingers$Height, 3) TinyFingers$Height2Group = recode(TinyFingers$Height2Group, '1' = "1-Short", '2' = "2-Tall") TinyFingers$Height3Group = recode(TinyFingers$Height3Group, '1' = "1-Short", '2' = "2-Medium", '3' = "3-Tall") TinyFingers$GroupNum <- ntile(TinyFingers$Height, 2)   # fit a model of Thumb length based on Height2Group Height2Group.model <- lm() # fit a model of Thumb length based on GroupNum GroupNum.model <- lm() # this prints the parameter estimates from the two models Height2Group.model GroupNum.model   # fit a model of Thumb length based on Height2Group Height2Group.model <- lm(Thumb ~ Height2Group, data = TinyFingers) # fit a model of Thumb length based on GroupNum GroupNum.model <- lm(Thumb ~ GroupNum, data = TinyFingers) # this prints the parameter estimates from the two models Height2Group.model GroupNum.model   test_object("Height2Group.model") test_object("GroupNum.model") test_output_contains("Height2Group.model") test_output_contains("GroupNum.model") success_msg("Great work!") 
DataCamp: ch8-2

L_Ch8_Fitting_2

Because Height2Group is a factor (i.e., a categorical variable), lm() fits a group model. But for GroupNum, lm() thinks the 1 or 2 coding refers to a quantitative variable because we did not tell R that it was a factor. So it fits a regression line instead of a group model. If it does that, the meaning of the estimates will not be what you expect for the group model.

The slope will be accurate, because it will tell you the increment in thumb length between people coded as 2 vs. those coded as 1. But the $$b_{0}$$ estimate will be the y-intercept—i.e., the predicted thumb length when $$X_{i}$$ equals 0. This makes no sense when there are only two groups and they are coded 1 and 2. This is an accidental regression model.

L_Ch8_Fitting_3

Try it here by recoding GroupNum as 0 and 1. See if the results fit your expectations.

 require(mosaic) require(ggformula) #set up tiny data set Thumb <- c(56, 60, 61, 63, 64, 68) Sex <- c("female","female","female","male","male","male") TinyFingers <- data.frame(Sex, Thumb) TinyFingers$Sex <- as.factor(TinyFingers$Sex) TinyFingers$Height = c(62, 66, 67, 63, 68, 71) TinyFingers$Height2Group = ntile(TinyFingers$Height, 2) TinyFingers$Height3Group = ntile(TinyFingers$Height, 3) TinyFingers$Height2Group = recode(TinyFingers$Height2Group, '1' = "1-Short", '2' = "2-Tall") TinyFingers$Height3Group = recode(TinyFingers$Height3Group, '1' = "1-Short", '2' = "2-Medium", '3' = "3-Tall") TinyFingers$GroupNum <- ntile(TinyFingers$Height, 2) TinyFingers$Group01 <- ntile(TinyFingers$Height, 2) Height2Group.model <- lm(Thumb ~ Height2Group, data = TinyFingers)   # recode GroupNum from 1 and 2 to 0 and 1 TinyFingers$GroupNum <- recode() # This will fit an accidental regression model GroupNum.model <- lm(Thumb ~ GroupNum, data = TinyFingers) GroupNum.model   # recode GroupNum from 1 and 2 to 0 and 1 TinyFingers$GroupNum <- recode(TinyFingers$GroupNum, "1" = 0, "2" = 1) # This will fit an accidental regression model GroupNum.model <- lm(Thumb ~ GroupNum, data = TinyFingers) GroupNum.model   test_data_frame("TinyFingers") test_object("GroupNum.model") test_output_contains("GroupNum.model") success_msg("Great work!") 
DataCamp: ch8-2a

### Fitting the Height Model to the Full Fingers Data Set

Now that you have looked in detail at the tiny set of data, fit the height model to the full Fingers data frame, and save the model in an R object called Height.model.

 require(mosaic) require(ggformula) Fingers <- read.csv(file="https://raw.githubusercontent.com/UCLATALL/intro-stats-modeling/master/datasets/fingers.csv", header=TRUE, sep=",") # this is for measurement section #Fingers <- arrange(Fingers, desc(Sex)) #Fingers$FamilyMembers[1] <- 2 #Fingers$Height[1] <- 62 #Fingers$Sex <- recode(Fingers$Sex, '1' = "female", '2' = "male") Fingers <- data.frame(Fingers) # clean up str Fingers$Sex <- as.factor(Fingers$Sex) Fingers$RaceEthnic <- as.numeric(Fingers$RaceEthnic) Fingers$SSLast <- as.numeric(Fingers$SSLast) Fingers$Year <- as.numeric(Fingers$Year) Fingers$Job <- as.numeric(Fingers$Job) Fingers$MathAnxious <- as.numeric(Fingers$MathAnxious) Fingers$Interest <- as.numeric(Fingers$Interest) Fingers$GradePredict <- as.numeric(Fingers$GradePredict) Fingers$Thumb <- as.numeric(Fingers$Thumb) Fingers$Index <- as.numeric(Fingers$Index) Fingers$Middle <- as.numeric(Fingers$Middle) Fingers$Ring <- as.numeric(Fingers$Ring) Fingers$Pinkie <- as.numeric(Fingers$Pinkie) Fingers$Height <- as.numeric(Fingers$Height) Fingers$Weight <- as.numeric(Fingers$Weight) Fingers <- filter(Fingers, Thumb >= 33 & Thumb <= 100) set.seed(2)   # modify this to fit the Height model of Thumb for the Fingers data Height.model <- # this prints best estimates Height.model   Height.model <- lm(Thumb ~ Height, data = Fingers) Height.model   test_object("Height.model") test_output_contains("Height.model") success_msg("Awesome job!") 
Model Thumb as a function of Height
DataCamp: ch8-3

L_Ch8_Fitting_4

Here is the code to make a scatter plot to show the relationship between Height (on the x-axis) and Thumb (on the y-axis) for TinyFingers. Note that the code also overlays the best-fitting regression line on the scatter plot. Edit the code to make this scatter plot for the full Fingers data set.

 require(mosaic) require(ggformula) Fingers <- read.csv(file="https://raw.githubusercontent.com/UCLATALL/intro-stats-modeling/master/datasets/fingers.csv", header=TRUE, sep=",") # this is for measurement section #Fingers <- arrange(Fingers, desc(Sex)) #Fingers$FamilyMembers[1] <- 2 #Fingers$Height[1] <- 62 #Fingers$Sex <- recode(Fingers$Sex, '1' = "female", '2' = "male") Fingers <- data.frame(Fingers) # clean up str Fingers$Sex <- as.factor(Fingers$Sex) Fingers$RaceEthnic <- as.numeric(Fingers$RaceEthnic) Fingers$SSLast <- as.numeric(Fingers$SSLast) Fingers$Year <- as.numeric(Fingers$Year) Fingers$Job <- as.numeric(Fingers$Job) Fingers$MathAnxious <- as.numeric(Fingers$MathAnxious) Fingers$Interest <- as.numeric(Fingers$Interest) Fingers$GradePredict <- as.numeric(Fingers$GradePredict) Fingers$Thumb <- as.numeric(Fingers$Thumb) Fingers$Index <- as.numeric(Fingers$Index) Fingers$Middle <- as.numeric(Fingers$Middle) Fingers$Ring <- as.numeric(Fingers$Ring) Fingers$Pinkie <- as.numeric(Fingers$Pinkie) Fingers$Height <- as.numeric(Fingers$Height) Fingers$Weight <- as.numeric(Fingers$Weight) Fingers <- filter(Fingers, Thumb >= 33 & Thumb <= 100) set.seed(2)   # edit this code to create a scatter plot for teh full Fingers data gf_point(Thumb ~ Height, data = TinyFingers, size = 4) %>% gf_lm(color = "orange")   gf_point(Thumb ~ Height, data = Fingers ) %>% gf_lm(Thumb ~ Height, data = Fingers, color = "orange")   test_function("gf_point", args = "data") test_function("gf_lm", args = "data") test_error() success_msg("You R an R wizard!") 
DataCamp: ch8-4