Course Outline

segmentGetting Started (Don't Skip This Part)

segmentIntroduction to Statistics: A Modeling Approach

segmentPART I: EXPLORING VARIATION

segmentChapter 1  Welcome to Statistics: A Modeling Approach

segmentChapter 2  Understanding Data

2.3 Measurement

segmentChapter 3  Examining Distributions

segmentChapter 4  Explaining Variation

segmentPART II: MODELING VARIATION

segmentChapter 5  A Simple Model

segmentChapter 6  Quantifying Error

segmentChapter 7  Adding an Explanatory Variable to the Model

segmentChapter 8  Models with a Quantitative Explanatory Variable

segmentPART III: EVALUATING MODELS

segmentChapter 9  Distributions of Estimates

segmentChapter 10  Confidence Intervals and Their Uses

segmentChapter 11  Model Comparison with the F Ratio

segmentChapter 12  What You Have Learned

segmentResources
list Introduction to Statistics: A Modeling Approach
Measurement
Measurement is the process of turning variation in the world into data. When we measure, we assign numbers or category labels to some sample of cases in order to represent some attribute or dimension along which the cases vary.
Let’s make this more concrete by looking at some more measurements, in a data set called Fingers. A sample of college students filled in an online survey in which they were asked a variety of basic demographic questions. They also were asked to measure the length of each finger on their right hand.
require(supernova)
require(mosaic)
require(tidyverse)
Fingers < arrange(Fingers, desc(Sex))
Fingers$FamilyMembers[1] < 2
Fingers$Height[1] < 62
Fingers$Sex < as.numeric(Fingers$Sex)
Fingers$RaceEthnic < as.numeric(Fingers$RaceEthnic)
Fingers$Job < as.numeric(Fingers$Job)
Fingers$MathAnxious < as.numeric(Fingers$MathAnxious)
Fingers$Interest < as.numeric(Fingers$Interest)
# One way to look at a data frame in R is just to type its name. Try that for the data frame called Fingers.
Fingers
test_output_contains("Fingers")
success_msg("Great! Let's try something more challenging.")
You’ll notice that trying to look at the whole data frame can be very cumbersome, especially for larger data sets.
require(supernova)
require(mosaic)
require(tidyverse)
Fingers < arrange(Fingers, desc(Sex))
Fingers$FamilyMembers[1] < 2
Fingers$Height[1] < 62
Fingers$Sex < as.numeric(Fingers$Sex)
Fingers$RaceEthnic < as.numeric(Fingers$RaceEthnic)
Fingers$Job < as.numeric(Fingers$Job)
Fingers$MathAnxious < as.numeric(Fingers$MathAnxious)
Fingers$Interest < as.numeric(Fingers$Interest)
# Remember the head() command? Use it here to look at just the first six rows of the data frame.
head(Fingers)
test_function("head")
test_output_contains("head(Fingers)")
success_msg("Look at how far you have come!")
The command head()
shows you the first six rows of a data frame but if you wanted to look at a different number of rows, you can just add in a number at the end like this.
require(supernova)
require(mosaic)
require(tidyverse)
Fingers < arrange(Fingers, desc(Sex))
Fingers$FamilyMembers[1] < 2
Fingers$Height[1] < 62
Fingers$Sex < as.numeric(Fingers$Sex)
Fingers$RaceEthnic < as.numeric(Fingers$RaceEthnic)
Fingers$Job < as.numeric(Fingers$Job)
Fingers$MathAnxious < as.numeric(Fingers$MathAnxious)
Fingers$Interest < as.numeric(Fingers$Interest)
# Try it and see what happens
head(Fingers, 3)
head(x=Fingers, n=3)
test_function("head", args = c("x", "n"))
test_output_contains("head(Fingers, 3)")
success_msg("Great effort!")
L_Ch2_Measurement_1
Notice that to answer this question, you need to know something about how these numbers were measured. You need to know: Was Height measured with inches? What number represents which Sex? Does FamilyMembers include the person answering the question?
We will be talking a lot about what measurements mean throughout the class. But before we go on, let’s learn one more way to take a quick look at a dataframe.
L_Ch2_Measurement_2
require(mosaic)
require(tidyverse)
require(supernova)
Fingers < arrange(Fingers, desc(Sex))
Fingers$FamilyMembers[1] < 2
Fingers$Height[1] < 62
Fingers$Sex < as.numeric(Fingers$Sex)
Fingers$RaceEthnic < as.numeric(Fingers$RaceEthnic)
Fingers$Job < as.numeric(Fingers$Job)
Fingers$MathAnxious < as.numeric(Fingers$MathAnxious)
Fingers$Interest < as.numeric(Fingers$Interest)
# Try using tail() to look at the last 6 rows of the Fingers data frame.
tail(Fingers)
test_function("tail")
test_output_contains("tail(Fingers)")
success_msg("Super!")
Levels of Measurement: Quantitative and Categorical Variables
Measures can be divided into two types, often referred to as “levels of measurement”: quantitative and categorical.
FamilyMembers and Height (which in this case was measured in inches) are examples of quantitative variables. The values assigned to quantitative variables represent some quantity (e.g., inches for height). And we can know that someone with a higher number (say, 62) is taller than someone with a lower number (say, 60). Moreover, the difference between the numbers actually tells us exactly how much taller one person is than another.
Categorical variables are quite different. Sex in this data set is a categorical variable. Students categorized themselves as male, female, or other. For purposes of analysis we might code each person in the following way: 1 if they are female; 2 if male; or 3 if other. The specific numbers we assign are arbitrary; we could have said other is 1, female is 2, and male is 3. The numbers don’t tell us anything about quantity; the numbers simply tell us which category the object belongs to.
L_Ch2_Measurement_3
L_Ch2_Measurement_4
Statisticians find it necessary to distinguish at least two types of variables, each of which is measured in a different way: categorical and quantitative. Don’t get caught up on names. While we use the terms quantitative and categorical, other writers use other terms. They all mean roughly the same thing. Here are a few synonyms for quantitative variable and categorical variable that you may run across:
L_Ch2_Measurement_5
Quantitative and Categorical Variables in R
Quantitative variables are always represented as numeric (or num) variables in R. Categorical variables could be either numeric or character (chr) variables in R, depending on what values they hold. If we were to code the variable Sex, for example, as 1 or 2 (for male and female) we could put the values in a numeric variable in R. If, on the other hand, we wanted to enter the values “male” or “female” into the variable Sex, R would represent it as a character variable. No matter what kind of variable we use in R, from the researcher’s point of view, the variable itself is still categorical.
R won’t necessarily know whether a variable is quantitative or categorical. A number could be used by a researcher to code a categorical variable (e.g., 1 for males and 2 for females), or it could represent units of some real quantitative measurement (1 sibling or 2 siblings). R will usually try to guess what kind of variable it is, but it may guess wrong!
For that reason, R has a way to let you specify whether a variable is categorical using the as.factor
command. A factor variable, in R, is always categorical. In the Fingers data frame, Sex is coded as 1 or 2. In order for R to know that it is categorical, we can tell it by using the command as.factor(Fingers$Sex)
. Remember, we also have to save the result of the command back into the Fingers data frame if we want R to remember it. We use the following code to turn Sex into a factor, and then replace the old version of the variable, which was numeric, with the new version, a factor:
Fingers$Sex < as.factor(Fingers$Sex)
We can also turn a factor back into a numeric variable by using the as.numeric()
function.
If the 1s and 2s in the Sex column were numbers, we could add them up using the code sum(Fingers$Sex)
. But if we tell R that Sex is a factor, it will assume the 1s and 2s refer to categories, and so it won’t be willing to add them up.
Run this code to see the error. To allow R to sum up all the 1s and 2s, you can use as.numeric()
to turn the 1s and 2s back into numbers.
require(mosaic)
require(tidyverse)
Fingers < read.csv(file="https://raw.githubusercontent.com/UCLATALL/introstatsmodeling/master/datasets/fingers.csv", header=TRUE, sep=",")
# this turns Sex into a factor
Fingers$Sex < as.factor(Fingers$Sex)
# sum up all the values of Sex (all the 1s and 2s)
# run this code and observe the error
sum(Fingers$Sex)
# this turns Sex into a factor
Fingers$Sex < as.factor(Fingers$Sex)
test_error()
If you’d like to make this code run, use as.numeric()
to correct the code and run it again. Now you should see the sum.
Depending on your goals, you may decide to treat a variable with numbers as both a quantitative and a categorical variable. If this is the case, it’s a good idea to make two copies of the variable, one numeric and one factor.
For example, Likert scales (those questions that ask you to rate something on a 5 or 7point scale) could be treated as quantitative variables in some situations, and categorical in other situations. In the Fingers data frame we have a variable called Interest, a rating of how interested the student is in statistics. It is coded on a 4point scale from 2 (very interested in statistics) to 1 (dread the course).
If you want to ask what the average rating is, you would need the variable to be numeric in R. But if you want to compare people who gave a 0 rating with those who gave 2, you would need to make it a factor.
require(mosaic)
require(tidyverse)
Fingers < read.csv(file="https://raw.githubusercontent.com/UCLATALL/introstatsmodeling/master/datasets/fingers.csv", header=TRUE, sep=",")
# Here we will create a numeric version of Interest in Fingers. We'll call it Interest.num
Interest.num < as.numeric(Fingers$Interest)
#Modify the following code to also create a factor version of Interest in Fingers. Call it Interest.factor
Interest.factor <
Interest.factor < as.factor(Fingers$Interest)
test_object("Interest.factor")
The str()
command will tell you the type of each variable in a data frame, provided it was specified by the researcher.
str(Fingers)
L_Ch2_Measurement_6