segmentGetting Started (Don't Skip This Part)
segmentIntroduction to Statistics: A Modeling Approach
segmentPART I: EXPLORING VARIATION
segmentChapter 1 - Welcome to Statistics: A Modeling Approach
segmentChapter 2 - Understanding Data
2.4 Measurement (Continued)
segmentChapter 3 - Examining Distributions
segmentChapter 4 - Explaining Variation
segmentPART II: MODELING VARIATION
segmentChapter 5 - A Simple Model
segmentChapter 6 - Quantifying Error
segmentChapter 7 - Adding an Explanatory Variable to the Model
segmentChapter 8 - Models with a Quantitative Explanatory Variable
segmentPART III: EVALUATING MODELS
segmentChapter 9 - Distributions of Estimates
segmentChapter 10 - Confidence Intervals and Their Uses
segmentChapter 11 - Model Comparison with the F Ratio
segmentChapter 12 - What You Have Learned
list Introduction to Statistics: A Modeling Approach
Understanding the Values of a Variable
It is important to distinguish between the variable (e.g., Height or Sex or Family Members) and the value or number we assign to each object in the sample (e.g., 62 height in inches or 2 to represent female). One variable can take on many different values.
The fewest unique values a variable can take on would be two: the presence (1) or absence (0) of some characteristic. For example, a variable may be coded 1 if someone is a college graduate, or 0 if they are not. If a variable could only take on one possible value there would be no variation and hence it’s not really a variable. It’s possible, however, for quantitative variables to take on an infinite number of possible values.
When we code the values of a variable using numbers, it is always important to keep in mind what the numbers mean. The value 2 has a very different meaning if it represents the sex of a person (e.g., male) than if it represents their height in inches (very short!). When we use statistical software to analyze data, the software processes the numbers. But the software doesn’t know what the numbers actually mean. Only you know that.
Let’s take a look at just the variable Sex in the Fingers data documentation. In R, to access just one variable we first specify the data frame it comes from (Fingers), and then use the
$ symbol before specifying the variable name (Sex).
We can see that Sex is coded with a bunch of 1s and 2s. The documentation that accompanies the Fingers data tells us that 1 represents female and 2 represents male. But this output looks different from before.
When R is asked to print out multiple variables, it uses the rows and columns format, where rows are cases and columns are variables. But when asked to print out a single variable (such as Sex), R prints out each person’s value on the variable all in a row. When it gets to the end of one row it begins again on the next row.
When we use numbers to indicate levels of a categorical variable (e.g., 1 and 2 are used to represent male and female, the two levels of the categorical variable Sex), it is sometimes hard to remember which number signifies which category or level. As part of the
factor() function, R lets us assign labels to different levels of a categorical variable.
Only categorical variables should have labels to refer to the numbers or levels because in a categorical variable the numbers are a stand-in for a name. For example, the number 1 just stands for “female”. But in a quantitative variable, the numbers actually stand for some number. So 60 actually stands for 60 mm.
factor() function has three arguments: the variable name, the levels, and the labels. You should save the labeled version of the variable in place of the old version.
Fingers$Sex <- factor(Fingers$Sex, levels = c(1,2), labels = c("female", “male”))
Now when we print out
Fingers$Sex, we can see the labels.
That’s cool; now you don’t have to remember which number means what.
The Fingers data documentation is a summary of the contents of this data set. All the data sets in R have documentation. If you look at the documentation for Fingers, it says that the variable RaceEthnic represents racial or ethnic background and is coded like this: 1=White, 2=African American, 3=Asian, 4=Latino, 5=Other. The variable Job represents current employment status and is coded: 1=not working, 2=part-time job, 3=full-time job.
factor() to label the levels of RaceEthnic and** Job** with words that make sense.
require(mosaic) require(tidyverse) require(supernova) Fingers$MathAnxious <- as.numeric(Fingers$MathAnxious) Fingers$Interest <- as.numeric(Fingers$Interest) Fingers$RaceEthnic <- as.numeric(Fingers$RaceEthnic) Fingers$Job <- as.numeric(Fingers$Job) Fingers <- arrange(Fingers, desc(Sex))
#This code resets the window each time you Run. Please don't remove. Fingers$RaceEthnic <- as.numeric(Fingers$RaceEthnic) Fingers$Job <- as.numeric(Fingers$Job) # Edit this code to label the levels of RaceEthnic with words that match the labels in the Fingers Data documentation Fingers$RaceEthnic <- factor(Fingers$RaceEthnic, levels=c(1,2,3,4,5), labels=c("label1","label2","label3","label4","label5")) # Write code to label the levels of Job with words that match the labels in the Fingers Data documentation. Fingers$Job <- # This prints out a few lines of Fingers head(Fingers)
Fingers$RaceEthnic <- as.numeric(Fingers$RaceEthnic) Fingers$Job <- as.numeric(Fingers$Job) Fingers$RaceEthnic <- factor(Fingers$RaceEthnic, levels=c(1,2,3,4,5), labels=c("White", "African American", "Asian", "Latino", "Other")) Fingers$Job <- factor(Fingers$Job, levels=c(1,2,3), labels=c("not working", "part-time job", "full-time job")) head(Fingers)
ex() %>% check_function("factor", index = 1) %>% check_result() %>% check_equal() ex() %>% check_function("factor", index = 2) %>% check_result() %>% check_equal() ex() %>% check_function("head") %>% check_arg("x") %>% check_equal() ex() %>% check_error() success_msg("Awesome work!")
Just a quick note: sometimes your output won’t look EXACTLY like what we have here. We’ve made these output in a slightly different way so that they appear bigger and more readable. What matters is that the content is basically the same!
head() will show you a few rows of your data frame but will show you all the variables for those few observations. Sometimes, you might just want to look at a few variables. For instance, in this case, we just want to check out how RaceEthnic and Job were labeled.
We can use the
select() function to look at just a few specific variables. When using select, we first need to tell R which data frame, then which variables to select from that data frame.
select(Fingers, RaceEthnic, Job)
select() will print out all the values of the specified variables. If you just want to look at a few rows of a few variables, we can combine
head(select(Fingers, RaceEthnic, Job))
Before we introduce the next topic, let’s take a look at the first 10 students in Fingers data frame using
head() but let’s only select their thumb length. The code provided shows the RaceEthnic variable for the first three students in Fingers. Modify that code.
require(mosaic) require(tidyverse) require(supernova) Fingers <- supernova::Fingers custom_seed(2)
# Modify the code to show the Thumb variable for the first 10 students in the Fingers dataframe. head(select(Fingers, RaceEthnic), 3)
head(select(Fingers, Thumb), n=10)
test_function_result("head", not_called_msg="Have you used head()?", incorrect_msg="Did you set the arguments to select(Fingers, Thumb) and 10?") success_msg("Great thinking!")
Finally, it is important to note that measurements usually include error. Measurement error is not the same thing as a mistake. For example, the students were told to measure their fingers using mm but maybe they used cm instead. This would be a mistake. But measurement error is different. This is error caused by the natural fluctuation in most real-world measurements.
If you ask 10 different people to measure the same thumb to the nearest millimeter, you will probably get a variety of slightly different results. This can happen for many reasons, but here is one: some people might include the width of the crease between the thumb and palm in their measurement (see Figure below) whereas other people might not.
Measurement is sometimes hard. You might think height or thumb length are fairly easy to measure, but what if you want to measure depression, intelligence, health, and so on? We often want to know about these very important attributes and we have many ways of measuring them. These things are hard to measure, which means these measures often have more error.
Even though a measurement might contain error, this does not necessarily mean it is biased. Error just means that there is variation in the measure when we can assume there should not be; if 10 people get different measurements of the same thumb, we assume it’s the measurements that vary, not the length of the thumb. A measure is unbiased if the error is just as likely to be too high as too low, thus balancing out error around the true value.
But measurement can also be biased. A biased measure is systematically too high or too low. The error does not vary randomly around the middle, but pulls the measurement one way or the other. For example, if the 10 people who measured the same thumb all rounded up to the next mm, this would bias all the measurements to be slightly bigger than the actual length of the thumb. Contrast this with unbiased error: some people rounded down and some people rounded up and some people didn’t round at all. Even though these measurements would also have error, they would have unbiased error. This is something to keep in mind later as you analyze the data that is produced by the measures.