Course Outline
-
segmentGetting Started (Don't Skip This Part)
-
segmentIntroduction to Statistics: A Modeling Approach
-
segmentPART I: EXPLORING VARIATION
-
segmentChapter 1 - Welcome to Statistics: A Modeling Approach
-
segmentChapter 2 - Understanding Data
-
2.7 Manipulating Data
-
segmentChapter 3 - Examining Distributions
-
segmentChapter 4 - Explaining Variation
-
segmentPART II: MODELING VARIATION
-
segmentChapter 5 - A Simple Model
-
segmentChapter 6 - Quantifying Error
-
segmentChapter 7 - Adding an Explanatory Variable to the Model
-
segmentChapter 8 - Models with a Quantitative Explanatory Variable
-
segmentPART III: EVALUATING MODELS
-
segmentChapter 9 - Distributions of Estimates
-
segmentChapter 10 - Confidence Intervals and Their Uses
-
segmentChapter 11 - Model Comparison with the F Ratio
-
segmentChapter 12 - What You Have Learned
-
segmentResources
list Introduction to Statistics: A Modeling Approach
Manipulating Data
Once data are in a tidy format, it is easy to use simple R commands to manipulate the data. Here are few things you might want to know how to do, even before you start analyzing your data:
Identifying Missing Data
Filtering Data
Creating Summary Variables
Recoding Variables
Creating Categorical Variables by Cutting Quantitative Variables
Aggregating Data
Identifying Missing Data
Sometimes (in fact, usually) we end up with some missing data in our data set. R represents missing data with the value NA (not available), and then also lets you decide how to handle missing data in subsequent analyses. If your data set represents missing data in some other way (e.g., some people put the value -999), you should recode the values as NA when working in R.
Let’s consider the last digit of students’ Social Security Numbers (SSLast) in the Fingers data frame. First, arrange the Fingers data frame so that rows are in in descending order by SSLast (remember to save it). Then print out just the variable SSLast from the Fingers data frame (remember to use $).
require(mosaic)
require(tidyverse)
require(supernova)
Fingers <- supernova::Fingers
# Arrange SSLast in descending order
Fingers <-
# Print out just the variable SSLast from the Fingers data frame
Fingers <- arrange(Fingers, desc(SSLast))
Fingers$SSLast
test_object("Fingers")
test_output_contains("Fingers$SSLast")
test_error()
L_Ch2_Structure_3
In R, blanks are automatically given the label ‘NA’ for not available. You can choose to remove observations with missing data from an individual analysis, or you can remove them from the data set entirely.
For example, if you wanted to create a new data frame without the missing data (but keep the mistakes, that is, the students who have entered in the last four digits of the SSN), you could use a comparison operator (such as >, <, ==, !=) to check whether the data is missing or not.
L_Ch2_Structure_4
This is a situation where it is more useful to think about what SSLast does not equal because there are a lot of numbers that students could have entered in. The phrase SSLast != “NA” means that SSLast does not equal “NA”. This statement would be true for anyone who actually entered in some numbers. This statement returns false for anyone who did not answer this question.
Filtering Data
We can filter the data so that we only see the observations that do not include missing data using the function filter()
, like this:
filter(Fingers, SSLast != "NA")
Note that filter()
filters in, not out. So, in the previous example, we used filter to only see the observations where SSLast was not equal to “NA”.
As with anything in R, your filtered work is only temporary unless you save it to an R object. So save the data with no missing SSLast values in a new data frame called Fingers.subset.
require(mosaic)
require(tidyverse)
require(supernova)
Fingers <- supernova::Fingers
Fingers <- arrange(Fingers, desc(SSLast))
# Filter out the students who have missing data for SSLast
Fingers.subset <-
# Print out the variable SSLast from Fingers.subset
Fingers.subset <- filter(Fingers, SSLast != "NA")
Fingers.subset$SSLast
test_object("Fingers.subset")
test_output_contains("Fingers.subset$SSLast")
Remember, however, that if you remove cases with missing data you may be introducing bias into your sample.
L_Ch2_Structure_5
Creating Summary Variables
Often we use multiple measures of a single attribute because no single measure would be adequate. For instance, it would be difficult to measure school achievement with a measure of just one course. However, if you do have multiple measures, you probably will want to aggregate them into a single variable. In the case of school achievement, a good summary measure might be the average grade earned across all of a student’s courses.
It is quite common to create new variables that summarize values from other variables. For example, in Fingers, we have a measurement for the length of each person’s fingers (Thumb, Index, Middle, Ring, Pinkie). By now, you should imagine this in the data frame where each person is a row and the length of each finger is in a column.
Although for some purposes you may want to examine these finger lengths separately, you also might want to create a new variable based on these finger lengths. For example, in most people the Index finger (the second digit) is shorter than the Ring finger (the fourth digit). We can create a new summary variable called RingLonger that tells us whether someone’s Ring finger is longer than their Index finger. We can add this new variable to our Fingers data frame as a new column.
Fingers$RingLonger <- Fingers$Ring > Fingers$Index
L_Ch2_Structure_6
Tally up how many people have longer Ring fingers (relative to their own Index finger).
require(mosaic)
require(tidyverse)
require(supernova)
Fingers <- supernova::Fingers
Fingers$RingLonger <- Fingers$Ring > Fingers$Index
# Write code to tally up RingLonger in Fingers.
tally(Fingers$RingLonger)
ex() %>% check_function("tally") %>% check_arg("x") %>% check_equal()
ex() %>% check_error()
success_msg("Fantastic work!")
L_Ch2_Structure_7
You can also use arithmetic operators to summarize variables. For example, it turns out that the ratio of Index to Ring finger (that is, Index divided by Ring) is often used in health research as a crude measure of prenatal testosterone exposure. Use the division operator, /, to create this summary variable.
require(mosaic)
require(tidyverse)
require(supernova)
Fingers <- supernova::Fingers
# Write code to create this summary variable
Fingers$IndexRingRatio <-
Fingers$IndexRingRatio <- Fingers$Index / Fingers$Ring
test_data_frame("Fingers", columns="IndexRingRatio")
L_Ch2_Structure_8
Whenever you make new variables, or even do anything else in R, it’s a good idea to check to make sure R did what you intended it to do. You can use the head()
function for this. Go ahead and print out the first six rows of Fingers. Use select()
to look at Ring, Index, and RingLonger. By looking at the ring and index fingers of a few students, you can see whether the RingLonger variable ended up meaning what you thought it did.
require(mosaic)
require(tidyverse)
require(supernova)
Fingers <- supernova::Fingers
Fingers$RingLonger <- Fingers$Ring > Fingers$Index
# Use head() and select() together to look at the first six rows of Ring, Index, and RingLonger
head(select(Fingers, Ring, Index, RingLonger))
ex() %>% check_function("select") %>% check_arg(".data") %>% check_equal()
ex() %>% check_function("select") %>% check_arg("...") %>% check_equal()
ex() %>% check_function("head")
ex() %>% check_error()
success_msg("Great work!")
It might be helpful to get an average finger length by adding up all the values of Thumb, Index, Middle, Ring, and Pinkie and dividing by 5. Write code for adding the variable AvgFinger to Fingers that does this. Write code to look at the first few lines of the Fingers data frame as well so you can check that your calculations look correct.
require(mosaic)
require(tidyverse)
require(supernova)
Fingers <- supernova::Fingers
# Write code to create this summary variable
Fingers$AvgFinger <-
# Write code to look at a few lines of Fingers
Fingers$AvgFinger <- (Fingers$Thumb + Fingers$Index + Fingers$Middle + Fingers$Ring + Fingers$Pinkie)/5
test_correct(test_data_frame("Fingers", columns="AvgFinger", incorrect_msg="Calculate the average using (Fingers$Thumb + Fingers$Index + Fingers$Middle + Fingers$Ring + Fingers$Pinkie)/5"),
{
test_error()
})
test_output_contains("head(Fingers)", incorrect_msg="Did you call `head()` on Fingers?")