Course Outline

list Introduction to Statistics: A Modeling Approach

Manipulating Data

Once data are in a tidy format, it is easy to use simple R commands to manipulate the data. Here are few things you might want to know how to do, even before you start analyzing your data:

  • Identifying Missing Data

  • Filtering Data

  • Creating Summary Variables

  • Recoding Variables

  • Creating Categorical Variables by Cutting Quantitative Variables

  • Aggregating Data

Identifying Missing Data

Sometimes (in fact, usually) we end up with some missing data in our data set. R represents missing data with the value NA (not available), and then also lets you decide how to handle missing data in subsequent analyses. If your data set represents missing data in some other way (e.g., some people put the value -999), you should recode the values as NA when working in R.

Let’s consider the last digit of students’ Social Security Numbers (SSLast) in the Fingers data frame. First, arrange the Fingers data frame so that rows are in in descending order by SSLast (remember to save it). Then print out just the variable SSLast from the Fingers data frame (remember to use $).

require(mosaic) require(tidyverse) require(supernova) Fingers <- supernova::Fingers # Arrange SSLast in descending order Fingers <- # Print out just the variable SSLast from the Fingers data frame Fingers <- arrange(Fingers, desc(SSLast)) Fingers$SSLast test_object("Fingers") test_output_contains("Fingers$SSLast") test_error()
Did you use arrange(Fingers, desc(SSLast)) to arrange SSLast in descending order? You can print SSLast by typing Fingers$SSLast
DataCamp: ch2-18


In R, blanks are automatically given the label ‘NA’ for not available. You can choose to remove observations with missing data from an individual analysis, or you can remove them from the data set entirely.

For example, if you wanted to create a new data frame without the missing data (but keep the mistakes, that is, the students who have entered in the last four digits of the SSN), you could use a comparison operator (such as >, <, ==, !=) to check whether the data is missing or not.


This is a situation where it is more useful to think about what SSLast does not equal because there are a lot of numbers that students could have entered in. The phrase SSLast != “NA” means that SSLast does not equal “NA”. This statement would be true for anyone who actually entered in some numbers. This statement returns false for anyone who did not answer this question.

Filtering Data

We can filter the data so that we only see the observations that do not include missing data using the function filter(), like this:

filter(Fingers, SSLast != "NA")

Note that filter() filters in, not out. So, in the previous example, we used filter to only see the observations where SSLast was not equal to “NA”.

As with anything in R, your filtered work is only temporary unless you save it to an R object. So save the data with no missing SSLast values in a new data frame called Fingers.subset.

require(mosaic) require(tidyverse) require(supernova) Fingers <- supernova::Fingers Fingers <- arrange(Fingers, desc(SSLast)) # Filter out the students who have missing data for SSLast Fingers.subset <- # Print out the variable SSLast from Fingers.subset Fingers.subset <- filter(Fingers, SSLast != "NA") Fingers.subset$SSLast test_object("Fingers.subset") test_output_contains("Fingers.subset$SSLast")
Make sure to assign filter(Fingers, SSLast != "NA" to Fingers.subset
DataCamp: ch2-19

Remember, however, that if you remove cases with missing data you may be introducing bias into your sample.


Creating Summary Variables

Often we use multiple measures of a single attribute because no single measure would be adequate. For instance, it would be difficult to measure school achievement with a measure of just one course. However, if you do have multiple measures, you probably will want to aggregate them into a single variable. In the case of school achievement, a good summary measure might be the average grade earned across all of a student’s courses.

It is quite common to create new variables that summarize values from other variables. For example, in Fingers, we have a measurement for the length of each person’s fingers (Thumb, Index, Middle, Ring, Pinkie). By now, you should imagine this in the data frame where each person is a row and the length of each finger is in a column.

Although for some purposes you may want to examine these finger lengths separately, you also might want to create a new variable based on these finger lengths. For example, in most people the Index finger (the second digit) is shorter than the Ring finger (the fourth digit). We can create a new summary variable called RingLonger that tells us whether someone’s Ring finger is longer than their Index finger. We can add this new variable to our Fingers data frame as a new column.

Fingers$RingLonger <- Fingers$Ring > Fingers$Index


Tally up how many people have longer Ring fingers (relative to their own Index finger).

require(mosaic) require(tidyverse) require(supernova) Fingers <- supernova::Fingers Fingers$RingLonger <- Fingers$Ring > Fingers$Index # Write code to tally up RingLonger in Fingers. tally(Fingers$RingLonger) ex() %>% check_function("tally") %>% check_arg("x") %>% check_equal() ex() %>% check_error() success_msg("Fantastic work!")
Use the tally() function on Fingers$RingLonger
DataCamp: ch2-20


You can also use arithmetic operators to summarize variables. For example, it turns out that the ratio of Index to Ring finger (that is, Index divided by Ring) is often used in health research as a crude measure of prenatal testosterone exposure. Use the division operator, /, to create this summary variable.

require(mosaic) require(tidyverse) require(supernova) Fingers <- supernova::Fingers # Write code to create this summary variable Fingers$IndexRingRatio <- Fingers$IndexRingRatio <- Fingers$Index / Fingers$Ring test_data_frame("Fingers", columns="IndexRingRatio")
Divide Fingers$Index by Fingers$Ring using the / operator
DataCamp: ch2-21


Whenever you make new variables, or even do anything else in R, it’s a good idea to check to make sure R did what you intended it to do. You can use the head() function for this. Go ahead and print out the first six rows of Fingers. Use select() to look at Ring, Index, and RingLonger. By looking at the ring and index fingers of a few students, you can see whether the RingLonger variable ended up meaning what you thought it did.

require(mosaic) require(tidyverse) require(supernova) Fingers <- supernova::Fingers Fingers$RingLonger <- Fingers$Ring > Fingers$Index # Use head() and select() together to look at the first six rows of Ring, Index, and RingLonger head(select(Fingers, Ring, Index, RingLonger)) ex() %>% check_function("select") %>% check_arg(".data") %>% check_equal() ex() %>% check_function("select") %>% check_arg("...") %>% check_equal() ex() %>% check_function("head") ex() %>% check_error() success_msg("Great work!")
Have you used head and select?
DataCamp: ch2-21a

It might be helpful to get an average finger length by adding up all the values of Thumb, Index, Middle, Ring, and Pinkie and dividing by 5. Write code for adding the variable AvgFinger to Fingers that does this. Write code to look at the first few lines of the Fingers data frame as well so you can check that your calculations look correct.

require(mosaic) require(tidyverse) require(supernova) Fingers <- supernova::Fingers # Write code to create this summary variable Fingers$AvgFinger <- # Write code to look at a few lines of Fingers Fingers$AvgFinger <- (Fingers$Thumb + Fingers$Index + Fingers$Middle + Fingers$Ring + Fingers$Pinkie)/5 test_correct(test_data_frame("Fingers", columns="AvgFinger", incorrect_msg="Calculate the average using (Fingers$Thumb + Fingers$Index + Fingers$Middle + Fingers$Ring + Fingers$Pinkie)/5"), { test_error() }) test_output_contains("head(Fingers)", incorrect_msg="Did you call `head()` on Fingers?")
Add up Fingers$Thumb, Fingers$Index, Fingers$Middle, Fingers$Ring, Fingers$Pinkie and divide by 5
DataCamp: ch2-22