Course Outline

list Introduction to Statistics: A Modeling Approach

Manipulating Data (Continued)

Recoding Variables

There are some instances where you want to add a label onto a number. There are other cases where you want to change the numbers themselves. For instance, the variable Job is coded 1 for no job, 2 for part-time job, and 3 for full-time job. Perhaps you want to recode full-time job as 100 instead of 3. And maybe you’ll also recode part-time as 50 instead of 2. We’ll recode no job as 0 instead of 1.

recode(Fingers$Job, "1" = 0, "2" = 50, "3" = 100)

Note that in the recode() function, you need to put the old value in quotes; the new variable could be in quotes (if a character value) or not (if numerical).

As always, whenever we do anything, we might want to save it. Try saving the recoded version of Job as Job.recode, a new variable in Fingers. Print a few observations of Job and Job.recode to check that your recode worked.

require(mosaic) require(tidyverse) require(supernova) Fingers <- supernova::Fingers Fingers$Job <- as.numeric(Fingers$Job) # Save the recoded version of `Job` to `Job.recode` Fingers$Job.recode <- recode() # Write code to print a few observations of `Job` and `Job.recode` Fingers$Job.recode <- recode(Fingers$Job, "1"=0, "2"=50, "3"=100) head(select(Fingers, Job, Job.recode)) test_data_frame("Fingers", columns=c("Job.recode","Job")) test_or(test_output_contains("head(select(Fingers, Job, Job.recode))"), test_output_contains("head(select(Fingers, Job.recode, Job))"))
Use `recode(Fingers$Job, "1"=0, "2"=50, "3"=100)`
DataCamp: ch2-23

Creating Categorical Variables by Cutting Quantitative Variables

Sometimes it might be helpful to look at a quantitative variable in a categorical way. For example, if you want to split up the students in the Fingers data set into 2 groups by their Height (a short group and a tall group), you could use the function ntile().

N-tile comes from the idea that if you sort any quantitative variable, you could then divide the observations into groups of equal sizes. So, you could have tertiles (three equal-sized groups), quartiles (four groups), quantiles (five groups), deciles (10 groups), and so on. So ntile just means some number, n, of -tiles.

Running the code below will divide the students into two equal groups: those taller than the middle student, and those shorter. Students who belong to the shorter group will get a 1 and those in the taller group will get a 2.

ntile(Fingers$Height, 2)

Like everything else in R, if you don’t save it to a data frame, this work will go to waste. Use ntile() to create the shorter and taller group. Save this in Fingers as a new variable called Height2Group. Assign the resulting levels the labels ‘short’ and ‘tall’.

require(mosaic) require(tidyverse) require(supernova) Fingers <- supernova::Fingers Fingers$Job <- as.numeric(Fingers$Job) # Use `ntile()` to cut the data into groups Fingers$Height2Group <- # Label the levels "short" and "tall" Fingers$Height2Group <- factor(Fingers$Height2Group, levels= , labels = ) # This prints out a few observations of Height and Height.2group head(select(Fingers, Height, Height2Group)) Fingers$Height2Group <- ntile(Fingers$Height, 2) Fingers$Height2Group <- factor(Fingers$Height2Group, levels = c(1,2), labels = c("short", "tall")) head(select(Fingers, Height, Height2Group)) test_function_result("ntile", incorrect_msg="Did you use `ntile()` with the arguments `Fingers$Height` and `2`?") test_function("factor", args=c("levels","labels"), incorrect_msg="Did you set levels= `c(1,2)` and labels =`c('short','tall')`?") test_or(test_output_contains("head(select(Fingers, Height, Height2Group))"), test_output_contains("head(select(Fingers, Height2Group, Height))"), incorrect_msg="Did you use `head()` and `select()` on `Fingers`, `Height`, and `Height2Group`?")
Use ntile() to split Fingers$Height into 2 categories
DataCamp: ch2-24

Aggregating Rows

Finally, it is sometimes desirable to change what counts as a unit or row in your data set. Because each row in a tidy data set must refer to the same type of unit (e.g., person, family, school, country), changing what counts as a row will result in the creation of a new data frame.

For example, in the HappyPlanetIndex, each country is represented as a row and their Happiness scores come from a question called the ‘Ladder of Life’ by the Gallup World Poll. Respondents were asked to imagine a ladder, where 0 represents the worst possible life and 10 the best possible life, and report the step of the ladder they are currently standing on. Here are the first six rows of that data frame.



The Gallup poll asked a lot of respondents and tried to get a representative sample of Albanians. This single number, 5.5, probably summarizes a bunch of Albanians’ responses. This might be something like an average or a median score.


To save the average Happiness of these regions of the world, you would need to use your data frame with 143 rows (representing countries) and use it to create a data frame with seven rows (representing regions). In this new data frame, the Happiness score might be the average (or mean) happiness of the countries grouped by region.

The command aggregate() can create new data frames with different summary values based on different groupings.

aggregate(Happiness ~ Region, data = HappyPlanetIndex, FUN = mean)


As always, if we don’t save our work, we can’t refer to it again. Try saving the result of the aggregate() function into a new data frame called HappyRegions.

require(mosaic) require(tidyverse) require(Lock5withR) require(supernova) HappyPlanetIndex <- Lock5withR::HappyPlanetIndex HappyPlanetIndex$Region <- recode(HappyPlanetIndex$Region, '1'="Latin America", '2'="Western Nations", '3'="Middle East and North Africa", '4'="Sub-Saharan Africa", '5'="South Asia", '6'="East Asia", '7'="Former Communist Countries") # Use the aggregate() function and save the mean Happiness scores of regions into a new data frame called HappyRegions HappyRegions <- # Print out HappyRegions # Use the aggregate() function and save the mean Happiness scores of regions into a new data frame called HappyRegions HappyRegions <- aggregate(Happiness ~ Region, data = HappyPlanetIndex, FUN = mean) # Print out HappyRegions HappyRegions test_correct(test_data_frame("HappyRegions"), { test_function("aggregate", args= c("formula", "data", "FUN")) test_error() }) test_output_contains("HappyRegions")
Use aggregate(Happiness ~ Region, data= HappyPlanetIndex, FUN=mean)
DataCamp: ch2-25

The command aggregate() is quite useful because you can ask for different summary functions (FUN) other than mean. You can ask for max (for maximum value in the group), min (for minimum value), sum (the total), or median (the middle number).