Course Outline
-
segmentGetting Started (Don't Skip This Part)
-
segmentIntroduction to Statistics: A Modeling Approach
-
segmentPART I: EXPLORING VARIATION
-
segmentChapter 1 - Welcome to Statistics: A Modeling Approach
-
segmentChapter 2 - Understanding Data
-
2.8 Manipulating Data (Continued)
-
segmentChapter 3 - Examining Distributions
-
segmentChapter 4 - Explaining Variation
-
segmentPART II: MODELING VARIATION
-
segmentChapter 5 - A Simple Model
-
segmentChapter 6 - Quantifying Error
-
segmentChapter 7 - Adding an Explanatory Variable to the Model
-
segmentChapter 8 - Models with a Quantitative Explanatory Variable
-
segmentPART III: EVALUATING MODELS
-
segmentChapter 9 - Distributions of Estimates
-
segmentChapter 10 - Confidence Intervals and Their Uses
-
segmentChapter 11 - Model Comparison with the F Ratio
-
segmentChapter 12 - What You Have Learned
-
segmentResources
list Introduction to Statistics: A Modeling Approach
Manipulating Data (Continued)
Recoding Variables
There are some instances where you want to add a label onto a number. There are other cases where you want to change the numbers themselves. For instance, the variable Job is coded 1 for no job, 2 for part-time job, and 3 for full-time job. Perhaps you want to recode full-time job as 100 instead of 3. And maybe you’ll also recode part-time as 50 instead of 2. We’ll recode no job as 0 instead of 1.
recode(Fingers$Job, "1" = 0, "2" = 50, "3" = 100)
Note that in the recode()
function, you need to put the old value in quotes; the new variable could be in quotes (if a character value) or not (if numerical).
As always, whenever we do anything, we might want to save it. Try saving the recoded version of Job as Job.recode, a new variable in Fingers. Print a few observations of Job and Job.recode to check that your recode worked.
require(mosaic)
require(tidyverse)
require(supernova)
Fingers <- supernova::Fingers
Fingers$Job <- as.numeric(Fingers$Job)
# Save the recoded version of `Job` to `Job.recode`
Fingers$Job.recode <- recode()
# Write code to print a few observations of `Job` and `Job.recode`
Fingers$Job.recode <- recode(Fingers$Job, "1"=0, "2"=50, "3"=100)
head(select(Fingers, Job, Job.recode))
test_data_frame("Fingers", columns=c("Job.recode","Job"))
test_or(test_output_contains("head(select(Fingers, Job, Job.recode))"),
test_output_contains("head(select(Fingers, Job.recode, Job))"))
Creating Categorical Variables by Cutting Quantitative Variables
Sometimes it might be helpful to look at a quantitative variable in a categorical way. For example, if you want to split up the students in the Fingers data set into 2 groups by their Height (a short group and a tall group), you could use the function ntile()
.
N-tile comes from the idea that if you sort any quantitative variable, you could then divide the observations into groups of equal sizes. So, you could have tertiles (three equal-sized groups), quartiles (four groups), quantiles (five groups), deciles (10 groups), and so on. So ntile just means some number, n, of -tiles.
Running the code below will divide the students into two equal groups: those taller than the middle student, and those shorter. Students who belong to the shorter group will get a 1 and those in the taller group will get a 2.
ntile(Fingers$Height, 2)
Like everything else in R, if you don’t save it to a data frame, this work will go to waste. Use ntile()
to create the shorter and taller group. Save this in Fingers as a new variable called Height2Group. Assign the resulting levels the labels ‘short’ and ‘tall’.
require(mosaic)
require(tidyverse)
require(supernova)
Fingers <- supernova::Fingers
Fingers$Job <- as.numeric(Fingers$Job)
# Use `ntile()` to cut the data into groups
Fingers$Height2Group <-
# Label the levels "short" and "tall"
Fingers$Height2Group <- factor(Fingers$Height2Group, levels= , labels = )
# This prints out a few observations of Height and Height.2group
head(select(Fingers, Height, Height2Group))
Fingers$Height2Group <- ntile(Fingers$Height, 2)
Fingers$Height2Group <- factor(Fingers$Height2Group, levels = c(1,2), labels = c("short", "tall"))
head(select(Fingers, Height, Height2Group))
test_function_result("ntile", incorrect_msg="Did you use `ntile()` with the arguments `Fingers$Height` and `2`?")
test_function("factor", args=c("levels","labels"), incorrect_msg="Did you set levels= `c(1,2)` and labels =`c('short','tall')`?")
test_or(test_output_contains("head(select(Fingers, Height, Height2Group))"),
test_output_contains("head(select(Fingers, Height2Group, Height))"), incorrect_msg="Did you use `head()` and `select()` on `Fingers`, `Height`, and `Height2Group`?")
Aggregating Rows
Finally, it is sometimes desirable to change what counts as a unit or row in your data set. Because each row in a tidy data set must refer to the same type of unit (e.g., person, family, school, country), changing what counts as a row will result in the creation of a new data frame.
For example, in the HappyPlanetIndex, each country is represented as a row and their Happiness scores come from a question called the ‘Ladder of Life’ by the Gallup World Poll. Respondents were asked to imagine a ladder, where 0 represents the worst possible life and 10 the best possible life, and report the step of the ladder they are currently standing on. Here are the first six rows of that data frame.
head(HappyPlanetIndex)
L_Ch2_Structure_9
The Gallup poll asked a lot of respondents and tried to get a representative sample of Albanians. This single number, 5.5, probably summarizes a bunch of Albanians’ responses. This might be something like an average or a median score.
L_Ch2_Structure_10
To save the average Happiness of these regions of the world, you would need to use your data frame with 143 rows (representing countries) and use it to create a data frame with seven rows (representing regions). In this new data frame, the Happiness score might be the average (or mean) happiness of the countries grouped by region.
The command aggregate()
can create new data frames with different summary values based on different groupings.
aggregate(Happiness ~ Region, data = HappyPlanetIndex, FUN = mean)
L_Ch2_Structure_11
As always, if we don’t save our work, we can’t refer to it again. Try saving the result of the aggregate()
function into a new data frame called HappyRegions.
require(mosaic)
require(tidyverse)
require(Lock5withR)
require(supernova)
HappyPlanetIndex <- Lock5withR::HappyPlanetIndex
HappyPlanetIndex$Region <- recode(HappyPlanetIndex$Region, '1'="Latin America", '2'="Western Nations", '3'="Middle East and North Africa", '4'="Sub-Saharan Africa", '5'="South Asia", '6'="East Asia", '7'="Former Communist Countries")
# Use the aggregate() function and save the mean Happiness scores of regions into a new data frame called HappyRegions
HappyRegions <-
# Print out HappyRegions
# Use the aggregate() function and save the mean Happiness scores of regions into a new data frame called HappyRegions
HappyRegions <- aggregate(Happiness ~ Region, data = HappyPlanetIndex, FUN = mean)
# Print out HappyRegions
HappyRegions
test_correct(test_data_frame("HappyRegions"),
{
test_function("aggregate", args= c("formula", "data", "FUN"))
test_error()
})
test_output_contains("HappyRegions")
The command aggregate()
is quite useful because you can ask for different summary functions (FUN
) other than mean
. You can ask for max
(for maximum value in the group), min
(for minimum value), sum
(the total), or median
(the middle number).