Course Outline

list Introduction to Statistics: A Modeling Approach

A Data Frame Example: MindsetMatters

Let’s look at a data frame called MindsetMatters. These data come from a study investigating the health of 75 female housekeepers from different hotels. You can read more about how these data were collected and organized here: MindsetMatters R documentation.

A data frame is a kind of object in R, and as with any object, you can just type the name of it to see the whole thing.

Type the name of the data frame MindsetMatters and then Run.

require(mosaic) require(tidyverse) MindsetMatters <- read.csv(file="https://raw.githubusercontent.com/UCLATALL/intro-stats-modeling/master/datasets/mindset-matters.csv", header=TRUE, sep=",") MindsetMatters$Cond <- factor(MindsetMatters$Cond, levels = c(1,0), labels=c("Informed","Uninformed")) # Try typing MindsetMatters to see what is in the data frame. MindsetMatters test_output_contains("MindsetMatters")
Just type MindsetMatters then click Run
DataCamp: ch2-3

Wow, that’s a lot to take in. This is usually the case when working with real data—there are a whole lot of things in a data set including a lot of variables and values. And usually we don’t just sample one case (e.g., one housekeeper)—we have a bunch of values for a bunch of variables for a bunch of housekeepers. So things get pretty complicated, pretty fast.

It’s always useful to take a quick peek of your data frame. But looking at the whole thing might be a little complicated. So a helpful command is head() which shows you just the first few rows of a data frame.

Press the Run button to see what happens when you run the command head(MindsetMatters).

require(mosaic) require(tidyverse) MindsetMatters <- read.csv(file="https://raw.githubusercontent.com/UCLATALL/intro-stats-modeling/master/datasets/mindset-matters.csv", header=TRUE, sep=",") MindsetMatters$Cond <- factor(MindsetMatters$Cond, levels = c(1,0), labels=c("Informed","Uninformed")) # Try running this code head(MindsetMatters) head(MindsetMatters) test_object("MindsetMatters") test_function("head") test_output_contains("head(MindsetMatters)")
Just click Run
DataCamp: ch2-4

The head() function just prints out the first six rows of the data frame as rows and columns.

Depending on your browser, your output might not look exactly like the one pictured above. However, you should notice in any case that there are so many variables that they do not fit in one row. When the row numbers start again, the columns that didn’t fit are shown. In our output (pictured above), Condition didn’t fit in the screen. Let’s try to read the data for the first housekeeper (in row 1). Here are a few of her measurements: age was 43, starting weight (Wt) was 137 pounds, starting BMI was 25.1, and BMI at the end of the study (BMI2) was 25.1. She was also in the uninformed condition.

Sometimes, it’s more useful to take a look at an overview of what’s in the data frame. The function str() shows us the overall structure of the data frame, including number of observations, number of variables, names of variables and so on.

Run str() on MindsetMatters and look at the results.

require(mosaic) require(tidyverse) MindsetMatters <- read.csv(file="https://raw.githubusercontent.com/UCLATALL/intro-stats-modeling/master/datasets/mindset-matters.csv", header=TRUE, sep=",") # Try running this code str(MindsetMatters) str(MindsetMatters) test_object("MindsetMatters") test_function("str") test_output_contains("str(MindsetMatters)")
Just click Run
DataCamp: ch2-5

Note that there is a $ in front of each variable name. In R, $ is often used to indicate that what follows is a variable name. If you want to specify the Age variable in the MindsetMatters data frame, for example, you would write MindsetMatters$Age. (R has its own way of categorizing variables, such as int, num, and Factor. You will learn more about these later.)

Let’s try one of our statistical techniques on the MindsetMatters data frame. We’ll make a frequency table with tally(). For instance we can tally how many housekeepers there were of each age.

tally(MindsetMatters$Age)

But we can also specify the variable and data frame separately like this:

tally(~ Age, data = MindsetMatters)

L_Ch2_FromNumbers_4

The rows that start with 19, 37, and 54 represent the ages of the housekeepers and the numbers underneath them represent how many of each age are in this data frame. For example, there is one housekeeper who is 19 years old. There are two housekeepers who are 54 years old. There are three housekeepers who are 45 years old.

L_Ch2_FromNumbers_5a

Try using the tally function to make a frequency table of housekeepers by Condition.

require(mosaic) MindsetMatters <- read.csv(file="https://raw.githubusercontent.com/UCLATALL/intro-stats-modeling/master/datasets/mindset-matters.csv", header=TRUE, sep=",") MindsetMatters$Cond <- factor(MindsetMatters$Cond, levels = c(1,0), labels=c("Informed","Uninformed")) # Use the tally() function on the MindsetMatters dataframe to create a frequency table of housekeepers by Condition # One solution tally(~Condition, data=MindsetMatters) # Another solution tally(MindsetMatters$Condition) test_object("MindsetMatters") test_function("tally", not_called_msg = "Have you used `tally()` to tally the data?") test_or(test_function_result("tally", index=1), test_function_result("tally", index=2)) test_error()
Use tally(), ~Cond, and data=MindsetMatters
DataCamp: ch2-6

L_Ch2_FromNumbers_6a

L_Ch2_FromNumbers_6c

Let’s focus on two variables in the MindsetMatters data frame: Age (the age of the housekeepers, in years, at the start of the study) and Wt (their weight, in pounds, at the start of the study).

We might want to sort the whole data frame MindsetMatters by Age. But now we can’t use the sort() function—that only works with vectors, not with data frames. So, if you want to sort a whole data frame, we will use a different function, arrange().

The arrange() function works similarly to sort(), except now you have to specify both the name of the data frame and the name of the variable on which you want to sort.

arrange(MindsetMatters, Age)

And, importantly, when you sort on one variable (e.g., Age), the order of the rows (which in this case is housekeepers) will change for every variable.

L_Ch2_FromNumbers_8

The printout of MindsetMatters won’t stay arranged by age because we didn’t save our work! In order to save it, we need to assign the arranged version to an R object. We don’t need to create a new object so let’s just assign it back to MindsetMatters using the assignment operator (<-). See if you can edit the code below to save the arranged by Age version of MindsetMatters back to MindsetMatters. The print out the first 6 lines of MindsetMatters using head().

require(mosaic) require(tidyverse) MindsetMatters <- read.csv(file="https://raw.githubusercontent.com/UCLATALL/intro-stats-modeling/master/datasets/mindset-matters.csv", header=TRUE, sep=",") # save MindsetMatters, arranged by Age, back to MindsetMatters arrange(MindsetMatters, Age) # write code to print out MindsetMatters MindsetMatters <- arrange(MindsetMatters, Age) MindsetMatters test_object("MindsetMatters") test_output_contains("MindsetMatters")
Save the arranged dataset back to MindsetMatters
DataCamp: ch2-7

The function arrange() can also be used to arrange values in descending order by adding desc() around our variable name.

arrange(MindsetMatters, desc(Age))

Try arranging MindsetMatters by Wt in descending order. Save this to MindsetMatters. Print a few rows of MindsetMatters to check out what happened.

require(mosaic) require(tidyverse) MindsetMatters <- read.csv(file="https://raw.githubusercontent.com/UCLATALL/intro-stats-modeling/master/datasets/mindset-matters.csv", header=TRUE, sep=",") # arrange MindsetMatters by Wt in descending order MindsetMatters <- # write code to print out a few rows of MindsetMatters MindsetMatters <- arrange(MindsetMatters, desc(Wt)) head(MindsetMatters) test_object("MindsetMatters") test_function("head") test_output_contains("head(MindsetMatters)")
Did you arrange by desc(Wt)?
DataCamp: ch2-8

Responses