Course Outline

list Introduction to Statistics: A Modeling Approach

Fooled by Chance: The Problem of Type I Error

Let’s take a break from this praise-fest for the random assignment experiment, and talk about why we might still be fooled, even with the most powerful of all research designs.

Let’s say we did an ideal experiment, that everything was done very carefully and by the book. We randomly assigned restaurant servers to the two groups, and those in the experimental group drew smiley face just as they were instructed. Other than that, servers in both groups just went about their business normally. At the end of the experiment we measured the tips, and the servers in the smiley face group did, indeed, end up with a bit more money.

L_Ch4_Fooled_1

If this was a perfectly done experiment, there are two possible reasons why the smiley face group made more tips. The first and most interesting possible reason is that there is a causal relationship between drawing a smiley face and tips! That would be cool if a little drawing really does help. But there is a second reason as well: random variation. It’s true that we randomly assigned servers to one of the groups. But even if we did not intervene, and no one drew smiley faces, we would still expect some difference in tips across the two groups just by chance.

In a random assignment experiment, we know that any difference in an outcome variable between two groups prior to an intervention would be the result of random chance. But this does not mean that the difference between the two groups would be 0, i.e., that the two groups would have the exact same distribution on the outcome variable.

Let’s explore this idea further in a data frame called Servers. There are 44 restaurant servers in the data frame and two variables: ServerID (just a number from 1 to 44 to identify each server) and Tip (in dollars, earned for one day of work). Here are a few rows of this data frame.

head(Servers)

L_Ch4_Fooled_2

We went ahead and randomly assigned each server to Group 1 or 2. We put their randomly assigned group number into variable called RandomGroups1. Here are a few rows of Servers showing this new variable.

head(Servers)

Make histograms in a facet grid of Tip by the RandomGroups1 variable from the data frame Servers.

require(mosaic) require(ggformula) require(haven) Tip_Study_Data <- read.csv(file="https://raw.githubusercontent.com/UCLATALL/intro-stats-modeling/master/datasets/tip_study_data.csv", header=TRUE, sep=",") set.seed(100) ServerID <- c(1:44) Tip <- sample(Tip_Study_Data$tips, 44) Servers <- data.frame(ServerID, Tip) set.seed(8) Servers <- sample(Servers, orig.id=FALSE) Servers$RandomGroups1 <- append(rep(1,22), rep(2,22)) Servers$RandomGroups1 <- as.factor(Servers$RandomGroups1) Servers <- arrange(Servers, ServerID) # Create density histograms in a facet grid of Tips by the RandomGroups1 variable from Servers # Create density histograms in a facet grid of Tips by the RandomGroups1 variable from Servers gf_histogram(..density..~ Tip, data = Servers) %>% gf_facet_grid(RandomGroups1 ~ .) test_function("gf_histogram") test_error() success_msg("Great job!")
Use `..density..` to create a density plot
DataCamp: ch4-17

L_Ch4_Fooled_3

We took these 44 servers and did the whole thing again, randomly assigning them to Group 1 or 2. We put the results in a new variable called RandomGroups2. We used a function called shuffle() like this.

Servers$RandomGroups2 <- shuffle(Servers$RandomGroups1)

Write the code to randomly assign these servers one more time and put the results in yet another variable called RandomGroups3. Then print a few rows of the data frame Servers with these new random group assignments.

require(mosaic) require(ggformula) require(haven) Tip_Study_Data <- read.csv(file="https://raw.githubusercontent.com/UCLATALL/intro-stats-modeling/master/datasets/tip_study_data.csv", header=TRUE, sep=",") set.seed(100) ServerID <- c(1:44) Tip <- sample(Tip_Study_Data$tips, 44) Servers <- data.frame(ServerID, Tip) set.seed(8) Servers <- sample(Servers, orig.id=FALSE) Servers$RandomGroups1 <- append(rep(1,22), rep(2,22)) Servers$RandomGroups1 <- as.factor(Servers$RandomGroups1) Servers <- arrange(Servers, ServerID) Servers <- sample(Servers, orig.id=FALSE) Servers$RandomGroups1 <- append(rep(1,22), rep(2,22)) Servers$RandomGroups1 <- as.factor(Servers$RandomGroups1) Servers <- arrange(Servers, ServerID) set.seed(17) Servers$RandomGroups2 <- shuffle(Servers$RandomGroups1) custom_seed(20) # shuffle around the group assignments again Servers$RandomGroups3 <- # print a few lines of Servers # shuffle around the group assignments again Servers$RandomGroups3 <- shuffle(Servers$RandomGroups1) # print a few lines of Servers head(Servers) test_data_frame("Servers") test_function_result("head") test_error()
Use the shuffle() function
DataCamp: ch4-18

L_Ch4_Fooled_4

Go ahead and make Tip paneled by group histograms for RandomGroups2 and RandomGroups3.

require(mosaic) require(ggformula) require(haven) Tip_Study_Data <- read.csv(file="https://raw.githubusercontent.com/UCLATALL/intro-stats-modeling/master/datasets/tip_study_data.csv", header=TRUE, sep=",") set.seed(100) ServerID <- c(1:44) Tip <- sample(Tip_Study_Data$tips, 44) Servers <- data.frame(ServerID, Tip) set.seed(8) Servers <- sample(Servers, orig.id=FALSE) Servers$RandomGroups1 <- append(rep(1,22), rep(2,22)) Servers$RandomGroups1 <- as.factor(Servers$RandomGroups1) set.seed(17) Servers$RandomGroups2 <- shuffle(Servers$RandomGroups1) set.seed(20) Servers$RandomGroups3 <- shuffle(Servers$RandomGroups1) # create histograms in a facet grid for RandomGroups2 # create histograms in a facet grid for RandomGroups3 gf_histogram(..density..~ Tip, data = Servers) %>% gf_facet_grid(RandomGroups2 ~ .) gf_histogram(..density..~ Tip, data = Servers) %>% gf_facet_grid(RandomGroups3 ~ .) test_function("gf_histogram") test_error()
DataCamp: ch4-19

Here are the histograms for the three times we randomly assigned the servers to a group.

L_Ch4_Fooled_5

In this case, remember, these groups are different because of random chance! We randomly assigned them to these groups. There is nothing special about any of these groups. But even when there is nothing special about the groups, they look different from one another.

Sometimes the difference—which must, in this case, be due to chance—will appear large, as in the third set of histograms. In this case, we know that the difference is not due to the effect of some variable such as drawing a smiley face, because we haven’t done an experimental intervention at all. It must be due to chance.

Back to the Real Tipping Experiment

Now let’s go back to the actual tipping experiment. The data from the study are in a data frame called TipExperiment with three variables: ServerID, Tip, and Condition (coded Smiley Face or Control). Here are the first six rows of this data frame.

head(TipExperiment)

In the histogram below, we plotted the distribution of the outcome variable (Tip) for both conditions in TipExperiment.

We observe some difference in the distributions across the groups. But keep in mind, we observed differences even when there had been no intervention! How do we know if this difference between experimental and control groups is a real effect of drawing a smiley face and not just due to chance? How distinctive would the difference have to be to convince us that the smiley faces had an effect?

This decision is the root of what statisticians call Type I error. Type I error is when we conclude that some variable we manipulated—the smiley face in this case—had an effect when in fact the observed difference was simply due to random sampling variation.

L_Ch4_Fooled_6

Later you will learn how to reduce the probability of making a Type I error. For now, just be aware of it. But spoiler alert: you can never reduce the chance of Type I error to zero; there is always some chance you might be fooled.

Responses