## Course Outline

• segmentGetting Started (Don't Skip This Part)
• segmentIntroduction to Statistics: A Modeling Approach
• segmentPART I: EXPLORING VARIATION
• segmentChapter 1 - Welcome to Statistics: A Modeling Approach
• segmentChapter 2 - Understanding Data
• segmentChapter 3 - Examining Distributions
• segmentChapter 4 - Explaining Variation
• segmentPART II: MODELING VARIATION
• segmentChapter 5 - A Simple Model
• segmentChapter 6 - Quantifying Error
• segmentChapter 7 - Adding an Explanatory Variable to the Model
• segmentChapter 8 - Models with a Quantitative Explanatory Variable
• segmentPART III: EVALUATING MODELS
• segmentChapter 9 - Distributions of Estimates
• segmentChapter 10 - Confidence Intervals and Their Uses
• segmentChapter 11 - Model Comparison with the F Ratio
• segmentChapter 12 - What You Have Learned
• segmentResources

## Research Design

We now have a sense of what it means to explain variation in an outcome variable with an explanatory variable. And, we’ve visited briefly with the idea that if we add more explanatory variables we can reduce the amount of variation that is unexplained. But let’s now dig a little deeper into the meaning of the word “explained.”

L_Ch4_Research_1

L_Ch4_Research_2

### The Problem of Inferring Causation

Thus far, we have used the word “explain” to mean that variation in one variable (the explanatory variable) can account for variation in another variable (the outcome variable). Put another way, the unexplained variation in the outcome variable is reduced when the explanatory variable is included in the model.

The problem with this definition of “explain” is that it sometimes does not result in an explanation that makes sense. In our attempt to make sense of the world, we usually are looking for a causal explanation for variation in an outcome variable. A variable may be related to another variable, but that doesn’t necessarily mean that the relationship is causal.

The examples above illustrate two specific problems we can run into as we try to understand the relationships among variables. First, the direction of the causal relationship may not be what we assume. Researcher B erroneously concluded that skimpy clothes cause temperatures to rise based on a relationship between the two variables.

In one sense he is correct: People do tend to wear skimpier clothes in hotter weather. But he is undoubtedly wrong in his causal inference: it is not the wearing of skimpy clothing that causes hot weather, but instead hot weather that causes people to wear skimpy clothing. The fact that two variables are related does not in and of itself tell us what the direction of causality might be. We’ll call this the directionality problem.

A second problem is the problem of confounding. Researcher A has found a pattern: variation in shoe size is related to variation in academic achievement. But it is possible that neither of these two variable causes the other. Instead, there may be a confound (sometimes called a “lurking variable” or a third variable) that, though not measured in our study, nevertheless causes both shoe size and academic achievement.

L_Ch4_Research_3

Sometimes we may not care about causality. For example, if our goal is merely to use an explanatory variable to predict an outcome in a future sample, it doesn’t really matter if the relationship is causal; just based on the relationship itself, we can bet on the same pattern emerging in a future study.

But more often than not, we do want to know what the causal relationships are. If we want to understand why things turn out the way they do, we usually won’t be satisfied unless we have identified the causes of the outcome we are trying to understand.

We also will need to identify the variables that cause an outcome if our goal is to change the outcome (e.g., help children score higher or impact climate change). No matter how good shoe size might be at predicting test scores, giving kids bigger shoes won’t help them score higher.

Indeed, being able to change one thing and see an effect on something else is our commonsense notion of what causality means. It’s one of the main ways we know when we truly understand a system.

### Research Designs and Causality

The example cases portrayed above are intentionally silly. But in reality, we often do not know when there is a directionality problem (i.e., when you have two variables and you don’t know which is the cause and which is the effect) or a confounding problem. It’s something that researchers always need to worry about. Our best tool for differentiating real causal relationships from spurious relationships (i.e., relationships that are not causal but that instead result from some unmeasured third variable) is research design.

The simplest research design is just taking a random sample from a population, and then measuring some variables. This kind of design is referred to as a correlational study, or an observational study. We don’t control any of the variables in this type of study, we just measure them. Sometimes it’s the best we can do, but it can make it difficult to interpret the results of our analyses, as we have seen.

If we really want to be sure that a relationship is causal, we generally will need to make a change in one variable, then observe the outcome in another. In fact, this is the way we judge causality in our everyday life. However, this is not as straightforward as it sounds, as pointed out long ago by R.A. Fisher, the father of experimental design.

For example, consider a server at a restaurant who thinks that servers in general should get better tips. She has a hunch that if a server draws smiley face on the back of the check at the end of the meal customers will tip more. So she decides to give it a try. She draws smiley face on a check and gets a huge tip! Has she proven a causal relationship?

Unfortunately, no. The particular guest she chose for her experiment may have just been a generous tipper or the food might have been exceptionally good that day. The server might have gotten a larger tip anyway, even without drawing a smiley face on the check. And also: maybe it wasn’t the smiley face that caused the bigger tip but just something about this particular waitress, excited to be running her first experiment!

To figure this out, we need to use an experimental design. Because we want our findings to generalize to lots of servers, we decide to study a sample of servers. Once we have a sample, we randomly assign each server to the experimental group or the comparison group. Servers in the experimental group are told to write smiley face on the back of each check, while those in the comparison group are told to write nothing.

Having launched our experiment, we simply measure the average tip per customer for servers in the two groups. Because we manipulated the explanatory variable (drawing a smiley face or not), and because we randomly assigned servers to the two groups, we assume that if we do see a difference in tips between the two groups it must be due to the explanatory variable.

### The Beauty of Random Assignment

This might be a good place just to pause and extol the beauty of random assignment! First of all, random assignment is not the same as random sampling. Both are random, but random assignment can accomplish so much more than simple random sampling.

L_Ch4_Research_5

The beauty of random assignment lies in the fact that by randomly assigning subjects in a study to one group or the other, we are, in essence, constructing two groups that, although they could be very diverse, they nevertheless are comparable to each other. Any difference between the two groups is completely random.

Because the two groups are assumed to be equivalent, except for differences due to random chance, we can infer that if some intervention (e.g., drawing a smiley face) leads to a difference in distributions across the groups (e.g., in the tips), the difference must be due to the smiley face not to other factors. (Later we will learn how to take random variation across the groups into account when making this inference.)

L_Ch4_Research_4

Provided we assign servers randomly to groups, fast servers are no more likely to be assigned to one group than to the other, and the same would be true for slow servers. Random assignment helps us rule out confounding variables by ensuring that any variables that affect our outcome variable, whether positively or negatively, will balance each other out across groups.

This does not mean that the level of a confounding variable will be exactly the same in two groups, even if they are randomly assigned. But it does mean that the various confounding factors should, over time, balance each other out.

There will inevitably be variation in the outcome variable within each group. After all, a bunch of servers will all get different sizes of tips depending on many factors. Later we will learn more about ways to model this within-group variation as a random process.