## Course Outline

• segmentGetting Started (Don't Skip This Part)
• segmentIntroduction to Statistics: A Modeling Approach
• segmentPART I: EXPLORING VARIATION
• segmentChapter 1 - Welcome to Statistics: A Modeling Approach
• segmentChapter 2 - Understanding Data
• segmentChapter 3 - Examining Distributions
• segmentChapter 4 - Explaining Variation
• segmentPART II: MODELING VARIATION
• segmentChapter 5 - A Simple Model
• segmentChapter 6 - Quantifying Error
• segmentChapter 7 - Adding an Explanatory Variable to the Model
• segmentChapter 8 - Models with a Quantitative Explanatory Variable
• segmentPART III: EVALUATING MODELS
• segmentChapter 9 - Distributions of Estimates
• segmentChapter 10 - Confidence Intervals and Their Uses
• segmentChapter 11 - Model Comparison with the F Ratio
• segmentChapter 12 - What You Have Learned
• segmentResources

## The Data Generating Process

We can learn a lot by examining distributions of data. But our interest usually goes beyond the data, to the Data Generating Process (or DGP). We are generally looking at data because we want to find out something about the way the world works—something that is hard to see because there is so much variation in the world.

Most statistics textbooks distinguish the sample and population. In fact, these are the first two types of distributions included in what we refer to as the Distribution Triad (we will introduce the third distribution much later in this course). Our data are a sample—they’re the units we actually selected—on which we collected our measures. But our interest is not generally in the sample but in the population from which it was drawn. We study a sample because we want to generalize to a population.

In this course we dig a little deeper into the population. Not only do we want to generalize from our data to the population, but our real interest is in understanding the processes that produced the variation in the population itself and then in the data—this is what we refer to as the Data Generating Process (DGP).

If our answer to the question, “Why does our distribution look the way it does?” is just “Because that’s the way the population distribution looks,” it’s not very satisfying. What we really want to know is: Why does the population distribution look like that? The answer to this question gets at the DGP. We want you to develop a mental habit of always asking yourself: what might the process be that could have generated a distribution of data that looks like this?

Whether we are examining the distribution of a single variable (like we are in this chapter), or the relationships among variables (like in the next chapter), we always want to be digging deeper, trying to understand what could have produced the variation we see in our data.

Here’s a simple example. The histogram below shows the distribution of 60,000 waiting times at a bus stop on the corner of Fifth Avenue and 97th Street in New York City (source).

L_Ch3_Shape_4

Answering questions like this one requires going far beyond just the information in the histogram. You need to imagine yourself waiting at a bus stop, and think about why you got there when you did. You need to bring to bear your knowledge about bus systems and how they work. What causes a bus to arrive when it does?

From the histogram you can see that most people wait just a short time for the bus, while some people end up waiting longer times. This makes sense. Buses have schedules, and because many of the passengers are regulars, they roughly know when the bus will come and try to get to the bus stop just before it comes.

L_Ch3_Shape_5

Consider again the passengers that know the bus schedule well. If they just miss the bus, and arrive right after the bus leaves, they will end up waiting the longest, until the next bus comes.

### Population, the Result of the DGP Over a Long Period of Time

The term population has some limitations. If you are taking a sample of likely voters in order to predict an election result, you can imagine the complete population being “out there,” just waiting to be sampled (or not). But for people waiting at a bus stop, the population is constantly shifting.

In cases like this, it makes a lot more sense to think of the population as the result of a process or of many processes—what we refer to in aggregate as the Data Generating Process (DGP). You could think of the DGP as a lot of causal factors, each with some attached probability of occurrence, that produce the population distribution as they play out over time.

Data are concrete, in hand. The DGP, on the other hand, is unknown; we can’t see it directly. We can get clues from data as to what the population produced by the DGP might eventually look like, but we can never get a perfect understanding of the DGP.