Course Outline

list Introduction to Statistics: A Modeling Approach

Modeling a Distribution as a Single Number

Building on this concept of model, let’s now develop what we mean by a statistical model. Whereas in the previous section we were building a model to help us estimate the area of the state of California, we now want to build a model that will enable us to characterize a distribution. We want to use the model to make predictions about what the next observation added to a sample distribution might be. For now we will consider only a single outcome variable. Later we will extend our concept of statistical model to include explanation of variation in an outcome variable with variation in explanatory variables.

At its most basic level, a statistical model can be thought of as a single number. (I know; that sounds even simpler than rectangles and triangles.) The question to ask is this: if you had to pick one number to represent an entire distribution, what would it be? Or, thought of in a different way: if you wanted to predict what the next randomly chosen observation would be, what would be your prediction?

L_Ch5_Modeling_1

In different kinds of distributions, we will use different approaches for choosing one number as a model. If a distribution is roughly symmetrical and bell shaped, a number right in the middle might be the best fitting model. (Remember, we aren’t saying that such a simple model is a good model—just better than nothing!) If a distribution is skewed left or right, the best model might be a number toward where the middle is when you ignore the long tail on one side or the other. If a distribution is for a categorical variable, the best model is generally the category that is most frequent.

Let’s zero in on just distributions of quantitative variables. Take a look at the two distributions below for variable 1 and 2.

L_Ch5_Modeling_2

If a single number is used to model the distribution of a quantitative variable, error from the model can be seen as deviations of the observed scores from that number in the same way that error in our model of California can be seen as deviations from the geometric shapes. As we just saw, a model for a distribution with less spread seems to have a better fit than a model for a distribution with more spread. The reason for this is that the error around the model is greater for the distribution with more spread.

The idea of modeling a distribution as a single number gives us a more concrete and detailed way of thinking about our models. Whereas we thought about the California example like this:

*

AREA OF CALIFORNIA = AREA OF GEOMETRIC FIGURES + OTHER STUFF

We can think of a statistical model like this:

*

DATA = MODEL + ERROR

Each data point in a distribution can be decomposed into two parts: the model (i.e., the number we are using to represent the whole distribution), and the data point’s deviation from the model (the error).

Responses