Course Outline

list Introduction to Statistics: A Modeling Approach

The Mean as a Model

Having developed the idea that a single number can serve as a statistical model for a distribution, we now ask: which single number should we choose? We have been talking informally about choosing a number in the middle of a symmetric, normal-shaped, distribution. But now we want to get more specific.

Median and Mean

If we were trying to pick a number to model the distribution of a categorical variable, we should pick the mode; really, there isn’t much choice here. If you are going to predict the value of a new observation on a categorical variable, the prediction will have to be one of the categories and you will be wrong least often if you pick the most frequently observed category.

For a quantitative variable, statisticians typically choose one of two numbers: the median or the mean. The median is just the middle number of a distribution. Take the following distribution of five numbers:

5, 5, 5, 10, 20

The median is 5, meaning that if you sort all the numbers in order, the number in the middle is 5. You can see that the median is not affected by outliers. So, if you changed the 20 in this distribution to 20,000, the median would still be 5.

To calculate the mean of this distribution, we simply add up all the numbers in the sample, and then divide by the sample size, which is 5. So, the mean of this distribution is 9. Both mean and median are indicators of where the middle of the distribution is, but they define “middle” in different ways: 5 and 9 represent very different points in this distribution.

In R, these and other statistics are very easy to find with the function favstats(). Create a variable called outcome and put in these numbers: 5, 5, 5, 10, 20. Then put that variable into a data frame called TinyData. Finally, run the favstats() function on the variable outcome.

require(mosaic) require(ggformula) require(supernova) # Modify this line outcome <- c() # Put outcome into the tinydata data frame tinydata <- data.frame() # This will give you the favstats for outcome favstats(~ outcome, data = tinydata) # Modify this line outcome <- c(5, 5, 5, 10, 20) # Put outcome into the tinydata data frame tinydata <- data.frame(outcome) # This will give you the favstats for outcome favstats(~ outcome, data = tinydata) test_object("outcome") test_data_frame("tinydata") test_function_result("favstats") test_error()
Have you created a data frame out of outcome?
DataCamp: ch5-1

If you want you can save output of favstats() into an R object. Although you can name it whatever you want, it might be helpful to use a rule of thumb like the variable name and the phrase “.stats” after it.

outcome.stats <- favstats( ~ outcome, data = tinydata)

If our goal is just to find the single number that best characterizes a distribution, sometimes the median is better, and sometimes the mean.

L_Ch5_Modeling_3

If you are trying to choose one number that would best predict what the next randomly sampled value might be, the median might well be better than the mean for this distribution. With only five numbers, the fact that three of them are 5 leads us to believe that the next one might be 5 as well.

On the other hand, we know nothing about the Data Generating Process (DGP) for these numbers. The fact that there are only five of them means that they likely don’t represent very well what the underlying population distribution looks like. The population could be normal or uniform in which case the mean would be a better model than the median. The point is, we just don’t know.

Realizing this limitation, let’s look at some distributions of quantitative variables, and see which number we think is a summary of each distribution as a whole: median or mean.

For each of these variables, make histograms and get the favstats(). For each distribution, which do you think is a better one-number model? The median or the mean?

require(mosaic) require(ggformula) Fingers <- read.csv(file="https://raw.githubusercontent.com/UCLATALL/intro-stats-modeling/master/datasets/fingers.csv", header=TRUE, sep=",") Fingers <- data.frame(Fingers) # modify this code to make a histogram of GradePredict gf_histogram(~ , data = Fingers) # save the favstats for GradePredict GradePredict.stats <- # this code will print out the favstats GradePredict.stats # modify this code to make a histogram of GradePredict gf_histogram(~ GradePredict, data = Fingers) # save the favstats for GradePredict GradePredict.stats <- favstats(~ GradePredict, data = Fingers) # this code will print out the favstats GradePredict.stats test_function("gf_histogram") test_object("GradePredict.stats") test_output_contains("GradePredict.stats") test_error() success_msg("Keep up the great work!")
Call gf_histogram on GradePredict in the Fingers data frame
DataCamp: ch5-2

require(mosaic) require(ggformula) require(supernova) # modify this code to make a histogram of Thumb gf_histogram(~ , data = Fingers) # save the favstats for Thumb Thumb.stats <- # write code to print out the favstats # modify this code to make a histogram of Thumb gf_histogram(~ Thumb, data = Fingers) # save the favstats for GradePredict Thumb.stats <- favstats(~ Thumb, data = Fingers) # write code to print out the favstats Thumb.stats test_function("gf_histogram") test_object("Thumb.stats") test_output_contains("Thumb.stats") test_error() success_msg("Nice work!")
Call gf_histogram on Thumb in the Fingers data frame. Don't forget to print Thumb.stats
DataCamp: ch5-3

require(mosaic) require(ggformula) require(Lock5withR) require(Lock5Data) # modify this code to make a histogram of Age in the MindsetMatters data frame gf_histogram(~ , data = MindsetMatters) # save the favstats for Age in a variable called Age.stats # write code to print out the favstats # modify this code to make a histogram of Age gf_histogram(~ Age, data = MindsetMatters) # save the favstats for GradePredict Age.stats <- favstats(~ Age, data = MindsetMatters) # write code to print out the favstats Age.stats test_function("gf_histogram") test_object("Age.stats") test_output_contains("Age.stats") test_error() success_msg("Nice work!")
Call gf_histogram on Age in the MindsetMatters data frame. Don't forget to print Age.stats
DataCamp: ch5-4

L_Ch5_Modeling_4

In general, the median may be a more meaningful summary of a distribution of data than the mean when the distribution is skewed one way or the other. In essence, this discounts the importance of the tail of the distribution, focusing more on the part of the distribution where most people score. The mean is a good summary when the distribution is more symmetrical.

But, if our goal is to create a statistical model of the population distribution, we almost always—especially in this course—will use the mean. We shall dig in a little to see why. But first, a brief detour to see how we can add the median and mean to a histogram.

Adding Median and Mean to Histograms

You already know the R code to make a histogram.

gf_histogram( ~ outcome, data = tinydata)

Let’s add a vertical line to show where the mean is. We know from favstats() that the mean is 9, so we can just add a vertical line that crosses the x-axis at 9. Let’s color it blue.

gf_histogram(~ outcome, data = tinydata) %>%
gf_vline(xintercept = 9, color = "blue")

Alternatively, if we don’t want to have to remember the mean, we can just take the mean from the output of favstats() we saved earlier.

gf_histogram(~ outcome, data = tinydata) %>%
gf_vline(xintercept = ~mean,  color = "blue", data = outcome.stats)

Try modifying this code to draw a green line for the median using the favstats you’ve saved in outcome.stats.

require(mosaic) require(ggformula) outcome <- c(5,5,5,10,20) tinydata <- data.frame(outcome) outcome.stats <- favstats(~ outcome, data = tinydata) # Modify this code to draw a vline representing the median in green gf_histogram(~ outcome, data = tinydata) %>% gf_vline(xintercept = ~mean, color = "blue", data = outcome.stats) gf_histogram(~ outcome, data = tinydata) %>% gf_vline(xintercept = ~median, color = "green", data = outcome.stats) ex() %>% check_function("gf_histogram") %>% check_arg("object") %>% check_equal() ex() %>% check_function("gf_histogram") %>% check_arg("data") %>% check_equal() ex() %>% check_function("gf_vline") %>% check_arg("xintercept") %>% check_equal() ex() %>% check_function("gf_vline") %>% check_arg("color") %>% check_equal() ex() %>% check_error()
Have you replaced mean with median and purple with green?
DataCamp: ch5-5

You can string these commands together (using %>%) to put both the mean and median lines onto a histogram.

gf_histogram(~ outcome, data = tinydf) %>%
gf_vline(xintercept = ~mean,  color = "blue", data = outcome.stats) %>%
gf_vline(xintercept = ~median,  color = "green", data = outcome.stats)

Exploring the Mean

It’s pretty easy to understand how the median is the middle of a distribution, but in what sense is the mean the middle? One way to think of the mean is as the balancing point of the distribution. But what does it balance? What is even on both sides of the mean?

You can either watch the video explanation (with Dr. Ji) or read about it in the section below.

You might think that the values below the mean balance with the values above the mean. Let’s try that. Does 5+5+5 = 10+20? No, 15 does not equal 30. A bunch of smaller values, what we find below the mean, is not going to balance a bunch of larger values (the ones above the mean). So what does the mean balance?

Here it helps to think about each score as a deviation, which is the distance above or below the mean. In our example, the three 5s are 4 units below the mean of 9, which we will represent as -4. If you think of it this way, the sum of deviations below the mean (-12) balances out the sum of deviations above the mean (+1 and +11, or +12).

It turns out that no number other than the mean (not 8, not 8.5, not 9.1!) will perfectly balance the deviations above the mean with those below the mean. Whereas the magnitude of a score—especially an outlier—won’t necessarily affect the median, it will affect the mean because that outlier point has to be balanced with the other points in the data. Every value gets taken into account when calculating the mean.

Remember we talked about finding some simple shapes that “fit” the more detailed shape of California the best? We wanted to find shapes that were not too big and not too small, shapes that would minimize the error around the model, defined as the parts of California that were not covered by the model, and the parts of the model that covered stuff not in California.

The mean is a model that is not too big and not too small. The mean is pulled in both directions (larger and smaller) at once and settles right in the middle. The mean is the number that balances the amount of deviation above and below it, yielding the same amount of error above it as below it. It’s kind of amazing that this procedure of adding up all the numbers and dividing by the number of numbers results in this balancing point.

Thinking about the mean in this way also helps us think about DATA = MODEL + ERROR in a more specific way. If the mean is the model, each data point can now be thought of as the sum of the model (9 in our TinyData set) plus its deviation from the model. So 20 can be decomposed into the model part (9) and the error from the model (+11). And 5 can be decomposed into 9 (model) and -4 (error).

L_Ch5_Modeling_5

Responses