Course Outline

segmentGetting Started (Don't Skip This Part)

segmentIntroduction to Statistics: A Modeling Approach

segmentPART I: EXPLORING VARIATION

segmentChapter 1  Welcome to Statistics: A Modeling Approach

segmentChapter 2  Understanding Data

segmentChapter 3  Examining Distributions

3.6 The FiveNumber Summary

segmentChapter 4  Explaining Variation

segmentPART II: MODELING VARIATION

segmentChapter 5  A Simple Model

segmentChapter 6  Quantifying Error

segmentChapter 7  Adding an Explanatory Variable to the Model

segmentChapter 8  Models with a Quantitative Explanatory Variable

segmentPART III: EVALUATING MODELS

segmentChapter 9  Distributions of Estimates

segmentChapter 10  Confidence Intervals and Their Uses

segmentChapter 11  Model Comparison with the F Ratio

segmentChapter 12  What You Have Learned

segmentResources
list Introduction to Statistics: A Modeling Approach
The FiveNumber Summary
So far we have used histograms as our main tool for examining distributions. But histograms aren’t the only tool we have available. In this section we will introduce a few more tools for examining distributions of quantitative variables. In the following section we will introduce some tools for examining the distributions of categorical variables.
Sorting Revisited, and the Min/Max/Median
In the previous chapter we introduced the simple idea of sorting a quantitative variable in order. Before sorting the numbers, it was hard to see a pattern. You could read the numbers, but it was hard to draw any conclusions about the distribution itself.
As soon as we sort the numbers in order, we can see things that are true of the distribution. For example, when we sort even a long list of numbers we can see what the smallest number is and what the largest number is. This shows us something about the distribution that we couldn’t have seen just by looking at a jumbled list of numbers.
We can demonstrate this by looking at the Wt variable in the MindsetMatters data frame. Write some code to sort the housekeepers by weight from lowest to highest, and then see what the minimum and maximum weights are.
require(mosaic)
require(tidyverse)
require(Lock5withR)
require(supernova)
# Write code to sort Wt from lowest to highest
# Write code to sort Wt from lowest to highest
# Solution 1
arrange(MindsetMatters, Wt)
# Solution 2
sort(MindsetMatters$Wt)
test_or(test_function_result("arrange"),
test_function_result("sort"),
incorrect_msg="Use either arrange() or sort()")
success_msg("Fantastic work!")
Note that you can also use the function arrange()
to sort but that will sort the entire data frame. We just wanted to be able to see the sorted weights in a row so we just used sort()
on the vector (MindsetMatters$Wt).
Now that we have sorted the weights, we can see that the minimum weight is 90 pounds and the maximum weight is 196 pounds. In addition to knowing the minimum and maximum weight, it would be helpful to know what is the number right in the middle of this distribution. If there are 75 housekeepers, we are looking for the 38th housekeeper’s weight because there are 37 weights that are smaller than this number and 37 that are bigger than this number. This middle number is called the median.
Numbers such as the minimum (often abbreviated as min), median, and maximum (abbreviated as max) are helpful for understanding a distribution. These three numbers can be thought of as a three number summary of the distribution. (We’ll build up to the fivenumber summary in a bit.)
There is a function called favstats()
(for favorite statistics!) that will quickly summarize these values for us. Here is how to get the favstats for Wt from MindsetMatters.
favstats(~ Wt, data = MindsetMatters)
There are a lot of other numbers that are generated by the favstats()
function but let’s take a look at Min, Median, and Max for now. By looking at the median weight in relation to the minimum and maximum weight, you can tell a little bit about the shape of the distribution.
L_Ch3_Boxplots_1
Try writing code to get the favstats()
for the variable Population of countries in the data frame HappyPlanetIndex.
require(mosaic)
require(tidyverse)
require(Lock5withR)
require(supernova)
HappyPlanetIndex$Region < recode(HappyPlanetIndex$Region, '1'="Latin America", '2'="Western Nations", '3'="Middle East and North Africa", '4'="SubSaharan Africa", '5'="South Asia", '6'="East Asia", '7'="Former Communist Countries")
# Modify the code to get favstats for Population of countries in HappyPlanetIndex
favstats()
# Modify the code to get favstats for Population of countries in HappyPlanetIndex
favstats(~ Population, data = HappyPlanetIndex)
test_output_contains("favstats(~ Population, data = HappyPlanetIndex)")
test_error()
success_msg("Keep up the great work!")
L_Ch3_Boxplots_2
Create a histogram of Population to see if your intuition about the shape of this distribution from looking at the min/median/max is correct.
require(mosaic)
require(tidyverse)
require(Lock5withR)
require(supernova)
HappyPlanetIndex$Region < recode(HappyPlanetIndex$Region, '1'="Latin America", '2'="Western Nations", '3'="Middle East and North Africa", '4'="SubSaharan Africa", '5'="South Asia", '6'="East Asia", '7'="Former Communist Countries")
# Make a histogram of Population from HappyPlanetIndex using gf_histogram
#Simple solution
gf_histogram(~ Population, data = HappyPlanetIndex)
test_function("gf_histogram", incorrect_msg="Don't forget to use `~Population` and `data=HappyPlanetIndex` as arguments")
success_msg("You're a rock star!")
L_Ch3_Boxplots_3
Quartiles and the fivenumber Summary
Another way to think about what we’ve been doing is this. Imagine all the values of a variable are sorted and lined up along the thick blue line below.
The min is the lowest value, the median is the middle value, and the max is the highest value. We have divided the distribution into two equal parts in order to find the median. The two parts can be thought of as halves.
We can divide each half into two parts again to divide the distribution into four parts called quartiles. The quartiles are four equal groups of values. It is as if the long vector of values have been cut into four equal sized pieces.
L_Ch3_Boxplots_4
If we divide the lower half of the distribution in two parts, the bottom part (the lowest .25 of values) is called the first quartile. The upper part (the next lowest .25 of values) is called the second quartile. If we divide the higher half of the distribution in two parts, the middle of that half is called the third quartile. Everyone above that point is in the top quartile (fourth quartile) of the distribution.
L_Ch3_Boxplots_5
It is important to note that what is equal about the four quartiles is the number of data points included in each. Each quartile contains onefourth of the observations, regardless of what their exact scores are on the variable.
In order to demarcate where a quartile begins and ends, statisticians have given these cut points (the orange lines) boring names: Q0, Q1, Q2, Q3, Q4.
When statisticians refer to the “fivenumber summary” they are referring to these five numbers: the minimum, Q1, the median, Q3, and the maximum. So let’s look again at the favstats()
for Wt.
favstats(~ Wt, data = MindsetMatters)
Now you can see that the favstats()
function actually gives you the fivenumber summary (min, Q1, median, Q3, max), then the mean, standard deviation, n (number of values), and how many cases (in this case, housekeepers) are missing a value for weight. We will delve into the mean and standard deviation in later chapters.
Range and InterQuartile Range
The distance between the max and min gives us range, a quick measure of how spread out the values are in a distribution. Based on the numbers from the favstats()
results above, use R as a calculator to find the range of Wt.
require(mosaic)
require(tidyverse)
require(Lock5withR)
require(supernova)
# Based on the numbers from the favstats results above, use R as a calculator to find the range of Wt in MindsetMatters
# Produces favstats for Population of countries in HappyPlanetIndex
favstats(~ Wt, data = MindsetMatters)
# Based on the numbers from the favstats results, use R as a calculator to find the range of Wt
196  90
ex() %>% check_output_expr("196  90")
ex() %>% check_error()
success_msg("Keep up the great work!")
In distributions like the Population of countries, the range can be very deceptive.
favstats(~ Population, data = HappyPlanetIndex)
The range looks like it is about 1,304.2 million. But we saw in the histogram that this is due to one or two very populous countries! There was a lot of empty space in that distribution. In cases like this, it might be useful to get the range for just the middle .50 of values. This is called the interquartile range (IQR).
L_Ch3_Boxplots_6
L_Ch3_Boxplots_7
Use the fivenumber summary of Population to find the IQR. You can use R as a calculator.
require(mosaic)
require(tidyverse)
require(Lock5withR)
require(supernova)
HappyPlanetIndex$Region < recode(HappyPlanetIndex$Region, '1'="Latin America", '2'="Western Nations", '3'="Middle East and North Africa", '4'="SubSaharan Africa", '5'="South Asia", '6'="East Asia", '7'="Former Communist Countries")
# Use R as a calculator to find the IQR of Population from the HappyPlanetIndex data set
# Use R as a calculator to find the IQR of Population from the HappyPlanetIndex data set
31.225  4.455
test_output_contains("31.2254.455")
test_error()
success_msg("Nice work! It looks like you're ready to move on to something more challenging.")
Interquartile range ends up being a handy ruler for figuring out whether a data point should be considered an outlier. Outliers present the researcher with a hard decision: should the score be excluded from analysis because it will have such a large effect on the conclusion, or should it be included because, after all, it’s a real data point?
For example, China is a very populous country and is the very extreme outlier in the HappyPlanetIndex, with a population of more than 1,300 million people (another way of saying that is 1.3 billion). If it weren’t there, we would have a very different view of the distribution of population across countries. Should we exclude it as an outlier?
Well, it depends on what we are trying to do. If we wanted to understand the total population of this planet, it would be foolish to exclude China because that’s a lot of people who live on earth! But if we are trying to get a sense of how many people live in a typical country, then perhaps it would make sense more sense to exclude China.
But then, what about the second most populous country, India? Should we exclude them too? What about the third most populous country, the US? Or the fourth, Indonesia? How do we decide what an outlier is? That process seems fraught with subjectivity.
There is no one right way to do it. After all, deciding on what an “outlier” is really depends on what you are trying to do with your data. However, the statistics community has agreed on a rule of thumb to help people figure out what an outlier might be. Any data point bigger than the \(Q3 + 1.5*IQR\) is considered a very large outlier. Anything smaller than the \(Q1  1.5*IQR\) is considered a small outlier.
L_Ch3_Boxplots_8