## Course Outline

• segmentGetting Started (Don't Skip This Part)
• segmentIntroduction to Statistics: A Modeling Approach
• segmentPART I: EXPLORING VARIATION
• segmentChapter 1 - Welcome to Statistics: A Modeling Approach
• segmentChapter 2 - Understanding Data
• segmentChapter 3 - Examining Distributions
• segmentChapter 4 - Explaining Variation
• segmentPART II: MODELING VARIATION
• segmentChapter 5 - A Simple Model
• segmentChapter 6 - Quantifying Error
• segmentChapter 7 - Adding an Explanatory Variable to the Model
• segmentChapter 8 - Models with a Quantitative Explanatory Variable
• segmentPART III: EVALUATING MODELS
• segmentChapter 9 - Distributions of Estimates
• segmentChapter 10 - Confidence Intervals and Their Uses
• segmentChapter 11 - Model Comparison with the F Ratio
• segmentChapter 12 - What You Have Learned
• segmentResources

## Boxplots and the Five-Number Summary

Boxplots are a handy tool for visualizing the five-number summary of a distribution. Making boxplots with the function gf_boxplot() will also clearly show you the IQR and outliers! Very handy.

Unlike histograms, where the values of the variable went on the x-axis, the boxplots made with gf_boxplot() put the values of the variable on the y-axis. Boxplots do not have to be made this way; this is just the way it is done by gf_boxplot().

L_Ch3_Boxplots_9

Here is the code for making a boxplot of Wt from MindsetMatters with gf_boxplot().

gf_boxplot(Wt ~ 1, data = MindsetMatters)

The 1 just means that there is only going to be one boxplot here. Later we will replace that as we explore methods of making multiple boxplots that appear next to each other.

The boxplot is made up of a few parts. There is a big white box with two parts–an upper and lower part. There are lines, called whiskers, above and below the box. Another name for boxplot is box-and-whisker plot.

This is a case where there are no outliers (defined as more than 1.5 IQRs above Q3 or below Q1). So the whiskers will simply end at the max and min values for Wt.

L_Ch3_Boxplots_10

Modify this code to create a boxplot for Population from the HappyPlanetIndex data frame.

 require(mosaic) require(tidyverse) require(Lock5withR) require(supernova) HappyPlanetIndex$Region <- recode(HappyPlanetIndex$Region, '1'="Latin America", '2'="Western Nations", '3'="Middle East and North Africa", '4'="Sub-Saharan Africa", '5'="South Asia", '6'="East Asia", '7'="Former Communist Countries")   # Modify this code to create a boxplot of Population from HappyPlanetIndex gf_boxplot(Wt ~ 1, data=MindsetMatters)   # Modify this code to create a boxplot of Population from HappyPlanetIndex gf_boxplot(Population ~ 1, data=HappyPlanetIndex)   test_function("gf_boxplot", args="data", incorrect_msg="Did you change the dataset to HappyPlanetIndex?") test_function_result("gf_boxplot", incorrect_msg="Did you create a boxplot of Population?") test_error() success_msg("Great work!") 
Don't forget to change both arguments
DataCamp: ch3-22

Wow, this is a strange looking boxplot. You can hardly see the box; it’s squished down on the bottom. And there are all these points here even though it’s supposed to be depicting a box and whisker plot.

The points that appear on a boxplot are the outliers. If they appear above the top whisker, they are outliers because R has checked whether these values are greater than the $$Q3 + 1.5*IQR$$. If they appear below the bottom whisker, they are outliers because their values are smaller than the $$Q1 - 1.5*IQR$$. When there are outliers, the end of the whisker depicts the max or min value that is not considered an outlier.

L_Ch3_Boxplots_11

There are a lot of large outlier countries. No wonder the histogram we looked at before put so many countries into the same bin! It looks as though most countries are at 0 millions. If only we could “zoom in” on these countries with a smaller population.

L_Ch3_Boxplots_12

In the following DataCamp window, use filter() to get just the countries with populations smaller than this upper boundary. Save these countries in a data frame called SmallerCountries. Run the code to see a histogram of those Population data.

 require(mosaic) require(tidyverse) require(Lock5withR) require(supernova) HappyPlanetIndex$Region <- recode(HappyPlanetIndex$Region, '1'="Latin America", '2'="Western Nations", '3'="Middle East and North Africa", '4'="Sub-Saharan Africa", '5'="South Asia", '6'="East Asia", '7'="Former Communist Countries")   # this calculates the Q3 + 1.5*IQR UpperBoundary <- 31.225 + 1.5*(31.225-4.455) # modify this code to filter in only countries with population sizes less than the UpperBoundary SmallerCountries <- # this makes a histogram of the smaller countries' populations gf_histogram(~ Population, data = SmallerCountries, fill = "slateblue4") %>% gf_labs(x = "Population (in millions)", title = "Population of Countries (Excludes Outliers)")   # this calculates the Q3 + 1.5*IQR UpperBoundary <- 31.225 + 1.5*(31.225-4.455) # modify this code for Population from HappyPlanetIndex SmallerCountries <- filter(HappyPlanetIndex, Population < UpperBoundary) # this makes a histogram of the smaller countries' populations gf_histogram(~ Population, data = SmallerCountries, fill = "slateblue4") %>% gf_labs(x = "Population (in millions)", title = "Population of Countries (Excludes Outliers)")   ex() %>% check_object("UpperBoundary") %>% check_equal("UpperBoundary") ex() %>% check_function("filter") %>% check_result() %>% check_equal() ex() %>% check_function("filter") %>% check_arg(".data") %>% check_equal(incorrect_msg="Don't forget to filter in HappyPlanetIndex") ex() %>% check_function("filter") %>% check_arg("...") %>% check_equal(incorrect_msg="Did you use Population < UpperBoundary as the second argument?") ex() %>% check_object("SmallerCountries") %>% check_equal() ex() %>% check_function("gf_histogram") %>% check_arg("object") %>% check_equal() ex() %>% check_function("gf_histogram") %>% check_arg("data") %>% check_equal() ex() %>% check_function("gf_labs") %>% check_arg("x") %>% check_equal() ex() %>% check_function("gf_labs") %>% check_arg("title") %>% check_equal() ex() %>% check_error() success_msg("That was a tough one! Great job working through it.") 
You can use Population < UpperBoundary to select populations less than 31.255 + 1.5*(31.255-4.455)
DataCamp: ch3-23

Ah, this is a very different histogram than the one that included outliers. Here we get a sense of how the countries that previously got lumped together in one bin actually vary in their population size.

L_Ch3_Boxplots_13

Let’s re-run the boxplot for just these countries in the data frame SmallerCountries to see what that looks like. Just press the Run button.

 require(mosaic) require(tidyverse) require(Lock5withR) require(supernova) HappyPlanetIndex$Region <- recode(HappyPlanetIndex$Region, '1'="Latin America", '2'="Western Nations", '3'="Middle East and North Africa", '4'="Sub-Saharan Africa", '5'="South Asia", '6'="East Asia", '7'="Former Communist Countries") Pop.stats <- favstats(~ Population, data = HappyPlanetIndex) SmallerCountries <- filter(HappyPlanetIndex, Population < (Pop.stats$Q3 + 1.5*(Pop.stats$Q3 - Pop.stats\$Q1)))   # Make a boxplot of Population from the SmallerCountries gf_boxplot(Population ~ 1, data=SmallerCountries)   # Make a boxplot of Population from the SmallerCountries gf_boxplot(Population ~ 1, data=SmallerCountries)   test_function_result("gf_boxplot") test_error() success_msg("Nice work!") 
Just click the Run button
DataCamp: ch3-24

L_Ch3_Boxplots_14