segmentGetting Started (Don't Skip This Part)
segmentIntroduction to Statistics: A Modeling Approach
segmentPART I: EXPLORING VARIATION
segmentChapter 1 - Welcome to Statistics: A Modeling Approach
segmentChapter 2 - Understanding Data
segmentChapter 3 - Examining Distributions
segmentChapter 4 - Explaining Variation
segmentPART II: MODELING VARIATION
segmentChapter 5 - A Simple Model
segmentChapter 6 - Quantifying Error
segmentChapter 7 - Adding an Explanatory Variable to the Model
segmentChapter 8 - Models with a Quantitative Explanatory Variable
segmentPART III: EVALUATING MODELS
segmentChapter 9 - Distributions of Estimates
segmentChapter 10 - Confidence Intervals and Their Uses
segmentChapter 11 - Model Comparison with the F Ratio
segmentChapter 12 - What You Have Learned
12.0 What You Have Learned About Exploring Variation
list Introduction to Statistics: A Modeling Approach
12.0 What You Have Learned
You may not think you have learned a lot, but we think you have! Think back to the very beginning of the class when we said that statistics is the study of variation. Look how far you have come since then! You now have a real sense of what it means to study variation. Just like a practicing statistician, you now have skills to help you explore variation, model variation, and evaluate models.
And although what we have done in this class might seem relatively clean and simple compared to the real world problems that thoughtful people such as yourself will need to solve, the basic concepts and skills you have learned provide all the foundation you need for future learning as you deepen your knowledge of statistics and data analysis.
Just in case you don’t think you have learned a lot, let’s take you for a little tour of what you have learned, and give you a chance to show yourself what you can do.
What You have Learned about Exploring Variation
We will start with a data set called hate_crimes from FiveThirtyEight. The data set contains data on hate crimes reported to both the Southern Poverty Law Center (SPLC) and the FBI in all 50 states and the District of Columbia. Go ahead and use R to look at the first six lines of this data frame. (See? You didn’t know how to do this before, did you?)
#load packages require(ggformula) require(mosaic) require(supernova) require(Lock5Data) require(Lock5withR) require(okcupiddata) require(MindsetMatters) require(HappyPlanetIndex) library(fivethirtyeight)
# take a look at the hate_crimes data frame
# take a look at the hate_crimes data frame head(hate_crimes)
test_output_contains("head(hate_crimes)") test_error() success_msg("You're doing a great job!")
Whatever your theories about hate crimes might be, you can explore variation in a number of ways! Let’s use avg_hatecrimes_per_100k_fbi as an outcome variable for now; hate_crimes_per_100k_splc would also be a good one—you are free to explore it too!
Take a look at the variation in the hate crimes reported to the FBI in some kind of plot.
#load packages require(ggformula) require(mosaic) require(supernova) require(MindsetMatters) require(Lock5Data) require(Lock5withR) require(okcupiddata) require(fivethirtyeight)
# make a plot to help us explore the variation in avg_hatecrimes_per_100k_fbi
Look at that! You are able to make graphs to help you visualize the variation you see in the rate of hate crimes. We might be interested in knowing which state is such an outlier in terms of hate crimes, an ignoble distinction. How would we find that out where that is? Could we arrange the data frame in some way to see the states sorted in descending order by avg_hatecrimes_per_100k_fbi? Could we just print the six states with highest crime rates?
#load packages require(ggformula) require(mosaic) require(supernova) require(Lock5Data) require(Lock5withR) require(okcupiddata) require(MindsetMatters) library(fivethirtyeight)
# Can you arrange the data frame to show you the places with the highest hate crime rates? Can you just print the 6 states with the highest crime rates?
# Can you arrange the data frame to show you the places with the highest hate crime rates? Can you just print the 6 states with the highest crime rates? head(arrange(hate_crimes, desc(avg_hatecrimes_per_100k_fbi))) hate_crimes <- arrange(hate_crimes, desc(avg_hatecrimes_per_100k_fbi)) head(select(hate_crimes, state, avg_hatecrimes_per_100k_fbi))
test_function("head") test_object("hate_crimes") test_error()
We could also save the arranged data frame and then print out just the state name and outcome variable of interest.
hate_crimes <- arrange(hate_crimes, desc(avg_hatecrimes_per_100k_fbi)) head(select(hate_crimes, state, avg_hatecrimes_per_100k_fbi))
Let’s go back now and take a closer looks at the distribution of hate crime rates. As statisticians, what we really want to do is explain the variation we see here. Why do some places have higher rates of hate crimes (per 100,000 people) than others?
Well guess what? You not only know how to think about explanatory variables, but you also know how to explore the relationships between outcome and explanatory variables in the data.
Let’s do it. Pick a few explanatory variables and make some plots to explore their relationship with rate of hate crimes (avg_hatecrimes_per_100k_fbi). You might want to see if unemployment (share_unemp_seas) is related to hate crimes. Or if wealth (median_house_inc) might have something to do with hate crimes.
In the DataCamp window below, make a visualization that would help you explore the relationship between unemployment and hate crimes. You can also put the best-fitting regression line on it if you want. Also make a visualization looking at whether hate crimes can be explained by income. (Feel free to explore other explanatory variables as well!)
#load packages require(ggformula) require(mosaic) require(supernova) require(Lock5Data) require(Lock5withR) require(okcupiddata) library(fivethirtyeight)
# make a visualization of hate crimes explained by unemployment # make a visualization of hate crimes explained by household income
# make a visualization of hate crimes explained by unemployment gf_point(avg_hatecrimes_per_100k_fbi ~ share_unemp_seas, data = hate_crimes, color = "navy", size = 3) %>% gf_lm(color = "orange") # make a visualization of hate crimes explained by household income gf_point(avg_hatecrimes_per_100k_fbi ~ median_house_inc, data = hate_crimes, color = "darkgreen", size = 3) %>% gf_lm(color = "orange")
You also know, even with just a scatter plot, how to think about what it means for one variable to “explain” variation in another. You know that to explain means that knowing a score on one variable helps you make a better guess as to that same unit’s score on another. It looks like there may be a relationship in the scatter plot on the right, at least more so than in the plot on the left.