## Course Outline

• segmentGetting Started (Don't Skip This Part)
• segmentIntroduction to Statistics: A Modeling Approach
• segmentPART I: EXPLORING VARIATION
• segmentChapter 1 - Welcome to Statistics: A Modeling Approach
• segmentChapter 2 - Understanding Data
• segmentChapter 3 - Examining Distributions
• segmentChapter 4 - Explaining Variation
• segmentPART II: MODELING VARIATION
• segmentChapter 5 - A Simple Model
• segmentChapter 6 - Quantifying Error
• segmentChapter 7 - Adding an Explanatory Variable to the Model
• segmentChapter 8 - Models with a Quantitative Explanatory Variable
• segmentPART III: EVALUATING MODELS
• segmentChapter 9 - Distributions of Estimates
• segmentChapter 10 - Confidence Intervals and Their Uses
• segmentChapter 11 - Model Comparison with the F Ratio
• segmentChapter 12 - What You Have Learned
• segmentResources

## Variance

Sum of Squares is a good measure of total variation if we are using the mean as a model. But, it does have one important disadvantage. L_Ch6_Sum_5

Although you can see that the spread of the data points does not look different between the two distributions, the one on the bottom (#2) has a much larger SS.

L_Ch6_Sum_6

Sum of Squares worked fine as a way to quantify error around the mean, and compare error across two distributions when both distributions had the same sample size. But SS isn’t as easily interpreted when sample sizes vary.

The reason for this is that each time you add another data point to the sample distribution, you are adding another squared deviation from the mean to the total SS. So even if two distributions appear to be equally well modeled by their respective means, they may have very different SS. SS always grows as the number of data points in the distribution gets larger, irrespective of the degree of spread.

L_Ch6_Sum_7

This problem is solved by adding two new statistics to our toolbox: variance and standard deviation. To calculate variance, we start with SS, or total error, but then divide by the sample size to end up with a measure of average error around the mean—the average of the squared deviations.

Because it is an average, variance is not impacted by sample size, and thus, can be used to compare the amount of error across two samples of different sizes.

The formula for variance, usually represented as $$s^2$$, is like this:

$\frac{\sum_{i=1}^n (Y_i-\bar{Y})^2}{n-1}$

L_Ch6_Sum_8

You can see that the numerator is the sum of squares. Although to get an actual average of squared deviations you would divide by n, we instead divide by n-1. We do this because simulation studies have shown that dividing by n-1 gives us a better estimate of the actual population variance.

The reason for this is that when you take a small sample, the most extreme values in a population are unlikely to show up. So, if we divided by n it would, especially in smaller samples, slightly underestimate the true population variance. Dividing by n-1 corrects this bias, making the variance estimate a bit larger. And, as the sample gets larger, the difference between n and n-1 makes less and less difference. If you want to know more, you can read about this correction here.

The main thing to know is that taking the SS and dividing by n-1 results in something that approximates an average squared deviation. (Also note: the n-1 you see in the denominator is sometimes called the degrees of freedom, or df. This will be more important later.)

L_Ch6_Sum_9

So how do we calculate variance in R? We use var(). Here is how to calculate the variance of our Thumb data from TinyFingers.

var(TinyFingers$Thumb) Try calculating the variance of Thumb from the larger Fingers data frame.  require(mosaic) require(ggformula) require(supernova)   # calculate the variance of Thumb from the Fingers data frame var()   # calculate the variance of Thumb from the Fingers data frame var(Fingers$Thumb)   test_function_result("var") test_error() success_msg("Nice job! You just calculated the variance") 
You can use Fingers\$Thumb to select just the Thumb column
DataCamp: ch6-5 L_Ch6_Sum_10