Course Outline

segmentGetting Started (Don't Skip This Part)

segmentIntroduction to Statistics: A Modeling Approach

segmentPART I: EXPLORING VARIATION

segmentChapter 1  Welcome to Statistics: A Modeling Approach

segmentChapter 2  Understanding Data

segmentChapter 3  Examining Distributions

segmentChapter 4  Explaining Variation

segmentPART II: MODELING VARIATION

segmentChapter 5  A Simple Model

segmentChapter 6  Quantifying Error

segmentChapter 7  Adding an Explanatory Variable to the Model

segmentChapter 8  Models with a Quantitative Explanatory Variable

segmentPART III: EVALUATING MODELS

segmentChapter 9  Distributions of Estimates

segmentChapter 10  Confidence Intervals and Their Uses

10.4 A Confession, and the T Distribution

segmentChapter 11  Model Comparison with the F Ratio

segmentChapter 12  What You Have Learned

segmentResources
list Introduction to Statistics: A Modeling Approach
A Confession, and the t Distribution
So here is a tiny confession. We’ve been kind of (but not really) lying to you. We’ve been saying that a 95% confidence interval is roughly plus or minus two standard errors from the estimate. But as you may have guessed, it’s not exactly two standard errors.
It’s actually 1.96 standard errors. In our view, 2 is close enough to 1.96 and 2 is much easier to multiply by in your head. But when you ask R to calculate the confidence interval it will use 1.96 standard errors above and below the mean and you will get a more precise estimate of the confidence interval.
There is one more thing, though. You might also be wondering—why don’t they just shorten the name of that confidence interval function to confint()
rather than confint.default()
?
That is a good question that brings us to another tiny technicality. As you saw in the previous chapter, the normal distribution isn’t always going to be the best mathematical model for a sampling distribution. When your sample size is fairly large, you can assume the sampling distribution is normal. But if the sample size is small or if the \(\sigma\) of the DGP is unknown (which it generally is), you’ll have more variation in your sampling distribution than is modeled by the normal distribution.
What do we do then? We use the t distribution which is very similar but slightly more variable than the normal distribution (also called the z distribution). The t distribution has a slightly different shape depending on the degrees of freedom used to estimate \(\sigma\). And for very large samples, the t distribution looks exactly like the normal distribution. Bottomline: we usually use the t distribution because it can act like the normal distribution when appropriate.
L_Ch10_UsingN_4(2)
The t distribution, like the standard normal distribution, is a mathematical probability function that is a good model for sampling distributions of the mean. It’s just a little more variable, that’s all.
We now know that if our sampling distribution was assumed to be normal, then the critical distance is 1.96 standard errors away from the estimate. We call this the critical z score.
Let’s try to think about what the critical distance would be if we used the t distribution instead of the z distribution. In other words, what would the critical t score be? Would it be bigger or smaller than 1.96?
L_Ch10_UsingN_5
A function called xqt()
will take in the proportion you would like to see in one tail (e.g., .025) and the degrees of freedom (which, for now, will be n1) and tell you the length of the critical distance in units of standard error.
For a very large sample (like 1,000 data points), this code will return something very close to 1.96 (check the R Console as well as the Plot).
`xqt(.025, df = 999)
L_Ch10_UsingN_6
Let’s try that. Run the code below.
#load packages
require(ggformula)
require(mosaic)
require(supernova)
require(Lock5Data)
require(Lock5withR)
require(okcupiddata)
# Run this code
xqt(.975,df=999)
# Run this code
xqt(.975,df=999)
test_function_result("xqt")
test_error()
success_msg("Super!")
Let’s take a look at the critical t score for different sample sizes.
#load packages
require(ggformula)
require(mosaic)
require(supernova)
require(Lock5Data)
require(Lock5withR)
require(okcupiddata)
# Try finding the critical t in these different situations
# When the sample size is 500
xqt(.975,df=)
# When the sample size is 157 (like in our Fingers data)
xqt(.975,df=)
# When the sample size is 50
xqt(.975,df=)
# When the sample size is 20
xqt(.975,df=)
# Run this code
xqt(.975,df=499)
xqt(.975,df=156)
xqt(.975,df=49)
xqt(.975,df=19)
test_function("xqt", args = "df", index = 1)
test_function("xqt", args = "df", index = 2)
test_function("xqt", args = "df", index = 3)
test_function("xqt", args = "df", index = 4)
test_error()
success_msg("Keep up the great work!")
L_Ch10_UsingN_7
The function confint()
uses the t distribution to estimate the critical distance. So it will adjust for the size of your sample automatically. The function we tried out before, confint.default()
uses the z distribution.
L_Ch10_UsingN_8
Try out both of these functions in the DataCamp window below. Is the confidence interval based on the t distribution (confint()
) slightly bigger than the one based on the normal distribution?
#load packages
require(ggformula)
require(mosaic)
require(supernova)
require(Lock5Data)
require(Lock5withR)
require(okcupiddata)
# set up exercise
Thumb.stats < favstats(~ Thumb, data = Fingers)
SE < Thumb.stats$sd/sqrt(Thumb.stats$n)
Empty.model < lm(Thumb ~ NULL, data = Fingers)
# This will calculate CI based on the normal distribution.
confint.default(Empty.model)
# Write code to calculate CI based on the t distribution.
# Write code to calculate CI based on the t distribution.
Thumb.stats$mean + xqt(.975, df = Thumb.stats$n1)*SE
Thumb.stats$mean  xqt(.975, df = Thumb.stats$n1)*SE
test_function("xqt", args = "df", index = 1)
test_function("xqt", args = "df", index = 2)
test_error()
success_msg("Keep up the great work!")
Notice that at the end of the day, for samples that are fairly large, none of these technical details make a huge difference. That’s why, even after going through all these details, we still think of the 95% confidence interval as being about plus or minus two standard errors from our estimate.
L_Ch10_UsingN_9
Bottom line, all you need to use is confint()
. It will give you the best estimate of the confidence interval regardless of sample size.
One last thing: sometimes you may just want the lower bound of a confidence interval, or the upper bound. If we tell you that the function confint()
returns a vector of two numbers (the first being the lower bound and the second, the upper bound) you could probably figure out how to just get one or the other. Run the following code to see what we mean.
#load packages
require(ggformula)
require(mosaic)
require(supernova)
require(Lock5Data)
require(Lock5withR)
require(okcupiddata)
# set up exercise
Thumb.stats < favstats(~ Thumb, data = Fingers)
SE < Thumb.stats$sd/sqrt(Thumb.stats$n)
Empty.model < lm(Thumb ~ NULL, data = Fingers)
# Try running this code
confint(Empty.model)[1]
confint(Empty.model)[2]
# Try running this code
confint(Empty.model)[1]
confint(Empty.model)[2]
test_function("confint", index = 1)
test_function("confint", index = 2)
test_output_contains("confint(Empty.model)[1]")
test_output_contains("confint(Empty.model)[2]")
test_error()
success_msg("Awesome!")