Course Outline

list Introduction to Statistics: A Modeling Approach

A Confession, and the t Distribution

So here is a tiny confession. We’ve been kind of (but not really) lying to you. We’ve been saying that a 95% confidence interval is roughly plus or minus two standard errors from the estimate. But as you may have guessed, it’s not exactly two standard errors.

It’s actually 1.96 standard errors. In our view, 2 is close enough to 1.96 and 2 is much easier to multiply by in your head. But when you ask R to calculate the confidence interval it will use 1.96 standard errors above and below the mean and you will get a more precise estimate of the confidence interval.

There is one more thing, though. You might also be wondering—why don’t they just shorten the name of that confidence interval function to confint() rather than confint.default()?

That is a good question that brings us to another tiny technicality. As you saw in the previous chapter, the normal distribution isn’t always going to be the best mathematical model for a sampling distribution. When your sample size is fairly large, you can assume the sampling distribution is normal. But if the sample size is small or if the \(\sigma\) of the DGP is unknown (which it generally is), you’ll have more variation in your sampling distribution than is modeled by the normal distribution.

What do we do then? We use the t distribution which is very similar but slightly more variable than the normal distribution (also called the z distribution). The t distribution has a slightly different shape depending on the degrees of freedom used to estimate \(\sigma\). And for very large samples, the t distribution looks exactly like the normal distribution. Bottomline: we usually use the t distribution because it can act like the normal distribution when appropriate.

L_Ch10_UsingN_4(2)

The t distribution, like the standard normal distribution, is a mathematical probability function that is a good model for sampling distributions of the mean. It’s just a little more variable, that’s all.

We now know that if our sampling distribution was assumed to be normal, then the critical distance is 1.96 standard errors away from the estimate. We call this the critical z score.

Let’s try to think about what the critical distance would be if we used the t distribution instead of the z distribution. In other words, what would the critical t score be? Would it be bigger or smaller than 1.96?

L_Ch10_UsingN_5

A function called xqt() will take in the proportion you would like to see in one tail (e.g., .025) and the degrees of freedom (which, for now, will be n-1) and tell you the length of the critical distance in units of standard error.

For a very large sample (like 1,000 data points), this code will return something very close to 1.96 (check the R Console as well as the Plot).

`xqt(.025, df = 999)

L_Ch10_UsingN_6

Let’s try that. Run the code below.

#load packages require(ggformula) require(mosaic) require(supernova) require(Lock5Data) require(Lock5withR) require(okcupiddata) # Run this code xqt(.975,df=999) # Run this code xqt(.975,df=999) test_function_result("xqt") test_error() success_msg("Super!")
Just click Run
DataCamp: ch10-13

Let’s take a look at the critical t score for different sample sizes.

#load packages require(ggformula) require(mosaic) require(supernova) require(Lock5Data) require(Lock5withR) require(okcupiddata) # Try finding the critical t in these different situations # When the sample size is 500 xqt(.975,df=) # When the sample size is 157 (like in our Fingers data) xqt(.975,df=) # When the sample size is 50 xqt(.975,df=) # When the sample size is 20 xqt(.975,df=) # Run this code xqt(.975,df=499) xqt(.975,df=156) xqt(.975,df=49) xqt(.975,df=19) test_function("xqt", args = "df", index = 1) test_function("xqt", args = "df", index = 2) test_function("xqt", args = "df", index = 3) test_function("xqt", args = "df", index = 4) test_error() success_msg("Keep up the great work!")
In this case, find degrees of freedom by subtracting 1 from the sample size
DataCamp: ch10-14

L_Ch10_UsingN_7

The function confint() uses the t distribution to estimate the critical distance. So it will adjust for the size of your sample automatically. The function we tried out before, confint.default() uses the z distribution.

L_Ch10_UsingN_8

Try out both of these functions in the DataCamp window below. Is the confidence interval based on the t distribution (confint()) slightly bigger than the one based on the normal distribution?

#load packages require(ggformula) require(mosaic) require(supernova) require(Lock5Data) require(Lock5withR) require(okcupiddata) # set up exercise Thumb.stats <- favstats(~ Thumb, data = Fingers) SE <- Thumb.stats$sd/sqrt(Thumb.stats$n) Empty.model <- lm(Thumb ~ NULL, data = Fingers) # This will calculate CI based on the normal distribution. confint.default(Empty.model) # Write code to calculate CI based on the t distribution. # Write code to calculate CI based on the t distribution. Thumb.stats$mean + xqt(.975, df = Thumb.stats$n-1)*SE Thumb.stats$mean - xqt(.975, df = Thumb.stats$n-1)*SE test_function("xqt", args = "df", index = 1) test_function("xqt", args = "df", index = 2) test_error() success_msg("Keep up the great work!")
Use xqt() for t distributions
DataCamp: ch10-15

Notice that at the end of the day, for samples that are fairly large, none of these technical details make a huge difference. That’s why, even after going through all these details, we still think of the 95% confidence interval as being about plus or minus two standard errors from our estimate.

L_Ch10_UsingN_9

Bottom line, all you need to use is confint(). It will give you the best estimate of the confidence interval regardless of sample size.

One last thing: sometimes you may just want the lower bound of a confidence interval, or the upper bound. If we tell you that the function confint() returns a vector of two numbers (the first being the lower bound and the second, the upper bound) you could probably figure out how to just get one or the other. Run the following code to see what we mean.

#load packages require(ggformula) require(mosaic) require(supernova) require(Lock5Data) require(Lock5withR) require(okcupiddata) # set up exercise Thumb.stats <- favstats(~ Thumb, data = Fingers) SE <- Thumb.stats$sd/sqrt(Thumb.stats$n) Empty.model <- lm(Thumb ~ NULL, data = Fingers) # Try running this code confint(Empty.model)[1] confint(Empty.model)[2] # Try running this code confint(Empty.model)[1] confint(Empty.model)[2] test_function("confint", index = 1) test_function("confint", index = 2) test_output_contains("confint(Empty.model)[1]") test_output_contains("confint(Empty.model)[2]") test_error() success_msg("Awesome!")
Just click Run
DataCamp: ch10-16

Responses