Course Outline

list Introduction to Statistics: A Modeling Approach

Using Confidence Intervals to Evaluate a Regression Model

As you can see by now, we can construct a sampling distribution, and thus a confidence interval, for any parameter we can estimate from data. We started with our estimate of the mean (\(b_{0}\)), and then moved on to our estimate of the difference in means between two groups (\(b_{1}\)).

In our two-group example, we introduced an approach to using confidence intervals to evaluate models. We started by specifying a complex model (for example, a model with one more parameter than the empty model). We then fit the complex model (i.e., calculated the best parameter estimates).

Because the particular estimates we calculated were just one possible set of estimates that we could have gotten, depending on sampling variation, we constructed a confidence interval around our estimates. The confidence interval shows us the possible range of worlds (\(\beta\)s) where our sample estimate would be considered likely. We constructed a sampling distribution around the additional parameter, the one that differentiated the complex model from the empty model (in this case, \(\beta\_1\), or the increment from Short to Tall).

Finally, we checked to see if the confidence interval around the additional parameter estimate included 0. If it did include 0, we would just stick with the empty model. It is, after all, simpler! Even though your estimate might be greater than 0, it could have resulted from random variation in sampling. Thus, there is no strong evidence from our data that would cause us to rule out the simpler model in favor of the more complex model.

If it did not include 0, however, which in our case it did not, then we could reject the empty model and adopt the more complex one. To use the language of statistical significance testing, we would say that the complex model was significantly better than the empty model. Or, similarly, we could say that the parameter representing the group difference was significantly different from 0.

Application to the Regression Model

Let’s see if we now can apply this same approach to evaluating a regression model. We used regression, you may recall, to model the relationship between a quantitative explanatory variable and a quantitative outcome. We would use a regression model, for example, to represent the relationship between height in inches (the explanatory variable) and thumb length in millimeters.

What we really want to know is this: in the population, if we know someone’s height in inches, does it help us make a better prediction about their thumb length? Or would we do just as well to go with the same guess (the Grand Mean) for everyone?

L_Ch10_UsingCon_1

We can specify the Height model like this:

\[Y_{i}=b_{0}+b_{1}X_{i}+e\_{i}\]

Our model specification looks the same as the two-group model, but by now you know that the interpretation of the different model components will be different.

L_Ch10_UsingCon_2

In this model, the two parameters define a best-fitting line. \(b_{0}\) represents the y intercept, or the value of Y when X equals 0. \(b_{1}\) represents the slope of the line, or the incremental value added to Y for each unit increase in X. In this case, the slope would represent the increase in thumb length in millimeters for each one inch increase in height.

Go ahead and fit the Height model in R using lm().

require(supernova) require(Lock5Data) require(mosaic) require(Lock5withR) custom_seed(51) Fingers$Height2Group <- ntile(Fingers$Height, 2) Fingers$Height2Group <- factor(Fingers$Height2Group, levels = c(1,2), labels = c("short", "tall")) Height2Group.model <- lm(Thumb ~ Height2Group, data = Fingers) bootSDob1 <- do(10000) * b1(Thumb ~ Height2Group, data = resample(Fingers, 157)) # fit the Height model for Thumb Height.model <- # print out the best fitting parameters bootSDob1 <- do(10000) * b1(Thumb ~ Height2Group, data = resample(Fingers, 157)) # fit the Height model for Thumb Height.model <- lm(Thumb ~ Height, data = Fingers) # print out the best fitting parameters Height.model test_object("Height.model") test_output_contains("Height.model") test_error() success_msg("Great thinking!")
DataCamp: ch10-23

L_Ch10_UsingCon_3

From our best-fitting estimates we see that Height helps to predict Thumb in our sample. But what we really want to know is whether Height could help us predict Thumb length in the population. If \(\beta_{1}\) is actually 0, then we would not need to include Height in the model since it would be multiplied by 0 anyway. But if \(\beta_{1}\) is not equal to 0, we can reject the empty model and adopt the more complex regression model.

Since we can’t directly calculate \(\beta_{1}\), we will use \(b_{1}\) as an estimate. But estimates from samples have a problem. They vary from sample to sample. This is why we turn to sampling distributions to give us a sense of how much these estimates vary. Even though our estimate of the increment from the sample is .96 (adding on .96 mm to Thumb length for every inch of Height), in the population \(\beta{1}\) could be less, or it could be more. How much could it vary? Could it be 0? These are the questions we can answer with the confidence interval.

L_Ch10_UsingCon_4

The simple model to which we would compare the more complex, regression model would be this one:

\[Y_{i}=b_{0}+e\_{i}\]

Using this model, we would predict each person’s thumb length to be the mean of all thumb lengths in the sample regardless of their height. If we adopted the regression model, we would be saying that the prediction is significantly better if, in addition to the mean, you know the slope of the best fitting regression line.

The difference between these two models is the slope, or \(b_{1}\). If \(b_{1}\) were equal to 0, then the complex model would be reduced to the empty model. The slope, then, is the key parameter we are interested in.

Constructing a Sampling Distribution Around the Slope

We’ve already constructed a sampling distribution around \(b\_{1}\) before for the two-group model (i.e., Height2Group). Using the same approach as before, let’s use resampling to construct a sampling distribution for the slope of the regression line. Starting with our sample, we will:

  • Resample with replacement to generate a new, bootstrapped sample;

  • Fit the regression model to find the slope of the best fitting regression line (i.e., calculate a value for \(b\_{1}\));

  • Repeat 10,000 times;

  • Record the resampled estimates in a new data frame.

L_Ch10_UsingCon_5

Try putting all that together in the DataCamp window below. Save your 10,000 resampled slopes as bootSDob1. Print the first six lines of bootSDob1.

require(supernova) require(Lock5Data) require(mosaic) require(Lock5withR) custom_seed(100) Fingers$Height2Group <- ntile(Fingers$Height, 2) Fingers$Height2Group <- factor(Fingers$Height2Group, levels = c(1,2), labels = c("short", "tall")) Height2Group.model <- lm(Thumb ~ Height2Group, data = Fingers) # create a bootstrapped sampling distribution of b1s bootSDob1 <- # print a few lines of bootSDob1 # create a bootstrapped sampling distribution of b1s bootSDob1 <- do(10000) * b1(Thumb ~ Height, data = resample(Fingers, 157)) test_object("bootSDob1") test_error() success_msg("Nice work, let's move on to something more challenging!")
DataCamp: ch10-24

L_Ch10_UsingCon_6

Use the DataCamp window to create a histogram and run favstats on b1 from bootSDob1 (you can assume bootSDob1 has already been created).

require(supernova) require(Lock5Data) require(mosaic) require(Lock5withR) custom_seed(100) bootSDob1 <- do(10000) * b1(Thumb ~ Height, data = resample(Fingers, 157)) # make a histogram # run favstats bootSDob1 <- do(10000) * b1(Thumb ~ Height, data = resample(Fingers, 157)) # make a histogram gf_histogram(~ b1, data = bootSDob1, fill = "gold") # run favstats favstats(~ b1, data = bootSDob1) test_function("gf_histogram", args = "data") test_function("favstats", args = "data") test_error() success_msg("Nice work, let's move on to something more challenging!")
DataCamp: ch10-25

Now that we have a sampling distribution, we can construct the 95% confidence interval using one of the methods we have developed.

The first approach is simply to find the cutoff points for the confidence interval directly from the bootstrapped sampling distribution. In this DataCamp exercise, arrange b1 in bootSDob1 in order and examine the 250th and 9,750th values.

require(supernova) require(Lock5Data) require(mosaic) require(Lock5withR) custom_seed(100) bootSDob1 <- do(10000) * b1(Thumb ~ Height, data = resample(Fingers, 157)) # arrange b1s in descending order bootSDob1 <- # print the 250th b1 # print the 9750th b1 bootSDob1 <- do(10000) * b1(Thumb ~ Height, data = resample(Fingers, 157)) # arrange b1s in order bootSDob1 <- arrange(bootSDob1, desc(b1)) # print the 250th b1 bootSDob1$b1[250] # print the 9750th b1 bootSDob1$b1[9750] test_object("bootSDob1") test_output_contains("bootSDob1$b1[250]") test_output_contains("bootSDob1$b1[9750]") test_error() success_msg("Keep up the great work!")
DataCamp: ch10-26

These cutoff points define the 95% confidence interval for the slope of the regression line. The lower bound for the slope is .58, the upper bound, 1.36.

L_Ch10_UsingCon_7

Using a Mathematical Probability Distribution to Calculate the Confidence Interval for Slope

Instead of using bootstrapping to get the confidence interval around our estimate, we could just use the function confint(). This method will: 1) assume that the sampling distribution of the slope is shaped like a t distribution, 2) use the t distribution to figure out how far away the “unlikely” zone is in units of standard error (the critical distance will, once again, be about 2 standard errors), 3) then estimate standard error to figure out how far away the “unlikely” zone is in millimeters.

In the DataCamp window below, fit the complex model using Height to explain variation in Thumb length. Then run confint() to find the confidence intervals around the estimates.

require(supernova) require(Lock5Data) require(mosaic) require(Lock5withR) custom_seed(100) bootSDob1 <- do(10000) * b1(Thumb ~ Height, data = resample(Fingers, 157)) # fit the complex model and save it as Height.model # get the confidence interval around these best fitting estimates bootSDob1 <- do(10000) * b1(Thumb ~ Height, data = resample(Fingers, 157)) # fit the complex model and save it as Height.model Height.model <- lm(Thumb ~ Height, data = Fingers) # get the confidence interval around these best fitting estimates confint(Height.model) test_object("Height.model") test_function_result("confint") test_error() success_msg("Well done!")
DataCamp: ch10-27

L_Ch10_UsingCon_8

Notice that the confidence interval for slope produced by confint() using the t distribution (.60 to 1.32) is very close to the one we constructed based on our bootstrapped samples (.58 to 1.36).

L_Ch10_UsingCon_9

Interpreting the Confidence Interval for Slope

Based on our sample, we initially estimated the slope of the regression line to be .96, meaning that the increment in thumb length for every inch in height is .96 mm.

L_Ch10_UsingCon_10

We are interested in whether or not the true value of \(\beta\_{1}\) could be 0 because it will help us compare, and choose between, two models for thumb length.

The complex model represents thumb length as a regression line. It is a two-parameter model: one for the y intercept and the other for slope. We represent the model like this:

\[Y_{i}=\beta_{0}+\beta_{1}X_{i}+\epsilon\_{i}\]

If the confidence interval for slope included 0, we could decide to use the simpler, empty model. As you can see, if the slope is 0, we would be left with just one parameter, the mean:

\[Y_{i}=\beta_{0}+\cancel{\beta_{1}X_{i}}+\epsilon\_{i}\]

Which then becomes:

\[Y\_i=\beta\_0+\epsilon\_i\]

Note, however, that just because the confidence interval includes zero it doesn’t mean that it is zero. The confidence interval would include a lot of numbers around zero as well.

In the case of height and thumb length, confidence interval around the slope, .6 to 1.32, does not include 0. This means we are pretty confident that the true slope in the population is not 0. Because the confidence interval around the slope did not include 0, we can, with 95% confidence, reject the simple model and adopt the complex one.

Even with the best statistical tools, we still are left with only a fuzzy idea of what the true population parameters are. Our estimates of \(\beta_{0}\) and \(\beta_{1}\) the “best fitting estimates,” and if we had to choose a single estimate, these (\(b_{0}\) and \(b_{1}\)) are the numbers we would use.

But it is highly unlikely that these estimates are accurate indicators of the true population parameters. Confidence intervals help us to keep that in mind. By coming up with a range of possible parameters given our data, we simultaneously make use of our data while acknowledging that our data are filled with noise, random and otherwise.

Although it might be comforting to just get one estimate instead of the range offered by a confidence interval, it actually increases our likelihood of being wrong. By calculating a confidence interval, we are acknowledging the uncertainty in our estimate, and drawing boundaries around that uncertainty. Even if the interval is large, we can at least be 95% confident that the true parameter lies somewhere in that interval.

Responses