PSY350, Module 9
Bootstrapping and Confidence Intervals
Module road map
Learning objectives
- Describe confidence intervals
- Describe why calculating the precision of parameter estimates is important
- Describe how the level of confidence for a confidence intervals affects the width of the interval
- Distinguish between bootstrap confidence intervals and theory-based confidence intervals
- Calculate confidence intervals by hand and with R
Readings
- Read Chapter 8 of your Textbook
Overview
In this module, we will learn how to estimate confidence intervals, which provide us with an understanding of the precision of our statistical estimates. Please watch the video below from Crash Course: Statistics on Confidence Intervals.
Introduction to the data
Data scenario: We will continue to use the weight loss study introduced in Module 8.
In the dataframe called wtloss.Rds, the following variables are recorded:
pid is the participant ID
condition is the condition the participant was randomly assigned to receive (control, CBT, mobile groceries, CBT + mobile groceries)
bmi is the participant’s body mass index at the start of the study
caldef is the sum of caloric deficit (in kilocalories) across all three months of the intervention. Caloric deficit is the difference between calories consumed and calories expended (positive number = more calories burned than consumed, negative number = more calories consumed than burned).
lbs_lost is the total pounds lost during the program (positive number = weight loss, negative number = weight gain)
fv5 is a binary indicator that is coded “yes” if the individual ate at least 5 fruits and vegetables per day on at least four days during the last week of the intervention, and is coded “no” if the individual did not meet this criteria.
In Module 8 we explored sampling variation by imagining that we could draw many, many random samples from the population, conduct our study in each random sample, observe how weight changed for each adult in each sample, calculate the mean pounds lost in each sample, and then compare how the pounds lost differed across samples. This gave us a sense of how the effect of interest (pounds lost under the CBT + mobile groceries condition) differed as a result of sampling variation. It should be no surprise that logistically, there is simply no way for us to draw thousands of samples and conduct our study within each sample in order to estimate sampling variability. We need another approach.
In this module we will explore alternative approaches for estimating sampling variability. We’ll start with an approach called bootstrapping. It’s actually quite similar to the techniques that we used in Module 8 – but, rather than working from the premise that we have access to many random samples drawn from the population, we will work from the premise that we can draw many random samples from our observed sample.
Let’s read in the data and subset to the 100 adults in the CBT + mobile groceries condition.
library(skimr)
library(moderndive)
library(infer)
library(here)
library(tidyverse)
<- readRDS(here("data", "wtloss.Rds"))
wtloss
<- wtloss %>%
CBT_plus filter(condition == "CBT + mobile groceries") %>%
select(pid, lbs_lost)
Introduction to bootstrap resampling
Let’s begin by drawing one bootstrap resample of size 100 from our observed sample, randomly drawing participants with replacement. Drawing with replacement means that each time we draw a participant, we record them as a member of the resample, but we throw them back into the observed sample to be chosen again. This means that any one participant can be selected more than once. And, some participants in the observed sample won’t be drawn at all.
<- CBT_plus %>%
resample1 select(pid, lbs_lost) %>%
rep_sample_n(size = 100, replace = TRUE) %>%
arrange(pid)
The table below presents the first 20 cases in the resample just drawn. Scan down the list and notice that many people (indexed by pid) were drawn several times in this resample.
replicate | pid | lbs_lost |
---|---|---|
1 | 302 | 31.356768 |
1 | 303 | 0.237590 |
1 | 304 | 8.530642 |
1 | 304 | 8.530642 |
1 | 308 | 20.192784 |
1 | 309 | 12.225309 |
1 | 309 | 12.225309 |
1 | 309 | 12.225309 |
1 | 310 | 7.225871 |
1 | 310 | 7.225871 |
1 | 310 | 7.225871 |
1 | 311 | 3.032363 |
1 | 314 | 8.501624 |
1 | 314 | 8.501624 |
1 | 315 | 17.730369 |
1 | 315 | 17.730369 |
1 | 316 | 14.415086 |
1 | 316 | 14.415086 |
1 | 319 | 10.423608 |
1 | 322 | 14.102500 |
Let’s compare the distribution of lbs_lost in the full observed sample and our drawn resample.
# full observed sample
%>%
CBT_plus ggplot(mapping = aes(x = lbs_lost)) +
theme_bw() +
labs(title = "What is the distribution of weight loss in the full observed sample?",
x = "pounds lost during the weight loss program") +
geom_histogram(binwidth = 1)
# 1 bootstrap resample
%>%
resample1 ggplot(mapping = aes(x = lbs_lost)) +
theme_bw() +
labs(title = "What is the distribution of weight loss in the resample?",
x = "pounds lost during the weight loss program") +
geom_histogram(binwidth = 1)
The original sample and the resample look quite similar. This is an important property of bootstrap resampling – the resample typically looks very similar to the full observed sample.
Now, instead of drawing one resample, let’s draw 30 resamples.
<- CBT_plus %>%
resamples_30 select(pid, lbs_lost) %>%
rep_sample_n(size = 100, replace = TRUE, reps = 30) %>%
arrange(pid)
%>%
resamples_30 ggplot(aes(x = lbs_lost)) +
geom_histogram(binwidth = 1) +
theme_bw() +
facet_wrap(~replicate) +
labs(title = "Distribution of pounds lost across 30 bootstrap resamples",
x = "pounds lost")
The plot shows the distribution of scores on pounds lost across each of the 30 resamples. Although each sample is different, and the distributions clearly differ to some degree, they are all quite similar in terms of range and shape.
Now, let’s draw 5000 resamples. Given the large number, we won’t plot each individual resample, but rather, we’ll calculate the mean in each resample, and create a histogram of these resampled means.
set.seed(12345)
<- CBT_plus %>%
resamples_5000means select(pid, lbs_lost) %>%
rep_sample_n(size = 100, replace = TRUE, reps = 5000) %>%
group_by(replicate) %>%
summarize(mean.lbs_lost = mean(lbs_lost))
%>%
resamples_5000means ggplot(aes(x = mean.lbs_lost)) +
geom_histogram(binwidth = .1) +
theme_bw() +
labs(title = "Distribution of average pounds lost across 5000 bootstrap resamples",
x = "mean pounds lost in the resample")
This distribution of the mean pounds lost across our 5000 bootstrap resamples mimics the sampling distribution that we studied in Module 8. But, we created it without having to pretend that we have access to the entire population of scores. We created it simply with our single observed sample.
We can use this mimic of our sampling distribution to gain a sense of how much sampling variability is likely to exist for our statistic of interest. For example, the histogram of the mean pounds lost across our 5000 bootstrap resamples varies from about 7 pounds lost up to about 15 pounds lost. The largest concentration of means is right around 11 pounds lost. And, it looks like about 95% of the resamples produce a mean that is between about 9 pounds and 13 pounds lost – this gives us a range of plausible values for where the true population mean is likely to exist. When analysts construct a range of plausible values for a statistic of interest, it is referred to as a confidence interval (CI). You can think of the CI for a statistic as a range of values that is likely to contain the true population parameter with some degree of uncertainty.
Please play the video below for a recap and summary of the steps we just took to draw our resamples.
To calculate the CI of a statistic we first need to choose a desired level of confidence. For example, a 95% CI is common in Psychological research. This implies that if we were to draw many, many random samples from the population, estimate our parameter of interest in each, and then construct a CI around each parameter estimate, 95% of the random samples would produce a CI that includes the true population parameter. This means that 5% of the time our CI won’t include the true population parameter.
The two ends of the CI are called limits or bounds (i.e., the lower bound and the upper bound of the CI). If we aren’t comfortable knowing that the true population parameter will be included in the CI only 95% of the time, and instead we prefer it to be included 99% of the time, we could construct a 99% CI instead.
Let’s now study two methods for calculating a confidence interval from a bootstrap sampling distribution.
Construct a bootstrap confidence interval
Method 1: percentile
One common method to define the confidence interval is to use the middle 95% of values from the distribution of means across our bootstrap resamples. To do this, we simply find the 2.5th and 97.5th percentiles of our distribution – that will allow 2.5% of means to fall below our CI and 2.5% to fall above our CI. This is quite similar to the activity we did in Unit 4 to determine a range of plausible values of height that would encompass 95% of female adults.
Calculate a 95% confidence interval
We can use the quantile function to calculate this range for us – it takes the corresponding proportion/probability rather than percentile, so we need to divide our percentiles by 100 (2.5/100 = .025 and 97.5/100 = .975). We then plug these values into the code below.
%>% summarize(lower = quantile(mean.lbs_lost, probs = .025),
resamples_5000means upper = quantile(mean.lbs_lost, probs = .975))
lower | upper |
---|---|
9.378793 | 13.24606 |
The two values in the table above, 9.38 and 13.24, give us the lower and upper bound of our confidence interval. Let’s plot our bootstrap sampling distribution, and overlay this 95% CI. In this way, rather than eyeballing the bootstrap resampling distribution to figure out the range of pounds lost that encompasses the middle 95% of means, we can calculate the precise estimates using the quantile function.
%>%
resamples_5000means ggplot(aes(x = mean.lbs_lost)) +
geom_histogram(binwidth = .1) +
geom_vline(xintercept = 9.38, color = "skyblue", size = 1) +
geom_vline(xintercept = 13.24, color = "skyblue", size = 1) +
labs(title = "Distribution of average pounds lost across 5000 bootstrap resamples",
subtitle = "Blue lines represent lower and upper bounds of the 95% CI",
x = "pounds lost")
Calculate a 90% confidence interval
Finding the middle 95% of scores as we just did corresponds to a 95% confidence interval. This is a very common interval, but we certainly aren’t tied to this. We could choose the middle 90% of scores, for example. For the middle 90%, we’d use the same method, but plug in .05 (for the 5th percentile) and .95 (for the 95th percentile) – allowing 90% of the distribution to be in the middle and 10% in the tails (with 5% in the lower tail and 5% in the upper tail of the distribution).
In this example, we want 10% in the tails – 10%/2 equals 5% in each tail. This equates to the 5th and 95th percentile. Divide by 100 to get the correct value for probs below: 5/100 = .05 for lower quantile and 95/100 = .95 for upper quantile.
%>% summarize(lower = quantile(mean.lbs_lost, probs = .05),
resamples_5000means upper = quantile(mean.lbs_lost, probs = .95))
lower | upper |
---|---|
9.702529 | 12.9352 |
Calculate a 99% confidence interval
Or, we can find a 99% CI, which will be the widest of all that we’ve constructed so far since it is set to contain all but 1% of the full distribution of mean scores.
In this example, we want 1% in the tails – 1%/2 equals .5% in each tail. This equates to the .5th and 99.5th percentile. Divide by 100 to get the correct value for probs below: .5/100 = .005 for lower quantile and 99.5/100 = .995 for upper quantile.
%>% summarize(lower = quantile(mean.lbs_lost, probs = .005),
resamples_5000means upper = quantile(mean.lbs_lost, probs = .995))
lower | upper |
---|---|
8.720616 | 13.83601 |
Method 2: standard error
The Central Limit Theorem states that if you have a population parameter of interest and take sufficiently large random samples from the population with replacement, then the distribution of the parameter estimates across the random samples will be approximately normally distributed. A variable with a normal distribution (i.e., a bell-shaped curve) adheres to the empirical rule – which states that 95% of values fall between +/- 1.96 standard deviations of the mean. Our bootstrap distribution of the means across the 5000 resamples is roughly a normal distribution; therefore, we can use the empirical rule to construct confidence intervals using the standard error method.
To use this method, we simply calculate the mean and standard deviation of our means across the 5000 bootstrap resamples.
%>%
resamples_5000means select(mean.lbs_lost) %>%
skim()
Name | Piped data |
Number of rows | 5000 |
Number of columns | 1 |
_______________________ | |
Column type frequency: | |
numeric | 1 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
mean.lbs_lost | 0 | 1 | 11.32 | 0.99 | 7.4 | 10.66 | 11.32 | 12.01 | 14.74 | ▁▂▇▅▁ |
Using skim, we find the mean is 11.32 and the standard deviation is .99. Recall from Module 8 that the standard deviation of a sampling distribution (i.e., our distribution of bootstrap resamples) is called a standard error. Therefore, we can expect 95% of scores to fall within ~1.96 standard errors of the mean. That is \(11.32 \pm1.96\times.99\). This corresponds to a range of pounds lost between 9.38 (lower bound) and 13.26 (upper bound). This estimate is very close to our 95% CI that we calculated using the percentile method.
For a 99% confidence interval, we substitute 1.96 with 2.58 as 99% of values of a normal distribution fall within ~2.58 standard errors from the mean. Calculate the 99% CI yourself and compare it to the 99% CI calculated using the percentile method. Do the same with a 90% CI.
Tip – for any desired confidence level, you can find the critical value by which you multiply the standard error by using the qnorm() function in R.
# for 95% CI - take 1 - .95, and then divide by 2 to get middle 95%
qnorm(p = (1-.95)/2, lower.tail = FALSE)
## [1] 1.959964
# for 99% CI - take 1 - .99, and then divide by 2 to get middle 99%
qnorm(p = (1-.99)/2, lower.tail = FALSE)
## [1] 2.575829
# for 80% CI - take 1 - .80, and then divide by 2 to get middle 80%
qnorm(p = (1-.80)/2, lower.tail = FALSE)
## [1] 1.281552
Theory-based confidence intervals
In the first part of this module, we learned about bootstrapping to mimic the process of sampling many random samples from the population to obtain the sampling distribution. This gave us insight into the expected variability of our parameter estimate across many random samples and allowed us to calculate a 95% CI that gave us a range of plausible values for the true parameter in the population. We used two methods to calculate the bootstrap CI – a percentile method and a standard error method.
If one feels certain that the sampling distribution of the parameter being estimated is normal (i.e., bell-shaped, Guassian) – then there is another method that can be used to calculate the confidence interval. It is a theory-based approach and does not involve resampling, we simply work with the single sample that we drew. In fact, it can be calculated by hand using the sample summary statistics. This theory-based approach relies on the Central Limit Theorem (CLT) and uses the characteristics of the empirical rule for a normal distribution. The CLT tells us that for reasonably large sample sizes, the sampling distribution of a parameter will be normally distributed.
We can apply this to the sampling distribution for a parameter of interest (e.g., mean pounds lost in the sample). First, we need to calculate an estimate of the standard error. There is a theory-based formula for calculating the standard error (SE) of a mean, it is:
\[\text{SE}_{\bar y} \approx \frac{{\sigma }}{\sqrt{n}}\]
The numerator of the formula is the standard deviation of the variable of interest (pounds lost) in the single sample, and the denominator is the square root of the sample size (n). We can calculate the CI using the summarize function as follows, notice the line that starts with se – here we take the standard deviation of lbs_lost in our single observed sample, and divide it by the square root of n:
%>%
wtloss filter(condition == "CBT + mobile groceries") %>%
summarize(mean = mean(lbs_lost),
std = sd(lbs_lost),
sample_size = n(),
se = std/sqrt(sample_size),
lower_boundCI = mean - 1.96*se,
upper_boundCI = mean + 1.96*se)
mean | std | sample_size | se | lower_boundCI | upper_boundCI |
---|---|---|---|---|---|
11.32439 | 9.890134 | 100 | 0.9890134 | 9.385925 | 13.26286 |
Here, we see that the 95% CI calculated using a theory-based approach is 9.38 to 13.26. This theory-based method of calculating the CI is quite close to the 95% CI that we calculated using the bootstrap method.
Interpretation of confidence intervals
Across all of the methods that we used to calculate a 95% CI for the mean of pounds lost in the population we get a range of roughly 9.4 pounds to 13.3 pounds. To close out this module, let’s try to grow our intuition for exactly what this CI indicates. Please click the video below for a discussion on the interpretation of confidence intervals.