PSY350, Module 8

Sampling

Module road map

Learning objectives

Contrast descriptive statistics and inferential statistics
Distinguish between the population and the sample in a study
Distinguish between a population parameter and a parameter estimate
Describe the process of sampling from the population
Contrast accuracy and precision in statistical estimates
Explain how sample size impacts precision
Describe the Central Limit Theorem

Readings

Read Chapter 7 of your Textbook

Overview

Descriptive statistics are used to organize, summarize, and present data — usually about a sample that we have collected from a population. Inferential statistics are used to generalize from a randomly selected sample to the population, to test hypotheses, to make predictions, and understand the precision of our model estimates.

So far in this course, we’ve focused on descriptive statistics, that is, describing the data in terms of simple statistics (e.g., mean) or relationships (e.g., the slope that relates income growth to vote share in presidential elections, the change in GPA after receiving a mentoring program). In this module we will learn about the process of sampling so that we can make inferences about our defined population based on our sample. Before we dive into analyzing data, please watch the Crash Course: Statistics video on randomness.

Introduction to the data

Let’s imagine that you are a nutrition and exercise scientist focused on promoting healthy eating, exercise, and maintenance of a healthy weight. You are keenly aware that an individual’s behaviors aren’t just driven by personal and intrinsic factors, but rather the context in which one works, lives, and plays drives behavior as well. For example, healthy eating depends on both the availability of, and access to, healthy foods.

The United States Department of Agriculture’s Economic Research Service identified 6,500 food deserts in the United States based on 2000 Census data as well as 2006 data on locations of supermarkets, supercenters, and large grocery stores. A food desert is an area where people have limited access to healthy and affordable food. A paper by Morland, Diez Roux, and Wing found that people who live in a food desert are more likely to have a poor diet and be obese. Of course, food deserts are not evenly distributed across the country. The communities classified as food deserts tend to be home to poorer families and families of color according to a report by the USDA. Thus, food deserts are a factor in producing and maintaining health disparities.

You and your team of social scientists seek to design a weight loss program for overweight individuals living in the North Denver neighborhoods that have been designated as food deserts.The graph below depicts the food deserts in this area.

You believe that an effective weight loss program in these neighborhoods should have two components. First, a cognitive-behavioral therapy (CBT) component in which a dietitian and a therapist work with the individual to teach them about healthy eating, provide them with the skills necessary to prepare healthy foods, and counsel them on adopting new healthy eating behaviors. Second, a mobile grocery store that visits their home once a week that is full of fresh and wholesome foods to purchase at an affordable price.

Your team desires to determine if the combination of the two intervention components produces better results than no intervention at all, or just one of the interventions alone. Therefore, your team designs a randomized controlled trial (RCT) with four arms:

a control group (no intervention now, but will receive intervention after the RCT)
CBT alone
mobile groceries alone
CBT + mobile groceries

You recruit overweight adults (body mass index (BMI) > 25) interested in losing weight and living in one of the North Denver neighborhoods designated as food deserts. Your recruitment efforts are highly successful, and 10,000 people contact you to participate. Of those who agree to participate, you randomly select 400, and then randomly assign these 400 participants to one of the four arms of your study (100 people per arm). Prior to the intervention you measure each participant’s height and weight, and calculate their BMI (to verify that they meet the study criteria).

The interventions are delivered over the course of three months. During the study, information about calorie intake and calorie expenditure is ascertained so that a total caloric deficit can be computed. After the three months of intervention, you again weigh each participant to calculate the total pounds lost or gained over the intervention period.

In the dataframe called wtloss.Rds, the following variables are recorded:

pid is the participant ID
condition is the condition the participant was randomly assigned to receive (control, CBT, mobile groceries, CBT + mobile groceries)
bmi is the participant’s body mass index at the start of the study
caldef is the sum of caloric deficit (in kilocalories) across all three months of the intervention. Caloric deficit is the difference between calories consumed and calories expended (positive number = more calories burned than consumed, negative number = more calories consumed than burned).
lbs_lost is the total pounds lost during the program (positive number = weight loss, negative number = weight gain)
fv5 is a binary indicator that is coded “yes” if the individual ate at least 5 fruits and vegetables per day on at least four days during the last week of the intervention, and is coded “no” if the individual did not meet this criteria.

Let’s begin by importing the data.

wtloss <- here("data", "wtloss.Rds") %>% 
  readRDS()

The glimpse() function presents the key features of the dataframe. Study these, are each what you would expect?

glimpse(wtloss)

## Rows: 400
## Columns: 7
## $ pid       <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
## $ condition <fct> control, control, control, control, control, control, contro…
## $ caldef    <dbl> -9965.8, 7218.2, -6172.1, 20293.9, 10654.2, 9872.2, 274.5, 6…
## $ lbs_lost  <dbl> -2.14369251, 1.15335885, 2.36789011, 0.56314651, 10.08834935…
## $ bmi       <dbl> 35.43, 41.92, 38.42, 38.24, 39.42, 36.88, 48.58, 36.45, 40.3…
## $ sex       <fct> male, female, female, female, male, male, female, male, male…
## $ fv5       <fct> yes, yes, no, no, yes, no, yes, no, no, yes, no, no, no, no,…

Let’s obtain some additional descriptive statistics with skim().

wtloss %>% skim()

Data summary
Name	Piped data
Number of rows	400
Number of columns	7
_______________________
Column type frequency:
factor	3
numeric	4
________________________
Group variables	None

Variable type: factor

skim_variable	complete_rate	ordered	n_unique	top_counts
condition	1	FALSE	4	con: 100, mob: 100, CBT: 100, CBT: 100
sex	1	FALSE	2	fem: 203, mal: 197
fv5	1	FALSE	2	yes: 220, no: 180

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
pid	1	200.50	115.61	1.00	100.75	200.50	300.25	400.00	▇▇▇▇▇
caldef	1	17418.56	25779.13	-49516.10	-157.22	12761.50	31364.38	101476.30	▁▇▆▃▁
lbs_lost	1	4.89	9.42	-19.01	-1.58	4.57	10.11	32.43	▁▆▇▃▁
bmi	1	38.93	3.85	26.06	36.18	39.14	41.60	48.58	▁▃▇▇▂

Study these descriptive statistics a bit. Notice that the average caloric deficit is 17,419 and the average pounds lost is just shy of 5 pounds. Note the minimum and maximum values for these two key variables.

Kudos to your team for the execution of an excellent study!

As diligent scientists, you and your team carefully defined the population (all overweight adults in the identified North Denver neighborhoods interested in losing weight), then drew a random sample from that population. This process of sampling is a critically important aspect of scientific research. We hope that the single random sample of individuals that we draw and then engage in our research study is in fact representative of the population, and that estimating our statistical models with this sample will ultimately inform us on how our variables are related to one another in the population. The figure below contrasts the difference between descriptive statistics and inferential statistics, and outlines the scientific research flow used so that we can infer the findings from our sample to the population.

Figure produced by Carnegie Melon University, Open Learning Institute

In this module, we will study the process of sampling so that we can maximize our ability to draw inferences about the population based on our sample. To begin, we’ll work with a subset of the data just described – just those individuals in the CBT + mobile groceries condition. We’ll return to the full dataframe in future modules.

Let’s look at the descriptive statistics for pounds lost (lbs_lost) for this subset of individuals.

wtloss %>% 
  filter(condition == "CBT + mobile groceries") %>% 
  select(lbs_lost) %>% 
  skim()

Data summary
Name	Piped data
Number of rows	100
Number of columns	1
_______________________
Column type frequency:
numeric	1
________________________
Group variables	None

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
lbs_lost	0	1	11.32	9.89	-14.36	4.51	11.07	19.3	32.43	▂▅▇▇▂

With this subset of data, we aim to determine the average pounds lost among people who received the CBT + mobile groceries intervention. We’ll use this single sample of 100 individuals to explore our research question, but in actuality it is the population of individuals that we really want to know about. That is, we want to determine if significant weight loss will occur if we deliver our intervention to all of the people in the population.

The true value of interest in the population is called a population parameter. In this example, the population parameter of interest is the average pounds lost under the CBT + mobile groceries weight loss program. This is generally unknown. Using statistics, we estimate the population parameter using our random sample — this estimate that we garner from our sample is called a parameter estimate (sometimes it is also referred to as a sample statistic or a point estimate). If we carefully conduct our study – then this sample estimate is likely to be a good estimate of the unknown population parameter.

Please watch the video below to learn about how studying the sample that you draw will help you to learn about the population of interest.

In this module, the key learning objective is to recognize that uncovering the parameter estimate (e.g., the mean pounds lost) alone isn’t enough. In addition, we must determine the precision of the estimate. Since we only draw one random sample from the population, we need to know how well our sample estimate does in telling us about what we’re really after – that is, the pounds lost in the population. In other words, we need to gain a sense of how much uncertainty exists in using our sample estimate to infer something about our population.

Simulate a population

To gain intuition about sampling, let’s imagine that you had unlimited resources and could deliver the CBT + mobile groceries intervention to all 10,000 adults who expressed interest in your study. In this hypothetical scenario, we would have data on every person in the population.

To study this process, let’s simulate a population. We will generate a population of size 10,000 using the rnorm() function — this will generate a random variable from a normal distribution with a given mean and standard deviation. We will generate data in which the average pounds lost is 11, and the standard deviation is 10.

set.seed(8642) # setting the seed ensures the same result each time the simulation is fun
p_lbslost <- rnorm(n = 10000, mean = 11, sd = 10) # generate a population of 10,000 adults

my_pop <- as_tibble(p_lbslost) %>% 
  rename(lbs_lost = value)

head(my_pop)

lbs_lost
2.964542
17.384819
9.582131
32.542073
9.779112
3.667771

Now that we’ve simulated pounds lost for every person in the hypothetical population, we can use a histogram to look at the distribution of pounds lost across all people.

ggplot(my_pop, aes(x = lbs_lost)) +
  geom_histogram(binwidth = 1) +
  theme_bw() +
  labs(title = "Pounds lost following a 3-month weight loss program", 
       subtitle = "simulation of a population", 
       x = "pounds lost by the participant")

Just as we specified it to be, the middle of the distribution is about 11 pounds. That is, most people lost around 11 pounds during the three month program. We see a wide spread of scores for pounds lost, ranging from about 28 pounds gained over the three month period to a little over 50 pounds lost. Study the range and distribution presented in this histogram for a bit.

Simulate sampling from the population

Now that we have a hypothetical population, we can simulate the process of drawing a sample from that population. Let’s begin by drawing a single sample of size 100. The function rep_sample_n() from the moderndive package draws a random sample from the dataframe based on the size that you dictate (e.g., 100 people).

sample1 <- my_pop %>% 
  rep_sample_n(size = 100)

sample1 %>% skim()

Data summary
Name	Piped data
Number of rows	100
Number of columns	2
_______________________
Column type frequency:
numeric	1
________________________
Group variables	replicate

Variable type: numeric

skim_variable	replicate	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
lbs_lost	1	0	1	11.65	10.28	-11.11	5.4	11.85	17.42	54.37	▂▇▃▁▁

In this single sample of size 100, the mean pounds lost is 11.6 pounds. What is the range?

Now, instead of one random sample, let’s see what happens if we draw 25 random samples of size 100 from our population. By adding the argument “reps = 25”, we tell the function to repeat the process of drawing a sample of 100 people, 25 times.

samples_25 <- my_pop %>% 
  rep_sample_n(size = 100, reps = 25)

samples_25 %>% glimpse()

## Rows: 2,500
## Columns: 2
## Groups: replicate [25]
## $ replicate <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ lbs_lost  <dbl> 0.44012332, 29.75224970, 2.59261824, 20.87243332, 19.6944384…

In the outputted dataframe (samples_25), there are two variables, one called replicate, which indicates which sample the individual belongs to, and lbs_lost, our outcome variable of interest. Using this dataframe, we can calculate the average pounds lost in each of the 25 samples.

samples_25 %>% 
  group_by(replicate) %>% 
  skim()

Data summary
Name	Piped data
Number of rows	2500
Number of columns	2
_______________________
Column type frequency:
numeric	1
________________________
Group variables	replicate

Variable type: numeric

skim_variable	replicate	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
lbs_lost	1	1	11.38	11.18	-17.94	3.60	11.65	18.82	35.49	▁▅▇▆▂
lbs_lost	2	1	11.37	10.18	-16.00	5.05	10.88	17.93	41.64	▁▅▇▃▁
lbs_lost	3	1	8.91	10.58	-17.94	1.74	7.96	15.95	37.27	▁▇▇▅▂
lbs_lost	4	1	10.99	9.38	-13.42	5.86	10.21	17.45	37.92	▂▅▇▅▁
lbs_lost	5	1	9.96	10.21	-15.96	2.74	10.32	17.77	33.95	▁▆▇▆▂
lbs_lost	6	1	10.16	10.52	-23.81	2.22	10.85	17.72	31.06	▁▃▇▇▅
lbs_lost	7	1	7.91	10.09	-23.77	1.51	7.92	14.81	30.34	▁▂▇▇▂
lbs_lost	8	1	9.07	9.81	-13.73	2.36	8.33	16.53	30.93	▂▆▇▇▃
lbs_lost	9	1	10.34	10.27	-14.99	4.01	10.32	17.17	34.82	▂▅▇▅▂
lbs_lost	10	1	11.96	10.10	-13.42	6.14	13.07	18.41	33.81	▂▃▇▇▂
lbs_lost	11	1	12.86	11.11	-14.43	6.24	12.51	18.15	47.35	▂▅▇▂▁
lbs_lost	12	1	11.44	11.95	-15.58	3.37	10.35	18.72	42.14	▂▆▇▃▂
lbs_lost	13	1	12.69	11.08	-12.26	5.91	12.22	20.03	37.92	▂▅▇▆▂
lbs_lost	14	1	10.53	10.25	-10.25	3.56	10.57	17.39	37.18	▃▅▇▃▁
lbs_lost	15	1	11.53	10.51	-15.59	4.12	11.15	17.63	31.83	▁▃▇▅▃
lbs_lost	16	1	10.99	9.39	-10.72	4.56	11.81	17.73	30.24	▂▆▆▇▃
lbs_lost	17	1	10.75	9.32	-24.07	6.02	11.63	16.30	36.45	▁▂▇▇▁
lbs_lost	18	1	10.73	10.33	-16.90	4.82	10.23	16.15	37.18	▁▃▇▃▁
lbs_lost	19	1	11.00	10.25	-12.79	2.82	11.93	17.38	35.18	▂▆▇▅▂
lbs_lost	20	1	10.07	9.24	-13.50	3.62	9.15	15.72	37.12	▁▆▇▃▁
lbs_lost	21	1	9.77	9.18	-15.81	3.14	11.26	16.02	28.94	▁▃▅▇▂
lbs_lost	22	1	11.71	9.61	-7.00	3.86	11.03	17.80	45.85	▅▇▇▂▁
lbs_lost	23	1	11.35	9.84	-9.47	5.83	11.37	18.39	34.12	▃▇▇▆▂
lbs_lost	24	1	10.96	9.76	-12.28	3.78	11.21	17.54	33.27	▂▇▇▇▂
lbs_lost	25	1	11.88	12.22	-16.90	5.55	10.98	18.18	44.54	▂▆▇▃▁

The table above includes one row of information for each of the 25 random samples that we just drew. You can see that in each random sample, the average pounds lost (listed under mean in the table above) is a bit different. For example, in replicate 1, the average pounds lost among the 100 participants was 11.4 pounds. In replicate 20, the average pounds lost among the 100 participants was 10.1 pounds.

Now, let’s go really big and perform the multiple sampling again, but this time we’ll draw 1000 random samples of size 100.

samples_1000 <- my_pop %>% 
  rep_sample_n(size = 100, reps = 1000)

samples_1000 %>% glimpse()

## Rows: 100,000
## Columns: 2
## Groups: replicate [1,000]
## $ replicate <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ lbs_lost  <dbl> 0.7822595, -14.3592224, 16.6741451, 18.3760395, 7.4836846, 2…

Rather than printing out the mean pounds lost in all 1000 samples, let’s plot them in a histogram.

mean_samples_1000 <- samples_1000 %>% 
  group_by(replicate) %>% 
  summarize(mean.lbs_lost = mean(lbs_lost)) 

mean_samples_1000 %>% 
  ggplot(aes(x = mean.lbs_lost)) +
  geom_histogram(binwidth = .1) +
  theme_bw() +
  labs(title = "Average pounds lost across 1000 random samples", 
       x = "average pounds lost in the sample")

This is a historgram of the mean pounds lost in each of the 1000 random samples. Note that the cases in this dataset aren’t people, but rather our 1000 random samples (i.e., our 1000 replicates). This histogram shows us that in our simulation, the average pounds lost in most of the samples was around 11. But there are a few that had quite high averages (near 15 pounds lost) and a few with substantially lower averages (close to only 7 pounds lost).

This histogram of the mean pounds lost across 1000 random samples illustrates the sampling distribution for average pounds lost. The sampling distribution is the distribution of a statistic of interest across all possible random samples. It describes how the statistic differs across the many, many random samples that could have been drawn from the population.

The standard deviation of the sampling distribution is called the standard error of the statistic of interest, in this case the standard error of the mean pounds lost. See the sd for mean.lbs_lost in the skim output below. This standard error (i.e., the standard deviation of the sampling distribution) shows the degree of spread across the random samples and indicates the precision of our estimated statistic. A larger standard error relative to the mean indicates less precision (more variability across the random samples), a small standard error relative to the mean indicates more precision (less variability across the random samples).

mean_samples_1000 %>% skim()

Data summary
Name	Piped data
Number of rows	1000
Number of columns	2
_______________________
Column type frequency:
numeric	2
________________________
Group variables	None

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
replicate	0	1	500.50	288.82	1.00	250.75	500.5	750.25	1000.00	▇▇▇▇▇
mean.lbs_lost	0	1	10.88	1.05	7.38	10.17	10.9	11.56	14.78	▁▅▇▂▁

Please play the video below for commentary on the sampling distribution.

Sample size and precision

Let’s explore the concept of precision in statistical inference a bit more. The degree of spread across the random samples in our sampling distribution indicates the precision of our estimated statistic. In the last section we defined the standard error as the standard deviation of the sampling distribution. A larger standard error relative to the mean indicates less precision (more variability across samples), a small standard error relative to the mean indicates more precision (less variability across samples).

In your weight loss study, you and your colleagues decided to study 100 people in each condition (including the CBT + mobile groceries condition that we are working with now), but we might be interested in knowing how the sampling distribution would differ if we chose a smaller or a larger sample size. Let’s test three different sample sizes – 25 people, 50 people, and 500 people to see how the sampling distribution changes. The code below takes the same process we just studied, but applies is to consider these three new sample sizes.

# Segment 1: sample size = 25 ------------------------------
# 1.a) Draw 1000 samples
virtual_samples_size25 <- my_pop %>% 
  rep_sample_n(size = 25, reps = 1000)

# 1.b) Compute mean in each of the 1000 samples
means_across_samples_size25 <- virtual_samples_size25 %>% 
  group_by(replicate) %>% 
  summarize(mean.lbs_lost = mean(lbs_lost)) %>% 
  mutate(sample_size = "sample size = 25")


# Segment 2: sample size = 50 ------------------------------
# 2.a) Draw 1000 samples
virtual_samples_size50 <- my_pop %>% 
  rep_sample_n(size = 50, reps = 1000)

# 2.b) Compute mean in each of the 1000 samples
means_across_samples_size50 <- virtual_samples_size50 %>% 
  group_by(replicate) %>% 
  summarize(mean.lbs_lost = mean(lbs_lost)) %>% 
  mutate(sample_size = "sample size = 50")


# Segment 3: sample size = 500 ------------------------------
# 3.a) Draw 1000 samples
virtual_samples_size500 <- my_pop %>% 
  rep_sample_n(size = 500, reps = 1000)

# 3.b) Compute mean in each of the 1000 samples
means_across_samples_size500 <- virtual_samples_size500 %>% 
  group_by(replicate) %>% 
  summarize(mean.lbs_lost = mean(lbs_lost)) %>% 
  mutate(sample_size = "sample size = 500")

Once our samples of varying size are created, we can plot them to build our intuition about the role that sample size plays in determining the precision of our estimates.

First, let’s take a look at the mean and standard deviation (sd) of the sampling distribution for each sample size. We can estimate these with the skim() function, see the output below.

all_means <- bind_rows(means_across_samples_size25, 
                       means_across_samples_size50, 
                       means_across_samples_size500)

all_means %>% group_by(sample_size) %>% select(mean.lbs_lost) %>% skim()

Data summary
Name	Piped data
Number of rows	3000
Number of columns	2
_______________________
Column type frequency:
numeric	1
________________________
Group variables	sample_size

Variable type: numeric

skim_variable	sample_size	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
mean.lbs_lost	sample size = 25	1	10.83	2.06	4.95	9.44	10.89	12.27	18.89	▁▇▇▂▁
mean.lbs_lost	sample size = 50	1	10.93	1.43	7.09	9.92	10.94	11.94	15.63	▂▆▇▃▁
mean.lbs_lost	sample size = 500	1	10.93	0.45	9.42	10.63	10.92	11.23	12.62	▁▅▇▂▁

Notice that the means are very similar (see the column labeled mean in the table above) – across all three sample sizes, the average pounds lost is close to 11 pounds. However, the standard deviations differ quite a lot (see the column labeled sd in the table above). As sample size increases, the standard deviation of the sampling distribution decreases. That is, when we have only 25 people in the sample the standard deviation (i.e., which recall from earlier is also called the standard error of the sampling distribution) is quite large, but when we have 500 people in the sample the standard deviation (i.e., the standard error of the sampling distribution) is quite small. This means that we can estimate our parameters with more precision as the sample size increases.

Now, let’s plot the simulated data to visualize the phenomenon.

ggplot(all_means, aes(x = mean.lbs_lost)) +
  geom_histogram(binwidth = 0.01) +
  geom_vline(xintercept = 11, lwd = 1, color = "red") +
  labs(x = "Average pounds lost across 1000 samples", 
       title = "Sampling distribution as a function of sample size",
       subtitle = "red line denotes population mean") +
  facet_wrap(~sample_size, ncol = 1)

Notice that the mean of the sampling distribution is about the same, regardless of sample size. All three produce a mean that is very close to the population mean (11 pounds lost) – and thus, all three are highly accurate in reproducing the population mean.

However, the standard error (i.e., the standard deviation of the sampling distribution) differs a lot as we compare the three different sample sizes. When the sample size is very small (i.e., 25) the variability in the mean pounds lost across the many random samples is highly variable. However, when the sample size is large (i.e., 500) the variability in the mean pounds lost across the many random samples is much smaller. This demonstrates that we can more precisely estimate the true population mean with a sample as our sample size increases.

All of the ideas that we have studied in this module are well illustrated in the following figure from your textbook – Modern Dive. When we draw a random sample from a population and aim to use that sample to make inferences about the population, we must be cognizant of two critical dimensions of inferential statistics – accuracy and precision. Accuracy is achieved by carefully drawing a random sample so that the sample statistic (i.e., the mean pounds lost) is likely to be close to the population mean. Precision is achieved by having an adequately large sample size so as to minimize variability in the sample statistic across random samples.

Artwork from Modern Dive

The Central Limit Theorem

Let’s close out this module with one final concept that will be key to our work in the next couple modules. The Central Limit Theorem (CLT) is a fundamental principle of statistics. It states that if you have a population parameter of interest, and take sufficiently large random samples from the population with replacement, then the distribution of the parameter estimates across the random samples will be approximately normally distributed. This theorem sums up everything we explored in this module. The CLT is the basis for our ability to make inferences about a population parameter based on our sample statistics. We’ll rely on this theorem to begin making inferences in the next few modules.