PSY350, Module 11

Conducting Hypothesis Tests

Module road map

Learning objectives

Conduct and interpret a one sample t-test
Conduct and interpret a t-test for a linear regression coefficient
Conduct and interpret a t-test for comparing two means
Conduct and interpret an F-test for a one-way analysis of variance

Readings

Read Chapter 10 of your Textbook
Read Appendix B of your Textbook

How do we know if a relationship is significant?

Now that we have all of the key ingredients for conducting hypothesis tests, let’s walk through hypothesis testing for several common tests – including:

a one-sample t-test to compare a single mean to a pre-specified value
a t-test for a linear regression coefficient
a t-test for comparing two means
an F-test for an ANOVA to compare multiple means
an F-test for $R^2$ in a MLR with a grouping variable

Once you have the intuition for conducting hypothesis tests, you will be able to apply the general principles to any other type of test.

Introduction to the data

Data scenario: We will continue to use the weight loss study introduced in Module 8.

In the dataframe called wtloss.Rds, the following variables are recorded:

pid is the participant ID
condition is the condition the participant was randomly assigned to receive (control, CBT, mobile groceries, CBT + mobile groceries)
bmi is the participant’s body mass index at the start of the study
caldef is the sum of caloric deficit (in kilocalories) across all three months of the intervention. Caloric deficit is the difference between calories consumed and calories expended (positive number = more calories burned than consumed, negative number = more calories consumed than burned).
lbs_lost is the total pounds lost during the program (positive number = weight loss, negative number = weight gain)
fv5 is a binary indicator that is coded “yes” if the individual ate at least 5 fruits and vegetables per day on at least four days during the last week of the intervention, and is coded “no” if the individual did not meet this criteria.

Let’s load the needed packages for this module, and import the data.

library(skimr)
library(moderndive)
library(infer)
library(janitor)
library(performance)
library(see)
library(here)
library(tidyverse)

wtloss <- readRDS(here("data", "wtloss.Rds"))

Conduct a one sample t-test

As our first hypothesis test, let’s determine if the CBT + mobile groceries intervention resulted in significant weight change. Specifically, we will determine if the average weight loss as a result of the program is significantly different than 0 (no significant weight change after participating in the program). This is called a one-sample t-test because we are simply comparing the mean in our sample (which is about 11 pounds lost) to a specified value (0 pounds lost) to determine if it’s likely that the intervention would lead to significant weight change in the population given what we observed in our sample.

Let’s subset the data to include only people in the CBT + mobile groceries condition.

CBT_plus <- wtloss %>% 
  filter(condition == "CBT + mobile groceries") %>% 
  select(pid, lbs_lost)

To conduct the one sample t-test, let us first define the population parameter of interest. We are interested in the average weight change in the population. A population mean is written as μ (mu), and we want to determine if this average weight change is significantly different from 0.

Our null hypothesis is: $H_0:\small \mu = 0$ (the population mean is equal to 0)
Our alternative hypothesis: $H_a:\small \mu \neq 0$ (the population mean is not equal to 0)

In Module 10 you learned about the gist of hypothesis testing, using this exact example as a case study. Now, let’s dig into the mechanics of conducting the test using our observed sample data. Recall that in Module 8 we discussed a theory-based approach to calculating the confidence interval around a mean. We learned that this theory-based approach is justified if assumptions of the test are met. There are also theory-based approaches to conducting hypothesis tests for all sorts of statistics, and we’ll learn about those for each of the types of models we have studied so far in this course. At the end of this module, we’ll also review ways that we can study whether assumptions are likely met, and what to do if they are not.

Steps needed prior to conducting the test

The first step in conducting the test is to determine under what conditions we will reject the null hypothesis based on our test. We need two pieces of information to complete this task. The first is the alpha for the test. Recall that alpha is the probability of making a Type I error, that is rejecting the null hypothesis, when in actuality it is true. The second is the degrees of freedom (df) for our test.

What are degrees of freedom?
Imagine we have a variable measured for 5 people randomly selected from the population — for example, the weight loss of 5 people in our study. Before estimating anything, each person’s weight is free to take on any value. Now, imagine we want to use these five people to estimate the mean weight loss in the population. You now have a constraint — the estimation of the mean. That is, once the mean is estimated, 4 of the 5 values are free to vary, but one is not. That is, if we know the mean of the five people, and the pounds lost for 4 of the people, then the value of the 5th case is also known and, therefore, not free to vary. The degrees of freedom provide a heuristic for quantifying how much unique information we have available for any test we conduct.

With each test that we conduct, there will be a formula for calculating the degrees of freedom. When estimating the population mean, the formula for the df is simply n - 1, where n is the sample size. In our example, n = 100, so if we desire to use our sample to estimate the mean in the population, then we have 99 df left.

Please watch the following video from Crash Course: Statistics on Degrees of Freedom

Once we set alpha and calculate the df, we can obtain the critical value of t and establish the null hypothesis rejection criteria. To do this we will use the Student’s t-distribution. So far, we have been working with a standard normal distribution (also referred to as a z-distribution) as our probability distribution function for calculating a CI. For the standard normal distribution the percentile scores of .025 and .975 correspond to z-scores of -1.96 and +1.96 respectively. (If this is unclear, please review Module 4). For example, in Module 9, we multiplied the standard error by -1.96 (for lower bound) and +1.96 (for upper bound) to construct a 95% confidence interval for the mean pounds lost in the population. This is appropriate if the sample is large or the population standard deviation is known. When these rules do not hold, we should use the Student’s t-distribution. The Student’s t-distribution is also symmetric and has a bell-shaped curve, but rather than a single curve, it’s a family of curves based on the corresponding df for the statistic under consideration. As the df increases, the Student’s t-distribution approaches the standard normal distribution (labeled z in the figure below).

When working with the Student’s t-distribution rather than the standard normal distribution (i.e., z-distribution), we use the pt() function (instead of pnorm()) and the qt() function (instead of qnorm()) to obtain probabilities under the curve and critical values of t in R.

The critical value of t is the t-score that is associated with the desired percentile score of the Student’s t-distribution based on alpha. An alpha of .05, non-directional, corresponds to the percentile scores of .025 and .975 of the Student’s t-distribution.

To calculate critical values of t for our test, we use the qt() function. Input the percentile scores for your chosen alpha (e.g., .025 and 975) and the corresponding df. This returns two values that define the edges of the rejection region, -1.984 and +1.984. If our calculated test statistic is less than -1.984 or greater than 1.984, then we will reject the null hypothesis. This rejection range is depicted in the figure below.

qt(p = c(.025, .975), df = 99)

## [1] -1.984217  1.984217

The graph below presents the same information in a slightly different way. The pink areas of this distribution clearly show the rejection region. If our calculated test statistic falls in the pink region (i.e., the rejection region), we will reject the null hypothesis.

Steps to conduct the test

Now, we’re ready to conduct the test, that is, calculate the test statistic. Below is the formula for a one sample t-test. The null hypothesis value is subtracted from the sample mean, and then this quantity is divided by the estimated standard error of the mean. This provides the number of standard error units the sample mean is away from the null hypothesis value – i.e., our test statistic.

\[{t} = \frac{\bar X - 0}{se} = \frac{11.32}{.99} = 11.45\]

Recall that we can calculate the standard error of the mean with the following code:

CBT_plus %>%   
  summarize(mean = mean(lbs_lost),
            sd = sd(lbs_lost),
            sample_size = n(),
            se = sd/sqrt(sample_size))

mean	sd	sample_size	se
11.32439	9.890134	100	0.9890134

R will conduct the t-test (and calculate the test statistic) for us using the t.test() function. Notice the strange symbol between CBT_plus and t.test in the code chunk below. Some base R functions, including t.test(), don’t have built in data arguments, and therefore will not work with the pipe operator. The %$% is a substitute for the pipe operator in these cases. The arguments to t.test are the variable that you’re testing (that’s lbs_lost in our example), and the null hypothesis value (mu = 0, meaning the null hypothesis asserts that the population mean = 0).

CBT_plus %$% t.test(lbs_lost, mu = 0)

## 
##  One Sample t-test
## 
## data:  lbs_lost
## t = 11.45, df = 99, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##   9.361974 13.286808
## sample estimates:
## mean of x 
##  11.32439

The estimated test statistic for our one sample t-test is 11.45 (see t = 11.45 in the output). This indicates that our sample mean is 11.45 standard errors away from the null hypothesis value (0). This clearly is in the null hypothesis rejection range (i.e., > 1.984 or < -1.984); therefore, we can reject the null hypothesis.

The p-value is essentially 0. This means that the probability of obtaining a mean that is this many standard errors away from 0 (the value of our null hypothesis) or farther if the null hypothesis is true is essentially 0. This would suggest that the null hypothesis should be rejected. Recall that when the p-value is less than alpha, we reject the null hypothesis. Note that when the t-test is in the rejection range, the p-value will be less than alpha.

Taken all together, we can feel confident that if the CBT + mobile groceries intervention were delivered to the population, significant weight loss in the population would occur.

If we like, we can use the pt() function (akin to the pnorm() function that we discovered in Module 4) to calculate the p-value ourselves. Input the absolute value of the test statistic and the df of the test. Use lower.tail = FALSE to denote that we’re interested in the upper tail probability. Notice that because our hypothesis is non-directional (i.e., two-sided since we allow for the possibility that the intervention could cause weight loss or weight gain), we multiply the calculated tail probability by 2 to find the sum of the upper and lower tail probabilities. This is the same method used by the t.test() function, it differs just slightly due to rounding.

pt(q = abs(11.45), df = 99, lower.tail = FALSE) * 2

## [1] 7.77727e-20

The plot below illustrates our one sample t-test. Recall that in Module 10 we pretended that we had access to everyone in the population and unlimited resources and could therefore randomly sample 5000 samples of size 100 to conduct our study. The sampling distrubution from that exercise is presented below. The difference is that the x-axis is expressed in standard error units rather than pounds. Only 5% of samples should produce a mean that is more than 1.984 standard errors from the null hypothesis value (0), if the null is plausible. Since our sample produced a mean that is more than 1.984 standard errors from the mean, we reject the null hypothesis. The orange line is placed at 11.45 standard errors from the mean – the value of t calculated from our t-test.

Let’s also look at an alternative graph that includes the rejection region in pink (see the graph below). We can see here that our calculated test statistic of 11.45 is not even on the graph – that is it is so large and so far in the rejection region that I have to record it far to the right of the graph.

The t.test output also provides the 95% CI: 9.36, 13.29. This provides a range of plausible values for average weight loss if delivered to the entire population. This is calculated as the estimate for the mean plus the critical values of t times the standard error. We use the same critical values of t used to calculate the rejection region for the hypotheses. For example, the 95% confidence interval is calculated as follows: $11.32 \pm 1.984\times.99$.

Please click the video below for a recap of inference for a one-sample t-test.

Conduct a hypothesis test for a linear regression coefficient

We can apply these same principles to conduct hypothesis tests for the parameter estimates in a linear regression model. Before diving into our example, you may want to review the following Crash Course: Statistics video on Regression.

Continuous predictor

We’ll start by examining the relationship between caloric deficit, a continuous predictor, and pounds lost in the control condition. Let’s create a dataframe to use in this part of the module. I’m going to create a new version of the predictor called caldef1000 that divides caldef by 1000. This is because in our regression model, the slope will capture the expected change in pounds lost for a one-unit increase in caloric deficit. A one-unit increase is very small, therefore, interpreting the expected change in pounds lost for a 1000-unit increase (rather than a 1 unit increase) in caloric deficit is more useful. Since this is a linear transformation, the model fits with either version of caldef are identical, it’s simply the interpretation that changes. That is, with this new version a one-unit increase corresponds to a 1000 calorie increase.

cntl <- wtloss %>% 
  filter(condition == "control") %>% 
  mutate(caldef1000 = caldef/1000)

Let’s begin by gaining some intuition about inferential methods for a simple linear regression. In the preceding example for a one sample t-test, we were interested in estimating the population mean using our sample. Likewise, when fitting a linear regression model, we are interested in estimating the population parameters of the linear regression model (i.e., the intercept and slope(s)) using our sample.

Let’s begin by imagining a make-believe world in which we have access to the entire population and we know that the true regression equation for predicting pounds lost from caloric deficit in the control condition is:

\[\hat{y} = 0.0+.33x\]

This equation asserts that if a caloric deficit of 0 was attained (i.e., a net balance in calories consumed and expended), the predicted pounds lost would be 0. And, that for each one-unit increase in caldef1000, lbs_lost increases by .33 pounds.

This graph depicts our simulated population.

From this population, we can simulate the sampling distribution by drawing 5,000 random samples. Let us do this, and examine the results in the next few graphs.

To begin, take a look at the scatterplot of caloric deficit and pounds lost in the first 32 samples. In every sample there is a positive linear relationship, but the individual points, as well as the precise placement of the intercept and the slope of the best fit line, varies a bit from sample-to-sample.

We can fit the SLR within each of these samples to obtain the parameter estimates (intercept and slope) and then plot the best fit line from each sample on a single line plot.

Now, let’s plot the best fit line for all 5,000 samples, and overlay the population equation over the top (see the blue line in the graph below). Notice that the best fit lines produced by each sample varies symmetrically about the line defined by the equation in the population. That is, each of the samples gives a best fit line that is close, but not exactly the same, as the population equation. We see variability both in the intercept (predicted pounds lost when caloric deficit is 0) and the slope (the increase in pounds lost for a one-unit increase in caloric deficit).

These plots represent the sampling distribution for the intercept and slope respectively. Recall that at the center of the sampling distribution is the true population parameter, and in accordance with the Central Limit Theorem, the estimates across the random samples vary about the true population parameter. The sampling distributions for the intercept and slope of our regression line are normally distributed. Therefore, in accordance with the empirical rule, we can expect that 95% of the random samples will produce an estimate that is within about 1.96 standard errors of the population parameter for both the intercept and the slope.

In the same way that we determined whether the average pounds lost was significantly different from zero using a t-test as the test statistic (i.e., in the prior section of this module), we can use a t-test to test if the intercept and slope are each significantly different from zero.

For the intercept, the null and alternative hypotheses are:

$H_0: b_0 = 0$ (population intercept equals 0)
$H_a: b_0\ne0$ (population intercept is not equal to 0)

For the slope, the null and alternative hypotheses are:

$H_0: b_1 = 0$ (population slope equals 0)
$H_a: b_1\ne0$ (population slope is not equal to 0)

With the null hypothesis for the intercept, we inquire if 0 is a plausible value for the intercept in the population. This is a very similar question to our one sample t-test, but here, rather than a simple mean, we’re testing a conditional mean (the mean when caloric deficit equals 0). In other words, we’re testing whether the expected pounds lost when caloric deficit equals 0 (the intercept of the SLR) is significantly differ from 0. The t-test for the intercept is written as follows:

\[t = \frac{b_{0} - 0}{se}\]

With the null hypothesis for the slope, we inquire if a slope of 0 (a flat line through the mean of y) is plausible. A slope of 0 would mean that caloric deficit tells us nothing about pounds lost, in other words, that caloric deficit is unrelated to pounds lost. The t-test for the slope is written as follows:

\[t = \frac{b_{1} - 0}{se}\]

Each of these t-tests are compared to the critical values of t used to define the rejection region. The formula for finding the df for the intercept and slopes in a regression model is n - 1 - # of predictors, where n is the sample size. We lose one df for estimating the intercept and an additional df for each estimated slope. In our example, we have just one slope (the effect of caldef1000), therefore our df are 100 - 1 - 1 = 98. Thus, for a two-tailed test and an alpha of .05, the critical values defining the rejection region for 98 df are -1.984 and 1.984. If our test statistic is outside this range, that is, if our calculated test statistic is in the rejection region, we will reject the null hypothesis.

qt(p = c(.025, .975), df = 98)

## [1] -1.984467  1.984467

Let’s fit our simple linear regression model in the observed control sample.

cntl_lm1 <- cntl %>% 
  lm(lbs_lost ~ caldef1000, data = .)

cntl_lm1 %>% get_regression_table()

term	estimate	std_error	statistic	p_value	lower_ci	upper_ci
intercept	-0.476	0.555	-0.857	0.393	-1.577	0.625
caldef1000	0.297	0.060	4.957	0.000	0.178	0.416

\[\hat{y} = b_{0}+b_{1}x\] \[\hat{y} = -.476 + .297x\] Interpretation of the intercept The intercept for our regression model represents the predicted pounds lost among people with a caloric deficit of 0. It is estimated to be -.476. If an individual has a net balance for caloric deficit (no deficit, no gain), then we expect them to gain about 1/2 pound during the 3 month study.

Interpretation of the slope The slope for our regression model represents the predicted change in pounds lost for a one-unit increase in caloric deficit. It is estimated to be .297. For each additional 1000 calories burned, we expect the individual will lose close to 1/3 of a pound.

The test statistic from the t-tests (labeled statistic in the output in the table above) is calculated by dividing the estimate by the standard error (std_error). For the intercept, that is -.476/.555 = -.857. For the slope that is .297/.060 = 4.957. The test statistic is in the rejection region for the slope, but not the intercept. This is illustrated in the graphs below, where the green lines denote the value of the test statistic.

Rejection region with test statistic for intercept

Rejection region with test statistic for slope

For the intercept, this means that the pounds lost among people who have a 0 caloric deficit is not significantly different from 0 (i.e., 0 is a plausible value). For the slope, this means that the increase in pounds lost (i.e., change in y) for a one-unit increase in the caloric deficit is significantly different from 0 (i.e., 0 is not a plausible value).

The test statistic indicates how many standard errors away the regression coefficient is from 0 and the p-value shows the probability that we would obtain a test statistic of this magnitude or larger if the null hypothesis were true. When the p-value is less than alpha, we reject the null hypothesis. Note that when the test statistic is in the rejection range, the p-value will be less than alpha.

Summarizing this information, we find that:

If the test statistic is small, it’s possible that we would observe this result in our sample if the null hypothesis were true. DO NOT REJECT $H_0$.
If the test statistic is large, then it is unlikely that we would observe this result in our sample if the null hypothesis were true. REJECT $H_0$.

In the same way that we calculated a CI for the mean, we can calculate a CI for each regression parameter. These are calculated as the estimate plus the critical values of t times the standard error (std_error). We use the same critical values of t to calculate the rejection region for the hypotheses for both the regression intercept and slope. For example, for the slope (effect of caldef1000), the confidence interval is calculated as follows: $.297 \pm 1.984\times.060$. These are displayed in the regression model output under lower_ci and upper_ci for the lower and upper bound of the CI. By default, the CI is a 95% CI.

Notice that the 95% CI for the slope doesn’t contain 0. This indicates that 0 is not a plausible value for the slope estimate. But the 95% CI for the intercept does contain 0. This indicates that 0 is a plausible value for the intercept. This corresponds with our hypothesis test for these estimates. The t-test for alpha = .05 will always correspond with the 95% CI for the estimate. If the test statistic is in the rejection region, then the 95% CI will not include 0. Otherwise the 95% CI will include 0.

Now, let’s recreate the scatterplot, but request standard errors (se = TRUE in geom_smooth()). The shaded area in the figure below provides an assessment of the precision of our estimated best fit line. Though our single sample provides an intercept and slope for the best fit line in the sample (the blue line), the shaded area provides a range of plausible areas where the best fit line in the population may fall. In this way, our regression model provides us with the best fit line in our sample, but also provides us with an understanding of the precision of our estimates and how much we can expect them to vary from sample-to-sample.

cntl %>% 
  ggplot(aes(x = caldef1000, y = lbs_lost)) +
  geom_point() +
  geom_smooth(method = "lm", se = TRUE, formula = y ~ x) +
  theme_bw() +
  labs(title = "Scatterplot of Caloric Deficit and Pounds Lost with Best Fit Line",
       subtitle = "Confidence Interval in Gray Shaded Area", 
       x = "Caloric Deficit", 
       y = "Pounds Lost")

As we did earlier, we can compute the p-values ourselves, the code below does this for the test statistic for the intercept and slope. Notice that these are the same p-values outputted in our linear model results.

# intercept
pt(q = abs(-.857), df = 98, lower.tail = FALSE) * 2

## [1] 0.3935357

# slope
pt(q = abs(4.957), df = 98, lower.tail = FALSE) * 2

## [1] 3.000815e-06

Please watch the video recap below for testing the coefficients of a regression model.

Binary predictor

Now, let’s consider a binary predictor. We’ll return to considering just the CBT + mobile groceries condition, and we’ll consider whether pounds lost in the treatment condition differs as a function of sex. To fit this model, we will regress pounds lost (lbs_lost) on a dummy coded indicator to represent sex. We’ll specify females to be the reference group.

CBT_plus <- wtloss %>% 
  filter(condition == "CBT + mobile groceries") %>% 
  mutate(male = case_when(sex == "male" ~ 1, sex == "female" ~ 0))

Our null hypothesis is: $H_0:\small \mu_{males} = \small \mu_{females}$ (population means are equal)
Our alternative hypothesis is: $H_a:\small \mu_{males} \neq \small \mu_{females}$ (population means are not equal)

The df for the regression model is 98 as the sample size = 100 and we lose two df for estimating the intercept and the slope for male. Thus, the critical values of t for defining the rejection region are -1.984 and 1.984. If our test statistic is outside the range of -1.984 to +1.984 (and therefore in our rejection region) then we will reject the null hypothesis.

qt(p = c(.025, .975), df = 98)

## [1] -1.984467  1.984467

First, let’s take a look at the average pounds lost for males compared to females. We see that males lost 11.1 pounds on average, while females lost 11.6 pounds on average.

CBT_plus %>% 
  select(sex, lbs_lost) %>% 
  group_by(sex) %>% 
  skim()

Data summary
Name	Piped data
Number of rows	100
Number of columns	2
_______________________
Column type frequency:
numeric	1
________________________
Group variables	sex

Variable type: numeric

skim_variable	sex	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
lbs_lost	male	0	1	11.05	9.18	-8.45	3.51	10.51	19.81	28.21	▂▇▅▆▅
lbs_lost	female	0	1	11.58	10.61	-14.36	6.94	11.14	17.77	32.43	▂▃▇▇▂

Let’s also create a jittered scatter plot that represents the differences in pounds lost by sex.

CBT_plus %>% 
  ggplot(aes(x = sex, y = lbs_lost, color = sex)) + 
  geom_jitter(width = .1, show.legend = FALSE) +
  stat_summary(fun = mean, geom = "line", lwd = 1, colour = "black", aes(group = 1)) +
  stat_summary(fun = mean, size = 5, shape = 4, geom="point", colour = "black", show.legend = FALSE) + 
  theme_bw() +
  labs(title = "Is there a sex difference in pounds lost for the CBT + mobile groceries condition?",
       subtitle = "Mean in each group is marked with an X",
       x = "",
       y = "Pounds lost")

Now that we’ve visualized the data, we’re ready to fit our linear model with the binary predictor (represented as a dummy coded variable to contrast males to females).

lm_cat1 <- CBT_plus %>% 
  lm(lbs_lost ~ male, data = .)

lm_cat1 %>% get_regression_table()

term	estimate	std_error	statistic	p_value	lower_ci	upper_ci
intercept	11.584	1.391	8.325	0.000	8.822	14.345
male	-0.529	1.988	-0.266	0.791	-4.474	3.415

The intercept in this model represents the predicted pounds lost for females (the reference group) and the slope for male represents the difference in pounds lost for males compared to females. Males tended to lose about 1/2 a pound less than females. However, the standard error is quite large in comparison, and thus when we divide the estimate (-.529) by the standard error (1.988), the test statistic is quite small (-.266) and the p-value is quite large (.791). The test statistic is not in the rejection region – thus, we cannot reject the null hypothesis that there is no difference in pounds lost between males and females in the CBT + mobile groceries condition. Notice that we do not conclude that we will accept the null – rather, we simply conclude that, based on our current sample, we cannot reject the null hypothesis. It is important to note that absence of evidence to reject the null is not evidence that the null hypothesis is actually true.

Rejection region with test statistic for slope

Corresponding to the non-significant t-test, the 95% CI includes 0, indicating that 0 is a plausible value for this slope. Specifically the lower and upper CI for the slope for male is -4.474 to 3.415. Note that this is calculated as $-.529\pm(1.984\times1.988)$. We don’t find evidence that males and females differ in the amount of weight loss achieved as a result of the intervention.

We could equivalently test this model using an independent samples t-test, a test that you likely studied in your first statistics course. Here’s the code for conducting the test. The same df and rejection region are used as for the SLR with a binary predictor. That is, the df for an indepdendent samples t-test is n (the total sample size) minus 2 (we lose 2 df for estimating the mean in each group).

CBT_plus %$% t.test(lbs_lost ~ sex, alternative = "two.sided", var.equal = TRUE)

## 
##  Two Sample t-test
## 
## data:  lbs_lost by sex
## t = -0.26625, df = 98, p-value = 0.7906
## alternative hypothesis: true difference in means between group male and group female is not equal to 0
## 95 percent confidence interval:
##  -4.473907  3.415431
## sample estimates:
##   mean in group male mean in group female 
##             11.05448             11.58372

Notice that the results are identical (compare the t in this output to the test statistic for the male slope in the regression model). An independent samples t-test is just a special case of a linear regression model in which a single binary predictor is considered.

Conduct a hypothesis test for an ANOVA

Now, let’s learn about conducting a hypothesis test for an analysis of variance (ANOVA). Recall that you learned about how to fit an ANOVA model in Module 6 – now we’ll revisit that model with a focus on statistical inference. Before we apply an ANOVA to our data, you may want to review this Crash Course: Statistics video on ANOVA.

Now that you’ve built some intuition for ANOVA and hypothesis testing, let’s consider an example. In thinking about the full weight loss study, it is natural that we might want to determine if weight loss differed across the four treatment conditions. We can conduct a test to answer this question using an ANOVA. Before diving into the test, let’s visualize the data.

wtloss %>% 
  select(condition, lbs_lost) %>% 
  group_by(condition) %>% 
  skim()

Data summary
Name	Piped data
Number of rows	400
Number of columns	2
_______________________
Column type frequency:
numeric	1
________________________
Group variables	condition

Variable type: numeric

skim_variable	condition	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
lbs_lost	control	1	-0.32	6.16	-14.77	-4.69	-1.15	4.26	14.90	▂▇▇▆▂
lbs_lost	mobile groceries	1	2.10	7.30	-19.01	-2.44	2.36	7.29	20.21	▁▃▇▇▂
lbs_lost	CBT	1	6.45	9.42	-16.49	0.98	6.20	12.80	30.67	▂▅▇▅▂
lbs_lost	CBT + mobile groceries	1	11.32	9.89	-14.36	4.51	11.07	19.30	32.43	▂▅▇▇▂

wtloss %>% 
  ggplot(aes(x = condition, y = lbs_lost, color = condition)) + 
  geom_jitter(width = .1, show.legend = FALSE, alpha = .3) +
  stat_summary(fun = mean, size = 5, shape = 4, geom="point", colour = "black", show.legend = FALSE) + 
  theme_bw() +
  labs(title = "Is there a condition difference in pounds lost by treatment condition?",
       subtitle = "Mean in each group is marked with an X",
       x = "",
       y = "Pounds lost")

Based on the means from skim() and the plot, it appears that the average pounds lost differs across conditions – that is, that there are between group differences in pounds lost. For example, the average weight loss in the CBT + mobile groceries condition was over 11 pounds (mean lbs_lost = 11.32), while on average there was a slight weight gain in the control condition (mean lbs_lost = -.32). To determine if differences in average weight loss across the four conditions are significant, or if the differences are likely due to chance, we’ll fit a one-way analysis of variance (ANOVA).

Our null hypothesis is: $H_0:\small \mu_{control} = \small \mu_{mobile\ groceries} = \small \mu_{CBT} = \small \mu_{CBT + mobile\ groceries}$
Our alternative hypothesis is: $H_a:\small \mu_{control} \neq \small \mu_{mobile\ groceries} \neq \small \mu_{CBT} \neq \small \mu_{CBT + mobile\ groceries}$

In plain English, the null hypothesis is that the population means by condition are equal, and the alternative hypothesis is that they are not equal.

The test statistic for this type of hypothesis is an F-test. An F-test contrasts the ratio of two variances – specifically, between group variability to within group variability in this context. Recall from Module 6 that the gist of ANOVA is to compare Sum of Squares (SS) Between Groups ($\mbox{SS}_b$) and Sum of Squares Within Groups ($\mbox{SS}_w$) to each other. If the between-group variation is large relative to the within-group variation then we have evidence that the population means differ.

The F-test provides us with an equation to convert our SS values into an F-test, but before we can calculate this, we need to determine the degrees of freedom (df) for both the between group and within group variability. The df for the between groups variability is k - 1, where k is the number of groups. We have four experimental conditions to consider, thus our between group df (referred to as $df_b$) is 3. The df for the within group variability is the sample size minus the number of groups (n - k). For our example, that is 400 - 4 = 396 (referred to as $df_w$).

Once we have this information, we calculate the mean squares (MS) between group and within group using the following formulas.

\[ \mbox{MS}_b = \frac{\mbox{SS}_b }{ \mbox{df}_b} \]

\[ \mbox{MS}_w = \frac{\mbox{SS}_w }{ \mbox{df}_w} \] Then, we can calculate the F-test by dividing the Between Groups MS by the Within Groups MS as follows:

\[ F = \frac{\mbox{MS}_b }{ \mbox{MS}_w } \]

As with a t-test, we must first find the critical value of F that will be compared with our F-test test statistic to determine if the null hypothesis should be rejected.

An F-test utilizes an F-distribution. To find the critical value of the F-test in R, you can use the qf() function, as follows: qf(p, df1, df2), where p is 1 minus the desired alpha for the test (we’ll stick with alpha of .05 – so p in our example will be 1 - .05 = .95), df1 is the df for the numerator of the F-test (between group variability) and df2 is the df for the denominator of the F-test (within group variability). Unlike a t-test which can be one- or two-tailed, a F-test is non-directional. The df1 is $df_b$, and df2 is $df_w$. Therefore, we can calculate our critical value as follows:

qf(p = .95, df1 = 3, df2 = 396)

## [1] 2.627441

Our critical value of F is 2.63. If our calculated test statistic exceeds this value, then we reject the null hypothesis. The graph below depicts our F-distribution. If the test statistic from our F-test is to the right of the red line, then we will reject the null hypothesis.

Now, we can fit our ANOVA using the aov() function, just as we did in Module 6.

anova_cond <- wtloss %$% aov(lbs_lost ~ condition, data = .)
summary(anova_cond)

##              Df Sum Sq Mean Sq F value Pr(>F)    
## condition     3   7877  2625.5    37.8 <2e-16 ***
## Residuals   396  27503    69.5                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Our test statistic is 37.8, which greatly exceeds our critical value, and thus the p-value (labeled Pr(>F) in the output) is extremely small – indicating that there is a near zero chance that we’d obtain a test statistic of this magnitude or larger if the null hypothesis were true. Therefore we can reject the null hypothesis, it is highly unlikely that the mean pounds lost in the four conditions are equal in the population.

The graph below depicts our F-distribution with our test statistic calculated from our F-test overlaid.

Please click the video below for a recap of inference for an ANOVA.

Equivalence of ANOVA and multiple linear regression

We can equivalently fit this model as a multiple linear regression model. In fact, an ANOVA is just a special case of a multiple linear regression model in which the predictor is a grouping variable. Let’s see how this is done.

Before fitting the model, we need to create 3 dummy-coded indicators to represent the 4 treatment conditions, we’ll use the control condition as the reference group.

wtloss <- wtloss %>% 
  mutate(CBT = case_when(condition == "CBT" ~ 1, TRUE ~ 0)) %>% 
  mutate(mobile = case_when(condition == "mobile groceries" ~ 1, TRUE ~ 0)) %>% 
  mutate(CBT_mobile = case_when(condition == "CBT + mobile groceries" ~ 1, TRUE ~ 0))

Now, we’re ready to fit the regression model.

lm_cat2 <- wtloss %>% 
  lm(lbs_lost ~ CBT + mobile + CBT_mobile, data = .)

Let’s first examine the regression model summary. We obtain this via the get_regression_summaries() function.

lm_cat2 %>% get_regression_summaries()

r_squared	adj_r_squared	mse	rmse	sigma	statistic	p_value	df	nobs
0.223	0.217	68.75859	8.29208	8.334	37.803	0	3	400

Notice that the value under statistic is 37.8, which is precisely the F-test test statistic that we obtained via our ANOVA model, illustrating the equivalence between the two specifications. We evaluate this F-test in the same way too, comparing it to our critical value of F. The critical value of F for any regression model is obtained by considering two degrees of freedom for the sum of squares regression (SSR – the part of the variability in our outcome explained by the predictors in the model) and the sum of squares error (SSE – the part of the variability in our outcome not explained by the predictors in the model). The corresponding degrees of freedom for SSR and SSE is calculated as the number of predictors (that’s 3 in our example – three dummy coded indicators) and n - 1 - the number of predictors (that’s 400 - 1 - 3 = 396 in our example). Notice the equivalence of these degrees of freedom for the between and within group sum of squares in the ANOVA. Thus, we arrive at the same critical value of F.

qf(p = .95, df1 = 3, df2 = 396)

## [1] 2.627441

In any regression model, you can evaluate the F-test (37.8 in our example) as a test of whether any of the regression slopes are different from 0. The $R^2$ = .223, indicating that 22.3% of the variability in pounds lost is explained by condition, this indicates that condition is able to predict nearly 1/4 of the variability in pounds lost over the 3 months.

As an added benefit of fitting this model as a linear regression, we also gain a test of whether each group is significantly different from the reference group. That is, whether each treatment condition produces a mean pounds lost that is significantly different than the control condition. Let’s take a look at those estimates now.

lm_cat2 %>% get_regression_table()

term	estimate	std_error	statistic	p_value	lower_ci	upper_ci
intercept	-0.320	0.833	-0.384	0.701	-1.959	1.318
CBT	6.770	1.179	5.744	0.000	4.453	9.087
mobile	2.420	1.179	2.053	0.041	0.103	4.737
CBT_mobile	11.645	1.179	9.880	0.000	9.328	13.962

As was the case for evaluating the regression slopes in our simple linear regression models presented earlier in this module, each test statistic is a t-test and is labeled statistic in the table above. As usual, it is calculated as the estimate divided by the standard error. The test statistic is compared to the critical value of t. Recall that the formula for finding the df for a t-test in a regression model is n - 1 - # of predictors, where n is the sample size. We lost one df for estimating the intercept and an additional 3 df for each estimated slope. Thus, for a two-tailed test and an alpha of .05, the critical values of t for each slope are -1.966 and 1.966. If our test statistic for a slope is outside this range, then we will reject the null hypothesis.

qt(p = c(.025, .975), df = 396)

## [1] -1.965973  1.965973

First consider the row labeled CBT, this effect contrasts the CBT-only condition to the control condition. The difference in means between the two groups is 6.77, that is, the CBT group lost, on average about 6.8 pounds more than the control group. The test statistic is 5.74, which is in the rejection region. Therefore, we reject the null hypothesis. We have evidence that the CBT condition results in more weight loss than the control condition. Notice that the 95% CI is 4.45 to 9.09, it doesn’t contain 0 – and provides us with a range of plausible values for the difference in pounds lost between the CBT condition and the control condition if we were to offer the intervention in the population.

The next two rows in the output, labeled mobile and CBT_mobile, contrasts the other two conditions to the control condition. The slopes are interpreted in the same manner as we interpreted the slope for CBT. Both of these conditions appear to produce significant weight loss as compared to the control condition. Therefore, we find that all three treatment conditions produced significantly better results than the control condition.

At this point, you might be wondering, what if I want to compare the different weight loss programs to one another, for example, to compare mobile groceries only to CBT + mobile groceries. As we learned in Module 6, this can easily be accomplished by rotating the reference group and refitting the regression model. However, there is an important danger that we must consider.

We minimize Type I error when we set alpha for our tests (for example, to .05), but as we conduct multiple tests, the Type I error for the “family of tests” grows. We need a way of minimizing the Type I error when we make multiple comparisons for a set of groups. Given k groups, we can make $(k\times(k-1))/2$ comparisons. For our current example, that’s $(4\times3)/2 = 6$ possible comparisons. With all of these comparison, we’re using up more df for the set than we have available. Instead of using alpha = 0.05 for each individual test, one reasonable solution is to apply a Bonferroni correction. Here, we use our desired alpha (e.g., .05) for the entire family of tests when comparing groups in a set. This is called minimizing the family-wise error rate. To calculate the new alpha, divide the desired alpha by the number of comparisons that you would like to make. For example, if we set alpha to .05, and we want to make all six comparisons, then take .05/6 = .008. We can calculate the critical values of t for alpha=.008 (be sure to divide by 2 to put half in each tail for a two-tailed test). The df remains the same as for the usual MLR model. In assessing the significance for each of the group comparisons, we will use this rejection region rather than the usual rejection region.

Rejection region for family wise error

qt(p = c(.004, .996), df = 396)

## [1] -2.665584  2.665584

Now, let’s make the reference group the CBT + mobile groceries condition. First, we need to create a dummy code for the control condition.

wtloss <- wtloss %>% 
  mutate(control = case_when(condition == "control" ~ 1, TRUE ~ 0))

Then we can fit a second regression model using the CBT + mobile groceries condition as the reference group. Note that leaving out the CBT + mobile groceries dummy and adding the control dummy forces the CBT + mobile groceries condition to be the reference group.

lm_cat3 <- wtloss %>% 
  lm(lbs_lost ~ control + CBT + mobile, data = .)

lm_cat3 %>% get_regression_table()

term	estimate	std_error	statistic	lower_ci	upper_ci
intercept	11.324	0.833	13.588	9.686	12.963
control	-11.645	1.179	-9.880	-13.962	-9.328
CBT	-4.875	1.179	-4.136	-7.192	-2.558
mobile	-9.225	1.179	-7.827	-11.542	-6.908

To get the comparison of CBT and mobile, we’d fit a third model.

lm_cat4 <- wtloss %>% 
  lm(lbs_lost ~ control + CBT + CBT_mobile, data = .)

lm_cat4 %>% get_regression_table()

term	estimate	std_error	statistic	p_value	lower_ci	upper_ci
intercept	2.100	0.833	2.520	0.012	0.461	3.738
control	-2.420	1.179	-2.053	0.041	-4.737	-0.103
CBT	4.349	1.179	3.690	0.000	2.032	6.667
CBT_mobile	9.225	1.179	7.827	0.000	6.908	11.542

In evaluating these comparisons, as well as the previous comparisons from the model with the control group as the reference group, we need to use the new critical values as the rejection region. Notice that the test statistic for the mobile condition (compared to the control condition) from the first model (lm_cat2) is 2.05. With this more conservative alpha – this test statistic IS NOT in the rejection region. This questions the certainty with which we can feel that the mobile only condition would produce significant weight change in the population.

Checking assumptions

So far in this course, we’ve learned about fitting models and testing hypotheses. However, we have one more critical step to take. We can’t feel confident about our model until we check the assumptions underlying the model and evaluate the overall model fit. All models have assumptions — it’s important to understand them, check them, and implement remedial actions when necessary.

It’s important to note that assessing model adequacy is an advanced skill. As you begin your data analysis journey after this course – it’s important to consult with a mentor or supervisor. Soon you will build your intuition for assessing model fit, and be able to conduct these more advanced tasks on your own. For now, I just want you to recognize that models make assumptions, and have some familiarity with the assumptions.

Assumptions for comparisons of means

We covered several statistical tests for comparisons of group means. For example, a one-sample t-test, an independent samples t-test, and analysis of variance (ANOVA).

There are three primary assumptions made by these tests:

The outcome (y) for each group has a normal population distribution.
Building on assumption 1, if there are multiple groups, we assume that the variances across the groups are equal – this is referred to as homogeneity of variance.
The observations/cases are independent.

Minor violations to the first two is typically not a problem, particularly if the sample size is relatively large (~ 30 in each group). You can evaluate these assumptions by plotting a density plot or histogram of y across groups (noting if the distributions look relatively normal), and calculating descriptive statistics for y or each group. As a rule of thumb, when comparing the variances across groups, compare the smallest and largest sample standard deviations. If the ratio of these two sample standard deviations falls within 0.5 to 2, then typically this assumption will not be violated.

Here’s an example of computing these graphs and estimates for the comparison of lbs_lost by sex.

# create histograms of lbs_lost by sex
wtloss %>% 
  filter(condition == "CBT + mobile groceries") %>% 
  ggplot(aes(x = lbs_lost, fill = sex)) +
  geom_density(alpha = .5) +
  theme_bw()

# calculate standard deviation of lbs_lost by sex
wtloss %>% 
  filter(condition == "CBT + mobile groceries") %>% 
  select(sex, lbs_lost) %>% 
  group_by(sex) %>% 
  skimr::skim()

Data summary
Name	Piped data
Number of rows	100
Number of columns	2
_______________________
Column type frequency:
numeric	1
________________________
Group variables	sex

Variable type: numeric

skim_variable	sex	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
lbs_lost	male	0	1	11.05	9.18	-8.45	3.51	10.51	19.81	28.21	▂▇▅▆▅
lbs_lost	female	0	1	11.58	10.61	-14.36	6.94	11.14	17.77	32.43	▂▃▇▇▂

# ratio of sd
9.2/10.6

## [1] 0.8679245

When violations to these assumptions are severe, non-parametric approaches (e.g., permutation tests and bootstrapping) can be taken. These are described in your textbook (Modern Dive) in Chapter 9 and Appendix B.

The third assumption is related to the design of the study. These tests assume that cases are independent of one another – for example, this assumption would be violated if some people were measured multiple times or if you included sibling pairs. If you have repeated measures from the same cases, then dependent-samples t-tests or repeated measures ANOVA can be used, see this resource for examples of these models in R. There are also two Crash Course Statistics videos on these topics: Matched Pair T-Tests, RMA and ANCOVA. Multilevel models can also be used to handle repeated measures or any other type of non-independence (e.g., siblings nested in families, employees nested in companies). There is a DataCamp course on multilevel models (Hierarchical and Mixed Effects Models in R) that you can take if interested. See also, this resource.

Assumptions for linear regression models

Linear regression models make several assumptions, including:

The relationship between the predictor(s) (x) and the outcome (y) is assumed to be linear. That is, a straight line drawn through the scatterplot provides a good representation of the relationship.
The residuals resulting from the fitted model are assumed to be normally distributed.
The residuals resulting from the fitted model are assumed to have a constant variance – also called homoscedasticity.
The residuals are independent of one another.

Checking model assumptions in a MLR and remedying problems is an advanced skill. It’s smart to consult with an adviser or mentor who has expertise in these techniques while you are developing your statistical skills. In this next section, we’ll learn about a few simple techniques that you can use to get started.

Let’s use the data from the weight loss study. We’ll fit a model where pounds lost (lbs_lost) is regressed on caloric deficit (caldef1000) and sex (male). The function autoplot() produces a multi-panel graph that is useful for examining violation of assumptions in a regression model.

wtloss <- readRDS(here("data", "wtloss.Rds")) 

wtloss <- wtloss %>% 
  mutate(male = case_when(sex == "male" ~ 1, sex == "female" ~ 0)) %>% 
  mutate(caldef1000 = caldef/1000)

mlr <- wtloss %>% 
  lm(lbs_lost ~ caldef1000 + male, data = .)

mlr %>% get_regression_table()

term	estimate	std_error	statistic	p_value	lower_ci	upper_ci
intercept	-0.165	0.429	-0.384	0.701	-1.009	0.679
caldef1000	0.298	0.011	28.030	0.000	0.277	0.319
male	-0.267	0.547	-0.488	0.626	-1.342	0.808

mlr %>% check_model()

There are 6 panels in the output. In each panel, the most extreme cases (in terms of potential assumption violations) are labeled, based on the row number in the dataframe.

The bottom right panel shows diagnostics for the assumption that the residuals are normally distributed. We want our model residuals (the dots) to follow the shape of the green line. If they do, then this indicates that the residuals are normally distributed. If there are cases (e.g., people in the study) that substantially depart from the line, their scores on the variables in the model can be inspected. Alternatively, non-normal residuals may mean that your model is misspecified (e.g., the relationships aren’t actually linear, there are missing interactions between variables that should be considered). For modeling interactions, see the Crash Course Statistics video on Intersectional Groups and the Intermediate Regression with R DataCamp course. Your textbook, Modern Dive, also discusses interactions in Chapter 6. Related, the top left panel compares the density of the observed outcome with the model predicted outcomes. The blue density curves should resemble the observed data (the green curve).

The middle left panel show diagnostics for homogeneity of variance. As we scan across the fitted values (i.e., y-hat or the predicted values), we hope to see that the residuals are evenly distributed above and below the horizontal line – across the whole distribution of the fitted values. For our model, these also look pretty good. A problem would be apparent if the data points were in the form of a funnel or some other distinct shape other than a rectangle. If normality of the residuals or homoscedasticity of the residuals is violated, then we can choose to use the non-parametric methods (i.e., bootstrapping) for obtaining inference presented in Chapter 10 of Modern Dive.

The top right panel can be used to check the linear relationship assumption. A horizontal green line over the dashed horizontal line at 0, is what we’re hoping for if the linearity assumption is met. If the green line is curved, then we may have a problem with our model assumptions. Oftentimes, this is because the relationship between one or more of the predictors and the outcome is curvilinear. Advanced models such as polynomial regression models or non-linear transformation of the variables (i.e., taking a log transformation like you studied in the Modeling Data in the tidyverse DataCamp course) in the model can often remedy these issues.

The middle right panel checks for influential observations. On occasion, one or more cases (e.g., people) in your data will have very unusual scores. These cases can have a very large influence (i.e., leverage) on the regression model estimates. We should investigate these sorts of cases to determine if there is a mistake in their scores. The plot considers two measures – leverage and residuals.

Leverage is a quantification of how far away a case’s scores on predictor variables are from those of the other cases.
Residuals measure the difference between each case’s observed score on the outcome (y) and their predicted score (i.e., y-hat).

Cases that are potentially too influential in the model are labeled by their row number in the dataframe. Cases in the data that have large leverage and a large residual are potentially problematic.

The bottom left plot presents the variance inflation factor (VIF) for each predictor – this is a measure of potential multicollinearity. If the predictors are too highly correlated, modeling issues can occur. In this plot, we’re hoping to find low (green) or moderate (blue) VIFs for each predictor. In this case, VIF for both predictors is low.

Wrap up

Congratulations for making it through Module 11 – this was a dense module! In our last module (Module 12) – we will pull everything together from the semester. We will walk through the analyses conducted for the Blair and colleagues (2004) Psychological Science that you studied in your last Apply and Practice activity.