PSY350, Module 10
Introduction to Hypothesis Testing
Module road map
Learning objectives
- Describe why hypothesis tests are conducted
- Contrast the null hypothesis and the alternative hypothesis
- Describe Type I and Type II errors
- Explain the gist of null hypothesis testing
- Contrast the concepts of alpha and p-value.
Readings
- Read Chapter 9 of your Textbook
Overview
In Module 10 we use the foundations of inferential statistics that we garnered in Modules 8 and 9 to learn about hypothesis testing. In order to use this information to conduct hypothesis tests, we need to learn about the concept of p-values. Even if you haven’t ever taken a statistics course, you probably have heard of p-values. They are at the heart of a frequentist approach to hypothesis testing – yet they are often misunderstood and misused. To get us started, please watch the following video from Crash Course: Statistics on p-values.
Introduction to the data
Data scenario: We will continue to use the weight loss study introduced in Module 8.
In the dataframe called wtloss.Rds, the following variables are recorded:
pid is the participant ID
condition is the condition the participant was randomly assigned to receive (control, CBT, mobile groceries, CBT + mobile groceries)
bmi is the participant’s body mass index at the start of the study
caldef is the sum of caloric deficit (in kilocalories) across all three months of the intervention. Caloric deficit is the difference between calories consumed and calories expended (positive number = more calories burned than consumed, negative number = more calories consumed than burned).
lbs_lost is the total pounds lost during the program (positive number = weight loss, negative number = weight gain)
fv5 is a binary indicator that is coded “yes” if the individual ate at least 5 fruits and vegetables per day on at least four days during the last week of the intervention, and is coded “no” if the individual did not meet this criteria.
Let’s load the needed packages for this module, and import the data.
library(skimr)
library(moderndive)
library(infer)
library(janitor)
library(here)
library(tidyverse)
<- readRDS(here("data", "wtloss.Rds")) wtloss
Introduction to hypothesis testing
In this module we will use the principles of the sampling distribution that we learned about in Modules 8 and 9 to conduct hypothesis tests. A hypothesis is a scientist’s assertion about the value of an unknown population parameter. Recall that the purpose of a scientific study isn’t to learn just about the sample, but rather to try and learn something about the population of interest – that is, the population from which the sample was drawn.
A hypothesis test consists of a test between two competing hypotheses about what might be happening in the population. The first is referred to as the null hypothesis and the second is referred to as the alternative hypothesis.
The null hypothesis is often an assertion that there is no difference in y (i.e., an outcome variable) between groups, or there is no effect of x (a predictor) on y (an outcome). That is, with the null hypothesis we assume that any effect observed in our sample is simply due to random chance. Some examples of null hypotheses include:
- The mean pounds lost in the CBT + mobile groceries condition is not different from zero (i.e., no significant weight change as a result of the intervention).
- There is no difference in pounds lost between the CBT + mobile groceries intervention condition and the control condition.
- There is no relationship between caloric deficit and pounds lost (i.e., a zero slope for the regression line of lbs_lost regressed on caldef).
The alternative hypothesis is typically the assertion that there is a difference between groups or there is an effect of one variable on another. Building on the three null hypotheses just stated, examples of alternative hypotheses include:
- The mean pounds lost in the CBT + mobile groceries condition is different from zero (i.e., significant change in weight as a result of the intervention).
- There is a difference in pounds lost between the CBT + mobile groceries intervention condition and the control condition.
- There is a relationship between caloric deficit and pounds lost (i.e., a non-zero slope for the regression line of lbs_lost regressed on caldef).
Let’s focus in this module on the first null and corresponding alternative hypothesis listed above. Here, we seek to determine if the average weight loss in the CBT + mobile groceries is significantly different than 0, where a score of 0 on our measure of pounds lost (lbs_lost) indicates no weight change.
You might ask yourself: “People in the CBT + mobile groceries intervention lost a lot of weight (~11 pounds on average), why do we need to conduct a test to determine if this is different from 0?” Conducting this hypothesis test is necessary because, as we saw with the sample-to-sample variability in our simulations in Modules 8 and 9, what we observe in our single random sample may not be indicative of what is actually happening in the population. In inferential statistics, we seek to understand if the effect/difference in the observed data is real, or due to random chance. We also seek to understand the likelihood that our sample results can be inferred to the population. In this case, we seek to know if the evidence observed in our sample (i.e., that on average people lost ~11 pounds) means that the weight loss program would result in significant weight change in the population. We surely need to determine this before investing the money, effort and time to deliver the intervention widely in the community.
In this example:
Our null hypothesis is: \(H_0: \small \mu = 0\) (the population mean is equal to 0)
Our alternative hypothesis: \(H_a: \small \mu \neq 0\) (the population mean is not equal to 0)
Sometimes alternative hypotheses are one-sided, for example, we could have asserted that the CBT + mobile groceries intervention will result in pounds lost (i.e., pounds lost will be greater than zero). However, in this course we are going to focus exclusively on two-tailed tests. That is, we are not going to specify the direction of the effect. Rather, we will allow for the possibility that the CBT + mobile groceries intervention could actually be iatrogenic (i.e., harmful) and cause weight gain. A two-tailed test in which directionality is not dictated is the norm in social and behavioral sciences. However, you can read more about when it is appropriate to conduct a one-tailed tests in your text book (Modern Dive) and here.
Let’s consider the theoretical sampling distribution under this null hypothesis to grow our intuition about null hypothesis significance testing. In modules 8 and 9 we learned that the center of a sampling distribution for a statistic of interest is the true population parameter. For our example, that’s the mean pounds lost after receiving the CBT + mobile groceries intervention in the population. Take a look at the two plots below. The plot on the left represents the sampling distribution that we simulated in Module 8, in which the center of the distribution is the true population mean (11 pounds of weight loss) for the CBT + mobile groceries program (see where the orange line meets the x-axis). The plot on the right is the sampling distribution under the null hypothesis — that is, that the average weight loss is 0 (0 pounds of weight loss – see where the blue line meets the x-axis). Now, we can use our knowledge of the Central Limit Theorem and the empirical rule to determine if this null hypothesis is reasonable.
If the null hypothesis is true, under the empirical rule we expect that 95% of the random samples that we would draw from the population would produce an average pounds lost within about 2 standard errors (1.96 to be more precise) of the mean (which is 0 under the null hypothesis). In Module 9, using 5000 bootstrap resamples, we estimated the standard deviation of the sampling distribution to be .99, and we learned that the standard deviation of the sampling distribution is the standard error of the parameter estimate. Therefore, we would expect 95% of the samples to produce an average pounds lost within 1.96 standard errors of the population expectation (0 under the null hypothesis). Since the standard error is .99, 1.96 times .99 yields 1.94. That is, we would expect 95% of the samples to produce a mean that is between -1.94 pounds (pounds gained) and +1.94 pounds (pounds lost) if the null hypothesis is true. This is depicted in the figure below.
Let’s overlay our observed sample mean onto the graph of the sampling distribution for the null hypothesis, the result is presented below. Our observed sample mean (the average pounds lost in the CBT + mobile groceries condition) is ~11 pounds lost and is represented by the orange line on the graph. This observed mean is well outside the threshold for what we’d expect to observe if the null hypothesis is true. Therefore, it is highly unlikely that the true mean pounds lost in the population is 0. That is, the null hypothesis seems highly unlikely.
Type I and Type II errors
After stating your null and alternative hypotheses, and then evaluating whether the null hypothesis is probable, there are four possible outcomes defined by the box below.
The columns of this box represent two states of the world. The left represents a world in which the null hypothesis is true (the CBT + mobile groceries intervention doesn’t produce weight change) and the right represents a world in which the null hypothesis is false (the CBT + mobile groceries intervention does produce weight change).
The rows of this box represent two outcomes of our experiment/study. The top represents the scenario in which our study rejects the null hypothesis (finds evidence of significant change in weight as a result of participating in the CBT + mobile groceries intervention). The bottom represents the scenario in which our study does not reject the null hypothesis (finds no evidence of significant weight change as a result of participation).
Crossing the real world (i.e., what would happen in the population if we delivered the intervention to everyone) with our study (what we find in our single sample when we conduct the experiment), we see that there are four different possibilities that can occur.
In two of the possibilities, we draw the correct conclusion (the green boxes). In the top green box, our study finds evidence of weight change, and weight change does occur in the real world. In the bottom green box, our study does not find evidence of significant weight change, and weight does not change in the real world.
In two of the possibilities, we draw an incorrect conclusion. The top incorrect conclusion is called a Type I error – we reject the null hypothesis based on our study, but in the real world the null hypothesis is actually true. The bottom incorrect conclusion is called a Type II error – we do not reject the null hypothesis based on our study, but the null hypothesis is actually false in the real world.
Probability of making a Type I error
We call the probability of making a Type I error alpha (\(\alpha\)). Therefore, alpha is the probability of rejecting the null hypothesis, when in actuality it is true. Just like when we decided on a confidence level for our confidence intervals in Module 9, we must decide how comfortable we are with making a Type I error. We need to set alpha a priori, that is, before we conduct the test. Often in Psychology, alpha is set to .05 – though this is just a choice and as researchers we should think carefully about setting alpha. If we choose .05 — that means that we are comfortable that there is a 5% chance that we will reject the null hypothesis when actually it is true. For example, that we will say we have evidence that the intervention produces weight change, but really it does not (i.e., what we observed in our sample was just random change/random noise).
The general framework for a hypothesis test
Once we define our null and alternative hypothesis, and choose alpha, we next conduct the hypothesis test. We use a test statistic to do this work. In Module 11 we will study how to compute several types of test statistics. We will use these test statistics to determine if the null hypothesis should be rejected. For example, for the null hypothesis we have been considering (i.e., that the mean pounds lost in the CBT + mobile groceries condition is 0), the test statistic is simply the number of standard errors our sample mean (~11 pounds) is away from the null hypothesis value.
These test statistics will have a p-value associated with them. The p-value is the probability that we would obtain a test statistic of the observed magnitude or larger if the null hypothesis were true. When the p-value is less than alpha, we reject the null hypothesis.
To summarize, we state the null and alternative hypothesis, then calculate a test statistic from the data that describes the observed effect. We use this test statistic to calculate a p-value by comparing it to the distribution of the statistic under the null hypothesis. This provides the probability that the observed data would emerge if the null hypothesis is true. If this probability (i.e., the p-value) is less than alpha, then we reject the null hypothesis.
We’ll dig into the mechanics of calculating the p-value of a test and comparing it to alpha in Module 11. However, to build on our work in Module 4 and prepare us for Module 11, let’s use the pnorm() function to calculate the probability that a random sample from the population would produce an average pounds lost of 11.32 pounds (our observed sample mean) if the null hypothesis were true. Recall that for pnorm(), we need to input the quantile (11.32 pounds – our observed sample mean) and the mean and standard deviation of the normal distribution we are considering – in this case, the sampling distribution under the null hypothesis. The sampling distribution under our null hypothesis has a mean of 0 and a standard deviation of .99 (which we estimated via simulation).
pnorm(11.32, mean = 0, sd = .99, lower.tail = FALSE)
## [1] 1.408296e-30
The resulting value is the probability of obtaining an average pounds lost of 11.32 pounds or larger, if the null hypothesis were true. It’s very, very tiny. It seems very unlikely that the null hypothesis is true given that our sample produced a mean of 11.32 pounds lost.
To close out Module 10 and prepare to conduct hypothesis tests in Module 11, please watch the following videos on the potential problems that we can encounter with p-values and how p-values are related to power.
Please also watch this video on test statistics, which will serve as an introduction to the work we will do in Module 11.