PSY350, Module 7
Multiple Linear Regression
Module road map
Learning objectives
- Describe how prediction can be enhanced by including multiple predictors in a regression model
- Describe how confounding can be handled via multiple linear regression (MLR)
- Write an equation to depict a MLR equation
- Interpret the intercept and slope estimates of a MLR model
- Calculate predicted outcomes based on a MLR
- Describe the role of \(R^2\) in a MLR
- Fit a MLR in R
Readings
- Read Chapter 6 of your Textbook
Overview
MLR extends SLR to consider more than one predictor. We’ll continue working with one continuously distributed dependent variable (i.e., outcome, y variable), but now we’ll consider multiple independent variables (i.e., predictors, x variables).
We’ll consider two uses of a MLR.
- In the first, we’ll examine how the inclusion of additional predictors can improve the accuracy with which we can predict the outcome.
- In the second, we’ll examine how adjusting for variables that may confound the relationship between a key predictor and an outcome can help us assess the true effect of the key predictor.
Let’s start with the first use of MLR, that is, to add predictors of our outcome in an effort to improve the model’s predictive accuracy. We’ll build on the Bread and Peace Model from Module 5.
MLR goal #1 – improving prediction
Reintroduction to the data
In this module, we will explore one well-studied and well-respected, but relatively simple, model for forecasting election outcomes. It’s called the Bread and Peace Model, and was formulated by Douglas A. Hibbs. He describes his model in detail here, the gist of his model is described below:
Postwar US presidential elections can for the most part be interpreted as a sequence of referendums on the incumbent party’s record during its four-year mandate period. In fact aggregate two-party vote shares going to candidates of the party holding the presidency during the postwar era are well explained by just two fundamental determinants: (1) Positively by weighted-average growth of per capita real disposable personal income over the term. (2) Negatively by cumulative US military fatalities (scaled to population) owing to unprovoked, hostile deployments of American armed forces in foreign wars.
In other words, in a US presidential election, the likelihood that the incumbent party maintains power is dependent on the economic growth experienced during the prior term and the loss of military troops due to war. The former increases favor for the incumbent party, while the latter decreases favor.
The following variables are in the dataframe:
- year is the presidential election year
- vote is the percentage share of the two-party vote received by the incumbent party’s candidate
- growth is the quarter-on-quarter percentage rate of growth of per capita real disposable personal income expressed at annual rates
- fatalities denotes the cumulative number of American military fatalities per millions of US population
- wars lists the wars of the term if fatalities > 0
- inc_party_candidate is the name of the incumbent party candidate
- other_party_candidate is the name of the other party candidate
- inc_party is an indicator of the incumbent party (D = Democrat, R = Republican)
In Module 5, we fit a simple linear regression (i.e., one predictor) to determine how well growth in income of US residents during the preceding presidential term predicts the share of the vote that the incumbent party receives. The growth predictor constitutes the bread component of the Bread and Peace Model. In this module, we will build on the SLR produced in Module 5 model and add fatalities as an additional predictor. The fatalities predictor constitutes the peace component of the Bread and Peace Model.
Let’s begin by importing the data.
<- here("data", "bread_peace.Rds") %>%
bp read_rds()
Bread and Peace Model 1 – growth in income as only predictor
Let’s begin by fitting the SLR that we fit with these data in Module 5. In our initial model, we considered a single predictor – growth in income during the four year period leading up to the election.
<- bp %>%
bp_mod1 lm(vote ~ growth, data = .)
%>% get_regression_table() bp_mod1
term | estimate | std_error | statistic | p_value | lower_ci | upper_ci |
---|---|---|---|---|---|---|
intercept | 45.021 | 2.075 | 21.702 | 0.000 | 40.600 | 49.443 |
growth | 2.975 | 0.782 | 3.803 | 0.002 | 1.307 | 4.642 |
%>% get_regression_summaries() bp_mod1
r_squared | adj_r_squared | mse | rmse | sigma | statistic | p_value | df | nobs |
---|---|---|---|---|---|---|---|---|
0.491 | 0.457 | 14.18538 | 3.766348 | 4.01 | 14.463 | 0.002 | 1 | 17 |
The model with only growth in income explains about 49% of the variability in the share of the vote garnered by the incumbent party. That’s a sizable proportion of variance, but we might be able to boost that variance explained, and thus our ability to predict the outcome, by considering the second element of the Bread and Peace model. The second element considers the number of war-related fatalities that took place during the prior term.
Take a look at the plot below. I’ve changed a couple of things. First, I added an additional aesthetic by mapping the number of war-related fatalities (per million US population) to the size of the points. The larger the point, the more fatalities. For years in which fatalities was greater than 0, I also labeled the point with the corresponding war(s).
Let’s evaluate the plot alongside the residuals from our first regression model (the model with growth as the only predictor). I sorted these by the value of the residual. Recall that the residual is the difference between the case’s observed score and the case’s predicted score. If the model perfectly predicts the case’s score, the predicted and observed values are identical, and the residual is zero. The further the residual is from zero (in either the positive or negative direction), the larger the discrepancy between what our model predicts the score will be and the actual observed score.
year | vote | growth | fatalities | wars | residual |
---|---|---|---|---|---|
1952 | 44.55 | 3.033 | 206 | Korean | -9.492 |
1968 | 49.59 | 3.264 | 174 | Vietnam | -5.142 |
2000 | 50.27 | 3.335 | 0 | none | -4.671 |
1992 | 46.55 | 1.376 | 0 | none | -2.566 |
1980 | 44.69 | 0.750 | 0 | none | -2.561 |
1976 | 48.95 | 1.432 | 1 | Vietnam | -0.332 |
2004 | 51.24 | 2.186 | 4 | Iraq | -0.284 |
1988 | 53.90 | 2.891 | 0 | none | 0.280 |
2016 | 51.11 | 1.895 | 1 | Afghanistan + Iraq | 0.452 |
2008 | 46.31 | 0.169 | 9 | Iraq | 0.786 |
2012 | 51.96 | 1.774 | 5 | Afghanistan | 1.661 |
1984 | 59.17 | 4.120 | 0 | none | 1.895 |
1960 | 49.92 | 0.577 | 0 | Vietnam | 3.183 |
1964 | 61.35 | 4.390 | 2 | Vietnam | 3.269 |
1996 | 54.74 | 1.880 | 0 | none | 4.127 |
1972 | 61.79 | 4.174 | 0 | none | 4.351 |
1956 | 57.75 | 2.584 | 0 | none | 5.044 |
Notice that the two largest residuals (which are for 1952 and 1968) are both negative – indicating the model based just on growth in income over-predicts the vote share that the incumbent party garnered in these years. I colored these points in blue in the figure above to highlight them. Viewing the figure and the table above, we see that these two elections had something in common – during the preceding presidential term, there was a large loss of American military troops to war. Specifically, in the Korean war in 1952 and the Vietnam war in 1968.
The Bread and Peace model asserts that this large loss of American troops will drive down the public’s affinity for the incumbent party during the next election. Thus, consideration of both growth in income and loss of life of American troops must be considered.
We’re going to build on this model to create a more complex model next. In order to compare the predicted and residual values across models, I am going to request these values from the first model, but I am going to save them in a new dataframe, and rename the values so that they have a name that is specific to model 1. Specifically, I will give them a mod1 extension, and save the result as as a dataframe called bp_mod1_pred to be used later.
<- bp_mod1 %>%
bp_mod1_pred get_regression_points(ID = "year") %>%
rename(yhat_mod1 = vote_hat, resid_mod1 = residual) %>%
select(year, yhat_mod1, resid_mod1)
Now, let’s fit a second model that adds fatalities to the regression model as an additional predictor.
Bread and Peace Model 2 – growth in income + fatalities
Adding fatalities to our regression model changes our equation. The equations for model 1 (without fatalities), and model 2 (with fatalities) are presented below, where x1 is growth and x2 is fatalities.
Model 1: \(\hat{y} =
b_{0}+b_{1}x_1\)
Model 2: \(\hat{y} =
b_{0}+b_{1}x_1 + b_{2}x_2\)
Notice that in model 2 we now have two predictors x1 and x2 – representing growth and fatalities respectively. Therefore, model 2 has three regression estimates:
- \(b_{0}\) is the intercept, defined as the predicted value of y when all x variables equal 0
- \(b_{1}\) is the slope for the first predictor, growth
- \(b_{2}\) is the slope for the second predictor, fatalities
In order to add an additional predictor to our lm() model in R, we just list the new predictor after the first with a + sign in between.
<- bp %>%
bp_mod2 lm(vote ~ growth + fatalities, data = .)
To begin, let’s examine the \(R^2\), printed in the get_regression_summaries() output. Recall that the \(R^2\) indicates the proportion of the variability in the outcome that is explained by the predictors in the model.
%>% get_regression_summaries() bp_mod2
r_squared | adj_r_squared | mse | rmse | sigma | statistic | p_value | df | nobs |
---|---|---|---|---|---|---|---|---|
0.771 | 0.738 | 6.37871 | 2.525611 | 2.783 | 23.577 | 0 | 2 | 17 |
The R-squared or \(R^2\) increases substantially with the addition of fatalities, going from about 49% in the first model to about 77% in the second model. That is, the model with fatalities explains quite a bit more of the variability in the vote scores than growth alone.
Now, let’s take a look at the estimates of the intercept and slope for model 2.
%>% get_regression_table() bp_mod2
term | estimate | std_error | statistic | p_value | lower_ci | upper_ci |
---|---|---|---|---|---|---|
intercept | 44.951 | 1.440 | 31.214 | 0.000 | 41.862 | 48.040 |
growth | 3.478 | 0.556 | 6.252 | 0.000 | 2.285 | 4.671 |
fatalities | -0.047 | 0.011 | -4.139 | 0.001 | -0.071 | -0.023 |
We will focus on the columns labeled term and estimate for now. With one predictor, our simple linear regression model yields a regression line which is described by an intercept and a slope. With two predictors, our multiple linear regression model yields a regression plane. Each case has a score on both predictor variables, and these are used together to predict the outcome.
\[\hat{y} = b_{0} + b_{1}x_1 + b_{2}x_2\] \[\hat{y} = 44.95 + 3.48x_1 + (-.047)x_2\]
- The intercept (44.95) represents the predicted score of vote when all predictors are zero. Thus, if there was 0 growth in income during the preceding term and 0 war-related fatalities (i.e., growth = 0 and fatalities = 0), then our model predicts that the incumbent party will win 44.95% of the votes.
With more than one predictor in the regression model, the interpretation of each regression slope is different. Each slope is interpreted as the effect of the variable, holding constant all other predictors in the model.
Consider the slope estimate for the effect of growth. Holding constant fatalities (i.e., comparing years when fatalities are the same), each 1 unit increase in growth is associated with a 3.48 unit increase in vote. In model 1, our model without fatalities, the slope for growth was 2.98. Once fatalities is taken into account, we anticipate more of a boost in votes for each 1 unit increase in income growth.
Now consider the slope estimate for the effect of fatalities. Holding constant growth (i.e., comparing years when growth is the same), each 1 unit increase in fatalities is associated with a .047 unit decrease in vote. In other words, with growth in income being equal, more war-related fatalities tends to have a negative impact on votes garnered by the incumbent party – more deaths leads to fewer votes.
We can also examine the predicted values and residuals to see how they’ve shifted with the addition of fatalities. As with model 1, I will save the predicted and residual values, and rename them to be specific to model 2 (i.e., specify a mod2 extension).
<- bp_mod2 %>%
bp_mod2_pred get_regression_points(ID = "year") %>%
rename(yhat_mod2 = vote_hat, resid_mod2 = residual) %>%
select(year, yhat_mod2, resid_mod2)
year | yhat_mod2 | resid_mod2 |
---|---|---|
1952 | 45.835 | -1.285 |
1956 | 53.937 | 3.813 |
1960 | 46.957 | 2.963 |
1964 | 60.127 | 1.223 |
1968 | 48.142 | 1.448 |
1972 | 59.470 | 2.320 |
1976 | 49.886 | -0.936 |
1980 | 47.558 | -2.868 |
1984 | 59.279 | -0.109 |
1988 | 55.005 | -1.105 |
1992 | 49.738 | -3.188 |
1996 | 51.489 | 3.251 |
2000 | 56.550 | -6.280 |
2004 | 52.366 | -1.126 |
2008 | 45.117 | 1.193 |
2012 | 50.888 | 1.072 |
2016 | 51.495 | -0.385 |
Note that the calculation of the predicted score (i.e., yhat_mod2) requires plugging the case’s score for both predictor variables into our equation. For example, to obtain the the predicted score for 1952 (a year where growth = 3.03 and fatalities = 206), we use:
\[\hat{y} = 44.95 + (3.48\times3.03) + (-.047\times206) = 45.84\]
As with a SLR, the residual is simply the difference between the predicted and observed score. For 1952, that is 44.55 - 45.84 = -1.29.
To make sure you understand where these numbers come from, practice calculating the predicted score and residual for a few of the other years.
Now that the predicted and residual scores are saved for each model, we can merge them together. The first 5 columns represent the observed variables in the dataframe, the last two represent the predicted and residual scores for models 1 and 2 (those ending in mod1 are from model 1, and those ending in mod2 are from model 2).
<- bp %>%
compare_bp left_join(bp_mod1_pred, by = "year") %>%
left_join(bp_mod2_pred, by = "year") %>%
select(year, vote, growth, wars, fatalities, yhat_mod1, yhat_mod2, resid_mod1, resid_mod2)
year | vote | growth | wars | fatalities | yhat_mod1 | yhat_mod2 | resid_mod1 | resid_mod2 |
---|---|---|---|---|---|---|---|---|
1952 | 44.55 | 3.0325 | Korean | 206 | 54.042 | 45.835 | -9.492 | -1.285 |
1956 | 57.75 | 2.5836 | none | 0 | 52.706 | 53.937 | 5.044 | 3.813 |
1960 | 49.92 | 0.5768 | Vietnam | 0 | 46.737 | 46.957 | 3.183 | 2.963 |
1964 | 61.35 | 4.3904 | Vietnam | 2 | 58.081 | 60.127 | 3.269 | 1.223 |
1968 | 49.59 | 3.2644 | Vietnam | 174 | 54.732 | 48.142 | -5.142 | 1.448 |
1972 | 61.79 | 4.1745 | none | 0 | 57.439 | 59.470 | 4.351 | 2.320 |
1976 | 48.95 | 1.4324 | Vietnam | 1 | 49.282 | 49.886 | -0.332 | -0.936 |
1980 | 44.69 | 0.7495 | none | 0 | 47.251 | 47.558 | -2.561 | -2.868 |
1984 | 59.17 | 4.1195 | none | 0 | 57.275 | 59.279 | 1.895 | -0.109 |
1988 | 53.90 | 2.8907 | none | 0 | 53.620 | 55.005 | 0.280 | -1.105 |
1992 | 46.55 | 1.3764 | none | 0 | 49.116 | 49.738 | -2.566 | -3.188 |
1996 | 54.74 | 1.8797 | none | 0 | 50.613 | 51.489 | 4.127 | 3.251 |
2000 | 50.27 | 3.3349 | none | 0 | 54.941 | 56.550 | -4.671 | -6.280 |
2004 | 51.24 | 2.1860 | Iraq | 4 | 51.524 | 52.366 | -0.284 | -1.126 |
2008 | 46.31 | 0.1691 | Iraq | 9 | 45.524 | 45.117 | 0.786 | 1.193 |
2012 | 51.96 | 1.7744 | Afghanistan | 5 | 50.299 | 50.888 | 1.661 | 1.072 |
2016 | 51.11 | 1.8951 | Afghanistan + Iraq | 1 | 50.658 | 51.495 | 0.452 | -0.385 |
Take a look at 1952 and 1968, the two years with the largest residuals in model 1. In both cases, the predicted value from model 2 (labeled yhat_mod2) is much closer to the observed value (vote), and thus the residuals are also drastically reduced as compared to model 1. That is, the residuals move closer to zero – e.g., -9.49 to -1.29 for 1952, and -5.14 to 1.45 for 1968). This indicates that once we add fatalities to the model, we can more accurately predict the outcome. Comparison of the two models clearly indicates that both growth in income and fatalities of American troops should be included in the regression model to predict the incumbent party’s vote share for presidential elections.
Returning to the premise of the Bread and Peace Model, we see that our fitted model is an excellent match to the theory: In a US presidential election, the likelihood that the incumbent party maintains power is dependent on the economic growth experienced during the prior term and the loss of military troops due to war. The former increases favor for the incumbent party, while the latter decreases favor.
For a video recap of the Bread and Peace regression model example, please watch the video below.
To finish up the first part of this module, please watch the following two videos from Crash Course: Statistics on Prediction.
MLR goal #2 – adjusting for confounders
An important adage in statistical modeling is “correlation does not equal causation.” Just because variable x is associated with variable y, does not mean that x causes y. In this second part of the Module, we will consider how MLR can be used to better understand causality.
Introduction to the data
Imagine that you want to study the effectiveness of a mentoring program for first generation college students in their sophomore year on subsequent GPA at the end of their junior year. Students are matched with leading scientists in their field of study to be mentored for a period of 3 months during the summer between their sophomore and junior year.
You believe that students who participate in the mentoring program will have a higher GPA at the end of their junior year than students who do not participate.
You compile academic records for all of the first generation college students in their sophomore year (N = 1000), 75 of these students go on to participate in the mentoring program. We’ll refer to the 75 students who received the mentoring program the treatment group and the 925 students who did not receive the program the control group. The mentoring program is delivered. Then, at the end of the junior year, you compile the academic records for all 1000 students again to measure the effectiveness of the intervention. The GPA in the sophomore year is the pre-intervention measure of academic performance, and the GPA in the junior year is the post-intervention measure of academic performance.
Let’s imagine two different designs for determining who receives the treatment in this study.
Design 1: You put the names of all 1000 eligible students in a hat, shake them thoroughly, and then draw 75 names. These students receive the mentoring program, the remainder serve as the control group.
Design 2: You advertise the mentoring program to all 1000 eligible students. The first 75 students to respond to your announcement receive the mentoring program, the remainder serve as the control group.
Assessment of treatment effects under the two designs
Design 1 – random assignment
In design 1, the treatment is randomly assigned. Thus, the probability of receiving the treatment is unrelated to any other variables measured prior to the start of the program (i.e., background variables or pre-treatment selection variables). For example, GPA in the sophomore year (prior to the start of the program) will be equal between those who were randomly selected to receive the program (treatment group) and those who were not (control group), unless the randomization strategy failed. In this example, we will assume that the randomization strategy was successful. The benefit of a randomized controlled trial is that all background variables should be equal in the treatment and control groups because of the random assignment.
Let’s begin by reading in the dataframe for design 1.
<- here("data", "mentor_design1.Rds") %>%
d1 read_rds()
The change in GPA from pre to post intervention will demonstrate whether the treatment is effective for design 1 – the randomized controlled trial. If we find that GPA at the end of the junior year is greater among students who receive the treatment, as compared to students in the control group, then we have evidence that the mentoring program works.
To evaluate this, let’s calculate the average GPA at pre and post intervention for the treatment and control conditions, and then plot the resulting information using a point and line plot.
%>% pivot_longer(cols = contains("gpa"), names_to = "time", values_to = "gpa") %>%
d1 group_by(condition, time) %>%
summarize(avg_gpa = mean(gpa), .groups = "keep") %>%
ungroup() %>%
mutate(year = factor(time, levels = c("gpa_pre", "gpa_post"), labels = c("pre-intervention", "post-intervention"))) %>%
ggplot(aes(x = year, y = avg_gpa, group = condition, color = condition)) +
geom_point(aes(shape = condition)) +
geom_line(aes(group = condition)) +
theme_bw() +
scale_y_continuous(limits = c(0, 4.0)) +
labs(title = "Change in average GPA by treatment condition",
x = "",
y = "Average GPA")
This graph shows that at pre-intervention, the treatment and control groups did not differ in average GPA. This is expected due to the random assignment of condition. However, at post-intervention, there is a clear difference in average GPA between the treatment and control groups. The average GPA for students in the treatment condition is higher at post-intervention compared to pre-intervention. However, the average GPA for students in the control condition is lower at post-intervention as compared to pre-intervention. The difference in average GPA at post-intervention between the treatment and control groups is indicative of a beneficial treatment effect of the mentoring program.
Design 2 – self-selection
In design 2, individuals select themselves into the program. Thus, background variables may differ between the treatment and control groups. For example, students with a higher GPA in their sophomore year (prior to the start of the program) may be more likely to sign up for the mentoring program than students with a lower GPA in their sophomore year.
Let’s read in the dataframe for design 2.
<- here("data", "mentor_design2.Rds") %>%
d2 read_rds()
Let’s calculate the average GPA at pre and post intervention for the treatment and control conditions, and then plot the resulting information using a point and line plot.
%>% pivot_longer(cols = contains("gpa"), names_to = "time", values_to = "gpa") %>%
d2 group_by(condition, time) %>%
summarize(avg_gpa = mean(gpa), .groups = "keep") %>%
ungroup() %>%
mutate(year = factor(time,
levels = c("gpa_pre", "gpa_post"),
labels = c("pre-intervention", "post-intervention"))) %>%
ggplot(aes(x = year, y = avg_gpa, group = condition, color = condition)) +
geom_point(aes(shape = condition)) +
geom_line(aes(group = condition)) +
theme_bw() +
scale_y_continuous(limits = c(0, 4.0)) +
labs(title = "Change in average GPA by treatment condition",
x = "",
y = "Average GPA")
This graph shows that at pre-intervention, the treatment and control groups differed in average GPA. Specifically, we see that students in the treatment group had a higher GPA before the intervention began. This might suggest that students with a higher GPA in their sophomore year were more likely to sign up for the mentoring program. The average GPA for students in the treatment condition is higher at post-intervention compared to pre-intervention. However, the average GPA for students in the control condition is lower at post-intervention as compared to pre-intervention.
At post-intervention, there is a clear difference in average GPA between the treatment and control groups. However, because the groups were not equivalent at pre-intervention, based on this graph, we cannot tell if the difference at post-intervention is due simply to pre-existing differences that carried forward to post-intervention or if the difference at post-intervention is attributable to participation in the mentoring program.
Contrasting design 1 and design 2
In design 1, treatment was randomly assigned, and thus no meaningful pre-existing differences between the treatment and control group exist. Therefore, when GPA is assessed after the program is over, we can be confident that the difference in GPA between the treatment and control group is due to the mentoring program.
In design 2, when GPA is assessed after the program is over, we cannot be confident that the difference in GPA between the treatment and control group is due to the mentoring program because the groups were not equivalent at the start. We don’t know if the difference in GPA at the end of the junior year is due to the mentoring program or due to pre-existing differences between the groups.
In design 2, treatment condition and pre-intervention GPA are confounded. In the English language, the word confounded means to throw into confusion or perplexity. In terms of our model, the treatment condition and pre-treatment variables are confused, as a result, we can’t separate the effect of the treatment from the pre-treatment variables.
Thus, in design 2, we refer to background variables that may lead to intervention selection (e.g., GPA in the sophomore year) as confounding variables. Because, the treatment and control groups likely differ on these background variables prior to the start of the program, our study suffers from selection bias – and the comparison of GPA at the end of the junior year between the treatment and control group will likely be be biased and inaccurate.
Regression adjustment
While randomizing the treatment provides the best opportunity to assess treatment effects, in cases where the intervention cannot be randomly assigned, a MLR can help us to better uncover the true treatment effect. If we aim to identify a potential cause-and-effect relationship between an exposure (x, e.g., the mentoring program) and an outcome (y, e.g., GPA at the end of the junior year), we must carefully identify potential confounders. Once identified, we must measure them and then adjust for them.
A confounder may be adjusted for through a variety of statistical techniques, one of which is regression adjustment. In this part of the module, we’ll work through an example in which a confounder of an exposure-outcome effect is identified and included as an additional covariate in a MLR model. We’ll use our example data from design 2 of the mentoring study.
The figure below depicts our model.
We see in this figure that the effect of the treatment on GPA at the end of the program (junior year) is confounded by GPA prior to the program (sophomore year). The confounder causes both selection into the program (the key predictor) and GPA at the end of the program (the outcome).
Let’s first fit a simple linear regression model that ignores the confounder. First, we need to create a new variable to represent our predictor – treatment condition. We’ll call it treated. For people in the treatment group, treated will equal 1. For people in the control group, treated will equal 0.
<- d2 %>%
d2_mod1 mutate(treated = case_when(condition == "treatment" ~ 1,
== "control" ~ 0)) %>%
condition lm(gpa_post ~ treated, data = .)
%>% get_regression_table() d2_mod1
term | estimate | std_error | statistic | p_value | lower_ci | upper_ci |
---|---|---|---|---|---|---|
intercept | 1.998 | 0.015 | 136.583 | 0 | 1.969 | 2.026 |
treated | 1.178 | 0.053 | 22.062 | 0 | 1.073 | 1.283 |
The intercept in this model represents the average GPA at the end of the junior year for the control group (treated = 0).
The slope for treated represents the average difference in GPA at the end of the junior year for the treatment group as compared to the control group. That is, the slope is the expected difference in y for a one unit increase in x, where a one unit increase in treated means going from a score of 0 (being in the control group) to a score of 1 (being in the treatment group). Those students who participated in the mentoring program had a GPA that was nearly 1.2 units higher than students who didn’t receive the mentoring program.
To calculate the average GPA at the end of the junior year for the treatment group we add the slope estimate to the intercept (=1.998 + 1.178 = 3.176). Thus, we predict the average GPA for students in the control condition to be 2.0 and the average GPA for students in the treatment condition to be 3.2.
- Predicted score for control group: \(\hat{y} = b_{0}+b_{1}x_1 = 1.998 + 1.178\times 0 = 2.00\)
- Predicted score for treatment group: \(\hat{y} = b_{0}+b_{1}x_1 = 1.998 + 1.178\times 1 = 3.18\)
Because of the pre-intervention differences between the intervention and control group, we don’t know how much of the differences in GPA at post-intervention is due to the mentoring program and how much is due to pre-existing differences.
To help remedy this situation, let’s control (i.e., adjust) for the confounder by adding it as an additional predictor to the model.
<- d2 %>%
d2_mod2 mutate(treated = case_when(condition == "treatment" ~ 1,
== "control" ~ 0)) %>%
condition lm(gpa_post ~ treated + gpa_pre, data = .)
%>% get_regression_table() d2_mod2
term | estimate | std_error | statistic | p_value | lower_ci | upper_ci |
---|---|---|---|---|---|---|
intercept | -0.394 | 0.051 | -7.786 | 0 | -0.494 | -0.295 |
treated | 0.687 | 0.031 | 22.043 | 0 | 0.626 | 0.748 |
gpa_pre | 0.952 | 0.020 | 47.842 | 0 | 0.913 | 0.991 |
The resultant regression equation is written as follows, where x1 equals the treatment condition (treated) and x2 equals the pre-intervention GPA (gpa_pre):
\[\hat{y} = b_{0}+b_{1}x_1+b_{2}x_2 = -.394 + .687x_1 + .952x_2\]
When more than one predictor is included in a regression model, each regression slope (i.e., for each predictor) represents the effect of that variable (i.e., the expected change in the outcome for a one unit increase in the predictor), holding constant all other predictors in the model.
Let’s see how this translates to our fitted MLR.
The intercept in the model represents the average GPA at the end of the junior year for the control group (treated = 0) among students with a GPA of 0 (i.e., the intercept is always the predicted value of y when all x variables equal 0). This isn’t terribly useful as we do not have any students with a GPA near 0.
The slope for treated represents the average difference in GPA at the end of the junior year for the treatment group as compared to the control group, holding constant students’ GPA at pre-intervention. That is, comparing two students with the same GPA at pre-intervention, those students who participated in the mentoring program had a GPA that was nearly .687 units higher than students who didn’t receive the mentoring program. Compared to the previous model, we see a reduced treatment effect (1.18 in the model without pre-intervention GPA, and .69 in the model with pre-intervention GPA). Here, we’ve made an attempt to remove the impact of pre-intervention GPA and solve for the issue that students with higher GPAs in their sophomore year were more likely to choose to participate in the mentoring program.
The slope for gpa_pre represents the expected change in gpa_post for a one unit increase in gpa_pre, holding constant intervention condition – i.e., comparing people in the treatment condition to one another, or comparing people in the control condition to one another.
We can use the regression model estimates to calculate the predicted average GPA for students in the treatment condition compared to students in the control condition, holding constant GPA. Let’s hold constant pre-intervention GPA at 3.0 as an example.
- Predicted score for control group: \(\hat{y} = b_{0}+b_{1}x_1+b_{2}x_2 = -.394 + .687\times 0 + .952\times 3.0 = 2.46\)
- Predicted score for treatment group: \(\hat{y} = b_{0}+b_{1}x_1+b_{2}x_2 = -.394 + .687\times 1 + .952\times 3.0 = 3.15\)
The difference between the two predicted scores (3.15 - 2.46) equals .69. Notice that this is the estimate for the slope for treated. The regression slope for treated provides us the expected difference in GPA at post-intervention between the intervention and control group, holding constant GPA at pre-intervention.
You can plug in any value for GPA at pre-intervention, and you will get the same difference between the two predicted scores. For example, hold GPA at pre-intervention at 2.0 and compare the two predicted scores.
Regression adjustment allows us to compare apples to apples – that is, we can see what the effect of the intervention is for two students who had the same pre-intervention GPA.
To visualize the results, let’s create a scatter plot that relates GPA at pre-intervention to GPA at post-intervention by treatment condition. This is easily accomplished using the geom_parallel_slopes() function from the moderndive package.
%>% ggplot(aes(x = gpa_pre, y = gpa_post, color = condition)) +
d2 geom_point() +
geom_parallel_slopes(se = FALSE) +
theme_bw() +
labs(title = "Change in average GPA by treatment condition",
x = "GPA in Sophomore Year",
y = "GPA in Junior Year")
The graphs below illustrate the two regression slopes – each indicating the question that is being answered when evaluating their coefficients.
For a video recap regression adjustment for accounting for a confounder, please watch the video below.
In the social sciences, we often cannot conduct randomized controlled trials to evaluate the effects of interest. In these cases, we must carefully understand potential confounders, measure them, and include them as control variables in our models. In this way, we can estimate the effects of interest more accurately and with less bias.
Please watch the following two videos from Crash Course Statistics that describe the benefits and drawbacks of experimental designs and observational studies.
This concludes Module 7. With the information, techniques and skills that you learned in Modules 5 through 7, you now have the ability to estimate basic statistical models. In the remainder of this course we will study the methods needed to evaluate the precision of our statistical models, and the level of confidence that we can have that the results from our study samples translate to the population.