PSY350, Module 5
Simple Linear Regression
Module road map
Learning objectives
- Describe the utility of a straight line for relating two variables to one another
- Explain the least squares criteria for finding the best fit line for a simple linear regression (SLR)
- Write an equation to describe the best fit line with an intercept and slope
- Interpret the intercept and slope estimates of a best fit line
- Explain the quantity of a residual and its role in a SLR
- Calculate predicted outcomes based on the best fit line
- Fit a SLR in R
- Identify and interpret the intercept, slope, and \(R^2\) in the output
- Plot the results of a SLR
- Define the correlation coefficient
- Describe the relationship between a SLR slope and a correlation coefficient
Readings
- Read Chapter 5 of your Textbook (excluding section 5.2)
Overview
In the social sciences, we often want to know whether two variables are associated with one another. By associated we mean that knowing the value of one variable tells us something about the value of the other variable. In this module, we will focus on modeling the relationship between two continuous variables. We will use a technique called simple linear regression to study this type of relationship.
Recall this example from Module 2. The Centers for Disease Control and Prevention compiled these data in order to determine if countries who adopted more COVID-19 mitigation policies had fewer deaths.
Simple linear regression (or SLR) helps us to understand the extent to which two variables are associated by estimating the straight line that relates them to one another. SLR is a statistical technique that allows the analyst to predict scores for some continuous outcome (y) using a single predictor (x).
In the plot below, there appears to be a negative relationship, more mitigation strategies were associated with fewer deaths.
To build your intuition about regression, please watch the following Crash Course: Statistics video on Regression. Please note that the video includes an overview of fitting regression models, as well as inference testing (with topics such as degrees of Freedom, test statistics, and test distributions). You don’t need to worry about the inference parts for now – we’ll revisit these techniques in Module 11.
In this module, we will use linear models to describe the relationship between two variables. Though you may not have studied or used linear models for prediction before, we often use models to predict outcomes in everyday life. In particular we often use mathematical models to solve problems or make conversions. Mathematical models are deterministic. Once the “rule” is known, the mathematical model can be used to perfectly fit the data. That is, we can perfectly predict the outcome.
Examples:
1. Perimeter of a square: \(4 \times
(length\,of\,side)\)
2. Area of a square: \((length\,of\,side)^2\)
3. Convert Fahrenheit to Celsius: \(C = (F -
32)\cdot\frac{5}{9}\)
While mathematical models are extremely useful, in this course, we will focus on statistical models. Unlike mathematical models, statistical models are not deterministic. They take into account that we usually don’t have all important predictors, that we rarely perfectly measure the variables that we do have, and that people (or organizations, schools, animals, etc.) act differently.
Statistical models allow for:
1. Excluded variables
2. Measurement error
3. Individual variation
The formula for a statistical model has a residual to account for variation from these sources:
Outcome = Systematic Component + Residual
Statistical techniques allow us to explain variation in our outcome (the systematic component) in the context of what remains unexplained (the residual).
Please watch the video below for an overview of simple linear regression.
Introduction to the data
In this module we will use a linear model to predict election outcomes. The people who we elect to public office have a large influence on the health and prosperity of our population. Our elected officials and their administrations have great power to either build up or tear down programs, practices and policies that promote health and well-being for all.
As one example, we have seen great differences in how state and local elected officials have handled the COVID-19 pandemic, in terms of mandates for public health protections, aid to individuals, and the roll out of the vaccine.
Given the critical significance of government and policy on public health, understanding the motivations of voters in electing public officials is an important public health matter. In this module, we will explore one well-studied and well-respected, but relatively simple, model for forecasting election outcomes. It’s called the Bread and Peace Model, and was formulated by Douglas A. Hibbs. He describes his model in detail here, the gist of his model is described below:
Postwar US presidential elections can for the most part be interpreted as a sequence of referendums on the incumbent party’s record during its four-year mandate period. In fact aggregate two-party vote shares going to candidates of the party holding the presidency during the postwar era are well explained by just two fundamental determinants: (1) Positively by weighted-average growth of per capita real disposable personal income over the term. (2) Negatively by cumulative US military fatalities (scaled to population) owing to unprovoked, hostile deployments of American armed forces in foreign wars.
In other words, in a US presidential election, the likelihood that the incumbent party maintains power is dependent on the economic growth experienced during the prior term and the loss of military troops due to war. The former increases favor for the incumbent party, while the latter decreases favor.
We’ll start simple in this module. We will consider a single outcome (election results of US presidential elections) and a single predictor (growth in personal income over the preceding presidential term). To estimate the Bread and Peace model, we will use data compiled by Drew Thomas (2020).
The following variables are in the dataframe:
- year is the presidential election year
- vote is the percentage share of the two-party vote received by the incumbent party’s candidate
- growth is the quarter-on-quarter percentage rate of growth of per capita real disposable personal income expressed at annual rates
- fatalities denotes the cumulative number of American military fatalities per millions of US population
- wars lists the wars of the term if fatalities > 0
- inc_party_candidate is the name of the incumbent party candidate
- other_party_candidate is the name of the other party candidate
- inc_party is an indicator of the incumbent party (D = Democrat, R = Republican)
We will estimate how well growth in income of US residents during the preceding presidential term predicts the share of the vote that the incumbent party receives. That is, we will determine if growth is predictive of vote. In the equations and descriptions below, I will refer to the predictor (growth) as the x variable, and the outcome (vote) as the y variable.
Let’s begin by importing the data.
<- read_rds(here("data", "bread_peace.Rds")) bp
The glimpse() function presents the key features of the data set. Each row of the data represents a different presidential election – starting in 1952 and ending in 2016. There are 17 elections in total to consider.
%>% glimpse() bp
## Rows: 17
## Columns: 8
## $ year <dbl> 1952, 1956, 1960, 1964, 1968, 1972, 1976, 1980, …
## $ vote <dbl> 44.55, 57.75, 49.92, 61.35, 49.59, 61.79, 48.95,…
## $ growth <dbl> 3.0325, 2.5836, 0.5768, 4.3904, 3.2644, 4.1745, …
## $ fatalities <dbl> 206, 0, 0, 2, 174, 0, 1, 0, 0, 0, 0, 0, 0, 4, 9,…
## $ wars <chr> "Korean", "none", "Vietnam", "Vietnam", "Vietnam…
## $ inc_party_candidate <chr> "Stevenson", "Eisenhower", "Nixon", "Johnson", "…
## $ other_party_candidate <chr> "Eisenhower", "Stevenson", "Kennedy", "Goldwater…
## $ inc_party <chr> "D", "R", "R", "D", "D", "R", "R", "D", "R", "R"…
Let’s obtain some additional descriptive statistics with skim().
%>% skim() bp
Name | Piped data |
Number of rows | 17 |
Number of columns | 8 |
_______________________ | |
Column type frequency: | |
character | 4 |
numeric | 4 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
wars | 0 | 1 | 4 | 18 | 0 | 6 | 0 |
inc_party_candidate | 0 | 1 | 4 | 11 | 0 | 15 | 0 |
other_party_candidate | 0 | 1 | 4 | 11 | 0 | 17 | 0 |
inc_party | 0 | 1 | 1 | 1 | 0 | 2 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
year | 0 | 1 | 1984.00 | 20.20 | 1952.00 | 1968.00 | 1984.00 | 2000.00 | 2016.00 | ▇▆▆▆▇ |
vote | 0 | 1 | 51.99 | 5.44 | 44.55 | 48.95 | 51.11 | 54.74 | 61.79 | ▅▇▃▁▃ |
growth | 0 | 1 | 2.34 | 1.28 | 0.17 | 1.43 | 2.19 | 3.26 | 4.39 | ▆▆▇▇▆ |
fatalities | 0 | 1 | 23.65 | 62.92 | 0.00 | 0.00 | 0.00 | 4.00 | 206.00 | ▇▁▁▁▁ |
Since this is a quite small dataframe, we can easily view the whole dataframe at once.
year | vote | growth | fatalities | wars | inc_party_candidate | other_party_candidate | inc_party |
---|---|---|---|---|---|---|---|
1952 | 44.55 | 3.0325 | 206 | Korean | Stevenson | Eisenhower | D |
1956 | 57.75 | 2.5836 | 0 | none | Eisenhower | Stevenson | R |
1960 | 49.92 | 0.5768 | 0 | Vietnam | Nixon | Kennedy | R |
1964 | 61.35 | 4.3904 | 2 | Vietnam | Johnson | Goldwater | D |
1968 | 49.59 | 3.2644 | 174 | Vietnam | Humphrey | Nixon | D |
1972 | 61.79 | 4.1745 | 0 | none | Nixon | McGovern | R |
1976 | 48.95 | 1.4324 | 1 | Vietnam | Ford | Carter | R |
1980 | 44.69 | 0.7495 | 0 | none | Carter | Reagan | D |
1984 | 59.17 | 4.1195 | 0 | none | Reagan | Mondale | R |
1988 | 53.90 | 2.8907 | 0 | none | Bush, Sr. | Dukakis | R |
1992 | 46.55 | 1.3764 | 0 | none | Bush, Sr. | Clinton, B. | R |
1996 | 54.74 | 1.8797 | 0 | none | Clinton, B. | Dole | D |
2000 | 50.27 | 3.3349 | 0 | none | Gore | Bush, Jr. | D |
2004 | 51.24 | 2.1860 | 4 | Iraq | Bush, Jr. | Kerry | R |
2008 | 46.31 | 0.1691 | 9 | Iraq | McCain | Obama | R |
2012 | 51.96 | 1.7744 | 5 | Afghanistan | Obama | Romney | D |
2016 | 51.11 | 1.8951 | 1 | Afghanistan + Iraq | Clinton, H. | Trump | D |
Let’s familiarize ourselves a bit more with the two variables we’ll consider here. First, the predictor (growth) ranges from .17% (in 2008, just after the 2008 financial crisis) to 4.4% (in 1964, one of the most prosperous times in US history – particularly for the wealthy). Second the outcome (vote) ranges from 44.6% (in 1952, the US was in the middle of the Korean War) and 61.8% (in 1972, riding a strong economy and moving toward the end of the Vietnam War, Richard Nixon won by a landslide).
Explore a linear relationship
Let’s take a look at the relationship between growth (our primary predictor) and vote (our primary outcome) using a scatterplot. We will map growth (the predictor) to the x-axis, and vote (the outcome) to the y-axis. I will label the datapoints with the election year.
%>%
bp ggplot(aes(x = growth, y = vote)) +
geom_point() +
::geom_label_repel(aes(label = year),
ggrepelcolor = "grey35", fill = "white", size = 2, box.padding = 0.4,
label.padding = 0.1) +
theme_bw() +
labs(title = "Bread and Peace Voting in US Presidential Elections 1952 - 2016",
x = "Annualized per capita real income growth over the term (%)",
y = "Incumbent party share of two-party vote (%)")
We always start an analysis by examining graphs of our variables. When two continuous variables are considered, a scatterplot is an excellent tool for visualization. It’s important to determine if the relationship appears linear (as opposed to curvilinear) and to note any possible outliers or other strange occurrences in the data.
If we want to summarize the relationship between x (growth) and y (vote), one option is to draw a straight line through the data points. This is accomplished in the plot below.
This straight line does a reasonable job of describing the relationship. Many relationships can be explained with a straight line, and this is the primary purpose of a linear regression model.
When we draw a straight line through a set of data points, the line can be defined by an intercept and a slope.
The intercept is the value of y on the line when x = 0. In looking at our graph, there were actually no years during the observation period in which there was 0 growth, the closest is 2008, which has about .2% growth. By extrapolating back a bit, it looks like this straight line to relate x and y (i.e., the linear model) would predict a score of about 45 (incumbent party receives 45% of vote share) if percent growth was 0.
The slope is the rate of change of the line.
Specifically, the slope is defined as the “rise over the run.” That is, the rise (i.e., change) in vote share for a one unit increase (i.e., run) in growth. Since it’s a straight line, we can pick any one-unit increase along the x-axis.
Let’s choose a “run” from 2 to 3 – that is, going from a 2 to a 3 on the x-axis.
When growth (the value on the x-axis) is 2, the value of vote looks to be about 51.
When growth is 3, the value of vote looks to be about 54.
That means as we go from growth = 2 to growth = 3 (a run of 1), vote share tends to increase by about 3 percentage points (54 minus 51 or a rise of 3). So the rise over the run = 3/1, corresponding to a slope of about 3. That means that for each one unit increase in growth, we predict the percent vote share to increase by about 3.
You might be wondering – “How was the location of the line on this graph determined?” If I asked you to manually draw the best fit line through these points, each student in the class would likely draw a line that was close, but not exactly the same to one another, and not exactly the same line that ggplot drew onto the graph for us. It is possible to fit an infinite number of straight lines to the data points in the scatterplot. However, we want to find the best fitting line.
The method that we use to find the best fit line is called the least squares criterion. We will study this method in the next part of this module.
Please view the video below to learn more about using a straight line to summarize the relationship between two continuous variables.
Estimate the best fit line
We can imagine many reasonable lines that could be drawn through the data points to relate the predictor to the outcome. For example, each of the lines on the graph below seems reasonable.
How do we determine the best fit line for the data? For any given x and y variable that are linearly related, there is indeed one best fitting line. The best fitting line is the line that allows x (growth) to predict y (vote) with the highest amount of accuracy. Let’s take a closer look at how we can determine the accuracy of any possible line drawn through the data.
The geom_smooth() function in R, specifying a linear model (method = “lm”) for y regressed on x (y ~ x), will overlay the best fit line. This is depicted in the graph below.
%>%
bp ggplot(aes(x = growth, y = vote)) +
geom_point() +
geom_smooth(method = 'lm',formula = y ~ x, se = FALSE) +
::geom_label_repel(aes(label = year),
ggrepelcolor = "grey35", fill = "white", size = 2, box.padding = 0.4,
label.padding = 0.1) +
theme_bw() +
labs(title = "Bread and Peace Voting in US Presidential Elections 1952 - 2016",
"Annualized per capita real income growth over the term (%)",
x = "Annualized per capita real income growth over the term (%)",
y = "Incumbent party share of two-party vote (%)")
An equation for the best fit line
The best fitting line is defined by an equation consisting of an intercept (the predicted value of y when x = 0) and a slope (the predicted change in y for each one unit increase in x). This equation is depicted below. For this depiction, the y and x values are subscripted with an i to denote that each case (i.e., election year in our example) has a score for y and a score for x. Note that, the equation for a best fit line is also sometimes written as y = mx + b, where b is the intercept and m is the slope.
The best fit line for our data is represented by the following equation. The intercept, that is, the predicted score for y when x = 0, is 45.021. The slope, that is, the predicted change in y for each one unit increase in x, is 2.975.
\[\hat{y_i} = b_{0}+b_{1}x_i\] \[\hat{y_i} = 45.021+2.975x_i\]
Note that sometimes you will also see the equation of a slope of a line written as y = mx + b, were b is the intercept (i.e., \({b_0}\)) and m is the slope (i.e., \({b_1}\)).
Predicted scores from the equation of the best fit line
Let’s use this equation to obtain a predicted score. We’ll use the 1988 election as an example. In this election, George Bush Sr. ran against Michael Dukakis. Coming off of Ronald Reagon’s second term, Bush was the incumbent party candidate. The economy was in pretty good shape, income growth was 2.89%. Using our equation, we’d predict Bush to garner 53.6% of the vote share.
\[\hat{y_i} = b_{0}+b_{1}x_i\] \[\hat{y_i} = 45.021+2.975\times2.89=53.6\]Take a look at the graph below, we see that this prediction (marked with an orange dot) is very close to what Bush actually garnered (53.9%). Note that I changed the best fit line to a dashed gray line to help you see the orange dot.
We can calculate the predicted score (also referred to as the fitted value or y-hat) for every case in the dataframe. Doing so yields the table below. The column labeled vote_hat provides the predicted score for every election year.
To practice and solidify this concept, please use the equation above to calculate y-hat by hand for a few of the years.
year | vote | growth | vote_hat |
---|---|---|---|
1952 | 44.55 | 3.033 | 54.042 |
1956 | 57.75 | 2.584 | 52.706 |
1960 | 49.92 | 0.577 | 46.737 |
1964 | 61.35 | 4.390 | 58.081 |
1968 | 49.59 | 3.264 | 54.732 |
1972 | 61.79 | 4.174 | 57.439 |
1976 | 48.95 | 1.432 | 49.282 |
1980 | 44.69 | 0.750 | 47.251 |
1984 | 59.17 | 4.120 | 57.275 |
1988 | 53.90 | 2.891 | 53.620 |
1992 | 46.55 | 1.376 | 49.116 |
1996 | 54.74 | 1.880 | 50.613 |
2000 | 50.27 | 3.335 | 54.941 |
2004 | 51.24 | 2.186 | 51.524 |
2008 | 46.31 | 0.169 | 45.524 |
2012 | 51.96 | 1.774 | 50.299 |
2016 | 51.11 | 1.895 | 50.658 |
Let’s consider another example – the 1968 election between Hubert Humphrey (Democrat) and Richard Nixon (Republican). Humphrey was the incumbent and the US was in the midst of the Vietnam War. During Humphrey’s term, economic growth was 3.3%, he only secured 49.6% of the vote, and ultimately lost to Nixon. Based on our best fit line, we would have predicted Humphrey to receive 54.7% of the vote.
\[\hat{y_i} = 45.021+2.975\times3.26=54.7\]
This predicted score is marked by the orange dot in the graph below and labeled \(\hat{y_i}\). Humphrey’s actual vote total was much poorer than our best fit line would have predicted – that is, given the relatively good economy, he did not perform nearly as well as we would have expected based on our linear model. To illustrate this point, I connected the predicted and observed vote with a dotted black line in the graph below.
Residual scores from the equation for the best fit line
Using the observed score and the predicted score for each case, we can calculate each case’s residual. The residual is simply the difference between a case’s observed score and their predicted score. Specifically, we subtract each case’s predicted score from their observed score. The residual is denoted by \({e_i}\) in the figure below. For the 1968 election, the residual is -5.14 (calculated as 49.59 - 54.732).
In the same way that we calculated the predicted score (y-hat) for each case, we can also calculate the residual for each case. These are all displayed in the table below. Please calculate a few yourself and map them onto the graph above to ensure you understand the concept.
year | vote | growth | vote_hat | residual |
---|---|---|---|---|
1952 | 44.55 | 3.033 | 54.042 | -9.492 |
1956 | 57.75 | 2.584 | 52.706 | 5.044 |
1960 | 49.92 | 0.577 | 46.737 | 3.183 |
1964 | 61.35 | 4.390 | 58.081 | 3.269 |
1968 | 49.59 | 3.264 | 54.732 | -5.142 |
1972 | 61.79 | 4.174 | 57.439 | 4.351 |
1976 | 48.95 | 1.432 | 49.282 | -0.332 |
1980 | 44.69 | 0.750 | 47.251 | -2.561 |
1984 | 59.17 | 4.120 | 57.275 | 1.895 |
1988 | 53.90 | 2.891 | 53.620 | 0.280 |
1992 | 46.55 | 1.376 | 49.116 | -2.566 |
1996 | 54.74 | 1.880 | 50.613 | 4.127 |
2000 | 50.27 | 3.335 | 54.941 | -4.671 |
2004 | 51.24 | 2.186 | 51.524 | -0.284 |
2008 | 46.31 | 0.169 | 45.524 | 0.786 |
2012 | 51.96 | 1.774 | 50.299 | 1.661 |
2016 | 51.11 | 1.895 | 50.658 | 0.452 |
We see then that each case has a residual – this represents the difference between what our model predicts the score will be and the actual observed score. With the residual, we can write an alternative version of our regression model that incorporates the residual:
\[{y_i} = b_0 + b_1x_i + e_i\]
Here, we replace \(\hat{y_i}\) with the actual observed y score (i.e., \({y_i}\)), and we add the case’s residual (represented as \({e_i}\)). In this way, the actual observed score for y for each case can be calculated using the intercept and slope for the best fine line (to obtain \(\hat{y_i}\)) and adding the case’s residual score. For example, to reproduce the observed score for vote (\({y_i}\)) for 1952, we use the following equation: \[{y_i} = 45.021 + 2.975\times3.033 + -9.492 = 44.55\]
Use the same technique to recover vote for a couple additional years to be sure you understand where the numbers come from.
The least squares criterion
To find the best fit line, we use the least squares criterion. The least squares criterion dictates that the best fitting line is the line that results in the smallest sum of squared residuals (i.e., each case’s residual is squared and then all squared residuals are summed across cases). For example, if we squared each year’s residual in the table above, then summed them across all 17 years, we would obtain the sum of squared residuals for our line.
We could draw every possible line through the data, calculate the sum of squared residuals for each line, and then choose the line that produces the smallest sum of squared residuals. Fortunately, the lm() function in R does this tedious work for us.
Fit a linear model in R
We use the lm() function in R to fit a linear model. In the code below I first define the name of the R object that will store our linear model results (bp_mod1). Then, I take the data frame (bp) and feed it into the lm() function.
We want to regress the outcome on the predictor, that is, we want to regress vote on growth to determine if growth is a predictor of vote. This is coded in the lm() function by writing vote ~ growth. Note that the lm() function in R, which is short for linear model, requires a data argument. Since we are piping in the data, by indicating data = ., we remind R that it already has the data.
<- bp %>%
bp_mod1 lm(vote ~ growth, data = .)
Note that an alternative way to write this code and fit a linear model without the pipe operator is as follows:
bp_mod1 <- lm(vote ~ growth, data = bp)
Obtain the regression parameter estimates
The get_regression_table() function from the moderndive package can be used to view the regression parameter estimates (i.e., the intercept and the slope). We just need to feed in the R object that we designated to store our linear model results (bp_mod1).
%>%
bp_mod1 get_regression_table()
term | estimate | std_error | statistic | p_value | lower_ci | upper_ci |
---|---|---|---|---|---|---|
intercept | 45.021 | 2.075 | 21.702 | 0.000 | 40.600 | 49.443 |
growth | 2.975 | 0.782 | 3.803 | 0.002 | 1.307 | 4.642 |
For now, we are only going to be concerned about the first two columns of the output above – the columns labeled term and estimate. Let’s map these estimates onto our equation. I’ll use all of the visible decimal places to ease identification. The term labeled intercept provides the intercept of the regression line and the term labeled growth provides the slope of the regression line.
\[\hat{y_i} = b_{0}+b_{1}x_i\] \[\hat{y_i} = 45.021+2.975x_i\]
As defined earlier, the intercept is the predicted value of y when x = 0, therefore, it is the predicted vote percentage for the incumbent party when growth in income equals 0.
The slope is the predicted change in y for each one unit increase in x. That is, for each one percentage point increase in growth (e.g., going from 1% to 2%, or 2% to 3%), we expect the percent of vote share garnered by the incumbent party to increase by 2.975 points.
Obtain the predicted values and residuals for each case
We can obtain the predicted values and residuals using the get_regression_points() function from moderndive.
%>%
bp_mod1 get_regression_points(ID = "year")
year | vote | growth | vote_hat | residual |
---|---|---|---|---|
1952 | 44.55 | 3.033 | 54.042 | -9.492 |
1956 | 57.75 | 2.584 | 52.706 | 5.044 |
1960 | 49.92 | 0.577 | 46.737 | 3.183 |
1964 | 61.35 | 4.390 | 58.081 | 3.269 |
1968 | 49.59 | 3.264 | 54.732 | -5.142 |
1972 | 61.79 | 4.174 | 57.439 | 4.351 |
1976 | 48.95 | 1.432 | 49.282 | -0.332 |
1980 | 44.69 | 0.750 | 47.251 | -2.561 |
1984 | 59.17 | 4.120 | 57.275 | 1.895 |
1988 | 53.90 | 2.891 | 53.620 | 0.280 |
1992 | 46.55 | 1.376 | 49.116 | -2.566 |
1996 | 54.74 | 1.880 | 50.613 | 4.127 |
2000 | 50.27 | 3.335 | 54.941 | -4.671 |
2004 | 51.24 | 2.186 | 51.524 | -0.284 |
2008 | 46.31 | 0.169 | 45.524 | 0.786 |
2012 | 51.96 | 1.774 | 50.299 | 1.661 |
2016 | 51.11 | 1.895 | 50.658 | 0.452 |
Obtain the overall model summary
We can also obtain additional summary information about the model using the get_regression_summaries() function from moderndive.
%>%
bp_mod1 get_regression_summaries()
r_squared | adj_r_squared | mse | rmse | sigma | statistic | p_value | df | nobs |
---|---|---|---|---|---|---|---|---|
0.491 | 0.457 | 14.18538 | 3.766348 | 4.01 | 14.463 | 0.002 | 1 | 17 |
For now, we’ll only be concerned about the r_squared column. This value is .491. It represents a commonly considered term in regression modeling called R-squared or \(R^2\). \(R^2\) is the proportion of the variability in the outcome (vote in our example) that is explained by the predictor(s) (growth in our example).
It is calculated by considering the amount of variability explained by our model (the systematic part of the model), and the amount not explained by our model (the error, also called the residual). The former is often called the Sum of Squares Regression (SSR) – and is calculated by taking the difference between the predicted y for each case (\(\hat{y_i}\)) and the mean of y (\(\bar{y}\)), squaring it, and then summing across all cases in the sample. In equation form this is \(SSR = Σ(\hat{y_i} – \bar{y})^2\). Note that Σ, the Greek letter called sigma, means to sum. The latter is often called the Sum of Squares Error (SSE) – and is calculated by taking the difference between the predicted y for each case (\(\hat{y_i}\)) and the observed y for each case (\(y_i\)) (which is the residual), squaring it, and then summing across all cases in the sample. In equation form this is \(SSE = Σ({y_i} - \hat{y_i})^2\).
Let’s calculate these quantities with some code.
First, lets get the needed quantities for each case – that is, the squared difference between the predicted y and the mean of y for SSR and the squared residual (for SSE).
<- bp_mod1 %>%
get_rsq get_regression_points(ID = "year") %>%
mutate(for_SSR = vote_hat - mean(vote), # calculate the difference between each predicted score and the mean of y
for_SSR2 = for_SSR^2) %>% # square the difference
mutate(for_SSE2 = residual^2) # square the residual
get_rsq
year | vote | growth | vote_hat | residual | for_SSR | for_SSR2 | for_SSE2 |
---|---|---|---|---|---|---|---|
1952 | 44.55 | 3.033 | 54.042 | -9.492 | 2.0514118 | 4.2082902 | 90.098064 |
1956 | 57.75 | 2.584 | 52.706 | 5.044 | 0.7154118 | 0.5118140 | 25.441936 |
1960 | 49.92 | 0.577 | 46.737 | 3.183 | -5.2535882 | 27.6001893 | 10.131489 |
1964 | 61.35 | 4.390 | 58.081 | 3.269 | 6.0904118 | 37.0931155 | 10.686361 |
1968 | 49.59 | 3.264 | 54.732 | -5.142 | 2.7414118 | 7.5153385 | 26.440164 |
1972 | 61.79 | 4.174 | 57.439 | 4.351 | 5.4484118 | 29.6851908 | 18.931201 |
1976 | 48.95 | 1.432 | 49.282 | -0.332 | -2.7085882 | 7.3364502 | 0.110224 |
1980 | 44.69 | 0.750 | 47.251 | -2.561 | -4.7395882 | 22.4636966 | 6.558721 |
1984 | 59.17 | 4.120 | 57.275 | 1.895 | 5.2844118 | 27.9250077 | 3.591025 |
1988 | 53.90 | 2.891 | 53.620 | 0.280 | 1.6294118 | 2.6549827 | 0.078400 |
1992 | 46.55 | 1.376 | 49.116 | -2.566 | -2.8745882 | 8.2632575 | 6.584356 |
1996 | 54.74 | 1.880 | 50.613 | 4.127 | -1.3775882 | 1.8977493 | 17.032129 |
2000 | 50.27 | 3.335 | 54.941 | -4.671 | 2.9504118 | 8.7049296 | 21.818241 |
2004 | 51.24 | 2.186 | 51.524 | -0.284 | -0.4665882 | 0.2177046 | 0.080656 |
2008 | 46.31 | 0.169 | 45.524 | 0.786 | -6.4665882 | 41.8167634 | 0.617796 |
2012 | 51.96 | 1.774 | 50.299 | 1.661 | -1.6915882 | 2.8614708 | 2.758921 |
2016 | 51.11 | 1.895 | 50.658 | 0.452 | -1.3325882 | 1.7757914 | 0.204304 |
Now, we can sum the squared quantities that we just calculated across all cases (all 17 years worth of data).
<- get_rsq %>%
SSR_SSE summarize(SSR = sum(for_SSR2),
SSE = sum(for_SSE2))
SSR_SSE
SSR | SSE |
---|---|
232.5317 | 241.164 |
Finally, to calculate the \(R^2\), we take SSR divided by the sum of SSR and SSE. This denominator is also called Sum of Squares Total (SST) as it represents the total variability in the outcome.
%>%
SSR_SSE mutate(r_squared = SSR/(SSR + SSE))
SSR | SSE | r_squared |
---|---|---|
232.5317 | 241.164 | 0.4908884 |
Of course, this is the same value of r_squared that get_regression_summaries() calculated for us. You can multiply the proportion by 100 to express it as a percentage, which indicates that about 49% of the variability in the share of the vote received by the incumbent party is explained by the income growth that US residents achieved during the prior term. The remaining 51% might be just random error in the model, or this remaining variability might be accounted for by other variables. For example, in Module 7 we will determine if additional variability in vote can be accounted for by adding fatalities to the regression model. Hibb’s Bread and Peace Model asserts that it should.
Recall from the beginning of the module, we learned that some outcome can be modeled as:
Outcome = Systematic Component + Residual
And, that statistical techniques allow us to explain variation in our outcome (the systematic component) in the context of what remains unexplained (the residual). The \(R^2\) gives us the quantity of the systematic component – the proportion of the variance in vote that can be predicted by growth. The pie chart below depicts this for our example.
Use the equation to forecast
We can use our equation to forecast out of sample predictions. That is, predict what might happen for years not included in the dataframe based on the fitted model. For example, 2020 data was not included in Thomas’s dataframe that we analyzed.
Let’s consider the 2020 election between Donald Trump (Republican - the incumbent) and Joe Biden (Democrat). Growth in income during Trump’s term is still being studied and calculated by economists, and it is surely complicated by the COVID-19 pandemic. But one reasonable estimate put forth by Thomas in the discussion of his paper is a growth rate of 2.52%. We can plug this number into the equation that we obtained using years 1952 to 2016 to get a predicted score for vote.
\[\hat{y_i} = 45.021+2.975\times2.52 = 52.5\]
This forecast means that, based on income growth alone, we would have expected Donald Trump to garner about 52.5% of the vote share. There were 81,285,571 votes cast for Biden and 74,225,038 votes cast for Trump in the 2020 election. Thus, the actual vote score for the 2020 election (i.e., the percentage of votes garnered by the incumbent candidate) was 47.7%, producing a residual of about -4.8.
Correlation
For two numeric variables (e.g., vote and growth in the Bread and Peace dataframe) we can also calculate the correlation coefficient. A correlation coefficient, often abbreviated as r, quantifies the strength of the linear relationship between two numerical variables using a standardized metric. A correlation coefficient ranges from -1 to 1, where -1 indicates a perfect negative relationship (as one value increases, the other decreases), and +1 indicates a perfect positive relationship (as one value increases, the other increases). A correlation coefficient of 0 indicates no relationship between the two variables.
The get_correlation() function in the moderndive package calculates the correlation coefficient.
%>%
bp get_correlation(vote ~ growth)
cor |
---|
0.7006353 |
The correlation between vote and growth is .7, denoting a strong, positive association. The square of the correlation coefficient equals the \(R^2\) in a SLR (i.e., \(R^2\) = .49).
While the regression slope is in the metric of the y variable, and is dependent on the measurement scale of both the x and y variables (i.e., a one unit increase in x is associated with a \({b_1}\) unit change in y), a correlation coefficient is in the metric of standardized units.
An interesting property of linear models is that if you create z-scores of your x and y variables, and then fit a simple linear regression, the slope for the regression line will equal the correlation. Take a look at the code below, and then see the estimate for growth_z in the output. It matches the correlation coefficient. Therefore, we can also interpret the correlation coefficient as “the expected change in the z-score of y for a one standard deviation increase in x.”
<- bp %>%
bp_z select(vote, growth) %>%
mutate(vote_z = (vote - mean(vote))/sd(vote)) %>%
mutate(growth_z = (growth - mean(growth))/sd(growth))
<- bp_z %>%
bp_mod_z lm(vote_z ~ growth_z, data = .)
%>%
bp_mod_z get_regression_table()
term | estimate | std_error | statistic | p_value | lower_ci | upper_ci |
---|---|---|---|---|---|---|
intercept | 0.000 | 0.179 | 0.000 | 1.000 | -0.381 | 0.381 |
growth_z | 0.701 | 0.184 | 3.803 | 0.002 | 0.308 | 1.093 |
Correlation is a statistical measure that quantifies the size and direction of a relationship between two variables. As we learned in this module, estimating the relationship between two variables can be extremely useful. However, it is critical to realize that finding a correlation between two variables does not automatically mean that one variable causes the other variable. That is, correlation does not equal causation. Causation indicates that one event is the result of the occurrence of another event – in other words, that there is a causal relationship between the two events. We’ll learn more about correlation and causation in psychological studies in Module 7. For now, please watch the following Crash Course: Statistics video that introduces these issues.
Special note for DataCamp exercises for this module
In the DataCamp course for this module, Modeling with Data in the tidyverse, the instructor represents the regression equation in a different format than presented in your text book and in this module. To represent the relationship between some variable called y and some variable called x, in this module, we represented the regression equation as:
\[{y_i} = b_0 + b_1x_i + e_i\]
In the DataCamp course, the instructor writes the equation as:
\[y=f(\vec{x})+ε\]
These two formulas are equivalent. Rather than writing the systematic part of the regression line (i.e., the intercept plus the slope times x), Modeling with Data in the tidyverse represents this quantity as \(f(\vec{x})\). Likewise, the \(e_i\) in the first equation is the same as the \(ε\) in the second equation – that is, the residual. In DataCamp, they also do not index their variables with an i subscript (to denote that each case has a score), though this is implied.
This concludes Module 5 on simple linear regression. In the next two modules, we will learn how to add categorical variables as predictors in our models (Module 6) and how to build models with multiple predictors (Module 7).