PSY350, Module 5

Simple Linear Regression

Module road map

Learning objectives

Describe the utility of a straight line for relating two variables to one another
Explain the least squares criteria for finding the best fit line for a simple linear regression (SLR)
Write an equation to describe the best fit line with an intercept and slope
Interpret the intercept and slope estimates of a best fit line
Explain the quantity of a residual and its role in a SLR
Calculate predicted outcomes based on the best fit line
Fit a SLR in R
Identify and interpret the intercept, slope, and \(R^2\) in the output
Plot the results of a SLR
Define the correlation coefficient
Describe the relationship between a SLR slope and a correlation coefficient

Readings

Read Chapter 5 of your Textbook (excluding section 5.2)

Overview

In the social sciences, we often want to know whether two variables are associated with one another. By associated we mean that knowing the value of one variable tells us something about the value of the other variable. In this module, we will focus on modeling the relationship between two continuous variables. We will use a technique called simple linear regression to study this type of relationship.

Recall this example from Module 2. The Centers for Disease Control and Prevention compiled these data in order to determine if countries who adopted more COVID-19 mitigation policies had fewer deaths.

Simple linear regression (or SLR) helps us to understand the extent to which two variables are associated by estimating the straight line that relates them to one another. SLR is a statistical technique that allows the analyst to predict scores for some continuous outcome (y) using a single predictor (x).

In the plot below, there appears to be a negative relationship, more mitigation strategies were associated with fewer deaths.

To build your intuition about regression, please watch the following Crash Course: Statistics video on Regression. Please note that the video includes an overview of fitting regression models, as well as inference testing (with topics such as degrees of Freedom, test statistics, and test distributions). You don’t need to worry about the inference parts for now – we’ll revisit these techniques in Module 11.

In this module, we will use linear models to describe the relationship between two variables. Though you may not have studied or used linear models for prediction before, we often use models to predict outcomes in everyday life. In particular we often use mathematical models to solve problems or make conversions. Mathematical models are deterministic. Once the “rule” is known, the mathematical model can be used to perfectly fit the data. That is, we can perfectly predict the outcome.

Examples:

1. Perimeter of a square: \(4 \times (length\,of\,side)\)
2. Area of a square: \((length\,of\,side)^2\)
3. Convert Fahrenheit to Celsius: \(C = (F - 32)\cdot\frac{5}{9}\)

While mathematical models are extremely useful, in this course, we will focus on statistical models. Unlike mathematical models, statistical models are not deterministic. They take into account that we usually don’t have all important predictors, that we rarely perfectly measure the variables that we do have, and that people (or organizations, schools, animals, etc.) act differently.

Statistical models allow for:

1. Excluded variables
2. Measurement error
3. Individual variation

The formula for a statistical model has a residual to account for variation from these sources:

Outcome = Systematic Component + Residual

Statistical techniques allow us to explain variation in our outcome (the systematic component) in the context of what remains unexplained (the residual).

Please watch the video below for an overview of simple linear regression.

Introduction to the data

In this module we will use a linear model to predict election outcomes. The people who we elect to public office have a large influence on the health and prosperity of our population. Our elected officials and their administrations have great power to either build up or tear down programs, practices and policies that promote health and well-being for all.

As one example, we have seen great differences in how state and local elected officials have handled the COVID-19 pandemic, in terms of mandates for public health protections, aid to individuals, and the roll out of the vaccine.

Given the critical significance of government and policy on public health, understanding the motivations of voters in electing public officials is an important public health matter. In this module, we will explore one well-studied and well-respected, but relatively simple, model for forecasting election outcomes. It’s called the Bread and Peace Model, and was formulated by Douglas A. Hibbs. He describes his model in detail here, the gist of his model is described below:

Postwar US presidential elections can for the most part be interpreted as a sequence of referendums on the incumbent party’s record during its four-year mandate period. In fact aggregate two-party vote shares going to candidates of the party holding the presidency during the postwar era are well explained by just two fundamental determinants: (1) Positively by weighted-average growth of per capita real disposable personal income over the term. (2) Negatively by cumulative US military fatalities (scaled to population) owing to unprovoked, hostile deployments of American armed forces in foreign wars.

In other words, in a US presidential election, the likelihood that the incumbent party maintains power is dependent on the economic growth experienced during the prior term and the loss of military troops due to war. The former increases favor for the incumbent party, while the latter decreases favor.

We’ll start simple in this module. We will consider a single outcome (election results of US presidential elections) and a single predictor (growth in personal income over the preceding presidential term). To estimate the Bread and Peace model, we will use data compiled by Drew Thomas (2020).

The following variables are in the dataframe:

year is the presidential election year
vote is the percentage share of the two-party vote received by the incumbent party’s candidate
growth is the quarter-on-quarter percentage rate of growth of per capita real disposable personal income expressed at annual rates
fatalities denotes the cumulative number of American military fatalities per millions of US population
wars lists the wars of the term if fatalities > 0
inc_party_candidate is the name of the incumbent party candidate
other_party_candidate is the name of the other party candidate
inc_party is an indicator of the incumbent party (D = Democrat, R = Republican)

We will estimate how well growth in income of US residents during the preceding presidential term predicts the share of the vote that the incumbent party receives. That is, we will determine if growth is predictive of vote. In the equations and descriptions below, I will refer to the predictor (growth) as the x variable, and the outcome (vote) as the y variable.

Let’s begin by importing the data.

bp <- read_rds(here("data", "bread_peace.Rds"))

The glimpse() function presents the key features of the data set. Each row of the data represents a different presidential election – starting in 1952 and ending in 2016. There are 17 elections in total to consider.

bp %>% glimpse()

## Rows: 17
## Columns: 8
## $ year                  <dbl> 1952, 1956, 1960, 1964, 1968, 1972, 1976, 1980, …
## $ vote                  <dbl> 44.55, 57.75, 49.92, 61.35, 49.59, 61.79, 48.95,…
## $ growth                <dbl> 3.0325, 2.5836, 0.5768, 4.3904, 3.2644, 4.1745, …
## $ fatalities            <dbl> 206, 0, 0, 2, 174, 0, 1, 0, 0, 0, 0, 0, 0, 4, 9,…
## $ wars                  <chr> "Korean", "none", "Vietnam", "Vietnam", "Vietnam…
## $ inc_party_candidate   <chr> "Stevenson", "Eisenhower", "Nixon", "Johnson", "…
## $ other_party_candidate <chr> "Eisenhower", "Stevenson", "Kennedy", "Goldwater…
## $ inc_party             <chr> "D", "R", "R", "D", "D", "R", "R", "D", "R", "R"…

Let’s obtain some additional descriptive statistics with skim().

bp %>% skim()

Data summary
Name	Piped data
Number of rows	17
Number of columns	8
_______________________
Column type frequency:
character	4
numeric	4
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
wars	1	4	18	6
inc_party_candidate	1	4	11	15
other_party_candidate	1	4	11	17
inc_party	1	1	1	2

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
year	1	1984.00	20.20	1952.00	1968.00	1984.00	2000.00	2016.00	▇▆▆▆▇
vote	1	51.99	5.44	44.55	48.95	51.11	54.74	61.79	▅▇▃▁▃
growth	1	2.34	1.28	0.17	1.43	2.19	3.26	4.39	▆▆▇▇▆
fatalities	1	23.65	62.92	0.00	0.00	0.00	4.00	206.00	▇▁▁▁▁

Since this is a quite small dataframe, we can easily view the whole dataframe at once.

year	vote	growth	fatalities	wars	inc_party_candidate	other_party_candidate	inc_party
1952	44.55	3.0325	206	Korean	Stevenson	Eisenhower	D
1956	57.75	2.5836	0	none	Eisenhower	Stevenson	R
1960	49.92	0.5768	0	Vietnam	Nixon	Kennedy	R
1964	61.35	4.3904	2	Vietnam	Johnson	Goldwater	D
1968	49.59	3.2644	174	Vietnam	Humphrey	Nixon	D
1972	61.79	4.1745	0	none	Nixon	McGovern	R
1976	48.95	1.4324	1	Vietnam	Ford	Carter	R
1980	44.69	0.7495	0	none	Carter	Reagan	D
1984	59.17	4.1195	0	none	Reagan	Mondale	R
1988	53.90	2.8907	0	none	Bush, Sr.	Dukakis	R
1992	46.55	1.3764	0	none	Bush, Sr.	Clinton, B.	R
1996	54.74	1.8797	0	none	Clinton, B.	Dole	D
2000	50.27	3.3349	0	none	Gore	Bush, Jr.	D
2004	51.24	2.1860	4	Iraq	Bush, Jr.	Kerry	R
2008	46.31	0.1691	9	Iraq	McCain	Obama	R
2012	51.96	1.7744	5	Afghanistan	Obama	Romney	D
2016	51.11	1.8951	1	Afghanistan + Iraq	Clinton, H.	Trump	D

Let’s familiarize ourselves a bit more with the two variables we’ll consider here. First, the predictor (growth) ranges from .17% (in 2008, just after the 2008 financial crisis) to 4.4% (in 1964, one of the most prosperous times in US history – particularly for the wealthy). Second the outcome (vote) ranges from 44.6% (in 1952, the US was in the middle of the Korean War) and 61.8% (in 1972, riding a strong economy and moving toward the end of the Vietnam War, Richard Nixon won by a landslide).

Explore a linear relationship

Let’s take a look at the relationship between growth (our primary predictor) and vote (our primary outcome) using a scatterplot. We will map growth (the predictor) to the x-axis, and vote (the outcome) to the y-axis. I will label the datapoints with the election year.

bp %>% 
  ggplot(aes(x = growth, y = vote)) +
  geom_point() +
  ggrepel::geom_label_repel(aes(label = year),
                            color = "grey35", fill = "white", size = 2, box.padding =  0.4, 
                            label.padding = 0.1) +
  theme_bw() +
  labs(title = "Bread and Peace Voting in US Presidential Elections 1952 - 2016", 
       x = "Annualized per capita real income growth over the term (%)", 
       y = "Incumbent party share of two-party vote (%)")

We always start an analysis by examining graphs of our variables. When two continuous variables are considered, a scatterplot is an excellent tool for visualization. It’s important to determine if the relationship appears linear (as opposed to curvilinear) and to note any possible outliers or other strange occurrences in the data.

If we want to summarize the relationship between x (growth) and y (vote), one option is to draw a straight line through the data points. This is accomplished in the plot below.

This straight line does a reasonable job of describing the relationship. Many relationships can be explained with a straight line, and this is the primary purpose of a linear regression model.

When we draw a straight line through a set of data points, the line can be defined by an intercept and a slope.

The intercept is the value of y on the line when x = 0. In looking at our graph, there were actually no years during the observation period in which there was 0 growth, the closest is 2008, which has about .2% growth. By extrapolating back a bit, it looks like this straight line to relate x and y (i.e., the linear model) would predict a score of about 45 (incumbent party receives 45% of vote share) if percent growth was 0.

The slope is the rate of change of the line.

Specifically, the slope is defined as the “rise over the run.” That is, the rise (i.e., change) in vote share for a one unit increase (i.e., run) in growth. Since it’s a straight line, we can pick any one-unit increase along the x-axis.

Let’s choose a “run” from 2 to 3 – that is, going from a 2 to a 3 on the x-axis.

When growth (the value on the x-axis) is 2, the value of vote looks to be about 51.

When growth is 3, the value of vote looks to be about 54.

That means as we go from growth = 2 to growth = 3 (a run of 1), vote share tends to increase by about 3 percentage points (54 minus 51 or a rise of 3). So the rise over the run = 3/1, corresponding to a slope of about 3. That means that for each one unit increase in growth, we predict the percent vote share to increase by about 3.

You might be wondering – “How was the location of the line on this graph determined?” If I asked you to manually draw the best fit line through these points, each student in the class would likely draw a line that was close, but not exactly the same to one another, and not exactly the same line that ggplot drew onto the graph for us. It is possible to fit an infinite number of straight lines to the data points in the scatterplot. However, we want to find the best fitting line.

The method that we use to find the best fit line is called the least squares criterion. We will study this method in the next part of this module.

Please view the video below to learn more about using a straight line to summarize the relationship between two continuous variables.

Estimate the best fit line

We can imagine many reasonable lines that could be drawn through the data points to relate the predictor to the outcome. For example, each of the lines on the graph below seems reasonable.

How do we determine the best fit line for the data? For any given x and y variable that are linearly related, there is indeed one best fitting line. The best fitting line is the line that allows x (growth) to predict y (vote) with the highest amount of accuracy. Let’s take a closer look at how we can determine the accuracy of any possible line drawn through the data.

The geom_smooth() function in R, specifying a linear model (method = “lm”) for y regressed on x (y ~ x), will overlay the best fit line. This is depicted in the graph below.

bp %>% 
  ggplot(aes(x = growth, y = vote)) +
  geom_point() +
  geom_smooth(method = 'lm',formula = y ~ x,  se = FALSE) +
  ggrepel::geom_label_repel(aes(label = year),
                            color = "grey35", fill = "white", size = 2, box.padding =  0.4, 
                            label.padding = 0.1) +
  theme_bw() +
  labs(title = "Bread and Peace Voting in US Presidential Elections 1952 - 2016", 
       "Annualized per capita real income growth over the term (%)", 
       x = "Annualized per capita real income growth over the term (%)", 
       y = "Incumbent party share of two-party vote (%)")

An equation for the best fit line

The best fitting line is defined by an equation consisting of an intercept (the predicted value of y when x = 0) and a slope (the predicted change in y for each one unit increase in x). This equation is depicted below. For this depiction, the y and x values are subscripted with an i to denote that each case (i.e., election year in our example) has a score for y and a score for x. Note that, the equation for a best fit line is also sometimes written as y = mx + b, where b is the intercept and m is the slope.

The best fit line for our data is represented by the following equation. The intercept, that is, the predicted score for y when x = 0, is 45.021. The slope, that is, the predicted change in y for each one unit increase in x, is 2.975.

\[\hat{y_i} = b_{0}+b_{1}x_i\] \[\hat{y_i} = 45.021+2.975x_i\]

Note that sometimes you will also see the equation of a slope of a line written as y = mx + b, were b is the intercept (i.e., \({b_0}\)) and m is the slope (i.e., \({b_1}\)).

Predicted scores from the equation of the best fit line

Let’s use this equation to obtain a predicted score. We’ll use the 1988 election as an example. In this election, George Bush Sr. ran against Michael Dukakis. Coming off of Ronald Reagon’s second term, Bush was the incumbent party candidate. The economy was in pretty good shape, income growth was 2.89%. Using our equation, we’d predict Bush to garner 53.6% of the vote share.

\[\hat{y_i} = b_{0}+b_{1}x_i\] \[\hat{y_i} = 45.021+2.975\times2.89=53.6\]Take a look at the graph below, we see that this prediction (marked with an orange dot) is very close to what Bush actually garnered (53.9%). Note that I changed the best fit line to a dashed gray line to help you see the orange dot.

We can calculate the predicted score (also referred to as the fitted value or y-hat) for every case in the dataframe. Doing so yields the table below. The column labeled vote_hat provides the predicted score for every election year.

To practice and solidify this concept, please use the equation above to calculate y-hat by hand for a few of the years.

year	vote	growth	vote_hat
1952	44.55	3.033	54.042
1956	57.75	2.584	52.706
1960	49.92	0.577	46.737
1964	61.35	4.390	58.081
1968	49.59	3.264	54.732
1972	61.79	4.174	57.439
1976	48.95	1.432	49.282
1980	44.69	0.750	47.251
1984	59.17	4.120	57.275
1988	53.90	2.891	53.620
1992	46.55	1.376	49.116
1996	54.74	1.880	50.613
2000	50.27	3.335	54.941
2004	51.24	2.186	51.524
2008	46.31	0.169	45.524
2012	51.96	1.774	50.299
2016	51.11	1.895	50.658

Let’s consider another example – the 1968 election between Hubert Humphrey (Democrat) and Richard Nixon (Republican). Humphrey was the incumbent and the US was in the midst of the Vietnam War. During Humphrey’s term, economic growth was 3.3%, he only secured 49.6% of the vote, and ultimately lost to Nixon. Based on our best fit line, we would have predicted Humphrey to receive 54.7% of the vote.

\[\hat{y_i} = 45.021+2.975\times3.26=54.7\]

This predicted score is marked by the orange dot in the graph below and labeled \(\hat{y_i}\). Humphrey’s actual vote total was much poorer than our best fit line would have predicted – that is, given the relatively good economy, he did not perform nearly as well as we would have expected based on our linear model. To illustrate this point, I connected the predicted and observed vote with a dotted black line in the graph below.

Residual scores from the equation for the best fit line

Using the observed score and the predicted score for each case, we can calculate each case’s residual. The residual is simply the difference between a case’s observed score and their predicted score. Specifically, we subtract each case’s predicted score from their observed score. The residual is denoted by \({e_i}\) in the figure below. For the 1968 election, the residual is -5.14 (calculated as 49.59 - 54.732).

In the same way that we calculated the predicted score (y-hat) for each case, we can also calculate the residual for each case. These are all displayed in the table below. Please calculate a few yourself and map them onto the graph above to ensure you understand the concept.

year	vote	growth	vote_hat	residual
1952	44.55	3.033	54.042	-9.492
1956	57.75	2.584	52.706	5.044
1960	49.92	0.577	46.737	3.183
1964	61.35	4.390	58.081	3.269
1968	49.59	3.264	54.732	-5.142
1972	61.79	4.174	57.439	4.351
1976	48.95	1.432	49.282	-0.332
1980	44.69	0.750	47.251	-2.561
1984	59.17	4.120	57.275	1.895
1988	53.90	2.891	53.620	0.280
1992	46.55	1.376	49.116	-2.566
1996	54.74	1.880	50.613	4.127
2000	50.27	3.335	54.941	-4.671
2004	51.24	2.186	51.524	-0.284
2008	46.31	0.169	45.524	0.786
2012	51.96	1.774	50.299	1.661
2016	51.11	1.895	50.658	0.452

We see then that each case has a residual – this represents the difference between what our model predicts the score will be and the actual observed score. With the residual, we can write an alternative version of our regression model that incorporates the residual:

\[{y_i} = b_0 + b_1x_i + e_i\]

Here, we replace \(\hat{y_i}\) with the actual observed y score (i.e., \({y_i}\)), and we add the case’s residual (represented as \({e_i}\)). In this way, the actual observed score for y for each case can be calculated using the intercept and slope for the best fine line (to obtain \(\hat{y_i}\)) and adding the case’s residual score. For example, to reproduce the observed score for vote (\({y_i}\)) for 1952, we use the following equation: \[{y_i} = 45.021 + 2.975\times3.033 + -9.492 = 44.55\]

Use the same technique to recover vote for a couple additional years to be sure you understand where the numbers come from.

The least squares criterion

To find the best fit line, we use the least squares criterion. The least squares criterion dictates that the best fitting line is the line that results in the smallest sum of squared residuals (i.e., each case’s residual is squared and then all squared residuals are summed across cases). For example, if we squared each year’s residual in the table above, then summed them across all 17 years, we would obtain the sum of squared residuals for our line.

We could draw every possible line through the data, calculate the sum of squared residuals for each line, and then choose the line that produces the smallest sum of squared residuals. Fortunately, the lm() function in R does this tedious work for us.

Fit a linear model in R

We use the lm() function in R to fit a linear model. In the code below I first define the name of the R object that will store our linear model results (bp_mod1). Then, I take the data frame (bp) and feed it into the lm() function.

We want to regress the outcome on the predictor, that is, we want to regress vote on growth to determine if growth is a predictor of vote. This is coded in the lm() function by writing vote ~ growth. Note that the lm() function in R, which is short for linear model, requires a data argument. Since we are piping in the data, by indicating data = ., we remind R that it already has the data.

bp_mod1 <- bp  %>%  
  lm(vote ~ growth, data = .)

Note that an alternative way to write this code and fit a linear model without the pipe operator is as follows:

bp_mod1 <- lm(vote ~ growth, data = bp)

Obtain the regression parameter estimates

The get_regression_table() function from the moderndive package can be used to view the regression parameter estimates (i.e., the intercept and the slope). We just need to feed in the R object that we designated to store our linear model results (bp_mod1).

bp_mod1 %>% 
  get_regression_table()

term	estimate	std_error	statistic	p_value	lower_ci	upper_ci
intercept	45.021	2.075	21.702	0.000	40.600	49.443
growth	2.975	0.782	3.803	0.002	1.307	4.642

For now, we are only going to be concerned about the first two columns of the output above – the columns labeled term and estimate. Let’s map these estimates onto our equation. I’ll use all of the visible decimal places to ease identification. The term labeled intercept provides the intercept of the regression line and the term labeled growth provides the slope of the regression line.

\[\hat{y_i} = b_{0}+b_{1}x_i\] \[\hat{y_i} = 45.021+2.975x_i\]

As defined earlier, the intercept is the predicted value of y when x = 0, therefore, it is the predicted vote percentage for the incumbent party when growth in income equals 0.

The slope is the predicted change in y for each one unit increase in x. That is, for each one percentage point increase in growth (e.g., going from 1% to 2%, or 2% to 3%), we expect the percent of vote share garnered by the incumbent party to increase by 2.975 points.

Obtain the predicted values and residuals for each case

We can obtain the predicted values and residuals using the get_regression_points() function from moderndive.

bp_mod1 %>% 
  get_regression_points(ID = "year")

year	vote	growth	vote_hat	residual
1952	44.55	3.033	54.042	-9.492
1956	57.75	2.584	52.706	5.044
1960	49.92	0.577	46.737	3.183
1964	61.35	4.390	58.081	3.269
1968	49.59	3.264	54.732	-5.142
1972	61.79	4.174	57.439	4.351
1976	48.95	1.432	49.282	-0.332
1980	44.69	0.750	47.251	-2.561
1984	59.17	4.120	57.275	1.895
1988	53.90	2.891	53.620	0.280
1992	46.55	1.376	49.116	-2.566
1996	54.74	1.880	50.613	4.127
2000	50.27	3.335	54.941	-4.671
2004	51.24	2.186	51.524	-0.284
2008	46.31	0.169	45.524	0.786
2012	51.96	1.774	50.299	1.661
2016	51.11	1.895	50.658	0.452

Obtain the overall model summary

We can also obtain additional summary information about the model using the get_regression_summaries() function from moderndive.

bp_mod1 %>% 
  get_regression_summaries()

r_squared	adj_r_squared	mse	rmse	sigma	statistic	p_value	df	nobs
0.491	0.457	14.18538	3.766348	4.01	14.463	0.002	1	17

For now, we’ll only be concerned about the r_squared column. This value is .491. It represents a commonly considered term in regression modeling called R-squared or \(R^2\). \(R^2\) is the proportion of the variability in the outcome (vote in our example) that is explained by the predictor(s) (growth in our example).

It is calculated by considering the amount of variability explained by our model (the systematic part of the model), and the amount not explained by our model (the error, also called the residual). The former is often called the Sum of Squares Regression (SSR) – and is calculated by taking the difference between the predicted y for each case (\(\hat{y_i}\)) and the mean of y (\(\bar{y}\)), squaring it, and then summing across all cases in the sample. In equation form this is \(SSR = Σ(\hat{y_i} – \bar{y})^2\). Note that Σ, the Greek letter called sigma, means to sum. The latter is often called the Sum of Squares Error (SSE) – and is calculated by taking the difference between the predicted y for each case (\(\hat{y_i}\)) and the observed y for each case (\(y_i\)) (which is the residual), squaring it, and then summing across all cases in the sample. In equation form this is \(SSE = Σ({y_i} - \hat{y_i})^2\).

Let’s calculate these quantities with some code.

First, lets get the needed quantities for each case – that is, the squared difference between the predicted y and the mean of y for SSR and the squared residual (for SSE).

get_rsq <- bp_mod1 %>% 
  get_regression_points(ID = "year") %>% 
  mutate(for_SSR = vote_hat - mean(vote), # calculate the difference between each predicted score and the mean of y
         for_SSR2 = for_SSR^2) %>%  # square the difference
  mutate(for_SSE2 = residual^2) # square the residual
  
get_rsq

year	vote	growth	vote_hat	residual	for_SSR	for_SSR2	for_SSE2
1952	44.55	3.033	54.042	-9.492	2.0514118	4.2082902	90.098064
1956	57.75	2.584	52.706	5.044	0.7154118	0.5118140	25.441936
1960	49.92	0.577	46.737	3.183	-5.2535882	27.6001893	10.131489
1964	61.35	4.390	58.081	3.269	6.0904118	37.0931155	10.686361
1968	49.59	3.264	54.732	-5.142	2.7414118	7.5153385	26.440164
1972	61.79	4.174	57.439	4.351	5.4484118	29.6851908	18.931201
1976	48.95	1.432	49.282	-0.332	-2.7085882	7.3364502	0.110224
1980	44.69	0.750	47.251	-2.561	-4.7395882	22.4636966	6.558721
1984	59.17	4.120	57.275	1.895	5.2844118	27.9250077	3.591025
1988	53.90	2.891	53.620	0.280	1.6294118	2.6549827	0.078400
1992	46.55	1.376	49.116	-2.566	-2.8745882	8.2632575	6.584356
1996	54.74	1.880	50.613	4.127	-1.3775882	1.8977493	17.032129
2000	50.27	3.335	54.941	-4.671	2.9504118	8.7049296	21.818241
2004	51.24	2.186	51.524	-0.284	-0.4665882	0.2177046	0.080656
2008	46.31	0.169	45.524	0.786	-6.4665882	41.8167634	0.617796
2012	51.96	1.774	50.299	1.661	-1.6915882	2.8614708	2.758921
2016	51.11	1.895	50.658	0.452	-1.3325882	1.7757914	0.204304

Now, we can sum the squared quantities that we just calculated across all cases (all 17 years worth of data).

SSR_SSE <- get_rsq %>% 
  summarize(SSR = sum(for_SSR2),
            SSE = sum(for_SSE2))

SSR_SSE

SSR	SSE
232.5317	241.164

Finally, to calculate the \(R^2\), we take SSR divided by the sum of SSR and SSE. This denominator is also called Sum of Squares Total (SST) as it represents the total variability in the outcome.

SSR_SSE %>% 
  mutate(r_squared = SSR/(SSR + SSE))

SSR	SSE	r_squared
232.5317	241.164	0.4908884

Of course, this is the same value of r_squared that get_regression_summaries() calculated for us. You can multiply the proportion by 100 to express it as a percentage, which indicates that about 49% of the variability in the share of the vote received by the incumbent party is explained by the income growth that US residents achieved during the prior term. The remaining 51% might be just random error in the model, or this remaining variability might be accounted for by other variables. For example, in Module 7 we will determine if additional variability in vote can be accounted for by adding fatalities to the regression model. Hibb’s Bread and Peace Model asserts that it should.

Recall from the beginning of the module, we learned that some outcome can be modeled as:

Outcome = Systematic Component + Residual

And, that statistical techniques allow us to explain variation in our outcome (the systematic component) in the context of what remains unexplained (the residual). The \(R^2\) gives us the quantity of the systematic component – the proportion of the variance in vote that can be predicted by growth. The pie chart below depicts this for our example.

Use the equation to forecast

We can use our equation to forecast out of sample predictions. That is, predict what might happen for years not included in the dataframe based on the fitted model. For example, 2020 data was not included in Thomas’s dataframe that we analyzed.

Let’s consider the 2020 election between Donald Trump (Republican - the incumbent) and Joe Biden (Democrat). Growth in income during Trump’s term is still being studied and calculated by economists, and it is surely complicated by the COVID-19 pandemic. But one reasonable estimate put forth by Thomas in the discussion of his paper is a growth rate of 2.52%. We can plug this number into the equation that we obtained using years 1952 to 2016 to get a predicted score for vote.

\[\hat{y_i} = 45.021+2.975\times2.52 = 52.5\]

This forecast means that, based on income growth alone, we would have expected Donald Trump to garner about 52.5% of the vote share. There were 81,285,571 votes cast for Biden and 74,225,038 votes cast for Trump in the 2020 election. Thus, the actual vote score for the 2020 election (i.e., the percentage of votes garnered by the incumbent candidate) was 47.7%, producing a residual of about -4.8.

Correlation

For two numeric variables (e.g., vote and growth in the Bread and Peace dataframe) we can also calculate the correlation coefficient. A correlation coefficient, often abbreviated as r, quantifies the strength of the linear relationship between two numerical variables using a standardized metric. A correlation coefficient ranges from -1 to 1, where -1 indicates a perfect negative relationship (as one value increases, the other decreases), and +1 indicates a perfect positive relationship (as one value increases, the other increases). A correlation coefficient of 0 indicates no relationship between the two variables.

The get_correlation() function in the moderndive package calculates the correlation coefficient.

bp %>% 
  get_correlation(vote ~ growth)

cor
0.7006353

The correlation between vote and growth is .7, denoting a strong, positive association. The square of the correlation coefficient equals the \(R^2\) in a SLR (i.e., \(R^2\) = .49).

While the regression slope is in the metric of the y variable, and is dependent on the measurement scale of both the x and y variables (i.e., a one unit increase in x is associated with a \({b_1}\) unit change in y), a correlation coefficient is in the metric of standardized units.

An interesting property of linear models is that if you create z-scores of your x and y variables, and then fit a simple linear regression, the slope for the regression line will equal the correlation. Take a look at the code below, and then see the estimate for growth_z in the output. It matches the correlation coefficient. Therefore, we can also interpret the correlation coefficient as “the expected change in the z-score of y for a one standard deviation increase in x.”

bp_z <- bp %>% 
  select(vote, growth) %>% 
  mutate(vote_z = (vote - mean(vote))/sd(vote)) %>% 
  mutate(growth_z = (growth - mean(growth))/sd(growth)) 

bp_mod_z <- bp_z %>% 
  lm(vote_z ~ growth_z, data = .)

bp_mod_z %>% 
  get_regression_table()

term	estimate	std_error	statistic	p_value	lower_ci	upper_ci
intercept	0.000	0.179	0.000	1.000	-0.381	0.381
growth_z	0.701	0.184	3.803	0.002	0.308	1.093

Correlation is a statistical measure that quantifies the size and direction of a relationship between two variables. As we learned in this module, estimating the relationship between two variables can be extremely useful. However, it is critical to realize that finding a correlation between two variables does not automatically mean that one variable causes the other variable. That is, correlation does not equal causation. Causation indicates that one event is the result of the occurrence of another event – in other words, that there is a causal relationship between the two events. We’ll learn more about correlation and causation in psychological studies in Module 7. For now, please watch the following Crash Course: Statistics video that introduces these issues.

Special note for DataCamp exercises for this module

In the DataCamp course for this module, Modeling with Data in the tidyverse, the instructor represents the regression equation in a different format than presented in your text book and in this module. To represent the relationship between some variable called y and some variable called x, in this module, we represented the regression equation as:

\[{y_i} = b_0 + b_1x_i + e_i\]

In the DataCamp course, the instructor writes the equation as:

\[y=f(\vec{x})+ε\]

These two formulas are equivalent. Rather than writing the systematic part of the regression line (i.e., the intercept plus the slope times x), Modeling with Data in the tidyverse represents this quantity as \(f(\vec{x})\). Likewise, the \(e_i\) in the first equation is the same as the \(ε\) in the second equation – that is, the residual. In DataCamp, they also do not index their variables with an i subscript (to denote that each case has a score), though this is implied.

This concludes Module 5 on simple linear regression. In the next two modules, we will learn how to add categorical variables as predictors in our models (Module 6) and how to build models with multiple predictors (Module 7).