PSY350, Module 6

Qualitative Predictors

Module road map

Learning objectives

  1. Describe the difference between qualitative and quantitative variables
  2. Contrast within group and between group variability in the context of analysis of variance
  3. Create dummy-coded indicators to represent qualitative variables as predictors in a regression model
  4. Fit a regression model with dummy-coded indicators
  5. Interpret the results of a regression model with dummy-coded indicators

Readings

  1. Read section 5.2 of Chapter 5 of your Textbook

Overview

In Module 5 we studied a regression model with a numeric, continuously distributed predictor. In this module, we will learn how to consider qualitative predictors. Examples of qualitative variables include gender, race/ethnicity, political party affiliation, and marital status.

Our goal will be to understand how a continuous outcome varies across the levels of a qualitative variable. The levels of these qualitative variables are often referred to as groups.

Qualitative variables (also commonly referred to as categorical, character, grouping variables, or factors) are qualitative in nature (rather than quantitative). The values denote ordered or unordered categories or groups.

Artwork by @allison_horst

Artwork by @allison_horst

  • Unordered (i.e., nominal) variables have no reasonable order that can be applied. Examples include:

    • gender (e.g., man, woman, non-binary)
    • marital status (e.g., married, divorced, single, widowed)
    • political party affiliation (e.g., Republican, Democrat, Independent)

  • Ordered (i.e., ordinal) variables have a reasonable order, but they are not on a ratio or interval scale. Examples include:

    • level of education (e.g., high school dropout, high school grad, college grad)
    • developmental period (e.g., child, teen, adult)
    • degree of liberalism (conservative, moderate, liberal)

When a variable has two categories, it is dichotomous (also called binary). When a variable has more than two categories, it is polychotomous.

In Part 1 of this module we will consider a technique called analysis of variance (ANOVA) to study how a continuous y variable differs as a function of a qualitative grouping variable. In Part 2 of this module we will learn how qualitative grouping variables can be re-coded and included as predictors in a regression model.


Part 1: Analysis of variance (ANOVA)

It is common for scientists to study how some variable of interest differs as a function of a qualitative or grouping variable. In this type of analysis, we are interested in contrasting the variability of some variable across groups – when we conduct this type of examination, we are conducting an analysis of variance (ANOVA). To build your intuition about ANOVA, please watch the following Crash Course: Statistics video on ANOVA. Please note that the video includes an overview of fitting ANOVA models, as well as inference testing (with topics such as degrees of Freedom, test statistics, and test distributions). You don’t need to worry about the inference parts for now – we’ll revisit these techniques in Module 11.



Imagine that you and your research team are preparing to study an intervention to increase physical activity among older adults living in urban affordable housing communities.

For your study, you want to collect objective measures of physical activity rather than relying on self-reported physical activity. There are many wearable devices on the market that measure physical activity. You are considering two – we’ll call these Device A and Device B. Device A is considered the gold standard for measuring physical activity – but is very expensive. Device B is new on the market and hasn’t been well studied, it is much less expensive and would be a cost savings to the project if it worked as well as Device A. You seek to measure the accuracy of these devices to measure minutes spent in moderate physical activity (MPA). You devise the following study:

All participants will spend 60 minutes engaged in easy to brisk walking on a treadmill under the supervision of research assistants, but the amount of time in moderate physical activity will differ. This will be controlled by the treadmill program – with the speed and the grade varying to push the individual’s heart rate into a moderate physical activity range for the designated minutes.

  1. In condition 1, participants will accumulate 15 minutes of MPA over the 60-minute session
  2. In condition 2, participants will accumulate 30 minutes of MPA over the 60-minute session
  3. In condition 3, participants will accumulate 45 minutes of MPA over the 60-minute session

You recruit 45 people from your target demographic to participate in the measurement study, and randomly assign 15 people to each condition. All people wear both devices simultaneously and complete their designated 60-minute activity bout. The resulting data are in the dataframe called activity.Rds. The following variables are included:

  • pid is the participant ID
  • condition is the condition randomly assigned – 15 minutes, 30 minutes, 45 minutes of MPA
  • DevA_MPA is the recorded minutes of MPA from Device A
  • DevB_MPA is the recorded minutes of MPA from Device B

Let’s read in the dataframe.

activity <- here("data", "activity.Rds") %>% 
  read_rds()

And, take a look at the data.

activity %>% glimpse()
## Rows: 45
## Columns: 4
## $ pid       <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
## $ condition <chr> "15 minutes", "15 minutes", "15 minutes", "15 minutes", "15 …
## $ DevA_MPA  <dbl> 15.7, 12.0, 14.7, 14.7, 16.0, 14.1, 14.0, 14.6, 15.4, 15.3, …
## $ DevB_MPA  <dbl> 18.0, 27.4, 25.6, 6.7, 17.5, 16.8, 6.9, 19.1, 25.1, 20.1, 10…

Let’s produce descriptive statistics for DevA_MPA and DevB_MPA by condition.

activity %>% 
  select(-pid) %>% 
  group_by(condition) %>% 
  skim()
Data summary
Name Piped data
Number of rows 45
Number of columns 3
_______________________
Column type frequency:
numeric 2
________________________
Group variables condition

Variable type: numeric

skim_variable condition n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
DevA_MPA 15 minutes 0 1 14.68 1.31 11.6 14.35 15.3 15.55 16.0 ▂▁▂▃▇
DevA_MPA 30 minutes 0 1 29.81 0.72 28.3 29.50 30.0 30.35 30.5 ▂▂▁▅▇
DevA_MPA 45 minutes 0 1 44.96 1.04 43.3 44.20 45.0 45.70 46.9 ▃▇▅▅▃
DevB_MPA 15 minutes 0 1 16.39 7.97 1.4 9.80 17.5 22.60 27.4 ▂▇▃▇▇
DevB_MPA 30 minutes 0 1 31.52 8.12 16.0 25.85 32.0 37.05 45.5 ▃▆▇▇▃
DevB_MPA 45 minutes 0 1 46.68 8.00 30.2 41.00 47.9 51.45 61.4 ▂▅▇▆▃

Notice the mean of minutes in MPA measured by each device seems quite close to the values dictated by condition. For example, for Device A the mean measured MPA (DevA_MPA) is 14.7 minutes for the 15-minute condition, 29.8 minutes for the 30-minute condition, and 45.0 minutes for the 45-minute condition.

Let’s visualize the data with a jittered point plot by condition (along x axis) and facet by measurement device. Notice that in this section I use pivot_longer to pivot the data from a wide to long dataframe for aid in creation of the plot. If you’d like to learn more about this function, please see this documentation. There is also a discussion of the function in your textbook (Chapter 4). I also create a factor in this section, which is a type of character variable. If you’d like to study factors in R, this DataCamp course is excellent.

activity_long <- activity %>% 
  pivot_longer(cols = c("DevA_MPA", "DevB_MPA"), names_to = "device",  values_to = "MPA") %>% 
  mutate(device = factor(device, levels = c("DevA_MPA", "DevB_MPA"), labels = c("Device A", "Device B")))

activity_long %>% 
  ggplot(aes(x = condition, y = MPA, color = condition)) + 
  geom_jitter(width = .1, show.legend = FALSE, alpha = .35) +
  geom_hline(yintercept = 45, color = "blue", linetype = "dashed") +
  geom_hline(yintercept = 30, color = "green", linetype = "dashed") +
  geom_hline(yintercept = 15, color = "pink", linetype = "dashed") +
  theme_bw() +
  stat_summary(fun = mean, size = 5, shape = 4, geom = "point", colour = "black", show.legend = FALSE) + 
  labs(title = "Variability in MPA by experimental condition and measurement device",
       subtitle = "Mean within each condition is marked with an X, dashed colored lines correspond to condition specifications",
       x = "Condition",
       y = "Minutes of Moderate or Vigorous Physical Activity") +
  facet_wrap(~device)

This plot represents the data collected in the study. On the left, we see the results for Device A. On the right, we see the results for Device B. Along the x-axis are the experimental conditions, and on the y-axis is the outcome – the minutes of physical activity that the device measured.

I added three dashed horizontal lines to this plot to indicate the minutes of physical activity that the device should have picked up given the experimental condition. For example, the blue line represents 45 minutes of moderate physical activity, this is the number of minutes that the device should have recorded in condition 3 (the 45-minute condition) if the device was accurate.

Notice that the sample means from the two devices (the X marks) are quite similar and are close to what was dictated by the experimental condition (see dashed lines).

Device B seems to consistently overestimate minutes in MPA (the X mark is slightly above what it should be as noted by the dashed line), while Device A produces a mean in each condition that is very close to the minutes dictated by the experimental condition (the X is right on the dashed line).

Within each condition, the variability of the individual data points around the group mean differs substantially. But, the individual measures garnered from Device A are tightly clustered around the mean, while the individual measures garnered from Device B are much more variable. This means that Device A consistently captures the correct minutes for every participant, while Device B is not very consistent.


For now, let’s focus only on the results from Device A.

We can partition the total variability in MPA into two components – within group variation and between group variation, where the groups are conditions in our example (i.e., 15 minutes, 30 minutes, or 45 minutes of MPA).

Within group variability

The within group variation captures the extent to which each individual differs from their group’s mean.

Consider the 15-minute condition. There are 15 people in this condition. To quantify the within group variation, we can compute the difference between each person’s MPA score and the average MPA score for their group, square these differences, and then sum these squared differences across all people in the group. This is called the sum of squares within each group. In the table below, try calculating these values by hand to see if you can obtain the same numbers for pid # 1.

ex15min <- activity %>% 
  filter(condition == "15 minutes") %>% 
  select(pid, DevA_MPA) %>% 
  mutate(grp_mean_DevA = mean(DevA_MPA),
         wg_diff_DevA = DevA_MPA - grp_mean_DevA,
         sq_wg_diff_DevA = wg_diff_DevA^2) 

ex15min
pid DevA_MPA grp_mean_DevA wg_diff_DevA sq_wg_diff_DevA
1 15.7 14.68 1.02 1.0404
2 12.0 14.68 -2.68 7.1824
3 14.7 14.68 0.02 0.0004
4 14.7 14.68 0.02 0.0004
5 16.0 14.68 1.32 1.7424
6 14.1 14.68 -0.58 0.3364
7 14.0 14.68 -0.68 0.4624
8 14.6 14.68 -0.08 0.0064
9 15.4 14.68 0.72 0.5184
10 15.3 14.68 0.62 0.3844
11 15.5 14.68 0.82 0.6724
12 11.6 14.68 -3.08 9.4864
13 15.6 14.68 0.92 0.8464
14 15.3 14.68 0.62 0.3844
15 15.7 14.68 1.02 1.0404
ex15min %>% summarize(SSwithin_grp = sum(sq_wg_diff_DevA)) 
SSwithin_grp
24.104

We can do the same thing for the 30-minute condition and the 45-minute condition. Rather than writing the code three times, we can use the group_by function in dplyr to perform the calculations by condition.

Doing so yields three sum of squares – one for each group.

Once the three sum of squares are calculated, we sum the sum of squared differences for each group (in our example, that means summing three values – one for each condition).

activity %>% 
  select(pid, condition, DevA_MPA) %>% 
  group_by(condition) %>% 
  mutate(grp_mean_DevA = mean(DevA_MPA),
         wg_diff_DevA = DevA_MPA - grp_mean_DevA,
         sq_wg_diff_DevA = wg_diff_DevA^2) %>% 
  summarize(SSwithin_grp = sum(sq_wg_diff_DevA)) %>% 
  ungroup() %>% 
  summarize(SSwithin = sum(SSwithin_grp))
SSwithin
46.66933

This value (46.7) represents a quantity officially called the Within Groups Sum of Squares. When analyzing variance across groups in an experiment (where people are randomly assigned to groups) – this part of the variance is referred to as the error as this is variation in the scores that is not due to the experimental condition or treatment. That is, to the extent that people differ within a group, then that variability is clearly not due to experimental condition as they all received the same manipulation/treatment. For example, within each of our groups, all people accumulated the same minutes of MPA. This quantity is also called the Sum of Squares Error (SSE).

Between group variability

The between groups variation captures the extent to which groups differ from one another. In this case, that is, the degree to which our three experimental conditions differ (15 minutes of MPA, 30 minutes of MPA and 45 minutes of MPA). To obtain the between groups variation, we compute the difference between each group’s average MPA score and the overall average MPA (i.e., the average score for all 45 people in the experiment), square these differences, and then sum these squared differences across all groups.

For example, the average minutes of MPA across the three conditions is 29.82 minutes.

activity %>% 
  select(DevA_MPA) %>% 
  skim()
Data summary
Name Piped data
Number of rows 45
Number of columns 1
_______________________
Column type frequency:
numeric 1
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
DevA_MPA 0 1 29.82 12.54 11.6 15.6 30 44.1 46.9 ▇▁▇▁▇

And, the average MPA for the 15 minute condition is 14.68 minutes.

activity %>% 
  filter(condition == "15 minutes") %>% 
  select(DevA_MPA) %>% 
  skim()
Data summary
Name Piped data
Number of rows 15
Number of columns 1
_______________________
Column type frequency:
numeric 1
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
DevA_MPA 0 1 14.68 1.31 11.6 14.35 15.3 15.55 16 ▂▁▂▃▇

We compute the difference between between the 15 minute group’s average MPA score and the overall average MPA as 14.68 minutes - 29.82 minutes – which equals about -15.14. -15.14 squared equals ~229. Since there are 15 people in the group, we sum this value across the 15 people – which equals 229 * 15 people or 3436 (the actual number without any rounding). We then do this same work for the 30 minute group and the 45 minute group. Finally, we sum these three sum of squares across the three groups. This value (6876.6) represents the Between Groups Sum of Squares. The code chunk below does all of this work.

activity %>% 
  select(pid, condition, DevA_MPA) %>% 
  mutate(total_mean_DevA = mean(DevA_MPA)) %>% 
  group_by(condition) %>% 
  mutate(grp_mean_DevA = mean(DevA_MPA),
         bg_diff_DevA = grp_mean_DevA - total_mean_DevA,
         sq_bg_diff_DevA = bg_diff_DevA^2) %>% 
  summarize(SSbetween_grp = sum(sq_bg_diff_DevA)) %>% 
  ungroup() %>% 
  summarize(SSbetween = sum(SSbetween_grp))
SSbetween
6876.59

When analyzing variance across groups in an experiment (where people are randomly assigned to groups), this part of the variability is due to experimental condition. That is, to the extent that group means differ from one another, then that variability is due to the experimental condition or treatment. This quantity is also called the Model Sum of Squares (SSM) or Sum of Squares Regression (SSR).

Contrasting within group and between groups variability

By contrasting the within groups and between groups variability we gain a sense of which component encompasses more of the variability.

For Device A, the Between Groups Sum of Squares is very large (~6877) and the Within Groups Sum of Squares is comparatively much smaller (~47).

This tells us that most of the variability in MPA scores for Device A are due to the different conditions people were assigned to (i.e., 15 minutes, 30 minutes, or 45 minutes), with very little variability within condition. As we saw in the jittered point plot of the data, the mean MPA is extremely close to the designated minutes of MPA for the condition (corresponding dashed line), denoting a high degree of accuracy of Device A. Furthermore, there is little person to person variability within condition, denoting a high degree of precision.

We don’t need to compute Within Groups and Between Groups Sum of Squares on our own as I have in this section. Instead, the aov() function in base R will do this work for us. Note that the value under Sum Sq for condition is 6877 – matching with our calculation of the Between Groups Sum of Squares, and the value for Residuals is 47 – matching with our calculation of Within Groups Sum of Squares. We’ll learn about the other elements of this output in Module 11.

aovA <- activity %$% aov(DevA_MPA ~ condition, data = .)
summary(aovA)
##             Df Sum Sq Mean Sq F value Pr(>F)    
## condition    2   6877    3438    3094 <2e-16 ***
## Residuals   42     47       1                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Let’s finish up our exploration of within and between groups variability by assessing these same values for Device B. Recall from our plot of the data, that Device B had much greater within group variability.

aovB <- activity %$% aov(DevB_MPA ~ condition, data = .)
summary(aovB)
##             Df Sum Sq Mean Sq F value   Pr(>F)    
## condition    2   6883    3441   53.37 2.93e-12 ***
## Residuals   42   2708      64                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The value under Sum Sq for condition is 6883, this is the Between Groups Sum of Squares for Device B. It is quite close to the Between Groups Sum of Squares for Device A. The value under Sum Sq for Residuals is 2708, this is the Within Groups Sum of Squares for Device B. It is much larger than the same value for Device A.

Thus, the ratio of Between Groups variability to Within Groups variability for Device A is very large, while the ratio of Between Groups variability to Within Groups variability for Device B is substantially smaller. If your research team’s desire is to feel confident that the device is accurately and precisely measuring real differences in MPA, which device should you choose?

Please click on the video below for further discussion of the ANOVA results for the two devices.

Inter-rater reliability

When comparing two ratings of a phenomena, a researcher is often interested in a statistical quantity called inter-rater reliability (IRR). Our example of the comparison of two devices to measure exercise lends itself well to considering inter-rater reliability. Here, the two raters are the two devices, and we seek to determine how well the two devices agree.

We can use the agree() function from the irr package to calculate one simple type of inter-rater reliability – percent agreement. To use it, we need to subset our dataframe to include just the two variables that we want to compare (DevA_MPA and DevB_MPA). There is just one argument to the agree() function, it is the tolerance. The value (in terms of the units of y – minutes in our example) that we provide here is the number of successive rating categories that should be regarded as rater agreement. I chose 5, indicating that the function will note “agreement” if the two scores are within 5 minutes of one another, and “disagreement” if they are further than 5 minutes apart.

activity %>% select(starts_with("Dev")) %>% irr::agree(tolerance = 10)
##  Percentage agreement (Tolerance=10)
## 
##  Subjects = 45 
##    Raters = 2 
##   %-agree = 73.3

The result of the agree() function indicates that the two scores agree (within our tolerance specification) about 47% of the time. That’s not so hot, and the ANOVA results combined with the IRR analysis cast doubt on the adequacy of Device B for our research study. If you’d like to learn more about inter-rater reliability, please see this article. It provides much more information and many additional ways to assess inter-rater reliability. If you’d like to learn more about conducting inter-rater reliability analyses in R, please click here.


Part 2: The regression approach to qualitative predictors

In this second part of the module we are going to work with the opioid data introduced in Module 3. Specifically we’ll work with the dataframe called opioid2010_cat.Rds, which is a dataframe that includes the opioid data at the county level for 2010 as well as the area disadvantage score (based on census records) that we joined with the opioid dataframe, matching on county.

The dataframe includes the following variables:

  • fips is a 5 digit code that identifies each county

  • county

  • state is an abbreviation

  • year

  • number_pills is the number of Oxycodone & Hydrocodone pills purchased by pharmacies in the county

  • population is the total number of people living in the county

  • pills_person is a ratio – calculated as number_pills/population to represent number of pills per capita

  • ADI is the area disadvantage index of the county – a higher score means more disadvantage

  • mincounty is a qualitative indicator of whether less than 50% of the residents of the county are non-Hispanic White (“minority county”), or 50% or more of the residents of the county are non-Hispanic White (“white county”)

Let’s import the dataframe

opioid <- here("data", "opioid2010_cat.Rds") %>% 
  read_rds() %>% 
  drop_na()

Here’s a glimpse of the data.

opioid %>% glimpse()
## Rows: 2,991
## Columns: 9
## $ fips         <chr> "45001", "22001", "51001", "16001", "19001", "21001", "29…
## $ county       <chr> "ABBEVILLE", "ACADIA", "ACCOMACK", "ADA", "ADAIR", "ADAIR…
## $ state        <chr> "SC", "LA", "VA", "ID", "IA", "KY", "MO", "OK", "CO", "IA…
## $ year         <int> 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 201…
## $ number_pills <dbl> 539280, 2679687, 800550, 16145256, 222100, 1033160, 13321…
## $ population   <int> 25643, 61139, 34066, 380718, 7779, 18542, 25315, 22424, 4…
## $ pills_person <dbl> 21.0303007, 43.8294215, 23.4999706, 42.4073881, 28.551227…
## $ ADI          <dbl> 115.04469, 117.66953, 105.90245, 76.85215, 88.76101, 122.…
## $ mincounty    <fct> white county, white county, white county, white county, w…

Dummy coding

By simple recoding of the qualitative predictor(s) that we aim to study, we can easily incorporate these types of categorical variables in our regression models. In this unit, we will consider the most common method — dummy coding. There are other types that you may encounter as well (e.g., effect coding), but once you learn one method you can easily understand and apply the others.

Dummy-coded variables summarize all the information in a qualitative variable with k categories, via k-1 dummy-coded indicators. For example, a qualitative variable with two categories (e.g., sex — male or female) can be represented by one dummy-coded indicator. A qualitative variable with three categories (e.g., race/ethnicity — Black, Hispanic, non-Hispanic White) can be represented by two dummy-coded indicators.

Why k-1 dummy codes? One category serves as the reference group. This category is assigned a 0 for all dummy-coded indicators. The remaining categories have a 1 for the respective category and a 0 for all others.

The group coded 0 for all dummy-coded indicators is called the reference group – they’re the group to which we’ll compare all other groups. In the Modern Dive textbook, the authors refer to the reference group as the “baseline for comparison.”

In this section, we’ll study two examples. First, we’ll consider a qualitative variable with two levels. Then, we’ll consider a qualitative variable with four levels.

Linear regression modeling with a binary qualitative predictor

To begin our exploration, we’re going to consider the race/ethnic make-up of the county (using our mincounty variable) as a predictor of number of pills per capita.

Let’s begin by creating a dummy variable to represent minority counties – we’ll call it mincountyDummy, and code counties where less than 50% of the residents are non-Hispanic White as 1, counties where 50% or more of the residents are non-Hispanic White as 0. To create the dummy-coded indicator, the case_when() function inside of the mutate() function is used.

Artwork by @allison_horst

Artwork by @allison_horst

To use the case_when() function in our example, we first define the new variable name (mincountyDummy), then indicate the conditions for assigning values to the new variable. Here the case_when() function for defining mincountyDummy indicates that if mincounty equals “white county” then assign a 0, otherwise if mincounty equals “minority county” assign a 1.

opioid <- opioid %>% 
  mutate(mincountyDummy = case_when(mincounty == "white county" ~ 0,
                                     mincounty == "minority county" ~ 1))

Note that we could have also written the code as follows and obtained the same result.

opioid <- opioid %>% 
  mutate(mincountyDummy = case_when(mincounty == "white county" ~ 0,
                                     TRUE ~ 1))

A cross tabulation of mincountyDummy and mincounty (as requested by the tabyl() function from the janitor package), shows us that our coding is correct. When mincountyDummy = 0, mincounty = “white county”, and when mincountyDummy = 1, mincounty = “minority county.” There are 304 minority counties and 2687 white counties.

opioid %>% tabyl(mincountyDummy, mincounty)
mincountyDummy white county minority county
0 2687 0
1 0 304

First, let’s calculate the mean score for pills_person by our grouping variable – minority county.

opioid %>% 
  select(mincounty, pills_person) %>% 
  group_by(mincounty) %>% 
  skim()
Data summary
Name Piped data
Number of rows 2991
Number of columns 2
_______________________
Column type frequency:
numeric 1
________________________
Group variables mincounty

Variable type: numeric

skim_variable mincounty n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
pills_person white county 0 1 42.71 24.89 0.03 26.41 38.37 54.24 324.94 ▇▁▁▁▁
pills_person minority county 0 1 31.14 19.98 0.04 19.19 27.70 40.33 252.37 ▇▁▁▁▁

It looks like the number of pill per capita is substantially higher in white counties (mean = 42.7, median = 38.4) as compared to minority counties (mean = 31.1, median = 27.7). Next, let’s construct a density plot of pills per capita, one for each group.

opioid %>% 
  ggplot(aes(x = pills_person, fill = mincounty)) +
  geom_density(alpha = .5) +
  theme_bw() +
  labs(title = "Were more pills per capita purchased in majority White counties?",
       x = "Number of pills per capita",
       y = "Density",
       fill = "Race/ethnic make-up of county")

The distribution of pills_person for white counties is clearly more to the right than for minority counties – this is in line with the means/medians just examined.

Let’s create a jittered scatter plot to take an additional look at the distribution of pills per capita as a function of minority make-up of the county.

opioid %>% 
  ggplot(aes(x = mincounty, y = pills_person, color = mincounty)) + 
  geom_jitter(width = .1, show.legend = FALSE) +
  theme_bw() +
  stat_summary(fun = mean, size = 5, shape = 4, geom = "point", colour = "black", show.legend = FALSE) + 
  labs(title = "Were more pills per capita purchased in majority White counties?",
       subtitle = "Mean of each group is marked with an X",
       x = "",
       y = "Number of pills per capita")

This plot shows that the mean of pills per capita is higher in the white counties (the means for each group are marked with an X).

Now that we’ve visualized the data, let’s fit a simple linear regression model, regressing pills per capita (pills_person) on the dummy-coded indicator representing the county minority status (mincountyDummy).

cat_lm1 <- opioid %>% 
  lm(pills_person ~ mincountyDummy, data = .)

cat_lm1 %>% get_regression_table()
term estimate std_error statistic p_value lower_ci upper_ci
intercept 42.708 0.471 90.581 0 41.784 43.633
mincountyDummy -11.566 1.479 -7.820 0 -14.466 -8.666

Let’s focus on the columns labeled “term” and “estimate.” The intercept of the regression model is, as usual, the predicted score when x = 0, that is, the predicted number of pills per capita in majority white counties (recall that mincountyDummy = 0 for white counties).

The slope of the regression model (i.e., the effect of mincountyDummy) is, as usual, the predicted change in y for a one-unit increase in x. Therefore, we can interpret this as the expected difference in pills per capita in minority counties compared to white counties. An increase of one unit for the mincountyDummy variable means contrasting a minority county (1) to a white county (0). This means that minority counties, on average, had 11.6 fewer pills per capita as compared to white counties.

We can write the equation from the fitted regression models as follows:

\[\hat{y} = b_{0}+(b_{1}\times mincountyDummy)\]
\[\hat{y} = 42.7+(-11.6\times mincountyDummy)\]


  • For the white counties: \(\hat{y} = 42.7+(-11.6\times\ 0) = 42.7\)
  • For the minority counties: \(\hat{y} = 42.7+(-11.6\times\ 1) = 31.1\)

We can map all of this onto the jitter plot that we created earlier:

So, you can see that with this simple linear regression, we can recreate the means of our continuous y variable (pills_person) in each of the groups. This may not seem all that interesting now, but it will become highly useful as we begin to build our regression models with both categorical and continuous predictors and begin testing for statistical significance.

Linear regression modeling with a polytomous qualitative predictor

Now, let’s consider a categorical variable with more than 2 categories.

Recall that the ADI is an index of the area disadvantage based on census indicators. It’s a continuous measure; however, for our purposes we will create quartiles (4 groupings) of ADI. Counties are binned based on their ADI score into the first quartile (labeled “1st Q” – which represents the least disadvantaged group), “2nd Q”, “3rd Q”, and “4th Q” (the most disadvantaged). We’ll subset to include only the majority white counties.

df_white <- opioid %>% 
  filter(mincounty == "white county") %>% 
  mutate(ADI_q = ntile(ADI, 4)) %>% 
  mutate(ADI_q = factor(ADI_q, levels = c(1,2,3,4), 
                        labels = c("1st Q", "2nd Q", "3rd Q", "4th Q"))) 

df_white %>% tabyl(ADI_q)
ADI_q n percent
1st Q 672 0.2500930
2nd Q 672 0.2500930
3rd Q 672 0.2500930
4th Q 671 0.2497209

Let’s take a look at descriptive statistics of pills_person by ADI quartile.

df_white %>% 
  select(ADI_q, pills_person) %>% 
  group_by(ADI_q) %>% 
  skim()
Data summary
Name Piped data
Number of rows 2687
Number of columns 2
_______________________
Column type frequency:
numeric 1
________________________
Group variables ADI_q

Variable type: numeric

skim_variable ADI_q n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
pills_person 1st Q 0 1 33.00 15.60 0.10 22.33 31.61 41.67 105.88 ▃▇▃▁▁
pills_person 2nd Q 0 1 35.80 18.27 0.13 22.44 33.31 46.63 109.10 ▅▇▃▁▁
pills_person 3rd Q 0 1 45.60 21.56 0.07 30.50 42.19 58.36 142.56 ▃▇▃▁▁
pills_person 4th Q 0 1 56.45 33.18 0.03 35.34 50.66 68.75 324.94 ▇▃▁▁▁

The descriptive statistics show that the number of pill per capita increases with each step in ADI quartile. That is, the more disadvantaged counties tended to have more pills per capita. Next, let’s construct a density plot of pills per capita, one for each group.

df_white %>% 
  ggplot(aes(x = pills_person, fill = ADI_q)) +
  geom_density(alpha = .5) +
  theme_bw() +
  theme(legend.position = "bottom") +
  labs(title = "Were more pills per capita purchased in counties with greater disadvantage?",
       x = "Number of pills per capita",
       y = "Density",
       fill = "Quartile of Area Disadvantage Index") +
  facet_wrap(~ADI_q, nrow = 1)

The same pattern is observed in the density plots, the higher ADI counties have more of their distributions toward the right of the plot – that is, toward more pills per capita.

Let’s create one last plot, a boxplot of pills per capita, with ADI quartile as a grouping variable.

df_white %>% 
  ggplot(aes(x = ADI_q, y = pills_person)) +
  geom_boxplot() +
  stat_summary(fun = mean, geom = "point", shape = 20, size = 3, color = "skyblue", fill = "skyblue") +
  theme_bw() +
  labs(title = "Were more pills per capita purchased in counties with greater disadvantage?",
       subtitle = "Mean of each group is depicted with blue circle",
       x = "Quartile of the Area Disadvantage Index (ADI)",
       y = "Number of pills per capita")

This box plot further depicts the increasing number of pills as ADI increases.

Now that we’ve visualized the data, we’re ready to fit a model. Before fitting the linear regression model, we need to create a set of dummy-coded indicators to represent ADI quartile.

From this categorical variable (called ADI_q), we can create three dummy-coded indicators to represent the four groups. Let’s select the 1st quartile as the reference group. The reference group (1st quartile) has a 0 score for all three dummy-coded variables. Then for each of the dummy-coded indicators, one of the remaining groups is coded 1, while all others are coded 0. I named each dummy-coded indicator to reflect the group coded 1. This will help us when we interpret the fitted models.

df_white <- df_white %>% 
  mutate(ADIq2 = case_when(ADI_q == "2nd Q" ~ 1, TRUE ~ 0)) %>% 
  mutate(ADIq3 = case_when(ADI_q == "3rd Q" ~ 1, TRUE ~ 0)) %>% 
  mutate(ADIq4 = case_when(ADI_q == "4th Q" ~ 1, TRUE ~ 0))

Using the code below allows us to double check our work to make sure the dummy-coded variables are what we intend.

df_white %>% tabyl(ADI_q, ADIq2)
ADI_q 0 1
1st Q 672 0
2nd Q 0 672
3rd Q 672 0
4th Q 671 0
df_white %>% tabyl(ADI_q, ADIq3)
ADI_q 0 1
1st Q 672 0
2nd Q 672 0
3rd Q 0 672
4th Q 671 0
df_white %>% tabyl(ADI_q, ADIq4)
ADI_q 0 1
1st Q 672 0
2nd Q 672 0
3rd Q 672 0
4th Q 0 671

Here, we see that:

  • when ADI_q equals 1, all dummies are 0,

  • when ADI_q equals 2, ADIq2 = 1, but ADIq3 = 0 and ADIq4 = 0

  • when ADI_q equals 3, ADIq3 = 1, but ADIq2 = 0 and ADIq4 = 0

  • when ADI_q equals 4, ADIq4 = 1, but ADIq2 = 0 and ADIq3 = 0

Therefore, everything looks as it should.

We’ll regress pills per capita (pills_person) on the three dummy-coded indicators to represent ADI quartile. The following equation represents the model we will fit, notice that we have three dummy-coded indicators that are needed to predict the outcome.

\[\hat{y} = b_{0}+(b_{1}\times ADIq2) + (b_{2}\times ADIq3) + (b_{3}\times ADIq4)\]

Take a look at the code below to fit the regression model. Notice that we have three dummy-coded indicators, and we simply add all three as predictors to the regression model, separating each dummy-coded indicator with a plus sign.

Recall that quartile 1, the group with the least disadvantage, is the reference group.

cat_lm2 <- df_white %>% 
  lm(pills_person ~ ADIq2 + ADIq3 + ADIq4, data = .)

cat_lm2 %>% get_regression_table()
term estimate std_error statistic p_value lower_ci upper_ci
intercept 33.005 0.893 36.971 0.000 31.254 34.755
ADIq2 2.797 1.262 2.215 0.027 0.321 5.272
ADIq3 12.592 1.262 9.974 0.000 10.116 15.067
ADIq4 23.445 1.263 18.563 0.000 20.968 25.921

Let’s focus on the columns labeled “term” and “estimate.” The intercept of the regression model is the predicted score when all x variables equal 0. In this example, that is the predicted number of pills per capita in the reference group. The reference group is ADI quartile 1 – which is the least disadvantaged group.

The three slopes compare each other quartile to the reference quartile.

  • The estimate for ADIq2 indicates that the mean for quartile 2 is 2.8 units (i.e, pills per capita) higher than quartile 1.

  • The estimate for ADIq3 indicates that the mean for quartile 3 is 12.6 units higher than quartile 1.

  • The estimate for ADIq4 indicates that the mean for quartile 4 is 23.4 units higher than quartile 1.

The equation from the fitted regression models is written as follows:

\[\hat{y} = b_{0}+(b_{1}\times ADIq2) + (b_{2}\times ADIq3) + (b_{3}\times ADIq4)\]
\[\hat{y} = 33.0+(2.8\times ADIq2) + (12.6\times ADIq3) + (23.4\times ADIq4)\]

We can recover the mean for each group by using this equation:

For ADI quartile 1: \(\hat{y} = 33.0+2.8\times 0 + 12.6\times 0 + 23.4\times 0=33.0\)

For ADI quartile 2: \(\hat{y} = 33.0+2.8\times 1 + 12.6\times 0 + 23.4\times 0=35.8\)

For ADI quartile 3: \(\hat{y} = 33.0+2.8\times 0 + 12.6\times 1 + 23.4\times 0=45.6\)

For ADI quartile 4: \(\hat{y} = 33.0+2.8\times 0 + 12.6\times 0 + 23.4\times 1=56.4\)


Let’s map all of this onto our earlier plot. In the enhanced plot below, the blue dots are the same as the boxplot presented earlier, but I removed the boxes to make it easier to see all of the new elements. You can see how the intercept and slope of the regression model line up with the graph. The intercept is just the mean for quartile 1, the slope for the quartile 2 dummy code is the difference between quartile 2 and quartile 1, the slope for the quartile 3 dummy code is the difference between quartile 3 and quartile 1, and the slope for the quartile 4 dummy code is the difference between quartile 4 and quartile 1.

Finally, let’s take a look at the \(R^2\), labeled r_squared.

cat_lm2 %>% get_regression_summaries()
r_squared adj_r_squared mse rmse sigma statistic p_value df nobs
0.137 0.136 534.7494 23.12465 23.142 141.687 0 3 2687

\(R^2\) represents the proportion of variability in the outcome (pills_person) explained by all of the predictors in the model. As such, this represents the total variability in pills_person that can be predicted by ADI quartile. That is, nearly 14% of the variability in pills per capita can be predicted by ADI quartile.

For a video recap of fitting a linear regression model with a qualitative predictor, please click the video below.

Making additional comparisons

You might be wondering if we can compare quartiles 2 through 4 to one another. Our first model considered quartile 1 as the reference group. Therefore, we obtained a comparison of the 1st quartile to all others. But what if we want to compare the other quartiles to one another?

One way to accomplish this is to change the reference group and refit the model. For example, we could make the 4th quartile the reference group. We’d need to create the corresponding set of dummy indicators (making the 4th quartile the reference group) and then refit the model. Let’s take a look at the code to accomplish this.

df_white <- opioid %>% 
  filter(mincounty == "white county") %>% 
  mutate(ADI_q = ntile(ADI, 4)) %>% 
  mutate(ADI_q = factor(ADI_q, levels = c(1:4), 
                        labels = c("1st Q", "2nd Q",
                                   "3rd Q", "4th Q"))) %>% 
  select(ADI_q, pills_person) %>% 
  mutate(ADIq1 = case_when(ADI_q == "1st Q" ~ 1, TRUE ~ 0)) %>% 
  mutate(ADIq2 = case_when(ADI_q == "2nd Q" ~ 1, TRUE ~ 0)) %>% 
  mutate(ADIq3 = case_when(ADI_q == "3rd Q" ~ 1, TRUE ~ 0))

cat_lm2_rotate <- df_white %>% 
  lm(pills_person ~ ADIq1 + ADIq2 + ADIq3, data = .)

cat_lm2_rotate %>% get_regression_table()
term estimate std_error statistic p_value lower_ci upper_ci
intercept 56.450 0.893 63.187 0 54.698 58.202
ADIq1 -23.445 1.263 -18.563 0 -25.921 -20.968
ADIq2 -20.648 1.263 -16.349 0 -23.125 -18.172
ADIq3 -10.853 1.263 -8.593 0 -13.330 -8.377

Now, the intercept represents quartile 4, and the slopes provide the difference between each of the other quartiles and quartile 4.

As noted earlier, none of this is terribly interesting right now because these regression models with a single categorical predictor simply reproduce the means of the groups. However, this will become more interesting later when we begin to include both categorical and quantitative variables, and study whether the groups are significantly different from one another (Module 11).