PSY350, Module 3

Data Wrangling

Module road map

Learning objectives

  1. Identify the dplyr verbs for wrangling data in the tidyverse
  2. Apply each dplyr verb to accomplish a data wrangling task
  3. Practice preparing data for analysis

Tools for mastery

  1. Read Chapter 3 of your Textbook
  2. Read Chapter 4 of your Textbook

Overview

Artwork by @allison_horst

Artwork by @allison_horst

To wrangle data is to clean and prepare data for analysis. Our goal is to get the data into a tidy shape that can be readily plotted and analyzed. I like to think about data analysis in terms of painting a room. In painting, an excellent result comes from careful and thorough preparation. About 90% of painting is preparation – cleaning the walls, taping off trim, protecting the floor, getting the right tools, etc. Data analysis is similar – and data wrangling is the preparation. For most projects, most of your time will be spent preparing the data for analysis. If you skip this step or do a shoddy job – your data analysis will be poor.


The dplyr functions to wrangle data

dplyr is a R package that is part of the tidyverse — it’s the workhorse for wrangling your data.

In this Module, we will explore the following wrangling functions:

  1. Choose certain cases (rows) in your dataframe with filter()

  2. Choose certain variables (columns) in your dataframe with select()

  3. Collapse many values down to summaries with summarize()

  4. Create or modify variables with mutate()

  5. Give variables a new name with rename()

  6. Order your cases (rows) with arrange()

  7. Merge two dataframes together with join()


Introduction to the data

In this Module we’re going to work with a very large data repository compiled by The Washington Post (WAPO) that tracked the path of every opioid pain pill, from manufacturer to pharmacy, in the United States between 2006 and 2014. The data scientists at WAPO created a package (called arcos) that makes the data publicly available.

The opioid epidemic in America has been devastating for individuals and communities. To provide context to the data exploration in this Module, please watch the following video from Last Week Tonight with John Oliver. Please note that the video contains some lewd language, you can skip the video without it affecting your grade if you like. You can also peruse this National Public Radio article and this Nature paper that also provide context.



We will work with the county level data provided by WAPO in this Module. I already downloaded it from the arcos package and saved it as a dataframe in our RStudio Cloud project. Let’s import the the dataframe.

opioid_counties <- here("data", "opioid_counties.Rds") %>% 
  read_rds()

Each county in the US has up to nine rows of data in this dataframe, one for each year from 2006 to 2014. The dataframe includes the following variables:

  • fips is a 5 digit code that identifies each county

  • county

  • state is the state abbreviation

  • year

  • number_pills is the number of pills purchased – specifically the number of Oxycodone & Hydrocodone pills purchased by pharmacies in the county

  • population is the total number of people living in the county

Let’s take a glimpse at the data.

opioid_counties %>% 
  glimpse()
## Rows: 27,758
## Columns: 6
## $ fips         <chr> "45001", "45001", "45001", "45001", "45001", "45001", "45…
## $ county       <chr> "ABBEVILLE", "ABBEVILLE", "ABBEVILLE", "ABBEVILLE", "ABBE…
## $ state        <chr> "SC", "SC", "SC", "SC", "SC", "SC", "SC", "SC", "SC", "LA…
## $ year         <int> 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 200…
## $ number_pills <dbl> 363620, 402940, 424590, 467230, 539280, 566560, 589010, 5…
## $ population   <int> 25821, 25745, 25699, 25347, 25643, 25515, 25387, 25233, 2…

Let’s also view the first set of rows. The head() function will display the first set of rows (we specify to see 18 rows in this case by including n = 18). If you don’t specify a number of rows, then the default number of rows (6) will be printed.

opioid_counties %>% 
  head(n = 18)
fips county state year number_pills population
45001 ABBEVILLE SC 2006 363620 25821
45001 ABBEVILLE SC 2007 402940 25745
45001 ABBEVILLE SC 2008 424590 25699
45001 ABBEVILLE SC 2009 467230 25347
45001 ABBEVILLE SC 2010 539280 25643
45001 ABBEVILLE SC 2011 566560 25515
45001 ABBEVILLE SC 2012 589010 25387
45001 ABBEVILLE SC 2013 596420 25233
45001 ABBEVILLE SC 2014 641350 25100
22001 ACADIA LA 2006 1969720 60522
22001 ACADIA LA 2007 2207310 60762
22001 ACADIA LA 2008 2246400 61115
22001 ACADIA LA 2009 2537040 59616
22001 ACADIA LA 2010 2679687 61139
22001 ACADIA LA 2011 2806270 61430
22001 ACADIA LA 2012 2640841 61611
22001 ACADIA LA 2013 2569300 61847
22001 ACADIA LA 2014 2541140 62031

The filter() function

The filter() dplyr verb is used to choose specific rows of data that meet certain conditions. R provides the standard set of selection operators:

  • >

  • >=

  • <

  • <=

  • != (not equal to)

  • == (equal to)

Boolean operators are also permitted:

  • & is “and”

  • | is “or”

  • ! is “not”

A single filter

First, let’s use filter() to create a new dataframe called opioid_counties2014, that includes only data from the year 2014.

opioid_counties2014 <- opioid_counties %>% 
  filter(year == 2014)

opioid_counties2014 %>% 
  head()
fips county state year number_pills population
45001 ABBEVILLE SC 2014 641350 25100
22001 ACADIA LA 2014 2541140 62031
51001 ACCOMACK VA 2014 796930 33165
16001 ADA ID 2014 18427220 409239
19001 ADAIR IA 2014 195120 7543
21001 ADAIR KY 2014 1087490 19032

Let’s consider some other options that might be of interest. You can replace filter(year == 2014) in the code above with the following code to accomplish different results:

  • filter(year == 2013 | year == 2014) to create a dataframe of years where year equals 2013 or 2014

  • filter(year != 2006) to create a dataframe that excludes year 2006 but keeps the others

  • filter(year < 2010) to create a dataframe that keeps all years less than 2010

  • filter(year >=2010 & year <=2013) to create a dataframe of years 2010, 2011, 2012, and 2013

  • filter(year %in% c(2010, 2012, 2014)) to create a dataframe of years 2010, 2012, and 2014

Notice that when requesting to filter, we must use == and not =.

Multiple filters

We can filter on more than one variable. Next, let’s create a new dataframe called opioid_counties2014_CO, that includes only counties in CO and only for the year 2014.

opioid_counties2014_CO <- opioid_counties %>% 
  filter(state == "CO" & year == 2014)

opioid_counties2014_CO %>% 
  head()
fips county state year number_pills population
08001 ADAMS CO 2014 12865600 461558
08003 ALAMOSA CO 2014 1292460 16111
08005 ARAPAHOE CO 2014 16246185 596684
08007 ARCHULETA CO 2014 308700 12132
08009 BACA CO 2014 101900 3735
08011 BENT CO 2014 171100 6000

We could make this even more specific by choosing counties in CO for years 2010 to 2014 by swapping the filter() statement above with the following:

  • filter(state == “CO” & (year >=2010 & year <= 2014))

Note that when referring to a character variable, like state, you must put the value of interest in quotation marks.

Tip: Always take a look at the resulting dataframe to make sure your code performed the right task!


The select() function

The select() dplyr verb is used to select only certain columns (i.e., variables) in your dataframe. Let’s choose just state, county, and number_pills.

opioid_counties_subset <- opioid_counties %>% 
  select(state, county, number_pills)

opioid_counties_subset %>% 
  head()
state county number_pills
SC ABBEVILLE 363620
SC ABBEVILLE 402940
SC ABBEVILLE 424590
SC ABBEVILLE 467230
SC ABBEVILLE 539280
SC ABBEVILLE 566560

You can also use select() to reorder variables. For example, if you simply wanted to move state, county, and number_pills to the front of your dataframe, but still keep the other variables, you could modify the select() function above as follows:

select(state, county, number_pills, everything())

The everything() function puts all the other variables after the three specifically mentioned.


The summarize function

The summarize() dplyr verb is used to summarize your data – for example, to create summary statistics.

Compute summary statistics

Let’s compute the mean and standard deviation of pills, ignoring year. In the code below, I use na.rm = TRUE to indicate that I want R to remove cases (i.e., rows of data) with missing data on the variable of interest (number_pills) and compute the statistic on the remainder of cases.

opioid_counties %>% 
  summarize(mean_number_pills = mean(number_pills, na.rm = TRUE), 
            sd_number_pills = sd(number_pills, na.rm = TRUE))
mean_number_pills sd_number_pills
3539313 9069074

Pair it up with the group_by() function

The group_by() dplyr verb is used to group your data by a salient variable. Once grouped, functions can be performed by group. Let’s compute summary statistics by a grouping variable. In this next example we’ll group by year and then compute the mean for number of pills for each year.

opioid_counties %>% 
  group_by(year) %>% 
  summarize(mean_number_pills = mean(number_pills, na.rm = TRUE)) 
year mean_number_pills
2006 2645410
2007 2992099
2008 3252162
2009 3526038
2010 3783656
2011 4035583
2012 3993064
2013 3861752
2014 3768738

We can save the object produced by our code. For example, let’s create a new dataframe that calculates the total number of pills sold each year. To create a dataframe that includes the sum of pills purchased by year, we simply need to provide the output with a name. Here, we call the new dataframe sum_pills_year.

sum_pills_year <- opioid_counties %>% 
  group_by(year) %>% 
  summarize(total_pills = sum(number_pills, na.rm = TRUE))

sum_pills_year
year total_pills
2006 8200770093
2007 9236611007
2008 10010153796
2009 10849619018
2010 11634741390
2011 12441702286
2012 12330581308
2013 11913504244
2014 11626556851

This is a very useful function of summarize(). For example, once the summary statistics by year are produced, we can create a line graph of the result to visualize the change in pills purchased across time.

Create a line graph of the result

sum_pills_year %>% 
  ggplot(aes(x = year)) +
  geom_line(aes(y = total_pills)) +
  theme_bw() +
  labs(title = "Change in total number of opioid pills purchased in the US from 2006 to 2014?",
       y = "Number of opioid pills puchased",
       x = "Year") 


This graph shows a precipitous increase in pills flooding into the US from 2006 to 2011.


The mutate() function

The mutate() dplyr verb is used to create new variables.

Create a new variable that is a transformation of another

Let’s create a new variable in our dataframe to calculate the number of pills per capita, for each county and year.

opioid_counties <- opioid_counties %>% 
  mutate(pills_person = number_pills/population) 

opioid_counties %>% 
  head()
fips county state year number_pills population pills_person
45001 ABBEVILLE SC 2006 363620 25821 14.08234
45001 ABBEVILLE SC 2007 402940 25745 15.65119
45001 ABBEVILLE SC 2008 424590 25699 16.52165
45001 ABBEVILLE SC 2009 467230 25347 18.43335
45001 ABBEVILLE SC 2010 539280 25643 21.03030
45001 ABBEVILLE SC 2011 566560 25515 22.20498

The rename() function

The rename() dplyr verb is used to rename a variable. Let’s imagine that we want to change the name of number_pills to opioid_pills.

opioid_counties_rename <- opioid_counties %>% 
  rename(opioid_pills = number_pills)

opioid_counties_rename %>% 
  head()
fips county state year opioid_pills population pills_person
45001 ABBEVILLE SC 2006 363620 25821 14.08234
45001 ABBEVILLE SC 2007 402940 25745 15.65119
45001 ABBEVILLE SC 2008 424590 25699 16.52165
45001 ABBEVILLE SC 2009 467230 25347 18.43335
45001 ABBEVILLE SC 2010 539280 25643 21.03030
45001 ABBEVILLE SC 2011 566560 25515 22.20498

The arrange() function

The arrange() dplyr verb will sort a dataframe by a variable or set of variables. Let’s filter to choose only CO counties in 2010 and sort by number of pills purchased per person.

opioid_counties %>% 
  filter(year == 2010 & state == "CO") %>% 
  arrange(pills_person)
fips county state year number_pills population pills_person
08109 SAGUACHE CO 2010 600 6161 0.0973868
08079 MINERAL CO 2010 100 1020 0.0980392
08027 CUSTER CO 2010 500 3899 0.1282380
08053 HINSDALE CO 2010 1300 489 2.6584867
08047 GILPIN CO 2010 42100 5126 8.2130316
08093 PARK CO 2010 134120 16286 8.2352941
08113 SAN MIGUEL CO 2010 86100 7299 11.7961365
08097 PITKIN CO 2010 214960 16389 13.1161145
08095 PHILLIPS CO 2010 60000 4394 13.6549841
08039 ELBERT CO 2010 336930 22712 14.8348890
08063 KIT CARSON CO 2010 146400 8156 17.9499755
08007 ARCHULETA CO 2010 223520 12136 18.4179301
08103 RIO BLANCO CO 2010 132300 6494 20.3726517
08031 DENVER CO 2010 11962630 578087 20.6934769
08009 BACA CO 2010 81800 3833 21.3409862
08051 GUNNISON CO 2010 327450 15136 21.6338531
08035 DOUGLAS CO 2010 6024770 273440 22.0332431
08125 YUMA CO 2010 222800 9896 22.5141471
08049 GRAND CO 2010 355400 14526 24.4664739
08091 OURAY CO 2010 109160 4319 25.2743691
08067 LA PLATA CO 2010 1272240 50149 25.3691998
08117 SUMMIT CO 2010 726095 27105 26.7882310
08005 ARAPAHOE CO 2010 15248752 552860 27.5815794
08013 BOULDER CO 2010 8049114 290177 27.7386354
08011 BENT CO 2010 170160 6125 27.7812245
08017 CHEYENNE CO 2010 62400 2194 28.4412033
08037 EAGLE CO 2010 1453880 50793 28.6236292
08001 ADAMS CO 2010 12211730 425330 28.7111890
08019 CLEAR CREEK CO 2010 267800 9088 29.4674296
08105 RIO GRANDE CO 2010 355440 11926 29.8037900
08107 ROUTT CO 2010 685151 22924 29.8879340
08065 LAKE CO 2010 216200 7039 30.7145901
08021 CONEJOS CO 2010 254100 8220 30.9124088
08075 LOGAN CO 2010 710600 22278 31.8969387
08041 EL PASO CO 2010 19247950 599988 32.0805583
08123 WELD CO 2010 7820510 242860 32.2017212
08087 MORGAN CO 2010 927730 27911 33.2388664
08069 LARIMER CO 2010 9728210 291162 33.4116746
08025 CROWLEY CO 2010 208900 5897 35.4247923
08045 GARFIELD CO 2010 1960088 54761 35.7935027
08085 MONTROSE CO 2010 1443710 40266 35.8543188
08115 SEDGWICK CO 2010 89930 2412 37.2844113
08077 MESA CO 2010 5420700 142284 38.0977482
08059 JEFFERSON CO 2010 21151398 528614 40.0129357
08061 KIOWA CO 2010 66000 1643 40.1704200
08083 MONTEZUMA CO 2010 1086800 25279 42.9922070
08043 FREMONT CO 2010 2040300 46941 43.4652010
08014 BROOMFIELD CO 2010 2323780 52872 43.9510516
08029 DELTA CO 2010 1404500 30533 45.9994105
08099 PROWERS CO 2010 601600 12734 47.2435998
08015 CHAFFEE CO 2010 843570 17540 48.0940707
08081 MOFFAT CO 2010 663000 13519 49.0420889
08071 LAS ANIMAS CO 2010 785200 15675 50.0925040
08119 TELLER CO 2010 1148040 22821 50.3062968
08073 LINCOLN CO 2010 276300 5476 50.4565376
08055 HUERFANO CO 2010 388500 6948 55.9153713
08101 PUEBLO CO 2010 11658140 156244 74.6149612
08089 OTERO CO 2010 1446400 18830 76.8135953
08003 ALAMOSA CO 2010 1381550 15293 90.3387171

If instead you wanted to sort from high to low, then use the desc() function:

opioid_counties %>% 
  filter(year == 2010 & state == "CO") %>% 
  arrange(desc(pills_person))
fips county state year number_pills population pills_person
08003 ALAMOSA CO 2010 1381550 15293 90.3387171
08089 OTERO CO 2010 1446400 18830 76.8135953
08101 PUEBLO CO 2010 11658140 156244 74.6149612
08055 HUERFANO CO 2010 388500 6948 55.9153713
08073 LINCOLN CO 2010 276300 5476 50.4565376
08119 TELLER CO 2010 1148040 22821 50.3062968
08071 LAS ANIMAS CO 2010 785200 15675 50.0925040
08081 MOFFAT CO 2010 663000 13519 49.0420889
08015 CHAFFEE CO 2010 843570 17540 48.0940707
08099 PROWERS CO 2010 601600 12734 47.2435998
08029 DELTA CO 2010 1404500 30533 45.9994105
08014 BROOMFIELD CO 2010 2323780 52872 43.9510516
08043 FREMONT CO 2010 2040300 46941 43.4652010
08083 MONTEZUMA CO 2010 1086800 25279 42.9922070
08061 KIOWA CO 2010 66000 1643 40.1704200
08059 JEFFERSON CO 2010 21151398 528614 40.0129357
08077 MESA CO 2010 5420700 142284 38.0977482
08115 SEDGWICK CO 2010 89930 2412 37.2844113
08085 MONTROSE CO 2010 1443710 40266 35.8543188
08045 GARFIELD CO 2010 1960088 54761 35.7935027
08025 CROWLEY CO 2010 208900 5897 35.4247923
08069 LARIMER CO 2010 9728210 291162 33.4116746
08087 MORGAN CO 2010 927730 27911 33.2388664
08123 WELD CO 2010 7820510 242860 32.2017212
08041 EL PASO CO 2010 19247950 599988 32.0805583
08075 LOGAN CO 2010 710600 22278 31.8969387
08021 CONEJOS CO 2010 254100 8220 30.9124088
08065 LAKE CO 2010 216200 7039 30.7145901
08107 ROUTT CO 2010 685151 22924 29.8879340
08105 RIO GRANDE CO 2010 355440 11926 29.8037900
08019 CLEAR CREEK CO 2010 267800 9088 29.4674296
08001 ADAMS CO 2010 12211730 425330 28.7111890
08037 EAGLE CO 2010 1453880 50793 28.6236292
08017 CHEYENNE CO 2010 62400 2194 28.4412033
08011 BENT CO 2010 170160 6125 27.7812245
08013 BOULDER CO 2010 8049114 290177 27.7386354
08005 ARAPAHOE CO 2010 15248752 552860 27.5815794
08117 SUMMIT CO 2010 726095 27105 26.7882310
08067 LA PLATA CO 2010 1272240 50149 25.3691998
08091 OURAY CO 2010 109160 4319 25.2743691
08049 GRAND CO 2010 355400 14526 24.4664739
08125 YUMA CO 2010 222800 9896 22.5141471
08035 DOUGLAS CO 2010 6024770 273440 22.0332431
08051 GUNNISON CO 2010 327450 15136 21.6338531
08009 BACA CO 2010 81800 3833 21.3409862
08031 DENVER CO 2010 11962630 578087 20.6934769
08103 RIO BLANCO CO 2010 132300 6494 20.3726517
08007 ARCHULETA CO 2010 223520 12136 18.4179301
08063 KIT CARSON CO 2010 146400 8156 17.9499755
08039 ELBERT CO 2010 336930 22712 14.8348890
08095 PHILLIPS CO 2010 60000 4394 13.6549841
08097 PITKIN CO 2010 214960 16389 13.1161145
08113 SAN MIGUEL CO 2010 86100 7299 11.7961365
08093 PARK CO 2010 134120 16286 8.2352941
08047 GILPIN CO 2010 42100 5126 8.2130316
08053 HINSDALE CO 2010 1300 489 2.6584867
08027 CUSTER CO 2010 500 3899 0.1282380
08079 MINERAL CO 2010 100 1020 0.0980392
08109 SAGUACHE CO 2010 600 6161 0.0973868

We can also create a chart of the same data – a bar chart with county on the y-axis and pill_person on the x-axis will do the trick. If we’d like to order the counties by the pills per capita, then we need to use the fct_reorder function (see code below). Rather than y = county, we need y = fct_reorder(county, pills_person). fct_reorder in this instance needs two arguments, the variable that you want to put on the y-axis (county), and the ordering that you desire (we want to order by pills_person).

opioid_counties %>% 
  filter(year == 2010 & state == "CO") %>% 
ggplot(aes(x = pills_person, y = fct_reorder(county, pills_person))) +
  geom_col(fill = "#E24E42") +
  labs(title = "Number of pills purchased per person in 2010 by county in CO",
       x = "Number of pills per person",
       y = "County",
       caption = "Source: The Washington Post, ARCOS") +
  theme_minimal()


The join functions

The join dplyr verbs are used to join/merge two dataframes together. Let’s read in a related dataframe, called adi_counties.Rds, that provides the Area Disadvantage Index (ADI) for each county. The ADI is based on a measure created by the Health Resources & Services Administration (HRSA). It is based on US Census records, and serves as a measure of concentrated socio-economic disadvantage in the county. In the dataframe, there is one row of data per county in the US, and then a series of variables that describe the county. In this example, we’ll use the following variables:

  • fips is the county ID.

  • ADI is the area disadvantage index – a higher score means more disadvantage

  • WhiteNonHispanicPct2010 is the percent of county residents who identify as non-Hispanic White.

For this example, we will dichotomize WhiteNonHispanicPct2010, naming it mincounty. This binary indicator will be used to compare counties where fewer than 50% of the residents identify as non-Hispanic White to all others. We’ll do this using the case_when() function in dplyr. Each case_when() statement has two parts – the part before the tilde (~) and the part after the tilde. The part before the tilde is called the left hand side (LHS) and the part after the tilde is called the right hand side (RHS). case_when() uses the LHS part of the statement as the condition to test for and the RHS as the value to return. case_when() works in sequence through the conditions, finding when the LHS is true for each case (i.e., row of data), and then returning the value associated with that condition.

In the example below, mincounty is a new variable that is created using case_when(). Each row in the dataframe will be evaluated, if WhiteNonHispanicPct2010 < 50, then “minority county” will be recorded for mincounty. The second case_when() statement is only evaluated if the first case_when() statement is false. In the second case_when() statement, if WhiteNonHispanicPct2010 is greater than or equal to 50, then “white county” will be recorded for mincounty. Any rows that don’t match any of the conditions will be assigned NA (NA is R’s missing data indicator) for mincounty.

adi <- here("data", "adi_counties.Rds") %>% 
  read_rds() %>% 
  mutate(mincounty = case_when(WhiteNonHispanicPct2010 < 50 ~ "minority county", 
                               WhiteNonHispanicPct2010 >= 50 ~ "white county")) %>%
  select(fips, ADI, WhiteNonHispanicPct2010, mincounty)

adi %>% 
  head()
fips ADI WhiteNonHispanicPct2010 mincounty
01001 87.89380 77.24616 white county
01003 85.95352 83.50479 white county
01005 122.40521 46.75310 minority county
01007 105.98061 75.02073 white county
01009 101.91368 88.88734 white county
01011 130.58507 21.91680 minority county

Now, let’s merge the adi dataframe with the 2010 opioid data, since fips is in both dataframes, and it uniquely identifies each county, we can join the dataframes together using fips as the key.

opioid_counties2010 <-
  opioid_counties %>% 
  filter(year == 2010) %>% 
  left_join(adi, by = "fips")

opioid_counties2010 %>% 
  head()
fips county state year number_pills population pills_person ADI WhiteNonHispanicPct2010 mincounty
45001 ABBEVILLE SC 2010 539280 25643 21.03030 115.04469 69.10336 white county
22001 ACADIA LA 2010 2679687 61139 43.82942 117.66953 78.60230 white county
51001 ACCOMACK VA 2010 800550 34066 23.49997 105.90245 61.10843 white county
16001 ADA ID 2010 16145256 380718 42.40739 76.85215 86.48376 white county
19001 ADAIR IA 2010 222100 7779 28.55123 88.76101 97.72195 white county
21001 ADAIR KY 2010 1033160 18542 55.71999 122.46864 94.11449 white county

Advanced tip: There are many different types of joins available, and the best choice depends on the qualities of the dataframes to be joined and the analyst’s needs. You can explore the types here.

With these data compiled, let’s take the opportunity to determine if counties with more socio-economic disadvantage had more pills coming into their communities. This is an important public health question. The socio-economic vitality of a community plays an important role in promoting health, well-being, and prosperity for residents. On the flip side, communities characterized by socio-economic strain pose threats to the health, well-being, and prosperity of residents. As such, public health scientists recognize socio-economic advantage vs. disadvantage, of both an individual and his/her community, as a key social determinant of health. HealthyPeople.gov defines social determinants of health and describes their importance:

Social determinants of health are conditions in the environments in which people are born, live, learn, work, play, worship, and age that affect a wide range of health, functioning, and quality-of-life outcomes and risks. Conditions (e.g., social, economic, and physical) in these various environments and settings (e.g., school, church, workplace, and neighborhood) have been referred to as “place.” In addition to the more material attributes of “place,” the patterns of social engagement and sense of security and well-being are also affected by where people live. Resources that enhance quality of life can have a significant influence on population health outcomes. Examples of these resources include safe and affordable housing, access to education, public safety, availability of healthy foods, local emergency/health services, and environments free of life-threatening toxins. Understanding the relationship between how population groups experience “place” and the impact of “place” on health is fundamental to the social determinants of health—including both social and physical determinants.

You probably know from the news cycle over the past decade that the opioid epidemic hit poor White communities with few resources especially hard. If you’d like to read more about the opioid epidemic and the role of neighborhood disadvantage, particularly in White communities, please read this interesting report by the Brookings Institute entitled “How can Policy Address the Opioid Crisis and Despair in America”.

Let’s use our data to take a look at the relationship between ADI and number of pills per person purchased. Given our knowledge that the opioid pandemic was especially wide spread in predominantly non-Hispanic White communities, we will facet our scatterplot by the mincounty indicator created earlier.

Create a scatter plot

opioid_counties2010 %>% 
  select(ADI, pills_person, mincounty) %>% 
  drop_na() %>% 
  ggplot(mapping = aes(x = ADI, y = pills_person)) +
  geom_jitter(color = "dark gray") + 
  geom_smooth(method = "loess", se = FALSE, formula = y~x) +
  facet_wrap(~mincounty) +
  theme_bw() +
  labs(title = "Is county-level socio-economic disadvantage associated with more opioid pills per capita?",
       x = "Area Disadvantage Index",
       y = "Number of pills purchased per person")

Among predominantly White counties, we see evidence that greater socio-economic disadvantage was associated with more pills per person. This trend is not apparent in predominantly minority communities, and provides some interesting information about the correlation of socio-economic hardship and substance abuse across different types of communities.


Importing and exporting data in R

Before we close out this Module, let’s practice importing (reading in) and exporting (writing out) data files.

Import data

Import R dataframe

So far in this course, we have imported R dataframes.

adi <- here("data", "adi_counties.Rds") %>% 
  read_rds()

But, you can import many different types of dataframes to use in R. Commonly, data is entered in excel and saved as a comma delimited file (.csv file extension). There is an example of this in our data folder – the file called adi_counties.csv is of this type. We can read it using the read_csv() function, which is part of the tidyverse.

Import an csv file

adi <- here("data", "adi_counties.csv") %>% 
  read_csv()

Import other types of files

The haven package, which is part of the tidyverse, allows for the import of many different types of dataframes – like SPSS, SAS, and much more. If you have a file of this type, you can get instructions here.

Export data

Sometimes, we may want to save a new version of our dataframe. For example, let’s imagine that we want to save the opioid_counties2010 dataframe that we used in the last section. We can do that with the following code that makes use of the saveRDS() function (to save as a R data file) or the write_csv() function (to save as a csv file). Notice the use of the here() function so the data is directed to be saved in our data folder. You should never overwrite the original data file though – if you make changes to the source (i.e., original) data file – it’s important to save it with a different name so that the original file remains intact.

Export a Rds file

opioid_counties2010 %>% 
  saveRDS(here("data", "opioid_counties2010.Rds"))

Export a csv file

opioid_counties2010 %>% 
  write_csv(here("data", "opioid_counties2010.csv"))

Export other types of files

Using the haven package mentioned above, you can write out a dataframe to any other type. For example, to save your R datafile as a SPSS data file.