PSY350, Module 3

Data Wrangling

Module road map

Learning objectives

Identify the dplyr verbs for wrangling data in the tidyverse
Apply each dplyr verb to accomplish a data wrangling task
Practice preparing data for analysis

Tools for mastery

Read Chapter 3 of your Textbook
Read Chapter 4 of your Textbook

Overview

$Artwork by @allison_horst$

Artwork by @allison_horst

To wrangle data is to clean and prepare data for analysis. Our goal is to get the data into a tidy shape that can be readily plotted and analyzed. I like to think about data analysis in terms of painting a room. In painting, an excellent result comes from careful and thorough preparation. About 90% of painting is preparation – cleaning the walls, taping off trim, protecting the floor, getting the right tools, etc. Data analysis is similar – and data wrangling is the preparation. For most projects, most of your time will be spent preparing the data for analysis. If you skip this step or do a shoddy job – your data analysis will be poor.

The dplyr functions to wrangle data

dplyr is a R package that is part of the tidyverse — it’s the workhorse for wrangling your data.

In this Module, we will explore the following wrangling functions:

Choose certain cases (rows) in your dataframe with filter()
Choose certain variables (columns) in your dataframe with select()
Collapse many values down to summaries with summarize()
Create or modify variables with mutate()
Give variables a new name with rename()
Order your cases (rows) with arrange()
Merge two dataframes together with join()

Introduction to the data

In this Module we’re going to work with a very large data repository compiled by The Washington Post (WAPO) that tracked the path of every opioid pain pill, from manufacturer to pharmacy, in the United States between 2006 and 2014. The data scientists at WAPO created a package (called arcos) that makes the data publicly available.

The opioid epidemic in America has been devastating for individuals and communities. To provide context to the data exploration in this Module, please watch the following video from Last Week Tonight with John Oliver. Please note that the video contains some lewd language, you can skip the video without it affecting your grade if you like. You can also peruse this National Public Radio article and this Nature paper that also provide context.

We will work with the county level data provided by WAPO in this Module. I already downloaded it from the arcos package and saved it as a dataframe in our RStudio Cloud project. Let’s import the the dataframe.

opioid_counties <- here("data", "opioid_counties.Rds") %>% 
  read_rds()

Each county in the US has up to nine rows of data in this dataframe, one for each year from 2006 to 2014. The dataframe includes the following variables:

fips is a 5 digit code that identifies each county
county
state is the state abbreviation
year
number_pills is the number of pills purchased – specifically the number of Oxycodone & Hydrocodone pills purchased by pharmacies in the county
population is the total number of people living in the county

Let’s take a glimpse at the data.

opioid_counties %>% 
  glimpse()

## Rows: 27,758
## Columns: 6
## $ fips         <chr> "45001", "45001", "45001", "45001", "45001", "45001", "45…
## $ county       <chr> "ABBEVILLE", "ABBEVILLE", "ABBEVILLE", "ABBEVILLE", "ABBE…
## $ state        <chr> "SC", "SC", "SC", "SC", "SC", "SC", "SC", "SC", "SC", "LA…
## $ year         <int> 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 200…
## $ number_pills <dbl> 363620, 402940, 424590, 467230, 539280, 566560, 589010, 5…
## $ population   <int> 25821, 25745, 25699, 25347, 25643, 25515, 25387, 25233, 2…

Let’s also view the first set of rows. The head() function will display the first set of rows (we specify to see 18 rows in this case by including n = 18). If you don’t specify a number of rows, then the default number of rows (6) will be printed.

opioid_counties %>% 
  head(n = 18)

fips	county	state	year	number_pills	population
45001	ABBEVILLE	SC	2006	363620	25821
45001	ABBEVILLE	SC	2007	402940	25745
45001	ABBEVILLE	SC	2008	424590	25699
45001	ABBEVILLE	SC	2009	467230	25347
45001	ABBEVILLE	SC	2010	539280	25643
45001	ABBEVILLE	SC	2011	566560	25515
45001	ABBEVILLE	SC	2012	589010	25387
45001	ABBEVILLE	SC	2013	596420	25233
45001	ABBEVILLE	SC	2014	641350	25100
22001	ACADIA	LA	2006	1969720	60522
22001	ACADIA	LA	2007	2207310	60762
22001	ACADIA	LA	2008	2246400	61115
22001	ACADIA	LA	2009	2537040	59616
22001	ACADIA	LA	2010	2679687	61139
22001	ACADIA	LA	2011	2806270	61430
22001	ACADIA	LA	2012	2640841	61611
22001	ACADIA	LA	2013	2569300	61847
22001	ACADIA	LA	2014	2541140	62031

The filter() function

The filter() dplyr verb is used to choose specific rows of data that meet certain conditions. R provides the standard set of selection operators:

>
>=
<
<=
!= (not equal to)
== (equal to)

Boolean operators are also permitted:

& is “and”
| is “or”
! is “not”

A single filter

First, let’s use filter() to create a new dataframe called opioid_counties2014, that includes only data from the year 2014.

opioid_counties2014 <- opioid_counties %>% 
  filter(year == 2014)

opioid_counties2014 %>% 
  head()

fips	county	state	year	number_pills	population
45001	ABBEVILLE	SC	2014	641350	25100
22001	ACADIA	LA	2014	2541140	62031
51001	ACCOMACK	VA	2014	796930	33165
16001	ADA	ID	2014	18427220	409239
19001	ADAIR	IA	2014	195120	7543
21001	ADAIR	KY	2014	1087490	19032

Let’s consider some other options that might be of interest. You can replace filter(year == 2014) in the code above with the following code to accomplish different results:

filter(year == 2013 | year == 2014) to create a dataframe of years where year equals 2013 or 2014
filter(year != 2006) to create a dataframe that excludes year 2006 but keeps the others
filter(year < 2010) to create a dataframe that keeps all years less than 2010
filter(year >=2010 & year <=2013) to create a dataframe of years 2010, 2011, 2012, and 2013
filter(year %in% c(2010, 2012, 2014)) to create a dataframe of years 2010, 2012, and 2014

Notice that when requesting to filter, we must use == and not =.

Multiple filters

We can filter on more than one variable. Next, let’s create a new dataframe called opioid_counties2014_CO, that includes only counties in CO and only for the year 2014.

opioid_counties2014_CO <- opioid_counties %>% 
  filter(state == "CO" & year == 2014)

opioid_counties2014_CO %>% 
  head()

fips	county	state	year	number_pills	population
08001	ADAMS	CO	2014	12865600	461558
08003	ALAMOSA	CO	2014	1292460	16111
08005	ARAPAHOE	CO	2014	16246185	596684
08007	ARCHULETA	CO	2014	308700	12132
08009	BACA	CO	2014	101900	3735
08011	BENT	CO	2014	171100	6000

We could make this even more specific by choosing counties in CO for years 2010 to 2014 by swapping the filter() statement above with the following:

filter(state == “CO” & (year >=2010 & year <= 2014))

Note that when referring to a character variable, like state, you must put the value of interest in quotation marks.

Tip: Always take a look at the resulting dataframe to make sure your code performed the right task!

The select() function

The select() dplyr verb is used to select only certain columns (i.e., variables) in your dataframe. Let’s choose just state, county, and number_pills.

opioid_counties_subset <- opioid_counties %>% 
  select(state, county, number_pills)

opioid_counties_subset %>% 
  head()

state	county	number_pills
SC	ABBEVILLE	363620
SC	ABBEVILLE	402940
SC	ABBEVILLE	424590
SC	ABBEVILLE	467230
SC	ABBEVILLE	539280
SC	ABBEVILLE	566560

You can also use select() to reorder variables. For example, if you simply wanted to move state, county, and number_pills to the front of your dataframe, but still keep the other variables, you could modify the select() function above as follows:

select(state, county, number_pills, everything())

The everything() function puts all the other variables after the three specifically mentioned.

The summarize function

The summarize() dplyr verb is used to summarize your data – for example, to create summary statistics.

Compute summary statistics

Let’s compute the mean and standard deviation of pills, ignoring year. In the code below, I use na.rm = TRUE to indicate that I want R to remove cases (i.e., rows of data) with missing data on the variable of interest (number_pills) and compute the statistic on the remainder of cases.

opioid_counties %>% 
  summarize(mean_number_pills = mean(number_pills, na.rm = TRUE), 
            sd_number_pills = sd(number_pills, na.rm = TRUE))

mean_number_pills	sd_number_pills
3539313	9069074

Pair it up with the group_by() function

The group_by() dplyr verb is used to group your data by a salient variable. Once grouped, functions can be performed by group. Let’s compute summary statistics by a grouping variable. In this next example we’ll group by year and then compute the mean for number of pills for each year.

opioid_counties %>% 
  group_by(year) %>% 
  summarize(mean_number_pills = mean(number_pills, na.rm = TRUE))

year	mean_number_pills
2006	2645410
2007	2992099
2008	3252162
2009	3526038
2010	3783656
2011	4035583
2012	3993064
2013	3861752
2014	3768738

We can save the object produced by our code. For example, let’s create a new dataframe that calculates the total number of pills sold each year. To create a dataframe that includes the sum of pills purchased by year, we simply need to provide the output with a name. Here, we call the new dataframe sum_pills_year.

sum_pills_year <- opioid_counties %>% 
  group_by(year) %>% 
  summarize(total_pills = sum(number_pills, na.rm = TRUE))

sum_pills_year

year	total_pills
2006	8200770093
2007	9236611007
2008	10010153796
2009	10849619018
2010	11634741390
2011	12441702286
2012	12330581308
2013	11913504244
2014	11626556851

This is a very useful function of summarize(). For example, once the summary statistics by year are produced, we can create a line graph of the result to visualize the change in pills purchased across time.

Create a line graph of the result

sum_pills_year %>% 
  ggplot(aes(x = year)) +
  geom_line(aes(y = total_pills)) +
  theme_bw() +
  labs(title = "Change in total number of opioid pills purchased in the US from 2006 to 2014?",
       y = "Number of opioid pills puchased",
       x = "Year")

This graph shows a precipitous increase in pills flooding into the US from 2006 to 2011.

The mutate() function

The mutate() dplyr verb is used to create new variables.

Create a new variable that is a transformation of another

Let’s create a new variable in our dataframe to calculate the number of pills per capita, for each county and year.

opioid_counties <- opioid_counties %>% 
  mutate(pills_person = number_pills/population) 

opioid_counties %>% 
  head()

fips	county	state	year	number_pills	population	pills_person
45001	ABBEVILLE	SC	2006	363620	25821	14.08234
45001	ABBEVILLE	SC	2007	402940	25745	15.65119
45001	ABBEVILLE	SC	2008	424590	25699	16.52165
45001	ABBEVILLE	SC	2009	467230	25347	18.43335
45001	ABBEVILLE	SC	2010	539280	25643	21.03030
45001	ABBEVILLE	SC	2011	566560	25515	22.20498

The rename() function

The rename() dplyr verb is used to rename a variable. Let’s imagine that we want to change the name of number_pills to opioid_pills.

opioid_counties_rename <- opioid_counties %>% 
  rename(opioid_pills = number_pills)

opioid_counties_rename %>% 
  head()

fips	county	state	year	opioid_pills	population	pills_person
45001	ABBEVILLE	SC	2006	363620	25821	14.08234
45001	ABBEVILLE	SC	2007	402940	25745	15.65119
45001	ABBEVILLE	SC	2008	424590	25699	16.52165
45001	ABBEVILLE	SC	2009	467230	25347	18.43335
45001	ABBEVILLE	SC	2010	539280	25643	21.03030
45001	ABBEVILLE	SC	2011	566560	25515	22.20498

The arrange() function

The arrange() dplyr verb will sort a dataframe by a variable or set of variables. Let’s filter to choose only CO counties in 2010 and sort by number of pills purchased per person.

opioid_counties %>% 
  filter(year == 2010 & state == "CO") %>% 
  arrange(pills_person)

fips	county	state	year	number_pills	population	pills_person
08109	SAGUACHE	CO	2010	600	6161	0.0973868
08079	MINERAL	CO	2010	100	1020	0.0980392
08027	CUSTER	CO	2010	500	3899	0.1282380
08053	HINSDALE	CO	2010	1300	489	2.6584867
08047	GILPIN	CO	2010	42100	5126	8.2130316
08093	PARK	CO	2010	134120	16286	8.2352941
08113	SAN MIGUEL	CO	2010	86100	7299	11.7961365
08097	PITKIN	CO	2010	214960	16389	13.1161145
08095	PHILLIPS	CO	2010	60000	4394	13.6549841
08039	ELBERT	CO	2010	336930	22712	14.8348890
08063	KIT CARSON	CO	2010	146400	8156	17.9499755
08007	ARCHULETA	CO	2010	223520	12136	18.4179301
08103	RIO BLANCO	CO	2010	132300	6494	20.3726517
08031	DENVER	CO	2010	11962630	578087	20.6934769
08009	BACA	CO	2010	81800	3833	21.3409862
08051	GUNNISON	CO	2010	327450	15136	21.6338531
08035	DOUGLAS	CO	2010	6024770	273440	22.0332431
08125	YUMA	CO	2010	222800	9896	22.5141471
08049	GRAND	CO	2010	355400	14526	24.4664739
08091	OURAY	CO	2010	109160	4319	25.2743691
08067	LA PLATA	CO	2010	1272240	50149	25.3691998
08117	SUMMIT	CO	2010	726095	27105	26.7882310
08005	ARAPAHOE	CO	2010	15248752	552860	27.5815794
08013	BOULDER	CO	2010	8049114	290177	27.7386354
08011	BENT	CO	2010	170160	6125	27.7812245
08017	CHEYENNE	CO	2010	62400	2194	28.4412033
08037	EAGLE	CO	2010	1453880	50793	28.6236292
08001	ADAMS	CO	2010	12211730	425330	28.7111890
08019	CLEAR CREEK	CO	2010	267800	9088	29.4674296
08105	RIO GRANDE	CO	2010	355440	11926	29.8037900
08107	ROUTT	CO	2010	685151	22924	29.8879340
08065	LAKE	CO	2010	216200	7039	30.7145901
08021	CONEJOS	CO	2010	254100	8220	30.9124088
08075	LOGAN	CO	2010	710600	22278	31.8969387
08041	EL PASO	CO	2010	19247950	599988	32.0805583
08123	WELD	CO	2010	7820510	242860	32.2017212
08087	MORGAN	CO	2010	927730	27911	33.2388664
08069	LARIMER	CO	2010	9728210	291162	33.4116746
08025	CROWLEY	CO	2010	208900	5897	35.4247923
08045	GARFIELD	CO	2010	1960088	54761	35.7935027
08085	MONTROSE	CO	2010	1443710	40266	35.8543188
08115	SEDGWICK	CO	2010	89930	2412	37.2844113
08077	MESA	CO	2010	5420700	142284	38.0977482
08059	JEFFERSON	CO	2010	21151398	528614	40.0129357
08061	KIOWA	CO	2010	66000	1643	40.1704200
08083	MONTEZUMA	CO	2010	1086800	25279	42.9922070
08043	FREMONT	CO	2010	2040300	46941	43.4652010
08014	BROOMFIELD	CO	2010	2323780	52872	43.9510516
08029	DELTA	CO	2010	1404500	30533	45.9994105
08099	PROWERS	CO	2010	601600	12734	47.2435998
08015	CHAFFEE	CO	2010	843570	17540	48.0940707
08081	MOFFAT	CO	2010	663000	13519	49.0420889
08071	LAS ANIMAS	CO	2010	785200	15675	50.0925040
08119	TELLER	CO	2010	1148040	22821	50.3062968
08073	LINCOLN	CO	2010	276300	5476	50.4565376
08055	HUERFANO	CO	2010	388500	6948	55.9153713
08101	PUEBLO	CO	2010	11658140	156244	74.6149612
08089	OTERO	CO	2010	1446400	18830	76.8135953
08003	ALAMOSA	CO	2010	1381550	15293	90.3387171

If instead you wanted to sort from high to low, then use the desc() function:

opioid_counties %>% 
  filter(year == 2010 & state == "CO") %>% 
  arrange(desc(pills_person))

fips	county	state	year	number_pills	population	pills_person
08003	ALAMOSA	CO	2010	1381550	15293	90.3387171
08089	OTERO	CO	2010	1446400	18830	76.8135953
08101	PUEBLO	CO	2010	11658140	156244	74.6149612
08055	HUERFANO	CO	2010	388500	6948	55.9153713
08073	LINCOLN	CO	2010	276300	5476	50.4565376
08119	TELLER	CO	2010	1148040	22821	50.3062968
08071	LAS ANIMAS	CO	2010	785200	15675	50.0925040
08081	MOFFAT	CO	2010	663000	13519	49.0420889
08015	CHAFFEE	CO	2010	843570	17540	48.0940707
08099	PROWERS	CO	2010	601600	12734	47.2435998
08029	DELTA	CO	2010	1404500	30533	45.9994105
08014	BROOMFIELD	CO	2010	2323780	52872	43.9510516
08043	FREMONT	CO	2010	2040300	46941	43.4652010
08083	MONTEZUMA	CO	2010	1086800	25279	42.9922070
08061	KIOWA	CO	2010	66000	1643	40.1704200
08059	JEFFERSON	CO	2010	21151398	528614	40.0129357
08077	MESA	CO	2010	5420700	142284	38.0977482
08115	SEDGWICK	CO	2010	89930	2412	37.2844113
08085	MONTROSE	CO	2010	1443710	40266	35.8543188
08045	GARFIELD	CO	2010	1960088	54761	35.7935027
08025	CROWLEY	CO	2010	208900	5897	35.4247923
08069	LARIMER	CO	2010	9728210	291162	33.4116746
08087	MORGAN	CO	2010	927730	27911	33.2388664
08123	WELD	CO	2010	7820510	242860	32.2017212
08041	EL PASO	CO	2010	19247950	599988	32.0805583
08075	LOGAN	CO	2010	710600	22278	31.8969387
08021	CONEJOS	CO	2010	254100	8220	30.9124088
08065	LAKE	CO	2010	216200	7039	30.7145901
08107	ROUTT	CO	2010	685151	22924	29.8879340
08105	RIO GRANDE	CO	2010	355440	11926	29.8037900
08019	CLEAR CREEK	CO	2010	267800	9088	29.4674296
08001	ADAMS	CO	2010	12211730	425330	28.7111890
08037	EAGLE	CO	2010	1453880	50793	28.6236292
08017	CHEYENNE	CO	2010	62400	2194	28.4412033
08011	BENT	CO	2010	170160	6125	27.7812245
08013	BOULDER	CO	2010	8049114	290177	27.7386354
08005	ARAPAHOE	CO	2010	15248752	552860	27.5815794
08117	SUMMIT	CO	2010	726095	27105	26.7882310
08067	LA PLATA	CO	2010	1272240	50149	25.3691998
08091	OURAY	CO	2010	109160	4319	25.2743691
08049	GRAND	CO	2010	355400	14526	24.4664739
08125	YUMA	CO	2010	222800	9896	22.5141471
08035	DOUGLAS	CO	2010	6024770	273440	22.0332431
08051	GUNNISON	CO	2010	327450	15136	21.6338531
08009	BACA	CO	2010	81800	3833	21.3409862
08031	DENVER	CO	2010	11962630	578087	20.6934769
08103	RIO BLANCO	CO	2010	132300	6494	20.3726517
08007	ARCHULETA	CO	2010	223520	12136	18.4179301
08063	KIT CARSON	CO	2010	146400	8156	17.9499755
08039	ELBERT	CO	2010	336930	22712	14.8348890
08095	PHILLIPS	CO	2010	60000	4394	13.6549841
08097	PITKIN	CO	2010	214960	16389	13.1161145
08113	SAN MIGUEL	CO	2010	86100	7299	11.7961365
08093	PARK	CO	2010	134120	16286	8.2352941
08047	GILPIN	CO	2010	42100	5126	8.2130316
08053	HINSDALE	CO	2010	1300	489	2.6584867
08027	CUSTER	CO	2010	500	3899	0.1282380
08079	MINERAL	CO	2010	100	1020	0.0980392
08109	SAGUACHE	CO	2010	600	6161	0.0973868

We can also create a chart of the same data – a bar chart with county on the y-axis and pill_person on the x-axis will do the trick. If we’d like to order the counties by the pills per capita, then we need to use the fct_reorder function (see code below). Rather than y = county, we need y = fct_reorder(county, pills_person). fct_reorder in this instance needs two arguments, the variable that you want to put on the y-axis (county), and the ordering that you desire (we want to order by pills_person).

opioid_counties %>% 
  filter(year == 2010 & state == "CO") %>% 
ggplot(aes(x = pills_person, y = fct_reorder(county, pills_person))) +
  geom_col(fill = "#E24E42") +
  labs(title = "Number of pills purchased per person in 2010 by county in CO",
       x = "Number of pills per person",
       y = "County",
       caption = "Source: The Washington Post, ARCOS") +
  theme_minimal()

The join functions

The join dplyr verbs are used to join/merge two dataframes together. Let’s read in a related dataframe, called adi_counties.Rds, that provides the Area Disadvantage Index (ADI) for each county. The ADI is based on a measure created by the Health Resources & Services Administration (HRSA). It is based on US Census records, and serves as a measure of concentrated socio-economic disadvantage in the county. In the dataframe, there is one row of data per county in the US, and then a series of variables that describe the county. In this example, we’ll use the following variables:

fips is the county ID.
ADI is the area disadvantage index – a higher score means more disadvantage
WhiteNonHispanicPct2010 is the percent of county residents who identify as non-Hispanic White.

For this example, we will dichotomize WhiteNonHispanicPct2010, naming it mincounty. This binary indicator will be used to compare counties where fewer than 50% of the residents identify as non-Hispanic White to all others. We’ll do this using the case_when() function in dplyr. Each case_when() statement has two parts – the part before the tilde (~) and the part after the tilde. The part before the tilde is called the left hand side (LHS) and the part after the tilde is called the right hand side (RHS). case_when() uses the LHS part of the statement as the condition to test for and the RHS as the value to return. case_when() works in sequence through the conditions, finding when the LHS is true for each case (i.e., row of data), and then returning the value associated with that condition.

In the example below, mincounty is a new variable that is created using case_when(). Each row in the dataframe will be evaluated, if WhiteNonHispanicPct2010 < 50, then “minority county” will be recorded for mincounty. The second case_when() statement is only evaluated if the first case_when() statement is false. In the second case_when() statement, if WhiteNonHispanicPct2010 is greater than or equal to 50, then “white county” will be recorded for mincounty. Any rows that don’t match any of the conditions will be assigned NA (NA is R’s missing data indicator) for mincounty.

adi <- here("data", "adi_counties.Rds") %>% 
  read_rds() %>% 
  mutate(mincounty = case_when(WhiteNonHispanicPct2010 < 50 ~ "minority county", 
                               WhiteNonHispanicPct2010 >= 50 ~ "white county")) %>%
  select(fips, ADI, WhiteNonHispanicPct2010, mincounty)

adi %>% 
  head()

fips	ADI	WhiteNonHispanicPct2010	mincounty
01001	87.89380	77.24616	white county
01003	85.95352	83.50479	white county
01005	122.40521	46.75310	minority county
01007	105.98061	75.02073	white county
01009	101.91368	88.88734	white county
01011	130.58507	21.91680	minority county

Now, let’s merge the adi dataframe with the 2010 opioid data, since fips is in both dataframes, and it uniquely identifies each county, we can join the dataframes together using fips as the key.

opioid_counties2010 <-
  opioid_counties %>% 
  filter(year == 2010) %>% 
  left_join(adi, by = "fips")

opioid_counties2010 %>% 
  head()

fips	county	state	year	number_pills	population	pills_person	ADI	WhiteNonHispanicPct2010	mincounty
45001	ABBEVILLE	SC	2010	539280	25643	21.03030	115.04469	69.10336	white county
22001	ACADIA	LA	2010	2679687	61139	43.82942	117.66953	78.60230	white county
51001	ACCOMACK	VA	2010	800550	34066	23.49997	105.90245	61.10843	white county
16001	ADA	ID	2010	16145256	380718	42.40739	76.85215	86.48376	white county
19001	ADAIR	IA	2010	222100	7779	28.55123	88.76101	97.72195	white county
21001	ADAIR	KY	2010	1033160	18542	55.71999	122.46864	94.11449	white county

Advanced tip: There are many different types of joins available, and the best choice depends on the qualities of the dataframes to be joined and the analyst’s needs. You can explore the types here.

With these data compiled, let’s take the opportunity to determine if counties with more socio-economic disadvantage had more pills coming into their communities. This is an important public health question. The socio-economic vitality of a community plays an important role in promoting health, well-being, and prosperity for residents. On the flip side, communities characterized by socio-economic strain pose threats to the health, well-being, and prosperity of residents. As such, public health scientists recognize socio-economic advantage vs. disadvantage, of both an individual and his/her community, as a key social determinant of health. HealthyPeople.gov defines social determinants of health and describes their importance:

Social determinants of health are conditions in the environments in which people are born, live, learn, work, play, worship, and age that affect a wide range of health, functioning, and quality-of-life outcomes and risks. Conditions (e.g., social, economic, and physical) in these various environments and settings (e.g., school, church, workplace, and neighborhood) have been referred to as “place.” In addition to the more material attributes of “place,” the patterns of social engagement and sense of security and well-being are also affected by where people live. Resources that enhance quality of life can have a significant influence on population health outcomes. Examples of these resources include safe and affordable housing, access to education, public safety, availability of healthy foods, local emergency/health services, and environments free of life-threatening toxins. Understanding the relationship between how population groups experience “place” and the impact of “place” on health is fundamental to the social determinants of health—including both social and physical determinants.

You probably know from the news cycle over the past decade that the opioid epidemic hit poor White communities with few resources especially hard. If you’d like to read more about the opioid epidemic and the role of neighborhood disadvantage, particularly in White communities, please read this interesting report by the Brookings Institute entitled “How can Policy Address the Opioid Crisis and Despair in America”.

Let’s use our data to take a look at the relationship between ADI and number of pills per person purchased. Given our knowledge that the opioid pandemic was especially wide spread in predominantly non-Hispanic White communities, we will facet our scatterplot by the mincounty indicator created earlier.

Create a scatter plot

opioid_counties2010 %>% 
  select(ADI, pills_person, mincounty) %>% 
  drop_na() %>% 
  ggplot(mapping = aes(x = ADI, y = pills_person)) +
  geom_jitter(color = "dark gray") + 
  geom_smooth(method = "loess", se = FALSE, formula = y~x) +
  facet_wrap(~mincounty) +
  theme_bw() +
  labs(title = "Is county-level socio-economic disadvantage associated with more opioid pills per capita?",
       x = "Area Disadvantage Index",
       y = "Number of pills purchased per person")

Among predominantly White counties, we see evidence that greater socio-economic disadvantage was associated with more pills per person. This trend is not apparent in predominantly minority communities, and provides some interesting information about the correlation of socio-economic hardship and substance abuse across different types of communities.

Importing and exporting data in R

Before we close out this Module, let’s practice importing (reading in) and exporting (writing out) data files.

Import data

Import R dataframe

So far in this course, we have imported R dataframes.

adi <- here("data", "adi_counties.Rds") %>% 
  read_rds()

But, you can import many different types of dataframes to use in R. Commonly, data is entered in excel and saved as a comma delimited file (.csv file extension). There is an example of this in our data folder – the file called adi_counties.csv is of this type. We can read it using the read_csv() function, which is part of the tidyverse.

Import an csv file

adi <- here("data", "adi_counties.csv") %>% 
  read_csv()

Import other types of files

The haven package, which is part of the tidyverse, allows for the import of many different types of dataframes – like SPSS, SAS, and much more. If you have a file of this type, you can get instructions here.

Export data

Sometimes, we may want to save a new version of our dataframe. For example, let’s imagine that we want to save the opioid_counties2010 dataframe that we used in the last section. We can do that with the following code that makes use of the saveRDS() function (to save as a R data file) or the write_csv() function (to save as a csv file). Notice the use of the here() function so the data is directed to be saved in our data folder. You should never overwrite the original data file though – if you make changes to the source (i.e., original) data file – it’s important to save it with a different name so that the original file remains intact.

Export a Rds file

opioid_counties2010 %>% 
  saveRDS(here("data", "opioid_counties2010.Rds"))

Export a csv file

opioid_counties2010 %>% 
  write_csv(here("data", "opioid_counties2010.csv"))

Export other types of files

Using the haven package mentioned above, you can write out a dataframe to any other type. For example, to save your R datafile as a SPSS data file.

Resources

Go further, learn more: