PSY350, Module 3
Data Wrangling
Module road map
Learning objectives
- Identify the dplyr verbs for wrangling data in the tidyverse
- Apply each dplyr verb to accomplish a data wrangling task
- Practice preparing data for analysis
Tools for mastery
- Read Chapter 3 of your Textbook
- Read Chapter 4 of your Textbook
Overview
To wrangle data is to clean and prepare data for analysis. Our goal is to get the data into a tidy shape that can be readily plotted and analyzed. I like to think about data analysis in terms of painting a room. In painting, an excellent result comes from careful and thorough preparation. About 90% of painting is preparation – cleaning the walls, taping off trim, protecting the floor, getting the right tools, etc. Data analysis is similar – and data wrangling is the preparation. For most projects, most of your time will be spent preparing the data for analysis. If you skip this step or do a shoddy job – your data analysis will be poor.
The dplyr functions to wrangle data
dplyr is a R package that is part of the tidyverse — it’s the workhorse for wrangling your data.
In this Module, we will explore the following wrangling functions:
Choose certain cases (rows) in your dataframe with filter()
Choose certain variables (columns) in your dataframe with select()
Collapse many values down to summaries with summarize()
Create or modify variables with mutate()
Give variables a new name with rename()
Order your cases (rows) with arrange()
Merge two dataframes together with join()
Introduction to the data
In this Module we’re going to work with a very large data repository compiled by The Washington Post (WAPO) that tracked the path of every opioid pain pill, from manufacturer to pharmacy, in the United States between 2006 and 2014. The data scientists at WAPO created a package (called arcos) that makes the data publicly available.
The opioid epidemic in America has been devastating for individuals and communities. To provide context to the data exploration in this Module, please watch the following video from Last Week Tonight with John Oliver. Please note that the video contains some lewd language, you can skip the video without it affecting your grade if you like. You can also peruse this National Public Radio article and this Nature paper that also provide context.
We will work with the county level data provided by WAPO in this Module. I already downloaded it from the arcos package and saved it as a dataframe in our RStudio Cloud project. Let’s import the the dataframe.
<- here("data", "opioid_counties.Rds") %>%
opioid_counties read_rds()
Each county in the US has up to nine rows of data in this dataframe, one for each year from 2006 to 2014. The dataframe includes the following variables:
fips is a 5 digit code that identifies each county
county
state is the state abbreviation
year
number_pills is the number of pills purchased – specifically the number of Oxycodone & Hydrocodone pills purchased by pharmacies in the county
population is the total number of people living in the county
Let’s take a glimpse at the data.
%>%
opioid_counties glimpse()
## Rows: 27,758
## Columns: 6
## $ fips <chr> "45001", "45001", "45001", "45001", "45001", "45001", "45…
## $ county <chr> "ABBEVILLE", "ABBEVILLE", "ABBEVILLE", "ABBEVILLE", "ABBE…
## $ state <chr> "SC", "SC", "SC", "SC", "SC", "SC", "SC", "SC", "SC", "LA…
## $ year <int> 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 200…
## $ number_pills <dbl> 363620, 402940, 424590, 467230, 539280, 566560, 589010, 5…
## $ population <int> 25821, 25745, 25699, 25347, 25643, 25515, 25387, 25233, 2…
Let’s also view the first set of rows. The head() function will display the first set of rows (we specify to see 18 rows in this case by including n = 18). If you don’t specify a number of rows, then the default number of rows (6) will be printed.
%>%
opioid_counties head(n = 18)
fips | county | state | year | number_pills | population |
---|---|---|---|---|---|
45001 | ABBEVILLE | SC | 2006 | 363620 | 25821 |
45001 | ABBEVILLE | SC | 2007 | 402940 | 25745 |
45001 | ABBEVILLE | SC | 2008 | 424590 | 25699 |
45001 | ABBEVILLE | SC | 2009 | 467230 | 25347 |
45001 | ABBEVILLE | SC | 2010 | 539280 | 25643 |
45001 | ABBEVILLE | SC | 2011 | 566560 | 25515 |
45001 | ABBEVILLE | SC | 2012 | 589010 | 25387 |
45001 | ABBEVILLE | SC | 2013 | 596420 | 25233 |
45001 | ABBEVILLE | SC | 2014 | 641350 | 25100 |
22001 | ACADIA | LA | 2006 | 1969720 | 60522 |
22001 | ACADIA | LA | 2007 | 2207310 | 60762 |
22001 | ACADIA | LA | 2008 | 2246400 | 61115 |
22001 | ACADIA | LA | 2009 | 2537040 | 59616 |
22001 | ACADIA | LA | 2010 | 2679687 | 61139 |
22001 | ACADIA | LA | 2011 | 2806270 | 61430 |
22001 | ACADIA | LA | 2012 | 2640841 | 61611 |
22001 | ACADIA | LA | 2013 | 2569300 | 61847 |
22001 | ACADIA | LA | 2014 | 2541140 | 62031 |
The filter() function
The filter() dplyr verb is used to choose specific rows of data that meet certain conditions. R provides the standard set of selection operators:
>
>=
<
<=
!= (not equal to)
== (equal to)
Boolean operators are also permitted:
& is “and”
| is “or”
! is “not”
A single filter
First, let’s use filter() to create a new dataframe called opioid_counties2014, that includes only data from the year 2014.
<- opioid_counties %>%
opioid_counties2014 filter(year == 2014)
%>%
opioid_counties2014 head()
fips | county | state | year | number_pills | population |
---|---|---|---|---|---|
45001 | ABBEVILLE | SC | 2014 | 641350 | 25100 |
22001 | ACADIA | LA | 2014 | 2541140 | 62031 |
51001 | ACCOMACK | VA | 2014 | 796930 | 33165 |
16001 | ADA | ID | 2014 | 18427220 | 409239 |
19001 | ADAIR | IA | 2014 | 195120 | 7543 |
21001 | ADAIR | KY | 2014 | 1087490 | 19032 |
Let’s consider some other options that might be of interest. You can replace filter(year == 2014) in the code above with the following code to accomplish different results:
filter(year == 2013 | year == 2014) to create a dataframe of years where year equals 2013 or 2014
filter(year != 2006) to create a dataframe that excludes year 2006 but keeps the others
filter(year < 2010) to create a dataframe that keeps all years less than 2010
filter(year >=2010 & year <=2013) to create a dataframe of years 2010, 2011, 2012, and 2013
filter(year %in% c(2010, 2012, 2014)) to create a dataframe of years 2010, 2012, and 2014
Notice that when requesting to filter, we must use == and not =.
Multiple filters
We can filter on more than one variable. Next, let’s create a new dataframe called opioid_counties2014_CO, that includes only counties in CO and only for the year 2014.
<- opioid_counties %>%
opioid_counties2014_CO filter(state == "CO" & year == 2014)
%>%
opioid_counties2014_CO head()
fips | county | state | year | number_pills | population |
---|---|---|---|---|---|
08001 | ADAMS | CO | 2014 | 12865600 | 461558 |
08003 | ALAMOSA | CO | 2014 | 1292460 | 16111 |
08005 | ARAPAHOE | CO | 2014 | 16246185 | 596684 |
08007 | ARCHULETA | CO | 2014 | 308700 | 12132 |
08009 | BACA | CO | 2014 | 101900 | 3735 |
08011 | BENT | CO | 2014 | 171100 | 6000 |
We could make this even more specific by choosing counties in CO for years 2010 to 2014 by swapping the filter() statement above with the following:
- filter(state == “CO” & (year >=2010 & year <= 2014))
Note that when referring to a character variable, like state, you must put the value of interest in quotation marks.
Tip: Always take a look at the resulting dataframe to make sure your code performed the right task!
The select() function
The select() dplyr verb is used to select only certain columns (i.e., variables) in your dataframe. Let’s choose just state, county, and number_pills.
<- opioid_counties %>%
opioid_counties_subset select(state, county, number_pills)
%>%
opioid_counties_subset head()
state | county | number_pills |
---|---|---|
SC | ABBEVILLE | 363620 |
SC | ABBEVILLE | 402940 |
SC | ABBEVILLE | 424590 |
SC | ABBEVILLE | 467230 |
SC | ABBEVILLE | 539280 |
SC | ABBEVILLE | 566560 |
You can also use select() to reorder variables. For example, if you simply wanted to move state, county, and number_pills to the front of your dataframe, but still keep the other variables, you could modify the select() function above as follows:
select(state, county, number_pills, everything())
The everything() function puts all the other variables after the three specifically mentioned.
The summarize function
The summarize() dplyr verb is used to summarize your data – for example, to create summary statistics.
Compute summary statistics
Let’s compute the mean and standard deviation of pills, ignoring year. In the code below, I use na.rm = TRUE to indicate that I want R to remove cases (i.e., rows of data) with missing data on the variable of interest (number_pills) and compute the statistic on the remainder of cases.
%>%
opioid_counties summarize(mean_number_pills = mean(number_pills, na.rm = TRUE),
sd_number_pills = sd(number_pills, na.rm = TRUE))
mean_number_pills | sd_number_pills |
---|---|
3539313 | 9069074 |
Pair it up with the group_by() function
The group_by() dplyr verb is used to group your data by a salient variable. Once grouped, functions can be performed by group. Let’s compute summary statistics by a grouping variable. In this next example we’ll group by year and then compute the mean for number of pills for each year.
%>%
opioid_counties group_by(year) %>%
summarize(mean_number_pills = mean(number_pills, na.rm = TRUE))
year | mean_number_pills |
---|---|
2006 | 2645410 |
2007 | 2992099 |
2008 | 3252162 |
2009 | 3526038 |
2010 | 3783656 |
2011 | 4035583 |
2012 | 3993064 |
2013 | 3861752 |
2014 | 3768738 |
We can save the object produced by our code. For example, let’s create a new dataframe that calculates the total number of pills sold each year. To create a dataframe that includes the sum of pills purchased by year, we simply need to provide the output with a name. Here, we call the new dataframe sum_pills_year.
<- opioid_counties %>%
sum_pills_year group_by(year) %>%
summarize(total_pills = sum(number_pills, na.rm = TRUE))
sum_pills_year
year | total_pills |
---|---|
2006 | 8200770093 |
2007 | 9236611007 |
2008 | 10010153796 |
2009 | 10849619018 |
2010 | 11634741390 |
2011 | 12441702286 |
2012 | 12330581308 |
2013 | 11913504244 |
2014 | 11626556851 |
This is a very useful function of summarize(). For example, once the summary statistics by year are produced, we can create a line graph of the result to visualize the change in pills purchased across time.
Create a line graph of the result
%>%
sum_pills_year ggplot(aes(x = year)) +
geom_line(aes(y = total_pills)) +
theme_bw() +
labs(title = "Change in total number of opioid pills purchased in the US from 2006 to 2014?",
y = "Number of opioid pills puchased",
x = "Year")
This graph shows a precipitous increase in pills flooding into the US from 2006 to 2011.
The mutate() function
The mutate() dplyr verb is used to create new variables.
Create a new variable that is a transformation of another
Let’s create a new variable in our dataframe to calculate the number of pills per capita, for each county and year.
<- opioid_counties %>%
opioid_counties mutate(pills_person = number_pills/population)
%>%
opioid_counties head()
fips | county | state | year | number_pills | population | pills_person |
---|---|---|---|---|---|---|
45001 | ABBEVILLE | SC | 2006 | 363620 | 25821 | 14.08234 |
45001 | ABBEVILLE | SC | 2007 | 402940 | 25745 | 15.65119 |
45001 | ABBEVILLE | SC | 2008 | 424590 | 25699 | 16.52165 |
45001 | ABBEVILLE | SC | 2009 | 467230 | 25347 | 18.43335 |
45001 | ABBEVILLE | SC | 2010 | 539280 | 25643 | 21.03030 |
45001 | ABBEVILLE | SC | 2011 | 566560 | 25515 | 22.20498 |
The rename() function
The rename() dplyr verb is used to rename a variable. Let’s imagine that we want to change the name of number_pills to opioid_pills.
<- opioid_counties %>%
opioid_counties_rename rename(opioid_pills = number_pills)
%>%
opioid_counties_rename head()
fips | county | state | year | opioid_pills | population | pills_person |
---|---|---|---|---|---|---|
45001 | ABBEVILLE | SC | 2006 | 363620 | 25821 | 14.08234 |
45001 | ABBEVILLE | SC | 2007 | 402940 | 25745 | 15.65119 |
45001 | ABBEVILLE | SC | 2008 | 424590 | 25699 | 16.52165 |
45001 | ABBEVILLE | SC | 2009 | 467230 | 25347 | 18.43335 |
45001 | ABBEVILLE | SC | 2010 | 539280 | 25643 | 21.03030 |
45001 | ABBEVILLE | SC | 2011 | 566560 | 25515 | 22.20498 |
The arrange() function
The arrange() dplyr verb will sort a dataframe by a variable or set of variables. Let’s filter to choose only CO counties in 2010 and sort by number of pills purchased per person.
%>%
opioid_counties filter(year == 2010 & state == "CO") %>%
arrange(pills_person)
fips | county | state | year | number_pills | population | pills_person |
---|---|---|---|---|---|---|
08109 | SAGUACHE | CO | 2010 | 600 | 6161 | 0.0973868 |
08079 | MINERAL | CO | 2010 | 100 | 1020 | 0.0980392 |
08027 | CUSTER | CO | 2010 | 500 | 3899 | 0.1282380 |
08053 | HINSDALE | CO | 2010 | 1300 | 489 | 2.6584867 |
08047 | GILPIN | CO | 2010 | 42100 | 5126 | 8.2130316 |
08093 | PARK | CO | 2010 | 134120 | 16286 | 8.2352941 |
08113 | SAN MIGUEL | CO | 2010 | 86100 | 7299 | 11.7961365 |
08097 | PITKIN | CO | 2010 | 214960 | 16389 | 13.1161145 |
08095 | PHILLIPS | CO | 2010 | 60000 | 4394 | 13.6549841 |
08039 | ELBERT | CO | 2010 | 336930 | 22712 | 14.8348890 |
08063 | KIT CARSON | CO | 2010 | 146400 | 8156 | 17.9499755 |
08007 | ARCHULETA | CO | 2010 | 223520 | 12136 | 18.4179301 |
08103 | RIO BLANCO | CO | 2010 | 132300 | 6494 | 20.3726517 |
08031 | DENVER | CO | 2010 | 11962630 | 578087 | 20.6934769 |
08009 | BACA | CO | 2010 | 81800 | 3833 | 21.3409862 |
08051 | GUNNISON | CO | 2010 | 327450 | 15136 | 21.6338531 |
08035 | DOUGLAS | CO | 2010 | 6024770 | 273440 | 22.0332431 |
08125 | YUMA | CO | 2010 | 222800 | 9896 | 22.5141471 |
08049 | GRAND | CO | 2010 | 355400 | 14526 | 24.4664739 |
08091 | OURAY | CO | 2010 | 109160 | 4319 | 25.2743691 |
08067 | LA PLATA | CO | 2010 | 1272240 | 50149 | 25.3691998 |
08117 | SUMMIT | CO | 2010 | 726095 | 27105 | 26.7882310 |
08005 | ARAPAHOE | CO | 2010 | 15248752 | 552860 | 27.5815794 |
08013 | BOULDER | CO | 2010 | 8049114 | 290177 | 27.7386354 |
08011 | BENT | CO | 2010 | 170160 | 6125 | 27.7812245 |
08017 | CHEYENNE | CO | 2010 | 62400 | 2194 | 28.4412033 |
08037 | EAGLE | CO | 2010 | 1453880 | 50793 | 28.6236292 |
08001 | ADAMS | CO | 2010 | 12211730 | 425330 | 28.7111890 |
08019 | CLEAR CREEK | CO | 2010 | 267800 | 9088 | 29.4674296 |
08105 | RIO GRANDE | CO | 2010 | 355440 | 11926 | 29.8037900 |
08107 | ROUTT | CO | 2010 | 685151 | 22924 | 29.8879340 |
08065 | LAKE | CO | 2010 | 216200 | 7039 | 30.7145901 |
08021 | CONEJOS | CO | 2010 | 254100 | 8220 | 30.9124088 |
08075 | LOGAN | CO | 2010 | 710600 | 22278 | 31.8969387 |
08041 | EL PASO | CO | 2010 | 19247950 | 599988 | 32.0805583 |
08123 | WELD | CO | 2010 | 7820510 | 242860 | 32.2017212 |
08087 | MORGAN | CO | 2010 | 927730 | 27911 | 33.2388664 |
08069 | LARIMER | CO | 2010 | 9728210 | 291162 | 33.4116746 |
08025 | CROWLEY | CO | 2010 | 208900 | 5897 | 35.4247923 |
08045 | GARFIELD | CO | 2010 | 1960088 | 54761 | 35.7935027 |
08085 | MONTROSE | CO | 2010 | 1443710 | 40266 | 35.8543188 |
08115 | SEDGWICK | CO | 2010 | 89930 | 2412 | 37.2844113 |
08077 | MESA | CO | 2010 | 5420700 | 142284 | 38.0977482 |
08059 | JEFFERSON | CO | 2010 | 21151398 | 528614 | 40.0129357 |
08061 | KIOWA | CO | 2010 | 66000 | 1643 | 40.1704200 |
08083 | MONTEZUMA | CO | 2010 | 1086800 | 25279 | 42.9922070 |
08043 | FREMONT | CO | 2010 | 2040300 | 46941 | 43.4652010 |
08014 | BROOMFIELD | CO | 2010 | 2323780 | 52872 | 43.9510516 |
08029 | DELTA | CO | 2010 | 1404500 | 30533 | 45.9994105 |
08099 | PROWERS | CO | 2010 | 601600 | 12734 | 47.2435998 |
08015 | CHAFFEE | CO | 2010 | 843570 | 17540 | 48.0940707 |
08081 | MOFFAT | CO | 2010 | 663000 | 13519 | 49.0420889 |
08071 | LAS ANIMAS | CO | 2010 | 785200 | 15675 | 50.0925040 |
08119 | TELLER | CO | 2010 | 1148040 | 22821 | 50.3062968 |
08073 | LINCOLN | CO | 2010 | 276300 | 5476 | 50.4565376 |
08055 | HUERFANO | CO | 2010 | 388500 | 6948 | 55.9153713 |
08101 | PUEBLO | CO | 2010 | 11658140 | 156244 | 74.6149612 |
08089 | OTERO | CO | 2010 | 1446400 | 18830 | 76.8135953 |
08003 | ALAMOSA | CO | 2010 | 1381550 | 15293 | 90.3387171 |
If instead you wanted to sort from high to low, then use the desc() function:
%>%
opioid_counties filter(year == 2010 & state == "CO") %>%
arrange(desc(pills_person))
fips | county | state | year | number_pills | population | pills_person |
---|---|---|---|---|---|---|
08003 | ALAMOSA | CO | 2010 | 1381550 | 15293 | 90.3387171 |
08089 | OTERO | CO | 2010 | 1446400 | 18830 | 76.8135953 |
08101 | PUEBLO | CO | 2010 | 11658140 | 156244 | 74.6149612 |
08055 | HUERFANO | CO | 2010 | 388500 | 6948 | 55.9153713 |
08073 | LINCOLN | CO | 2010 | 276300 | 5476 | 50.4565376 |
08119 | TELLER | CO | 2010 | 1148040 | 22821 | 50.3062968 |
08071 | LAS ANIMAS | CO | 2010 | 785200 | 15675 | 50.0925040 |
08081 | MOFFAT | CO | 2010 | 663000 | 13519 | 49.0420889 |
08015 | CHAFFEE | CO | 2010 | 843570 | 17540 | 48.0940707 |
08099 | PROWERS | CO | 2010 | 601600 | 12734 | 47.2435998 |
08029 | DELTA | CO | 2010 | 1404500 | 30533 | 45.9994105 |
08014 | BROOMFIELD | CO | 2010 | 2323780 | 52872 | 43.9510516 |
08043 | FREMONT | CO | 2010 | 2040300 | 46941 | 43.4652010 |
08083 | MONTEZUMA | CO | 2010 | 1086800 | 25279 | 42.9922070 |
08061 | KIOWA | CO | 2010 | 66000 | 1643 | 40.1704200 |
08059 | JEFFERSON | CO | 2010 | 21151398 | 528614 | 40.0129357 |
08077 | MESA | CO | 2010 | 5420700 | 142284 | 38.0977482 |
08115 | SEDGWICK | CO | 2010 | 89930 | 2412 | 37.2844113 |
08085 | MONTROSE | CO | 2010 | 1443710 | 40266 | 35.8543188 |
08045 | GARFIELD | CO | 2010 | 1960088 | 54761 | 35.7935027 |
08025 | CROWLEY | CO | 2010 | 208900 | 5897 | 35.4247923 |
08069 | LARIMER | CO | 2010 | 9728210 | 291162 | 33.4116746 |
08087 | MORGAN | CO | 2010 | 927730 | 27911 | 33.2388664 |
08123 | WELD | CO | 2010 | 7820510 | 242860 | 32.2017212 |
08041 | EL PASO | CO | 2010 | 19247950 | 599988 | 32.0805583 |
08075 | LOGAN | CO | 2010 | 710600 | 22278 | 31.8969387 |
08021 | CONEJOS | CO | 2010 | 254100 | 8220 | 30.9124088 |
08065 | LAKE | CO | 2010 | 216200 | 7039 | 30.7145901 |
08107 | ROUTT | CO | 2010 | 685151 | 22924 | 29.8879340 |
08105 | RIO GRANDE | CO | 2010 | 355440 | 11926 | 29.8037900 |
08019 | CLEAR CREEK | CO | 2010 | 267800 | 9088 | 29.4674296 |
08001 | ADAMS | CO | 2010 | 12211730 | 425330 | 28.7111890 |
08037 | EAGLE | CO | 2010 | 1453880 | 50793 | 28.6236292 |
08017 | CHEYENNE | CO | 2010 | 62400 | 2194 | 28.4412033 |
08011 | BENT | CO | 2010 | 170160 | 6125 | 27.7812245 |
08013 | BOULDER | CO | 2010 | 8049114 | 290177 | 27.7386354 |
08005 | ARAPAHOE | CO | 2010 | 15248752 | 552860 | 27.5815794 |
08117 | SUMMIT | CO | 2010 | 726095 | 27105 | 26.7882310 |
08067 | LA PLATA | CO | 2010 | 1272240 | 50149 | 25.3691998 |
08091 | OURAY | CO | 2010 | 109160 | 4319 | 25.2743691 |
08049 | GRAND | CO | 2010 | 355400 | 14526 | 24.4664739 |
08125 | YUMA | CO | 2010 | 222800 | 9896 | 22.5141471 |
08035 | DOUGLAS | CO | 2010 | 6024770 | 273440 | 22.0332431 |
08051 | GUNNISON | CO | 2010 | 327450 | 15136 | 21.6338531 |
08009 | BACA | CO | 2010 | 81800 | 3833 | 21.3409862 |
08031 | DENVER | CO | 2010 | 11962630 | 578087 | 20.6934769 |
08103 | RIO BLANCO | CO | 2010 | 132300 | 6494 | 20.3726517 |
08007 | ARCHULETA | CO | 2010 | 223520 | 12136 | 18.4179301 |
08063 | KIT CARSON | CO | 2010 | 146400 | 8156 | 17.9499755 |
08039 | ELBERT | CO | 2010 | 336930 | 22712 | 14.8348890 |
08095 | PHILLIPS | CO | 2010 | 60000 | 4394 | 13.6549841 |
08097 | PITKIN | CO | 2010 | 214960 | 16389 | 13.1161145 |
08113 | SAN MIGUEL | CO | 2010 | 86100 | 7299 | 11.7961365 |
08093 | PARK | CO | 2010 | 134120 | 16286 | 8.2352941 |
08047 | GILPIN | CO | 2010 | 42100 | 5126 | 8.2130316 |
08053 | HINSDALE | CO | 2010 | 1300 | 489 | 2.6584867 |
08027 | CUSTER | CO | 2010 | 500 | 3899 | 0.1282380 |
08079 | MINERAL | CO | 2010 | 100 | 1020 | 0.0980392 |
08109 | SAGUACHE | CO | 2010 | 600 | 6161 | 0.0973868 |
We can also create a chart of the same data – a bar chart with county on the y-axis and pill_person on the x-axis will do the trick. If we’d like to order the counties by the pills per capita, then we need to use the fct_reorder function (see code below). Rather than y = county, we need y = fct_reorder(county, pills_person). fct_reorder in this instance needs two arguments, the variable that you want to put on the y-axis (county), and the ordering that you desire (we want to order by pills_person).
%>%
opioid_counties filter(year == 2010 & state == "CO") %>%
ggplot(aes(x = pills_person, y = fct_reorder(county, pills_person))) +
geom_col(fill = "#E24E42") +
labs(title = "Number of pills purchased per person in 2010 by county in CO",
x = "Number of pills per person",
y = "County",
caption = "Source: The Washington Post, ARCOS") +
theme_minimal()
The join functions
The join dplyr verbs are used to join/merge two dataframes together. Let’s read in a related dataframe, called adi_counties.Rds, that provides the Area Disadvantage Index (ADI) for each county. The ADI is based on a measure created by the Health Resources & Services Administration (HRSA). It is based on US Census records, and serves as a measure of concentrated socio-economic disadvantage in the county. In the dataframe, there is one row of data per county in the US, and then a series of variables that describe the county. In this example, we’ll use the following variables:
fips is the county ID.
ADI is the area disadvantage index – a higher score means more disadvantage
WhiteNonHispanicPct2010 is the percent of county residents who identify as non-Hispanic White.
For this example, we will dichotomize WhiteNonHispanicPct2010, naming it mincounty. This binary indicator will be used to compare counties where fewer than 50% of the residents identify as non-Hispanic White to all others. We’ll do this using the case_when() function in dplyr. Each case_when() statement has two parts – the part before the tilde (~) and the part after the tilde. The part before the tilde is called the left hand side (LHS) and the part after the tilde is called the right hand side (RHS). case_when() uses the LHS part of the statement as the condition to test for and the RHS as the value to return. case_when() works in sequence through the conditions, finding when the LHS is true for each case (i.e., row of data), and then returning the value associated with that condition.
In the example below, mincounty is a new variable that is created using case_when(). Each row in the dataframe will be evaluated, if WhiteNonHispanicPct2010 < 50, then “minority county” will be recorded for mincounty. The second case_when() statement is only evaluated if the first case_when() statement is false. In the second case_when() statement, if WhiteNonHispanicPct2010 is greater than or equal to 50, then “white county” will be recorded for mincounty. Any rows that don’t match any of the conditions will be assigned NA (NA is R’s missing data indicator) for mincounty.
<- here("data", "adi_counties.Rds") %>%
adi read_rds() %>%
mutate(mincounty = case_when(WhiteNonHispanicPct2010 < 50 ~ "minority county",
>= 50 ~ "white county")) %>%
WhiteNonHispanicPct2010 select(fips, ADI, WhiteNonHispanicPct2010, mincounty)
%>%
adi head()
fips | ADI | WhiteNonHispanicPct2010 | mincounty |
---|---|---|---|
01001 | 87.89380 | 77.24616 | white county |
01003 | 85.95352 | 83.50479 | white county |
01005 | 122.40521 | 46.75310 | minority county |
01007 | 105.98061 | 75.02073 | white county |
01009 | 101.91368 | 88.88734 | white county |
01011 | 130.58507 | 21.91680 | minority county |
Now, let’s merge the adi dataframe with the 2010 opioid data, since fips is in both dataframes, and it uniquely identifies each county, we can join the dataframes together using fips as the key.
<-
opioid_counties2010 %>%
opioid_counties filter(year == 2010) %>%
left_join(adi, by = "fips")
%>%
opioid_counties2010 head()
fips | county | state | year | number_pills | population | pills_person | ADI | WhiteNonHispanicPct2010 | mincounty |
---|---|---|---|---|---|---|---|---|---|
45001 | ABBEVILLE | SC | 2010 | 539280 | 25643 | 21.03030 | 115.04469 | 69.10336 | white county |
22001 | ACADIA | LA | 2010 | 2679687 | 61139 | 43.82942 | 117.66953 | 78.60230 | white county |
51001 | ACCOMACK | VA | 2010 | 800550 | 34066 | 23.49997 | 105.90245 | 61.10843 | white county |
16001 | ADA | ID | 2010 | 16145256 | 380718 | 42.40739 | 76.85215 | 86.48376 | white county |
19001 | ADAIR | IA | 2010 | 222100 | 7779 | 28.55123 | 88.76101 | 97.72195 | white county |
21001 | ADAIR | KY | 2010 | 1033160 | 18542 | 55.71999 | 122.46864 | 94.11449 | white county |
Advanced tip: There are many different types of joins available, and the best choice depends on the qualities of the dataframes to be joined and the analyst’s needs. You can explore the types here.
With these data compiled, let’s take the opportunity to determine if counties with more socio-economic disadvantage had more pills coming into their communities. This is an important public health question. The socio-economic vitality of a community plays an important role in promoting health, well-being, and prosperity for residents. On the flip side, communities characterized by socio-economic strain pose threats to the health, well-being, and prosperity of residents. As such, public health scientists recognize socio-economic advantage vs. disadvantage, of both an individual and his/her community, as a key social determinant of health. HealthyPeople.gov defines social determinants of health and describes their importance:
Social determinants of health are conditions in the environments in which people are born, live, learn, work, play, worship, and age that affect a wide range of health, functioning, and quality-of-life outcomes and risks. Conditions (e.g., social, economic, and physical) in these various environments and settings (e.g., school, church, workplace, and neighborhood) have been referred to as “place.” In addition to the more material attributes of “place,” the patterns of social engagement and sense of security and well-being are also affected by where people live. Resources that enhance quality of life can have a significant influence on population health outcomes. Examples of these resources include safe and affordable housing, access to education, public safety, availability of healthy foods, local emergency/health services, and environments free of life-threatening toxins. Understanding the relationship between how population groups experience “place” and the impact of “place” on health is fundamental to the social determinants of health—including both social and physical determinants.
You probably know from the news cycle over the past decade that the opioid epidemic hit poor White communities with few resources especially hard. If you’d like to read more about the opioid epidemic and the role of neighborhood disadvantage, particularly in White communities, please read this interesting report by the Brookings Institute entitled “How can Policy Address the Opioid Crisis and Despair in America”.
Let’s use our data to take a look at the relationship between ADI and number of pills per person purchased. Given our knowledge that the opioid pandemic was especially wide spread in predominantly non-Hispanic White communities, we will facet our scatterplot by the mincounty indicator created earlier.
Create a scatter plot
%>%
opioid_counties2010 select(ADI, pills_person, mincounty) %>%
drop_na() %>%
ggplot(mapping = aes(x = ADI, y = pills_person)) +
geom_jitter(color = "dark gray") +
geom_smooth(method = "loess", se = FALSE, formula = y~x) +
facet_wrap(~mincounty) +
theme_bw() +
labs(title = "Is county-level socio-economic disadvantage associated with more opioid pills per capita?",
x = "Area Disadvantage Index",
y = "Number of pills purchased per person")
Among predominantly White counties, we see evidence that greater socio-economic disadvantage was associated with more pills per person. This trend is not apparent in predominantly minority communities, and provides some interesting information about the correlation of socio-economic hardship and substance abuse across different types of communities.
Importing and exporting data in R
Before we close out this Module, let’s practice importing (reading in) and exporting (writing out) data files.
Import data
Import R dataframe
So far in this course, we have imported R dataframes.
<- here("data", "adi_counties.Rds") %>%
adi read_rds()
But, you can import many different types of dataframes to use in R. Commonly, data is entered in excel and saved as a comma delimited file (.csv file extension). There is an example of this in our data folder – the file called adi_counties.csv is of this type. We can read it using the read_csv() function, which is part of the tidyverse.
Import an csv file
<- here("data", "adi_counties.csv") %>%
adi read_csv()
Import other types of files
The haven package, which is part of the tidyverse, allows for the import of many different types of dataframes – like SPSS, SAS, and much more. If you have a file of this type, you can get instructions here.
Export data
Sometimes, we may want to save a new version of our dataframe. For example, let’s imagine that we want to save the opioid_counties2010 dataframe that we used in the last section. We can do that with the following code that makes use of the saveRDS() function (to save as a R data file) or the write_csv() function (to save as a csv file). Notice the use of the here() function so the data is directed to be saved in our data folder. You should never overwrite the original data file though – if you make changes to the source (i.e., original) data file – it’s important to save it with a different name so that the original file remains intact.
Export a Rds file
%>%
opioid_counties2010 saveRDS(here("data", "opioid_counties2010.Rds"))
Export a csv file
%>%
opioid_counties2010 write_csv(here("data", "opioid_counties2010.csv"))
Export other types of files
Using the haven package mentioned above, you can write out a dataframe to any other type. For example, to save your R datafile as a SPSS data file.