PSY350, Module 2
Data Visualization
Module road map
Learning objectives
- Describe common measures of central tendency and dispersion
- Describe the graphical layers of a plot
- Identify the ggplot2 arguments needed to use each of the graph layers
- Create the most commonly used graphics in data science
- Apply graphics to illustrate important information and create new insight
Readings
- Read Chapter 2 of your Textbook
- Read Appendix A of your Textbook
Overview
Data visualization is the act of producing meaningful and aesthetically pleasing graphics of your data. Visualizing your data is the first thing you should do when you begin to work with a new data set. It’s the best way to get familiar with the data, find errors in the data, and develop a deeper understanding of the phenomena at play.
In this course, we will use data visualization for exploration (for ourselves/our research team) and explanation (for the readers and consumers of our research).
Before we dive into data visualizations, let’s begin by brushing up on common statistics used to describe data.
Measures of central tendency
When we approach a new analysis, we typically begin by examining the distribution of our variables. For a continuous variable, assessment of central tendency and dispersion is a good place to start.
A measure of central tendency captures a central or typical value from the distribution of the variable. Common measures of central tendency include the mean, the median, and the mode.
Please watch the video by Crash Course: Statistics on measures of central tendency.
Measures of dispersion/spread
Dispersion (or spread) describes the extent to which a distribution is stretched or squeezed. Common measures of dispersion are standard deviation and the range.
Please watch the video below from Crash Course: Statistics on Measures of Spread.
Introduction to data visualization
With common ways of describing data in mind, please watch the following two videos from Crash Course on visualization of data.
Creating graphics in R
ggplot2 is a package for R, it is part of the tidyverse. There are a number of plotting systems in R (including base R as well as other packages), but ggplot2 is the best, all purpose plotting system and has many wonderful resources for mastering it. We will walk through the fundamentals of using ggplot2 to create data visualizations, but to really master it I highly recommend the R-graph library, this resource is linked at the bottom of this handout.
Mastering R takes time and practice, and that includes becoming proficient with data visualization techniques with ggplot2. Start with simple graphs, and slowly try out new features. Before you know it you will be creating beautiful, publication-quality graphics to display meaningful information.
New users of R often find that searching for an example graph (like in the R graph library), copying and pasting the code to create the example graph, and then modifying the example to fit their needs is a great way to start. All of the available arguments can be a bit overwhelming at first. However, by exploring what different options do, you will slowly build your knowledge base and skill level.
Introduction to ggplot2
ggplot2 was developed to implement the graphing framework put forth by Leland Wilkinson in his book entitled The Grammar of Graphics. The grammar of graphics and ggplot2 are built around the 7 layers (listed in the table below) of a graph. Think about each layer as providing a piece of the graph – and when they come together they tell the complete story of the data.
The layers of a ggplot2 graphic
Layer | Description |
---|---|
Data | The data to be plotted. |
Aesthetics | The mapping of variables to elements of the plot. |
Geometrics | The visual elements to display the data – e.g., bar graph, scatterplot. |
Facets | Plotting of small groupings within the plot area – i.e., subplots. |
Statistics and Transformations | Additional statistics imposed onto plot or numerical transformations of variables to aid interpretation. |
Coordinate | The space on which the data are plotted – e.g., Cartesian coordinate system (x and y axis), polar coordinates (pie chart). |
Themes | All non-data elements – e.g., background color, labels. |
Let’s apply the grammar of graphics to the plot below, which appeared in a 2020 issue of the Center for Disease Control and Prevention’s Morbidity and Mortality Weekly Report.
The scientists aimed to determine if adoption of more COVID-19 mitigation policies was associated with fewer COVID-19 deaths in European countries during the first half of 2020. The graphic below presents their findings.
Early policy stringency* and cumulative mortality† from COVID-19 — 37 European countries, January 23–June 30, 2020
Plot footnotes:
Abbreviations: ALB = Albania; AUT = Austria; BEL = Belgium; BGR = Bulgaria; BIH = Bosnia and Herzegovina; BLR = Belarus; CHE = Switzerland; CI = confidence interval; COVID-19 = coronavirus disease 2019; CYP = Cyprus; CZE = Czechia; DEU = Germany; DNK = Denmark; ESP = Spain; EST = Estonia; FIN = Finland; FRA = France; GBR = United Kingdom; GRC = Greece; HRV = Croatia; HUN = Hungary; IRL = Ireland; ISL = Iceland; ITA = Italy; LTU = Lithuania; LUX = Luxembourg; LVA = Latvia; MDA = Moldova; NLD = Netherlands; NOR = Norway; POL = Poland; PRT = Portugal; ROU = Romania; SRB = Serbia; SVK = Slovakia; SVN = Slovenia; SWE = Sweden; TUR = Turkey; UKR = Ukraine.
Based on the Oxford Stringency Index (OSI) on the date the country reached the mortality threshold. The OSI is a composite index ranging from 0–100, based on the following nine mitigation policies: 1) cancellation of public events, 2) school closures, 3) gathering restrictions, 4) workplace closures, 5) border closures, 6) internal movement restrictions, 7) public transport closure, 8) stay-at-home recommendations, and 9) stay-at-home orders. The mortality threshold is the first date that each country reached a daily rate of 0.02 new COVID-19 deaths per 100,000 population, based on a 7-day moving average of the daily death rate. The color gradient represents the calendar date that each country reached the mortality threshold.
† Deaths per 100,000 population.
___
In this table, I list each of the possible layers of a graph, describe them, and then indicate if and how the layer is represented in the example graph above.
The layers of a ggplot2 graphic – mapped to the example MMWR plot
Layer | Description | Example |
---|---|---|
Data | The data to be plotted. | Data compiled by the CDC |
Aesthetics | The mapping of variables to elements of the plot. | OSI index is mapped to x-axis, Cumulative mortaility is mapped to y-axis, Population size is mapped to size of the points, Date mortality threshold was reached is mapped to the color of the points, Country abbreviations are mapped as text to the points. |
Geometrics | The visual elements to display the data – e.g., bar graph, scatterplot. | Scatterplot |
Facets | Plotting of small groupings within the plot area – i.e., subplots. | There are no facets in this example – there is just a single plot. |
Statistics and Transformations | Additional statistics imposed onto plot or numerical transformations of variables to aid interpretation. | A best fit line, corresponding confidence interval |
Coordinate | The space on which the data are plotted – e.g., Cartesian coordinate system (x and y axis), polar coordinates (pie chart). | Cartesian coordinate system |
Themes | All non-data elements – e.g., background color, labels. | Axis labels for x and y axis. |
Please watch this video that further describes the graph.
With this introduction to graphing in our pocket, let’s begin our journey to learn about data visualization and the power of ggplot2! All of the graphs in this module make use of the wealth of data compiled to examine the course and consequences of the COVID-19 pandemic during the first year. Before studying the graphs in this module, please read this compelling report by Amy Maxmen that illustrates how the COVID-19 pandemic has hit some communities and some groups far harder than others, particularly during the early days of the pandemic before vaccinations and treatments were readily available. Through the pandemic the pre-existing health inequities and social injustices have become even more apparent, and the pandemic has engendered even deeper inequities and injustices.
We’re going to take a look at several different types of graphs. For each type of graph, I will first show you a relatively simple version. It’s this simple version that you will need to master for this course. I will also show you a more advanced version of the graph type – one that incorporates more features, more layers, more information. My intent is to get you excited about what’s possible, and show you a path to next steps if you choose to extend your data science journey beyond this course.
Create a series of graphics
Let’s begin by loading the necessary libraries. We will need the here package (to refer to the datafiles that will be read in) and the tidyverse package (which includes ggplot – the package used to plot data).
library(here)
library(tidyverse)
Bar graphs
A bar chart is a graph that presents grouped data with rectangular bars. The height of the bars are proportional to the values that they represent. For example, perhaps an analyst aims to present the number of deaths from COVID-19 for each month of 2020. There are two methods for creating bar charts, and the choice between the two depends on the data structure.
In the first, the height of the bars represents values in the data, for example, summaries of some outcome by group. And, the dataframe has one row per group, with a variable representing the summary score (e.g., mean of some outcome, count of some outcome). In this method we will use the geom_col() geometry of ggplot2.
In the second, the height of the bars are proportional to the number of cases of some variable in each group, and the dataframe has one row per case, with each case belonging to a group. In this method we will use the geom_bar() geometry of ggplot2.
Bar graph, method 1 – geom_col()
We are going to create a graph that displays the COVID-19 death rate (expressed as the number of deaths per 100,000 people) by race/ethnicity for residents of Colorado. Identifying health disparities and acting to eliminate them is critical, and data science is a key tool for the identification of health disparities. Systematic racism and social injustices that began centuries ago in the US and have persisted over time cause Black, Latino, and Native people to be at greater risk for contracting COVID-19. Emerging evidence also suggests that Black, Latino and Native people suffer greater morbidity and mortality as a result of COVID-19, due to both underlying health issues (which are also driven by systematic racism) and reduced access to health care. For more information about the role of race/ethnicity in the COVID-19 pandemic, see this CDC report.
These data were compiled from the COVID-19 Health Equity Interactive Dashboard at Emory University. I pulled these data on August 10, 2021, and in the code chunk below, I import them manually. This is efficient since there are only five values to consider, and the tibble() function makes this easy. To use the tibble function to create a dataframe, list the variable names that you’d like (e.g., race_eth, deaths_per_100000 for the variable that represents race/ethnic group and the death rate respectively), after each variable name record the values. The c() argument stands for concatenate – and allows the user to list the corresponding values. So, for example, the race_eth variable will have five levels – Black, American Indian, Latino, White and Asian. Because the values are characters (not numbers), we must list them in quotation marks. The deaths_per_100000 variable will list the death rate for each of these race/ethnic groups – adhering to the same ordering listed in the race_eth variable above (e.g., the death rate for Blacks is 108, for American Indians is 189, etc.). Once this code is executed, a new dataframe called df_bar1 will be created, which will have two columns (i.e., variables – race_eth, deaths_per_100000) and five rows (one for each race/ethnic group).
<- tibble(
df_bar1 race_eth = c("Black", "American Indian", "Latino", "White", "Asian"),
deaths_per_100000 = c(108, 189, 107, 90, 88)
)
We can use the glimpse() function to see the structure of the dataframe. The glimpse() function is used to describe the variable types in the dataframe and take a peek at the first few rows.
%>%
df_bar1 glimpse()
## Rows: 5
## Columns: 2
## $ race_eth <chr> "Black", "American Indian", "Latino", "White", "Asia…
## $ deaths_per_100000 <dbl> 108, 189, 107, 90, 88
In this data frame there are two variables:
race_eth is a character variable (i.e., a string of letters, numbers, and/or symbols)
deaths_per_100000 is a double (which is a numeric variable)
Since this is such a small dataframe, we can easily print the entire set.
df_bar1
race_eth | deaths_per_100000 |
---|---|
Black | 108 |
American Indian | 189 |
Latino | 107 |
White | 90 |
Asian | 88 |
Notice that the dataframe consists of one row per group (race_eth), with a variable denoting the value that we wish to plot (deaths_per_100000).
Create a simple plot
For our first bar graph, we will create a relatively simple version. Here is the R code needed to create our graph.
%>%
df_bar1 ggplot(mapping = aes(x = race_eth, y = deaths_per_100000)) +
geom_col() +
theme_bw() +
theme(legend.position = "none") +
labs(title = "Are certain race/ethnic groups in Colorado disproportionately affected by COVID-19?",
subtitle = "Reflects cumulative mortality rates calculated through August, 2021",
x = "Race/ethnic group",
y = "Number of deaths from COVID-19 per 100,000 people")
Let’s breakdown the code in the code chunk above that created our bar graph.
- We take the dataframe (df_bar1) and feed it into the ggplot() function.
- Next, we need to map the aesthetics of the graph to our variables. First, we map the variable called race_eth to the x-axis, and we map the variable called deaths_per_100000 to the y-axis.
- Notice that after each layer, we add a + sign.
- The function geom_col() indicates that we want a bar graph with columns.
- For all of the graphics in this module, I will use a black and white theme for the background, therefore, I choose theme_bw(). To explore other themes, click here.
- Finally, we add labels – an overall title, a subtitle, and a description of the x and y axis. Notice that all of these labels must be in quotations.
This plot shows that the mortality rate is higher for individuals who are Black, Latino, or American Indian compared to White and Asian individuals. Particularly large disparities exist between American Indians and the other groups considered.
Alternative codings
This same graph could be produced with some alternative codings – and since you will come across different methods in resources used to help you develop your skills – let’s take a moment to consider some of them.
# list data frame as the first argument to ggplot
ggplot(data = df_bar1, mapping = aes(x = race_eth, y = deaths_per_100000)) +
geom_col() +
theme_bw() +
theme(legend.position = "none") +
labs(title = "Are certain race/ethnic groups in Colorado disproportionately affected by COVID-19?",
subtitle = "Reflects cumulative mortality rates calculated through August, 2021",
x = "Race/ethnic group",
y = "Number of deaths from COVID-19 per 100,000 people")
# start off the ggplot call with the data, then add aesthetics to the geom
ggplot(data = df_bar1) +
geom_col(mapping = aes(x = race_eth, y = deaths_per_100000)) +
theme_bw() +
theme(legend.position = "none") +
labs(title = "Are certain race/ethnic groups in Colorado disproportionately affected by COVID-19?",
subtitle = "Reflects cumulative mortality rates calculated through August, 2021",
x = "Race/ethnic group",
y = "Number of deaths from COVID-19 per 100,000 people")
# and, R is smart enough to know where the data come in and where the mappings come in, so you
# can omit those indications if you like
ggplot(df_bar1) +
geom_col(aes(x = race_eth, y = deaths_per_100000)) +
theme_bw() +
theme(legend.position = "none") +
labs(title = "Are certain race/ethnic groups in Colorado disproportionately affected by COVID-19?",
subtitle = "Reflects cumulative mortality rates calculated through August, 2021",
x = "Race/ethnic group",
y = "Number of deaths from COVID-19 per 100,000 people")
Enhance the plot
Our basic plot provides helpful and interesting information. However, we can do a few things to improve our basic plot:
1. Use geom_text() to add the actual number (of deaths_per_100000) to the top of each bar
2. Reorder the x-axis so the race/ethnic groups are ordered by the death rate using the desc() function
Compare the code in the chunk below with the prior code chunk to identify each of the additions.
%>%
df_bar1 ggplot(mapping = aes(x = reorder(race_eth, desc(deaths_per_100000)),
y = deaths_per_100000)) +
geom_col() +
geom_text(aes(label = deaths_per_100000), vjust = 2, color = "white", fontface = "bold") +
theme_bw() +
theme(legend.position = "none") +
labs(title = "Are certain race/ethnic groups in Colorado disproportionately affected by COVID-19?",
subtitle = "Reflects cumulative mortality rates calculated through August, 2021",
x = "Race/ethnic group",
y = "Number of deaths from COVID-19 per 100,000 people")
This graph presents the same general information, but is more visually interesting and may aid the reader in quickly digesting the material.
Bar graph, method 2 – geom_bar()
To illustrate the second method for creating a bar graph, let’s plot the number of COVID-19 deaths per month in Colorado during 2020. These data were pulled from the Colorado Department of Public Health and the Environment website. I downloaded and saved the data as a R dataframe called covid_deaths_CO_2020.Rds, this dataframe is saved in the data folder in our project directory.
Let’s read in the dataframe.
<- here("data", "covid_deaths_CO_2020.Rds") %>%
df_bar2 read_rds()
There are two variables in the dataframe:
casenum is the case number of the individual who passed away from COVID-19
month is the month of their passing
The glimpse() function will provide some information about the dataframe structure.
%>%
df_bar2 glimpse()
## Rows: 5,041
## Columns: 2
## $ casenum <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,…
## $ month <ord> March, March, March, March, March, March, March, March, March,…
We can also look at the first few rows of the dataframe (10 in the head() function below requests the first 10 rows).
%>%
df_bar2 head(n = 10)
casenum | month |
---|---|
1 | March |
2 | March |
3 | March |
4 | March |
5 | March |
6 | March |
7 | March |
8 | March |
9 | March |
10 | March |
There are 5,041 rows of data in the dataframe – each corresponding to an individual death from COVID-19. The glimpse() function shows us that casenum is an integer variable (which is a numeric variable that is a whole number), months is an ordered character variable (ordered because months have a natural ordering, i.e., March = 3, April = 4).
The dataframe consists of one row per deceased individual and is indexed by month. By adding up the number of rows that have March listed for months, we can arrive at the total number of people who died of COVID-19 in March of 2020. We can do this for each month of the year. The geom_bar() function performs this addition and then plots the result.
The code to create a plot using geom_bar() is very similar to geom_col(); however, we don’t need to specify the variable for the y-axis because geom_bar() knows that you want to add up the deaths and put the count of deaths on the y-axis.
Create a basic plot
%>%
df_bar2 ggplot(aes(x = month)) +
geom_bar() +
theme_bw() +
labs(title = "During which months in 2020 did most Coloradoans die of COVID-19?",
x = "Month",
y = "Number of deaths")
This plot shows that there was a spike of deaths in April, toward the beginning of the pandemic. Then, the situation improved through the late spring and summer. However, in November and December, CO saw the greatest number of deaths.
Enhance the plot
We can do a few things to enhance our basic plot:
1. Add color as an argument to the geom_bar() function to color the bars
2. Use geom_text() to add the actual number of deaths to the top of each bar. In this instance we need to use the stat argument to first add up the number then apply it, since it doesn’t exist as a variable in the dataframe.
%>%
df_bar2 ggplot(aes(x = month)) +
geom_bar(fill = "deepskyblue") +
geom_text(stat = "count", aes(label=..count..), vjust = 1.5,
color = "black", fontface = "bold") +
theme_bw() +
labs(title = "Number of deaths from COVID-19 by month in Colorado during 2020",
x = "Month",
y = "Number of deaths")
As you see in this example, with a few tweaks, we can arrive at a more visually appealing and informative graphic.
Histograms and density plots
A histogram is similar to a bar plot, but is used for continuous variables rather than categorical groupings (like race/ethnicity and month in our bar plot examples). A histogram bins the data into pre-specified intervals and then counts the number of cases in each bin.
To illustrate we will consider the cumulative death rate from COVID-19 across all counties in the US. The New York Times compiles various data sets related to the COVID-19 pandemic, and makes them available to users through their github page – you can peruse it here. On August 11, 2021, I pulled the cumulative counts, and took the the most recent day’s count.
In order to create a death rate, it is necessary to have the total population of each county. Therefore, I pulled this information from the US census data respository. I pulled data from the 2019 census.
I then merged the two dataframes together and computed the death rate per 100,000 people for each county (calculated as the number of deaths in the county divided by the total population of the county – I then multiplied this proportion by 100,000). Therefore, the computed variable, called covid_death_rate, represents each county’s cumulative number of deaths from COVID-19 as of August 11, 2021 per 100,000 residents.
The data are stored in a file called county_covid_svi.Rds in the data folder of our project directory. You can read more about calculating various types of rates common in public health at this CDC webpage.
First, let’s import the data for the histogram.
<- here("data", "county_covid_svi.Rds") %>%
df_hist read_rds()
There are ten variables in the dataframe:
fips is the county fips ID, a federal ID that uniquely identifies each county
state
county
pop2019 is the number of people living in the county in 2019 based on census records
cases is the cumulative number of COVID-19 cases in the county as of August 11, 2021
covid_case_rate is computed as (cases \(\div\) pop2019) \(\times\) 100,000
deaths is the cumulative number of COVID-19 deaths in the county as of August 11, 2021
covid_death_rate is computed as (deaths \(\div\) pop2019) \(\times\) 100,000
svi is the CDC’s social vulnerability index
svi_quartile provides the quartiles of svi, ordered from least vulnerable to most vulnerable
ph_spending_2019 is the per person expenditure (USD) on public health in 2019 in the corresponding state
ph_quartile provides the quartiles of ph_spending_2019, ordered from lowest spending to highest spending
%>%
df_hist glimpse()
## Rows: 3,131
## Columns: 12
## $ fips <chr> "01001", "01003", "01005", "01007", "01009", "01011",…
## $ state <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabama"…
## $ county <chr> "Autauga", "Baldwin", "Barbour", "Bibb", "Blount", "B…
## $ pop2019 <dbl> 55869, 223234, 24686, 22394, 57826, 10101, 19448, 113…
## $ cases <dbl> 7854, 28020, 2681, 3056, 7658, 1291, 2534, 15995, 418…
## $ covid_case_rate <dbl> 14057.885, 12551.851, 10860.407, 13646.512, 13243.178…
## $ deaths <dbl> 114, 338, 64, 67, 140, 41, 72, 336, 125, 48, 117, 25,…
## $ covid_death_rate <dbl> 204.0488, 151.4106, 259.2563, 299.1873, 242.1056, 405…
## $ svi <dbl> 0.4354, 0.2162, 0.9959, 0.6003, 0.4242, 0.8898, 0.865…
## $ svi_quartile <fct> 2nd Q, 1st Q, 4th Q, 3rd Q, 2nd Q, 4th Q, 4th Q, 4th …
## $ ph_spending_2019 <dbl> 54.25, 54.25, 54.25, 54.25, 54.25, 54.25, 54.25, 54.2…
## $ ph_quartile <fct> 3rd Q, 3rd Q, 3rd Q, 3rd Q, 3rd Q, 3rd Q, 3rd Q, 3rd …
%>%
df_hist skim()
Name | Piped data |
Number of rows | 3131 |
Number of columns | 12 |
_______________________ | |
Column type frequency: | |
character | 3 |
factor | 2 |
numeric | 7 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
fips | 0 | 1 | 5 | 5 | 0 | 3131 | 0 |
state | 0 | 1 | 4 | 14 | 0 | 50 | 0 |
county | 0 | 1 | 3 | 33 | 0 | 1842 | 0 |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
svi_quartile | 0 | 1 | FALSE | 4 | 1st: 786, 2nd: 782, 4th: 782, 3rd: 781 |
ph_quartile | 0 | 1 | FALSE | 4 | 1st: 1157, 3rd: 857, 2nd: 670, 4th: 447 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
pop2019 | 0 | 1 | 101933.21 | 326660.65 | 86.00 | 10911.50 | 25652.00 | 67647.00 | 10039107.00 | ▇▁▁▁▁ |
cases | 0 | 1 | 11123.50 | 39398.62 | 1.00 | 1180.50 | 2860.00 | 7306.00 | 1336029.00 | ▇▁▁▁▁ |
covid_case_rate | 0 | 1 | 11086.38 | 3150.59 | 943.38 | 9246.83 | 11184.76 | 12910.62 | 45218.67 | ▃▇▁▁▁ |
deaths | 0 | 1 | 184.06 | 679.99 | 0.00 | 21.00 | 53.00 | 125.00 | 24833.00 | ▇▁▁▁▁ |
covid_death_rate | 0 | 1 | 211.83 | 116.33 | 0.00 | 131.94 | 197.63 | 273.82 | 865.80 | ▇▇▂▁▁ |
svi | 0 | 1 | 0.50 | 0.29 | 0.00 | 0.25 | 0.50 | 0.75 | 1.00 | ▇▇▇▇▇ |
ph_spending_2019 | 0 | 1 | 35.44 | 23.71 | 7.13 | 17.16 | 30.21 | 48.23 | 139.60 | ▇▆▁▁▁ |
Histograms
Create a basic plot
We will create a histogram of the COVID-19 death rate across the 3,131 US counties represented in the dataframe. By choosing a binwidth of 10, we tell ggplot to create bins with intervals of 10 (10 deaths per 100,000 people). Each rectangle corresponds to an interval of this width (e.g., between 10 and 20, or between 100 and 110). Typically, the analyst considers a few different binwidths – plotting each potential binwidth, and picking the one that best represents the data (not too sparse with too many bars, but not so few bars that important information is disguised).
%>%
df_hist ggplot(aes(x = covid_death_rate)) +
geom_histogram(binwidth = 10) +
theme_bw() +
labs(title = "Do US counties vary in the COVID-19 death rate?",
subtitle = "Cumulative death rate per 100,000 people through August 11, 2021",
x = "COVID-19 cumulative death rate per 100,000 people",
y = "Number of counties in bin")
The histogram shows quite a lot of variability in the COVID-19 death rate across counties. The largest concentration of counties is around 200 (i.e., 200 deaths per 100,000 people), but there is a long tail extending to the right (referred to as right-skew or positive-skew). This illustrates that a small number of counties had extremely high death rates.
By opening the dataframe and scrolling to find Larimer County, we see that Larimer County’s death rate was about 70. In mid August 2021, our county was at the very low end of the distribution.
Enhance the plot
To enhance this plot, let’s try out the facet_wrap() function – which allows a separate histogram to be created for different groups. The facetting (i.e., grouping) variable is listed after a tilde (~), and the argument ncol indicates that we want just 1 column of graphs. Try changing this to 2 in RStudio cloud to see what happens.
We’ll group/facet our graph by state. While we could plot all of the states, it would be difficult to view so many histograms at once – so instead let’s select a few states. I will use the filter() function to do this – we’ll learn more about filtering in Module 3 (data wrangling).
%>%
df_hist filter(state == "Colorado" | state == "Florida" | state == "Louisiana" | state == "Texas" |
== "California" | state == "Pennsylvania") %>%
state ggplot(aes(x = covid_death_rate)) +
geom_histogram(fill = "skyblue", binwidth = 10) +
facet_wrap(~state, ncol = 1) +
theme_bw() +
labs(title = "Do US counties vary in their COVID-19 death rate both within and between states?",
subtitle = "Cumulative death rate per 100,000 people through August 11, 2021",
x = "COVID-19 cumulative death rate per 100,000 people",
y = "Number of counties in bin",
fill = "state")
This plot shows some interesting information. Within each state, we see substantial variability in the death rates across counties, indicating that within each state, some counties have much higher death rates than other counties. Texas has an interestingly wide range of scores.
Density plot
A density plot presents the probability density of a variable by drawing a continuous curve, rather than binning and counting (as in the histograms we just created). The density curve is scaled so that the area under the curve equals one. Density plots have an advantage over histograms in determining the distributional shape because they don’t depend on the number of bins selected.
Create a basic plot
Let’s present the same information that we presented in our first histogram above, but as a density plot.
%>%
df_hist ggplot(aes(x = covid_death_rate)) +
geom_density(fill = "gray") +
theme_bw() +
labs(title = "Do US counties vary in their COVID-19 death rate?",
subtitle = "Cumulative death rate per 100,000 people through August 11, 2021",
x = "COVID-19 cumulative death rate per 100,000 people",
y = "Density")
The density plot shows that the highest density of cases is found between about 125 and 225 deaths per 100,000 people.
Enhance the plot
To enhance this plot, let’s overlay multiple density plots to compare the distribution across states. That is, rather than faceting (creating a separate plot for each state as we did in our enhanced histogram), we will create a single plot but allow a different density curve for each specified group. Because the distributions are overlapping, we need to make the color transparent so we can see them more clearly. The alpha argument in the geom_density() function facilitates this. An alpha of .5 corresponds to 50% opaque.
%>%
df_hist filter(state == "Colorado" | state == "Florida" | state == "Louisiana" | state == "Texas" |
== "California" | state == "Pennsylvania") %>%
state ggplot(aes(x = covid_death_rate)) +
geom_density(aes(group = state, fill = state), alpha = .5) +
theme_bw() +
labs(title = "Do US counties vary in their COVID-19 death rate both within and between states?",
subtitle = "Cumulative death rate per 100,000 people through August 11, 2021",
x = "COVID-19 cumulative death rate per 100,000 people",
y = "Density",
fill = "state")
The overlaid curves help us to see that Colorado and California have the largest share of counties at the low end of the range. There are pros and cons to faceting versus grouping multiple curves on a single plot. As an analyst, you get to choose which type of graph best represents the data. Try a few specifications and choose the one that best conveys the information, is easiest to grasp, and is visually appealing.
Box plots
Like histograms and density plots, a box plot also presents the distribution of a variable, but does so by displaying the five number summary:
- minimum,
- first quartile (Q1, or the 25% mark),
- median (50th percentile),
- third quartile (Q3, or the 75% mark),
- maximum.
A box plot shows quite a few statistics of interest. The interquartile range, abbreviated IQR, is the length of the box (i.e., IQR = Q3 – Q1). The whiskers extend to the extremes (min and max), but no further than 1.5 times the IQR above Q3 or 1.5 times the IQR below Q1. An outlier is a value that is outside of the defined whiskers, and is represented by a point.
To illustrate a box plot, let’s consider how the COVID-19 death rate across the US counties differs as a function of quartiles of the Social Vulnerability Index (SVI). But first, let’s learn a bit more about the SVI, an index created by the Centers for Disease Control and Prevention.
As described by the authors of the index:
CDC’s SVI uses US Census data to determine the social vulnerability of every census tract. Census tracts are subdivisions of counties for which the Census collects statistical data. The SVI ranks each tract on 15 social factors, including poverty, lack of vehicle access, and crowded housing, and groups them into four related themes. Each tract receives a separate ranking for each of the four themes, as well as an overall ranking. The SVI can help public health officials and local planners better prepare for and respond to emergency events like hurricanes, disease outbreaks, or exposure to dangerous chemicals. Click here for more information.
The SVI can be aggregated up to the county level, that’s the version that we will use in this module. We’ll use the overall ranking (as described in the box above). This overall ranking for each county is called svi in our dataframe. A higher score indicates more vulnerability. In order to facilitate it’s use with a box plot, we’ll use a categorized version of the svi, the variable is called svi_quartile in our dataframe.
To create this variable, I divided the scores into four parts, or quarters, of similar size. The quartiles are ordered based on their degree of vulnerability. The 1st Q represents the lowest quartile, this is the least vulnerable group. The 2nd Q represents the second quartile. The 3rd Q represents the third quartile. Finally, the 4th Q represents the most vulnerable group.
Let’s read in the data frame, it’s the same dataframe used in the histograms section of this module.
<- county_covid_svi df_box
Create a basic plot
%>%
df_box ggplot(aes(x = svi_quartile, y = covid_death_rate)) +
geom_boxplot() +
theme_bw() +
labs(title = "Does the COVID-19 death rate differ by county social vulnerability?",
subtitle = "Cumulative death rate per 100,000 people through August 11, 2021",
x = "Quartile for the Social Vulnerability Index",
y = "COVID-19 cumulative death rate per 100,000 people")
The box plot shows that the median COVID-19 death rate is higher for counties with greater social vulnerability. This is particularly obvious when you compare the 1st quartile to the 4th quartile.
Enhance the plot
To enhance this plot, let’s add an additional grouping variable. We will explore the corresponding state’s investment in public health – as measured by public health spending per person in 2019. We will determine if the COVID-19 death rate can be further differentiated by considering both social vulnerability and public health spending. We can add a second grouping variable by mentioning the second variable as an aesthetic in the geom_boxplot() function.
Examination of public health spending is of interest. Numerous agencies and individuals have been sounding the alarm over the underfunding of public health agencies in the US. This underfunding can have drastic consequences in times of catastrophe, and we witnessed this devastation with our own eyes during the COVID-19 pandemic. A recent report by the Trust for America’s Health describes the state of underfunding in our public health systems, explains the consequences of failing to invest in public health, and outlines the steps that we need to take to be better prepared in the future. To be sure, these issues will be a primary topic of study for years to come as we emerge from the COVID-19 pandemic.
To explore the potential role of public health investments, I compiled the per person expenditure on public health in each state in 2019, prior to the pandemic. These data are publicly available through the State Health Assistance Data Assistance Center - SHADAC. The compiled variable is called ph_spending in our dataframe, and the graph below displays the data.
Similar to the SVI quartiles considered earlier, I created ordered quartiles of public health spending. Quartile 1 (1st Q) represents the group with the lowest public health spending per capita (i.e., per person), and Quartile 4 (4th Q) represents the group with the highest public health spending per capita. The graph below maps these quartiles onto each state.
Let’s now create a box plot that presents the COVID-19 death rate for each county by SVI quartile and public health spending quartile.
%>%
df_box ggplot(aes(x = svi_quartile, y = covid_death_rate)) +
geom_boxplot(aes(fill = ph_quartile)) +
theme_bw() +
theme(legend.position = "bottom") +
labs(title = "Distribution of the COVID-19 death rate across US counties",
subtitle = "The intersection of county-level social vulnerability and state-level public health spending",
x = "Quartile of the Social Vulnerability Index",
y = "COVID-19 death rate (per 100.000) as of August 11, 2021",
fill = "Quartile of Public Health Spending in 2019")
In this plot, we can still see the trend for a greater death rate as social vulnerability increases. An interesting pattern emerges in the second through fourth quartiles of the SVI. In these quartiles, there is a lower death rate if the state is in the upper quartile for public health spending. This pattern is not observed for the lowest quartile of social vulnerability (i.e, counties where people are not highly vulnerable to disaster).
Line plot
Line plots can be useful for examining change over some continuous variable (often change over time). For example, a line plot could be used to track the height of a child from ages 0 to 18. To illustrate line plots, we will consider excess mortality during each week of the COVID-19 pandemic. We’ll use data compiled by Our World in Data[https://ourworldindata.org/excess-mortality-covid]. As described by the authors:
Excess mortality is a term used in epidemiology and public health that refers to the number of deaths from all causes during a crisis above and beyond what we would have expected to see under ‘normal’ conditions. In this case, we’re interested in how deaths during the COVID-19 pandemic compare to the average number of deaths over the same period in previous years. Excess mortality is a more comprehensive measure of the total impact of the pandemic on deaths than the confirmed COVID-19 death count alone. In addition to confirmed deaths, excess mortality captures COVID-19 deaths that were not correctly diagnosed and reported as well as deaths from other causes that are attributable to the overall crisis conditions.
I pulled these data from the New York Times data repository on excess deaths due to COVID-19. They are stored in a datafile called covid_excess_mortality.Rds in the data folder of our project directory.
Let’s begin by importing the data.
<- here("data", "covid_excess_mortality.Rds") %>%
df_line read_rds()
There are four variables in the dataframe:
country
start_date is the date corresponding to the comparison week
deaths tallies deaths during the corresponding week in 2020
expected_deaths is the average deaths during corresponding weeks for 2015 to 2019
Let’s explore the dataframe a bit with glimpse(), skim(), and head().
%>% glimpse() df_line
## Rows: 1,159
## Columns: 4
## $ country <chr> "Austria", "Austria", "Austria", "Austria", "Austria",…
## $ start_date <date> 2020-01-06, 2020-01-13, 2020-01-20, 2020-01-27, 2020-…
## $ deaths <dbl> 1702, 1797, 1779, 1947, 1681, 1721, 1718, 1768, 1744, …
## $ expected_deaths <dbl> 1806, 1819, 1831, 1837, 1837, 1829, 1812, 1786, 1753, …
%>% skim() df_line
Name | Piped data |
Number of rows | 1159 |
Number of columns | 4 |
_______________________ | |
Column type frequency: | |
character | 1 |
Date | 1 |
numeric | 2 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
country | 0 | 1 | 5 | 14 | 0 | 24 | 0 |
Variable type: Date
skim_variable | n_missing | complete_rate | min | max | median | n_unique |
---|---|---|---|---|---|---|
start_date | 0 | 1 | 2020-01-04 | 2020-12-23 | 2020-06-22 | 198 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
deaths | 0 | 1 | 8585.10 | 13643.71 | 473 | 1440 | 2397 | 10767.5 | 79058 | ▇▁▁▁▁ |
expected_deaths | 0 | 1 | 7550.29 | 11853.09 | 548 | 1370 | 2236 | 10395.0 | 60821 | ▇▁▁▁▁ |
%>% head(n = 20) df_line
country | start_date | deaths | expected_deaths |
---|---|---|---|
Austria | 2020-01-06 | 1702 | 1806 |
Austria | 2020-01-13 | 1797 | 1819 |
Austria | 2020-01-20 | 1779 | 1831 |
Austria | 2020-01-27 | 1947 | 1837 |
Austria | 2020-02-03 | 1681 | 1837 |
Austria | 2020-02-10 | 1721 | 1829 |
Austria | 2020-02-17 | 1718 | 1812 |
Austria | 2020-02-24 | 1768 | 1786 |
Austria | 2020-03-02 | 1744 | 1753 |
Austria | 2020-03-09 | 1718 | 1717 |
Austria | 2020-03-16 | 1836 | 1678 |
Austria | 2020-03-23 | 1765 | 1640 |
Austria | 2020-03-30 | 1828 | 1603 |
Austria | 2020-04-06 | 1793 | 1570 |
Austria | 2020-04-13 | 1704 | 1539 |
Austria | 2020-04-20 | 1588 | 1512 |
Austria | 2020-04-27 | 1480 | 1487 |
Austria | 2020-05-04 | 1520 | 1467 |
Austria | 2020-05-11 | 1489 | 1450 |
Austria | 2020-05-18 | 1481 | 1437 |
Create a basic plot
Let’s begin by viewing excess deaths in the US. We’ll create two lines – one for the actual deaths (we’ll color this in red) and one for expected deaths based on the prior 5 years (we’ll color this in gray and make it a dashed line). The lwd argument indicates the desired line width – play around with it in RStudio cloud to see how the look of the graph changes as you make it larger (e.g., lwd = 3). the scale_x_date argument allows the user to manipulate the look of the x-axis for dates.
%>%
df_line filter(country == "United States") %>%
ggplot(aes(x = start_date)) +
geom_line(aes(y = deaths), color = "red", lwd = 1) +
geom_line(aes(y = expected_deaths), color = "gray", linetype = "dashed", lwd = 1) +
scale_x_date(date_breaks = "1 month", date_labels = "%b %Y") +
theme_bw() +
theme(axis.text.x = element_text(angle = 90)) +
labs(title = "Were there more deaths than expected in 2020?",
subtitle = "red line = observed deaths, gray line = expected deaths",
y = "number of deaths",
x = "",
caption = "Note: Expected deaths are average deaths for corresponding weeks in 2015-2019")
We can clearly see that the number of deaths for all months from March onward is substantially higher than what would have been expected based on data from the previous five years.
Enhance the plot
To enhance the plot, let’s consider multiple countries using the facet_wrap() function. I will select three countries with a similar population size since the faceted plots share a common range for the x-axis and y-axis.
<- df_line %>%
x filter(country == "Germany" | country == "France" | country == "Spain") %>%
ggplot(aes(x = start_date)) +
geom_line(aes(y = deaths), color = "red", lwd = 1) +
geom_line(aes(y = expected_deaths), color = "gray", linetype = "dashed", lwd = 1) +
scale_x_date(date_breaks = "1 month", date_labels = "%b %Y") +
theme_bw() +
theme(axis.text.x = element_text(angle = 90)) +
labs(title = "Were there more deaths than expected in 2020?",
subtitle = "red line = observed deaths, gray line = expected deaths",
y = "number of deaths",
x = "",
caption = "Note: Expected deaths are average deaths for corresponding weeks in 2015 to 2019") +
facet_wrap(~country, ncol = 1)
These plots are useful for comparing excess deaths across countries, as well as the timing of the spread and consequences of the virus over the course of the year.
Scatter plot
A scatterplot is a useful graph for examining the relationship between two continuous variables. Each point on the plot represents a case — and each case’s score on two variables, designated as x and y, are plotted. Let’s create the MMWR scatter plot we examined at the very start of this unit.
The dataframe includes six variables:
country
country code
population_2020 is the population size in 2020
mort_cumulative_june30 is the number of deaths of COVID-19 per 100,000 people by June 30, 2020
date_death_threshold is the date that the country first reached a defined threshold of daily mortality from COVID-19 (mortality threshold), the selected threshold was a daily rate of 0.02 new COVID-19 deaths per 100,000 population (based on a 7-day moving average). The mortality threshold is used to identify a common epidemiologic point early in the pandemic in each country to align countries by the progression of their epidemic, rather than by calendar date.
stringency_index_death_threshold is a composite index based on nine mitigation policies (cancellation of public events, school closures, gathering restrictions, workplace closures, border closures, internal movement restrictions, public transport closure, recommendations to stay at home, and stay-at-home orders). The index ranges from 0 to 100. For each country, the value of the index was extracted on the date that the country first reached a defined threshold of daily mortality from COVID-19 (mortality threshold).
First, we need to import the data.
<- here("data", "covid_mmwr_interventions.Rds") %>%
df_scatter read_rds()
Let’s explore the dataframe a bit with glimpse(), skim(), and head().
%>% glimpse() df_scatter
## Rows: 37
## Columns: 6
## $ country <chr> "Albania", "Austria", "Belarus", "Bel…
## $ country_code <chr> "ALB", "AUT", "BLR", "BEL", "BIH", "B…
## $ population_2020 <dbl> 3074579, 8859449, 9477918, 11720716, …
## $ mort_cumulative_june30 <dbl> 2.016536, 7.935031, 4.083175, 82.1878…
## $ date_death_threshold <date> 2020-03-24, 2020-03-20, 2020-04-08, …
## $ stringency_index_death_threshold <dbl> 84.26, 81.48, 18.52, 23.15, 89.81, 71…
%>% skim() df_scatter
Name | Piped data |
Number of rows | 37 |
Number of columns | 6 |
_______________________ | |
Column type frequency: | |
character | 2 |
Date | 1 |
numeric | 3 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
country | 0 | 1 | 5 | 22 | 0 | 37 | 0 |
country_code | 0 | 1 | 3 | 3 | 0 | 37 | 0 |
Variable type: Date
skim_variable | n_missing | complete_rate | min | max | median | n_unique |
---|---|---|---|---|---|---|
date_death_threshold | 0 | 1 | 2020-03-02 | 2020-04-18 | 2020-03-23 | 22 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
population_2020 | 0 | 1 | 18519973.43 | 24246187.31 | 350734.00 | 3835586.00 | 8403994.00 | 17280397.00 | 82017514.00 | ▇▁▁▁▁ |
mort_cumulative_june30 | 0 | 1 | 16.37 | 21.29 | 0.51 | 3.20 | 5.89 | 17.51 | 82.19 | ▇▁▁▁▁ |
stringency_index_death_threshold | 0 | 1 | 64.91 | 22.87 | 16.67 | 50.93 | 72.22 | 81.48 | 100.00 | ▂▂▂▇▃ |
%>% head(n = 20) df_scatter
country | country_code | population_2020 | mort_cumulative_june30 | date_death_threshold | stringency_index_death_threshold |
---|---|---|---|---|---|
Albania | ALB | 3074579 | 2.016536 | 2020-03-24 | 84.26 |
Austria | AUT | 8859449 | 7.935031 | 2020-03-20 | 81.48 |
Belarus | BLR | 9477918 | 4.083175 | 2020-04-08 | 18.52 |
Belgium | BEL | 11720716 | 82.187812 | 2020-03-13 | 23.15 |
Bosnia and Herzegovina | BIH | 3835586 | 4.771109 | 2020-03-27 | 89.81 |
Bulgaria | BGR | 6966899 | 3.200850 | 2020-04-01 | 71.30 |
Croatia | HRV | 4227746 | 2.530899 | 2020-03-27 | 96.30 |
Cyprus | CYP | 1266676 | 1.499989 | 2020-03-22 | 51.85 |
Czechia | CZE | 10702498 | 3.251577 | 2020-03-27 | 82.41 |
Denmark | DNK | 5869410 | 10.307680 | 2020-03-18 | 72.22 |
Estonia | EST | 1228624 | 5.127688 | 2020-03-27 | 72.22 |
Finland | FIN | 5571665 | 5.886930 | 2020-03-26 | 64.81 |
France | FRA | 67848156 | 43.818435 | 2020-03-13 | 43.98 |
Germany | DEU | 80159662 | 11.193910 | 2020-03-21 | 68.06 |
Greece | GRC | 10607051 | 1.800689 | 2020-03-22 | 74.07 |
Hungary | HUN | 9771827 | 5.986598 | 2020-03-31 | 76.85 |
Iceland | ISL | 350734 | 2.851164 | 2020-03-17 | 50.93 |
Ireland | IRL | 5176569 | 33.516408 | 2020-03-24 | 48.15 |
Italy | ITA | 62402659 | 55.677115 | 2020-03-02 | 69.91 |
Latvia | LVA | 1881232 | 1.594700 | 2020-04-10 | 69.44 |
Create a basic plot
Let’s examine the relationship between the number of mitigation strategies employed by the country and the cumulative death rate through June 30, 2020.
%>%
df_scatter ggplot(mapping = aes(x = stringency_index_death_threshold, y = mort_cumulative_june30)) +
geom_point(aes(size = population_2020)) +
theme_bw() +
labs(title = "Early policy mitigation and cumulative mortality from COVID-19 in Europe",
x = "Oxford Stringency Index when mortality threshold was reached",
y = "Cumulative mortality through June 30, 2020",
size = "Population",
colour = "")
We see a negative association, more policies adopted is related
to a lower death rate.
Enhance the plot
Let’s enhance the graph to reproduce the plot from the MMWR publication. This is a very full plot with many additional aesthetics and much more information. Though this isn’t a plot that I expect for you to be able to make now on your own, with time and practice you can easily create graphics of this quality and comprehensiveness if you’re willing to invest the time.
library(ggrepel)
library(scales)
%>%
df_scatter ggplot() +
stat_smooth(aes(x = stringency_index_death_threshold,
y = mort_cumulative_june30,
color = "Linear Fit\n& 95% CI"),
formula = y ~ x, method = "lm", se = TRUE, level = 0.95, fill = "grey70") +
geom_point(aes(x = stringency_index_death_threshold,
y = mort_cumulative_june30,
size = population_2020,
fill = date_death_threshold), shape = 21, color = "grey50") +
geom_label_repel(aes(x = stringency_index_death_threshold,
y = mort_cumulative_june30,
label = country_code),
color = "grey35", fill = "white", size = 2, box.padding = 0.4,
label.padding = 0.1) +
scale_size_continuous(range = c(3,10), labels = comma) +
scale_fill_gradientn(colors = c("#d7191c","#ffffbf","#2b83ba"),
name = "Date mortality\nthreshold was\nreached", trans = "date") +
scale_color_manual(name = NULL, values = "grey40") +
guides(fill_gradientn = guide_legend(order = 1), size = guide_legend(order = 2)) +
theme_classic() +
coord_cartesian(ylim = c(0,100), expand = TRUE) +
labs(title = "Early policy mitigation and cumulative mortality from COVID-19 in Europe",
x = "Oxford Stringency Index when mortality threshold was reached",
y = "Cumulative mortality through June 30, 2020",
size = "Population")
A taste of what’s possible
Hopefully the interesting graphics that we created in this unit has you excited about what’s possible with R and ggplot2. Good data visualization should be the first step you take when working with data. Indeed, it’s one of the most important aspects of data analysis.
In this last section, I’d like to show you some more advanced options. You aren’t required to apply these techniques for this course, but I hope you are intrigued by the possibilities.
Advanced option 1 – scrape data from the web
In this world of big data – there are many interesting and useful data sources available on the web. R and the tidyverse provide easy techniques for pulling in data from these types of sources. Let’s take a look at an example.
The COVID-19 tracking project was one of the key organizations compiling data on the virus spread and related data. Throughout the pandemic they made their data available directly from their website. Reading in the data is as simple as one line of code. Many websites provide csv files of their data, and you can readily pull the data into RStudio.
<- read_csv("https://covidtracking.com/data/download/national-history.csv")
df
%>%
df glimpse()
## Rows: 420
## Columns: 17
## $ date <date> 2021-03-07, 2021-03-06, 2021-03-05, 2021-03-…
## $ death <dbl> 515151, 514309, 512629, 510408, 508665, 50621…
## $ deathIncrease <dbl> 842, 1680, 2221, 1743, 2449, 1728, 1241, 1051…
## $ inIcuCumulative <dbl> 45475, 45453, 45373, 45293, 45214, 45084, 449…
## $ inIcuCurrently <dbl> 8134, 8409, 8634, 8970, 9359, 9465, 9595, 980…
## $ hospitalizedIncrease <dbl> 726, 503, 2781, 1530, 2172, 1871, 1024, 879, …
## $ hospitalizedCurrently <dbl> 40199, 41401, 42541, 44172, 45462, 46388, 467…
## $ hospitalizedCumulative <dbl> 776361, 775635, 775132, 772351, 770821, 76864…
## $ negative <dbl> 74582825, 74450990, 74307155, 74035238, 73857…
## $ negativeIncrease <dbl> 131835, 143835, 271917, 177957, 267001, 25577…
## $ onVentilatorCumulative <dbl> 4281, 4280, 4275, 4267, 4260, 4257, 4252, 425…
## $ onVentilatorCurrently <dbl> 2802, 2811, 2889, 2973, 3094, 3169, 3171, 324…
## $ positive <dbl> 28756489, 28714654, 28654639, 28585852, 28520…
## $ positiveIncrease <dbl> 41835, 60015, 68787, 65487, 66836, 54248, 480…
## $ states <dbl> 56, 56, 56, 56, 56, 56, 56, 56, 56, 56, 56, 5…
## $ totalTestResults <dbl> 363825123, 362655064, 361224072, 359479655, 3…
## $ totalTestResultsIncrease <dbl> 1170059, 1430992, 1744417, 1590984, 1406795, …
Advanced option 2 – make use of packages
One of the key benefits of mastering R is that you get to benefit from the incredible scientists all over the world who create open source packages to collate, manipulate, visualize, and analyze data. As one example, let’s take a look at the tidycovid19 package.
As described by the developer:
The key objective of the {tidycovid19} package is to provide transparent access to various data sources at the country-day level, including data on governmental interventions and on behavioral response of the public. It does not contain any data per se. Instead, it provides functions to pull data from publicly available authoritative sources. The sources and the data are documented by additional data frames included in the package.
One of the most interesting applications is displayed below. The package pulls the most recent COVID-19 data, and creates the cumulative case graph that you have surely seen many times since the pandemic began.
library(tidycovid19)
<- download_merged_data(cached = TRUE, silent = TRUE)
merged
plot_covid19_spread(merged, highlight = "USA", edate_cutoff = 600)
Advanced option 3 – consider geography-based plots
Though a variety of extension packages, R has a truly amazing capability to display geography-related data on maps. To see one example of this, let’s continue with the tidycovid19 package and plot COVID-19 spread in Europe. Again, the package pulls the most recent data available, and plots it based on your specifications.
map_covid19(merged, type = "confirmed", region = "Europe")