PSY350, Module 1
Introduction to the course and getting started with R
Module road map
Learning objectives
- Access and utilize the RStudio Cloud for computing
- Use RStudio to create a basic R Notebook
- Use several R functions to accomplish a specified task
- Write code to import a data file and produce basic descriptive statistics
- Render the R Notebook to produce a well-organized and reproducible html report
- Use resources to get help with R programming when needed
- Identify the ways in which data science can help social scientists produce insight
Readings
- Read Chapter 1 of your Textbook
Data Science for Psychologists
Welcome to Module 1 of PSY350!
In this course, you are going to learn the fundamentals of data science. Data science is an interdisciplinary field that aims to produce insight from data. Our focus will be on using data science for social good – that is, to help illuminate, understand, and solve the problems in our society that are related to health inequity and social injustice. This will take several forms.
First, we’ll describe, visualize, and analyze data sets that concern health and well-being in our society. We’ll identify areas where inequities and injustices among racial, ethnic, geographic, socioeconomic, and other groups exist. As described by the Centers for Disease Control and Prevention, the first step to mitigating disparities is to “shine a bright light on the problems to be solved.” Only when inequities and injustices are identified and understood can the barriers to health equity and social justice be removed.
Second, we’ll study and practice the data science techniques that social and behavioral scientists use to identify key ingredients needed to develop programs, practices, and policies to promote health and well-being for all. Effective initiatives aren’t borne from good ideas alone, they’re rooted in theory and empirical work. Developing strong skills to use data for decision making and action is critical.
Third, we’ll examine and apply analytic models to determine if initiatives designed to promote health and well-being are effective. Once initiatives to promote health and prosperity are implemented, we must carefully study them to determine if they work. If they are effective, we also need to understand for whom they work. By conducting these types of evaluation studies, we allow for continual improvement of our strategies, we ensure that we are spending time and resources on the most effective initiatives, and we grow the evidence-base for research and practice.
All three of these data science applications are critical tools for social and behavioral scientists working to promote health and well being. Many students in the social and behavioral sciences don’t see a role for their skill set in data science. However, social scientists are in an ideal position to make substantial contributions to the field of data science.
Data science has been described as the intersection of statistics, technology, and substantive expertise, and is depicted in the Venn diagram below.
In order for data to be used effectively, one needs to have a good foundation in statistics, to be proficient in using modern tools to work with data, and to understand the substantive problems that need to be solved.
In your Psychology training – you are developing your substantive training and domain knowledge. In this course, you are going to have the chance to build your skill set in all three areas – statistics, technology, and domain expertise in using psychological principles to improve health for all.
When designing this course, I spent a lot time reflecting on my own academic training. My early opportunities to use statistics, data science and mathematical thinking to solve real world problems were transformative – and ultimately led me to pursue a career in social science. I designed this course so that you could have these same types of experiences and hopefully develop a love for applying statistics to work toward solving the wicked problems in our society.
Please watch this video introduction to the course.
Introduction to Statistics
Throughout this course you will watch short video segments from the Crash Course series on Statistics. These are fun and informative videos that introduce the fundamental topics of statistics and, more broadly, data science. These videos will serve as an introduction to what you’ll learn in the PSY 350 modules.
To begin, please watch the following two videos that introduce the field of statistics.
Mapping out our journey
We’ll follow the journey to developing your statistics and technology skills using the map defined in our online textbook – Modern Dive. Instead of focusing only on the generic datasets that usually accompany statistics textbooks, we will focus on data sets that elucidate how data science is used for social good.
Introduction to the tools
R and R Studio
We will use R for data science in this course. R is a powerful language, computing, and graphics environment. It is made up of a base software program and thousands of packages that can be added to extend R’s capabilities. We will use R through a program called RStudio. RStudio is an integrated development environment (IDE) for R. R and RStudio are both available for Windows and Mac, and both are open source and free. While you can install both R and RStudio on your computer, in this course we will instead use a RStudio server called the RStudio Cloud, which runs R and RStudio remotely. I chose this method for PSY350 because it negates the need for you to install the software on your computer, and provides an efficient way to begin using the programs.
Before we begin digging into the RStudio project files for PSY350, please watch the following three videos that provide an overview of RStudio and RMarkdown Notebooks (the type of files that we will use in this course).
Accessing the RStudio Cloud
To begin, please create a RStudio Cloud account at this website. There is a free option that you can use. This includes 15 hours to work on the server per month. Depending on your workflow, this may be enough. However, if it’s not, at anytime, you can add an additional 50 hours per month for $5 per month. You will pay RStudio Cloud directly for this service.
Importing PSY350 materials
Once you have an RStudio Cloud account set up, please copy the PSY350Live R project from my RStudio Cloud account to your account. Please follow these directions to accomplish this task:
- Click on this link to go to the PSY350Live Project on my RStudio account.
- Clicking on the project should launch RStudio Cloud and open the PSY350Live project in your account. You should see that the project is a TEMPORARY COPY (there will be a red warning at the top of your window).
- YOU MUST now save your own permanent copy. Click on “Save a Permanent Copy” at the top of the window.
- Now, click back on Your Workspace (on the top left side of the window). You should now see that PSY350Live is a project owned by you – and listed under Your Projects. See the screen shot below for an example.
- It’s very important that you perform Steps 1 through 4 above ONLY ONCE. Do not keep downloading the PSY350Live project from my account – from here on out – make sure that you are always working in your own PSY350Live project. Your name should be in the upper right hand corner of the screen anytime you log into RStudio Cloud.
Next, I’d like for you to change the global settings for your project so that we’re all working under the same settings. THIS IS VERY IMPORTANT! Enter the PSY350Live project by clicking on it. Please click on Tools and then choose Global Options. This will bring up a dialog box (shown below).
In the dialog box, please do the following:
- Uncheck all boxes that are checked except the very last one (i.e., “Wrap around when navigating to previous/next tab”).
- Under Workspace, change “Save workspace to .RData on exit” to Never.
- When done, click Apply and then OK.
Your dialog box should look like this:
Take a screen shot of the Global Options dialog box. Here are some directions for taking a screen shot on a Mac and on a Windows machine. You will upload this to Canvas as part of your first Apply and Practice Assignment, directions are provided on Canvas (see Application #1 under Apply and Practice Exercises in Assignments).
Now that we’ve logged on to the RStudio Cloud, opened up our RStudio project, and configured our project – the fun can begin! Let’s begin working in our session.
Explore the PSY350Live Project
Organization of project files
When you first open your project, you are going to see a screen that looks like this:
There are many features and components of RStudio. We’ll start exploring these slowly, and by the end of the semester you will have a good handle on the most important tools for conducting data science with RStudio.
Notice the bottom right section of the window titled Files. If it’s not already selected, click on project in the bread crumbs, so that the address is Cloud > project. Here, all of the files that are part of your project are listed. Data science requires excellent organization, and organizing files is a big part of this process. I have created three folders to organize the materials. The first is data. Inside this folder are the data files needed for the project. The second is documentation. Inside this folder are all of the documents that are needed to describe the project. For a research study, this might include descriptions of the study design or important papers that describe study measures. For our class project, this includes additional materials that you will need while using RStudio for this course. The third folder is programs. Inside this folder are the R Markdown files that you can use to study and practice using the code chunks in the Modules for this course.
R Markdown Notebooks
In this course, we will use R Markdown files to conduct our analyses. Specifically, we’ll use R Markdown Notebooks. We’ll use these notebooks to wrangle, plot, and analyze data in this course. Once our R Markdown Notebook is complete, we will be able to knit it to a beautiful and reproducible report that displays all of our work. For example, I created the html document that you’re reading now using a R Markdown Notebook.
Let’s open the module_1.Rmd R Markdown file. Be sure that you are in your personal PSY350Live project, then follow the steps in the video below.
Elements of a R Markdown Notebook
Let’s break down the elements of the code in the module_1.Rmd file in our PSY350Live project.
YAML metadata
The very top part of the R Markdown file (between the two sets of three dashes) is called the YAML metadata. All R Markdown files will begin with this section. This code provides the directions for how the R Markdown file should be set up. This section MUST start and end with the three dashes.
Section header
Take a look at the next section below the YAML metadata which starts with a single hashtag and then a section header title called Load libraries
This section displays the three main elements of a R Markdown Notebook. The line that begins with a hashtag is called a section header. These headers work like an indented outline that you’d create for a Microsoft Word report.
One hashtag: # section header title – denotes a first level header,
Two hashtags: ## section header title – denotes a second level header,
and so on…
It’s smart to label each section with informative section header titles, this will help you to find sections easily when your notebook becomes very long. And, these headers can be used to build a table of contents (like the scrolling table of contents that appears on the left side of the document you are reading now).
Text
Just below the load packages section header you will find plain text: “In this section…”
When you write plain text in the white part of the R Markdown file, this will appear in your outputted report (also called a rendered report or a knitted report) as plain text. This is a good place to explain what you’re doing in each section, or to provide interpretations of your output.
Code chunk
Just after the text, there is a gray area.
This is called a code chunk. We write our R code to wrangle, visualize and analyze data inside these code chunks. In this first code chunk, we are simply loading a few packages that we will use in this session.
R functions are stored in packages. When a package is loaded into a session, it’s contents are available for use. Packages add to the capability of R by enhancing existing base R functionalities, and in many cases adding brand new capabilities. To use a package you must first install it. This is done using the install.packages() function – I already loaded all of the packages we will need for PSY 350 – therefore, this step is not necessary for students. Once a package is installed in your system, when you want to use it in a particular session, you call it into the session by typing library(package_name), where package_name is replaced with the package you want to use.
This is a good spot to introduce R functions. A function is a set of code used to carry out a specified task. A function has a name, is followed by parentheses, and inside the parentheses are any needed arguments for the function. Arguments are the directions to tell the function what to do. Sometimes there are no arguments needed because the function does one task by default, sometimes the arguments are very simple, and sometimes they are more complex. We’ll start with simple arguments. For example, in the example code chunk above library() is a function. It’s purpose is to carry out the task of loading a package into memory to use in a session. There is just one argument to this function, and that is to indicate which package to load.
The first library function loads the skimr package, which we will use to obtain descriptive statistics.
The second library function loads the here package, which we will use to tell R where to look for files (e.g., the directory where our data are found).
The third library function loads the tidyverse package. The tidyverse is actually an umbrella package for a set of core packages that we will use to wrangle, visualize and analyze data (e.g., dplyr for data wrangling, ggplot2 for visualization, etc.).
Once we click the green arrow to run this code section, all of these packages will be loaded and ready for us to use their functions. As an example, skimr is a package that we will use to describe data. We can load it into the session by running the code library(skimr). Once loaded, we can use it’s functions. For instance, skim is a function that provides basic descriptive statistics for the variables in a dataframe. So, if we wanted to get descriptive statistics for all of the variables in a dataset called mtcars, we could write the following code:
library(skimr)
skim(mtcars)
For the vast majority or our work, we will have a code chunk at the top of our R Notebook that loads the packages that we will use in the session. Once that code chunk is run, we can use the associated functions throughout our Notebook.
I do want to make you aware however of another method for using a function from a package, as you will see this technique from time to time. Rather than loading a package in the session, instead you can use the double colon operator – where for a package called pkg, and a function called fnct – one writes pkg::fnct(). For example, to use the skim function in the skimr package, we could equivalently write:
::skim(mtcars) skimr
This is useful if you’re only going to use the function once or twice in a session, but it becomes monotonous to use the double colon operator over and over. Therefore, for commonly used packages, it’s efficient to load them into the session using the library function. That way, you can set and forget.
In this course, at first we’ll rely on code chunks that I have created. Soon, however, you will create your own code chunks. To create a new code chunk, put your cursor where you want the code chunk to go, then click on the green C button (with a plus sign) toward the top section of the RStudio window, choose R (the first choice). To execute (i.e., run or submit) the code in the code chunk, press the green side arrow at the top of the code chunk. The video below demonstrates this.
A quick intro to data analysis in R
Now that we have a sense of the elements of a R Markdown Notebook, let’s see how we add to our file to perform data analysis. You can follow along with each of these sections in your own module_1.Rmd file. Before we begin, I want to emphasize that R is case sensitive. Moreover, all of the code that we write needs to be precise, otherwise R won’t understand our instructions.
Import data
First let’s import a data file. The code chunk under the section header labeled # Import data in your module_1.Rmd Notebook imports a data file called covid_mmwr_interventions.Rds (a file with a Rds extension is a R data frame – i.e., a data set or a data file). Let’s break down this code:
<- here("data", "covid_mmwr_interventions.Rds") %>%
df read_rds()
First, <- in R is called an assignment operator, it assigns an object to a name. In this case, the object is our dataframe, and the name is df. You could assign any name you like. For example, by changing
df <-
to
my_data <-
you would call your dataframe my_data instead of df.
Notice that there are two functions being used in the remainder of code: here() and read_rds().
here() is the function to tell R where to find the data.
read_rds() is the function needed to import a .Rds dataframe.
Let’s start with here(). The arguments to here are the directory where the data exist (data in our example), and the name of the dataframe in the directory (covid_mmwr_interventions.Rds in our example). here() establishes the root directory as your project folder. Inside this project (i.e., your PSY350Live project on the RStudio Cloud), there is a folder called data – inside that data folder is the covid_mmwr_interventions.Rds data set. Therefore, the here() function is telling R to find the covid_mmwr_interventions.Rds data file in the data folder.
The here() function is followed by a strange looking set of characters: %>%. This is called a pipe operator. Essentially, it takes the set of instructions on the left side of the operator and feeds it to the set of instructions on the right side of the operator. When you see the pipe operator, think “and then.” Always include the pipe operator at the end of the first set of instructions, then return to a new line (RStudio will automatically indent for you), and then include the next set of instructions.
The next set of instructions after here() is read_rds(). read_rds() simply tells R to read in the covid_mmwr_interventions.Rds data file as a Rds data file type. That’s the native data file type for R.
In summary, this set of code defines an object named df. It first identifies the location of the dataframe to be imported (using here()) and then reads in the Rds dataframe using read_rds().
Once you run this code chunk, in the upper right section of your RStudio screen, under the Environment tab, you will see a file called df. Click on it and it will open up. It’s a dataframe that includes all of the variables and values for our datafame. Take a look at the variables (columns) and entries – what do you think the variables represent? After viewing, click the x beside the df tab to close it.
Examine the structure of the dataframe
Now that we have our dataframe imported into our session, we can use it. The next code chunk in your R Markdown file is labeled with the section header # Get some information about the variables
%>%
df glimpse()
## Rows: 37
## Columns: 6
## $ country <chr> "Albania", "Austria", "Belarus", "Bel…
## $ country_code <chr> "ALB", "AUT", "BLR", "BEL", "BIH", "B…
## $ population_2020 <dbl> 3074579, 8859449, 9477918, 11720716, …
## $ mort_cumulative_june30 <dbl> 2.016536, 7.935031, 4.083175, 82.1878…
## $ date_death_threshold <date> 2020-03-24, 2020-03-20, 2020-04-08, …
## $ stringency_index_death_threshold <dbl> 84.26, 81.48, 18.52, 23.15, 89.81, 71…
The glimpse() function is used to describe the variable types in the dataframe and take a peek at the first few rows. It’s a simple function, that requires no arguments. You can read the pipeline as saying, “take the df dataframe and then feed it to the glimpse() function to take a look at the data.”
We’ll use the covid_mmrw_interventions.Rds data in Module 2 and will learn about each variable then. But, for now, notice that glimpse() tells us that there are 37 rows of data (these correspond to 37 European countries), and 6 columns (i.e., 6 variables). Two of the variables, country and country_code, are character variables (labeled <chr>, i.e., a string of letters, numbers, and/or symbols), and three of the variables are doubles (labeled <dbl>, i.e., a numeric variable that can have places after the decimal point). One of the variables, date_death_threshold, is a date. This is a good illustration of the different types of variables common in data science projects. We’ll work with a wide variety of data types in this course, including quantitative or numeric data, qualitative or character data, factors (a special type of qualitative data), and dates.
Descriptive statistics
The next code chunk in your R Markdown file starts with: # Produce some descriptive statistics
%>%
df skim()
Name | Piped data |
Number of rows | 37 |
Number of columns | 6 |
_______________________ | |
Column type frequency: | |
character | 2 |
Date | 1 |
numeric | 3 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
country | 0 | 1 | 5 | 22 | 0 | 37 | 0 |
country_code | 0 | 1 | 3 | 3 | 0 | 37 | 0 |
Variable type: Date
skim_variable | n_missing | complete_rate | min | max | median | n_unique |
---|---|---|---|---|---|---|
date_death_threshold | 0 | 1 | 2020-03-02 | 2020-04-18 | 2020-03-23 | 22 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
population_2020 | 0 | 1 | 18519973.43 | 24246187.31 | 350734.00 | 3835586.00 | 8403994.00 | 17280397.00 | 82017514.00 | ▇▁▁▁▁ |
mort_cumulative_june30 | 0 | 1 | 16.37 | 21.29 | 0.51 | 3.20 | 5.89 | 17.51 | 82.19 | ▇▁▁▁▁ |
stringency_index_death_threshold | 0 | 1 | 64.91 | 22.87 | 16.67 | 50.93 | 72.22 | 81.48 | 100.00 | ▂▂▂▇▃ |
In this code chunk, we use the skim function to request descriptive statistics for all of the variables in the dataframe. Read the pipeline as saying, “take the df dataframe and then feed it to the skim() function. This function gives us quite a bit more information over glimpse(). Carefully read through all of the available output.
The skim() function tells us how many cases (i.e., rows – which represent countries in this example) have missing data (n_missing). There is no missing data in this datafile. For each variable, it also gives the minimum and maximum values. For the character variables, this is the min and max number of characters in the names. For the numeric variables, this is the min and max scores observed in the dataframe. Also, for the numeric variables we get the mean, standard deviation, quartiles, and a mini histogram. Note that the final table is very wide, hover your cursor over the table and scroll over to the right to see the full output. We’ll learn more about these summaries later in the course.
Create a plot
The last section has a section header labeled: # Create a simple plot. In this code chunk a simple scatter plot using the ggplot() function is created. We’ll learn all about this function and how to create wonderful graphics in Module 2. For now, just notice that the ggplot() function has many more arguments than the others that we’ve seen – these arguments allow us to define how we want the plot to look.
%>%
df ggplot(mapping = aes(x = stringency_index_death_threshold, y = mort_cumulative_june30)) +
geom_point(aes(size = population_2020)) +
labs(title = "Early policy mitigation and cumulative mortality from COVID-19 in Europe",
x = "Oxford Stringency Index when mortality threshold was reached",
y = "Cumulative mortality through June 30, 2020",
size = "Population")
Knit your notebook
Once we’ve completed our R Markdown Noteboook, we can render a report. Before doing so, please add your own name into the author: section of the YAML metadata. Click the Knit icon at the top of the RStudio window, and choose knit to html. This will create your report, which will automatically be saved in the same folder where the .Rmd file is saved. Once knit, you can export the .html output file of your results from RStudio Cloud and onto your computer. To do this, navigate to the .html file in the files tab (lower right quadrant of your session), once you find the module_1.html file, check the box to the left. Then, click on the More button (the blue gear), and choose Export. This will export the .html file from RStudio Cloud to your computer. You can then drag it to your desktop or put it into a folder of your choosing on your hard drive. If you do this multiple times, note that your computer might start indexing the names: e.g., module_1.html (1). It’s bad practice to leave names like this. Once you drag the file to your desktop, rename the file and remove the space and (1), so that it’s called module_1.html.
In general, avoid naming files, or variables in files, with names that have spaces. For example, don’t give any type of data file (e.g., .Rds, and excel file) or Notebook file a name that has spaces in it! E.g., use mydata.Rds or my_data.xlsx – do not use my data.xlsx. Likewise, do not put spaces in variables names – notice that in the example dataframe (i.e., data set or data file) we looked at in this unit, none of the variable names have space (e.g., country, country_code).
One of the things I love about R Markdown is that you can be creative and customize the look of your reports if you desire. There are many possibilities to try. If you’re interested, click on the icon that looks like a gear beside the Knit icon, then choose Output Options. See how the report changes when you click to add/subtract a table of contents. Or, apply a different syntax highlighting style or a different theme. Re-knit your R Markdown file and see how it looks.
The video below walks through the steps covered in this module. Please complete these steps and then upload your knitted html report to Canvas as part of your first Apply and Practice Exercise (Application #1).
Alternative coding with the pipe
In this course, I will be consistent with feeding a dataframe into a pipe. But there is an alternative way that the code can be written. I’d like to show you this alternative as you will see it from time to time in your textbook and on DataCamp. In the code chunk below I show two ways of using glimpse(). First, is the initial way that I showed you, the second is an alternative way in which the dataframe is the first argument of the glimpse() function.
# method 1
%>%
df glimpse()
## Rows: 37
## Columns: 6
## $ country <chr> "Albania", "Austria", "Belarus", "Bel…
## $ country_code <chr> "ALB", "AUT", "BLR", "BEL", "BIH", "B…
## $ population_2020 <dbl> 3074579, 8859449, 9477918, 11720716, …
## $ mort_cumulative_june30 <dbl> 2.016536, 7.935031, 4.083175, 82.1878…
## $ date_death_threshold <date> 2020-03-24, 2020-03-20, 2020-04-08, …
## $ stringency_index_death_threshold <dbl> 84.26, 81.48, 18.52, 23.15, 89.81, 71…
# method 2
glimpse(df)
## Rows: 37
## Columns: 6
## $ country <chr> "Albania", "Austria", "Belarus", "Bel…
## $ country_code <chr> "ALB", "AUT", "BLR", "BEL", "BIH", "B…
## $ population_2020 <dbl> 3074579, 8859449, 9477918, 11720716, …
## $ mort_cumulative_june30 <dbl> 2.016536, 7.935031, 4.083175, 82.1878…
## $ date_death_threshold <date> 2020-03-24, 2020-03-20, 2020-04-08, …
## $ stringency_index_death_threshold <dbl> 84.26, 81.48, 18.52, 23.15, 89.81, 71…
Most functions that we’ll explore in this course (e.g., glimpse(), skim(), ggplot(), etc.) can work with either approach.
I’d also like to point out another convenient feature of R code chunks. Notice in the code chunk above, that I have a hashtag and then a comment (i.e., # method 1). When you use a single hashtag inside a code chunk, the text following the hashtag serves as a comment and will not be evaluated by R. In this way, you can write helpful notes to yourself about the code you have included.
Mathematical thinking
Throughout this course you will develop your mathematical thinking. To close out the learning objectives of this module, please watch the Crash Course video below that introduces mathematical thinking.
Resources
Learning R takes time. In large part, it’s about being conscientious and thoughtful. And, practicing a bit everyday (or most days at least) is the key to success. There are many wonderful resources to help you on your journey. Learning how to tap into these resources is a very important first step. Here are a few of my favorites:
Search for error messages (and how to fix them) on stack overflow
DataCamp: besides the assigned courses, there are dozens of other R courses – check them out, you can take any of them for free this semester.
Andy Field’s Intro to R and RStudio includes great short videos and fun practice activies.
When logged into the RStudio Cloud, click on the three horizontal lines beside Your Workspace. A menu should appear along the left side of the window. Under the header Learn, you will see several resources. One is called Primers. These are short tutorials on a variety of topics that we will cover this semester. These are great activities to solidify your understanding and skills. A second is called Cheatsheets. These are one page resources for a variety of different packages – they quickly and succinctly describe the key arguments of the functions in the package.
That concludes Module 1. Congratulations for completing your first module for PSY350. I hope that you have learned a bit about R and RStudio and are excited about what is to come in the remainder of the course!