PSY350, Module 4

Introduction to Probability, Probability Distributions, and some Descriptive Statistics

Module road map

Learning objectives

  1. Calculate probabilities of events
  2. Describe the properties of a binomial distribution and a normal distribution
  3. Explain the empirical rule
  4. Contrast the qnorm() and pnorm() functions in R for evaluating the probability density function
  5. Use qnorm() and pnorm() to describe normal distributions
  6. Describe standardized scores
  7. Contrast the normal distribution and the standard normal distribution

Overview

In the first three modules of this course, you grew your skills to work with data – including getting your feet wet with R and RStudio – as well as wrangling data and plotting data. We’ve also explored some basic techniques to summarize data. In this module, we will start to go a bit beyond describing data – and begin to focus more on interpreting data.

Taking the jump from simply summarizing data to using data to infer something about the world we live in requires thinking about probability. Let’s consider an example. The Pew Research Center is a nonpartisan fact tank that informs the public about the issues, attitudes and trends shaping the world. They conduct public opinion polling, demographic research, content analysis and other data-driven social science research. One of their primary goals is to understand how the public feels about critical issues in our society. Their scientists continuously draw random samples of US residents and poll them on a variety of topics. For example, a few years ago they created the graph below – which shows the distribution of scores on a measure of political ideology (a scale ranging from very liberal to very conservative) across Democrats and Republicans. Random samples of people over time were drawn, polled, and analyzed to examine the distribution of scores on political ideology in the US, and potential changes over time. Based on the survey results, a growing divide between the two groups over time is apparent.

When researchers draw a sample of individuals, poll them about an important phenomenon, and present the results – the intention is that the data are telling them something about not just the sample (e.g., the samples of individuals polled for a study) but the population of adults from which these samples were drawn. If the study is done properly, then what we learn from a sample of individuals can help us to understand what is happening in the larger population – and that is an extremely useful and powerful paradigm. And, it’s a paradigm that we as social science researchers rely on heavily to figure out where problems in society exist, how we might approach solving them, and whether programs, practices and policies that we implement are effective.

Beginning with Module 8 – we’ll really dig into these ideas, that is, the part of statistics called inferential statistics. For now, I want us to start simply. In this module, we will explore the ways in which we can consider the distribution of a variable (e.g., political ideology of US adults, or infant birth weights) – and formulate ideas about how probable events might be based on the observed distribution. For example, what is the probability that a newborn baby will weigh less than 5 pounds 8 oz (the threshold for classifying low birth weight infants)?

Notice the word “probability” in the last sentence – that is the focus of this module. We must have some foundation in probability theory in order to progress to understand inferential statistics. This module is all about growing that foundation. You don’t need to be an expert in probability – but it’s important for us to have a general understanding of the basic rules of probability and some intuition about how probability distributions can be used.

We’ll study two different types of probability distributions in this module – the binomial distribution and the normal distribution. We studied variable distributions a bit in Module 2 – for example, recall that several of the graphs we created were designed to display the distribution of a variable – e.g., histograms and density plots. Let’s start off by refreshing our memory of the distributions of variables by watching this video on the shapes of variable distributions from Crash Course: Statistics.



The role of probabability in data science

In order to make use of these distributions – and what they might tell us about the phenomenon at play, we need to learn a bit more about the basic tenets of probability. To do so, please watch the following two Crash Course: Statistics videos on probability.




Types of distributions

The binomial distribution

Now, we’re ready to begin learning about two specific types of probability distributions. Let’s start with a distribution appropriate for certain kinds of binary variables – for example, we might be interested in the probability of contracting COVID-19 on a particular day or the probability of getting into a car accident on the way to work. In both of these examples, there are two, discrete outcomes – either we contract COVID-19 or we don’t; or either we get into a car crash on the way to work or we don’t. Please watch the following Crash Course: Statistics video to learn a bit about binomial variables and the binomial distribution.



Let’s apply what you learned about probability in the videos so far. Let’s see how probability works in the context of COVID-19.

Imagine two statistics from a Local County Health Department:

  • One from March 2021: “A vaccinated person has a 4% chance of getting COVID-19.”

  • A second from August 2021: “A vaccinated person has a 10% chance of getting COVID-19.”

Do these two statistics necessarily mean that the vaccine became less effective from March to August of 2021?

Let’s use what we’ve learned about probability to consider this question. In doing so, let’s consider a couple scenarios.

Scenario 1

In this scenario, an unvaccinated person has a 10% chance of getting COVID-19. And, the vaccine is 60% effective at preventing infection.

Knowing this information, we can calculate then that a vaccinated person in this scenario has a 4% chance of getting COVID-19. We do this by multiplying the probability of contracting COVID-19 for the unvaccinated (.10) by (1-.6) or .4 (representing a 60% reduction as a result of the vaccine – i.e., the vaccine is 60% effective, so a vaccinated person has 40% of the chance of contracting COVID-19 as compared to an unvaccinated person). Thus, \(.10 \times(1-.6) = .04\). The probability of contracting COVID-19 for a vaccinated person in this scenario is .04 – or we can say they have a 4% chance.

Now, let’s consider a second scenario.

Scenario 2

In this scenario, an unvaccinated person has a 25% chance of getting COVID-19 (i.e., there is much more virus spreading in the community). And, the vaccine is 60% effective at preventing infection. Notice this is the same effectiveness as the first scenario.

Knowing this information, we can calculate then that a vaccinated person in this scenario has a 10% chance of getting COVID-19. Using the same mathematics as before: \(.25 \times(1-.6) = .10\).

Notice that the vaccine effectiveness in Scenario 1 and 2 is identical (60% effectiveness in both cases) – however, because the rate of COVID-19 infection is higher among the unvaccinated (because more virus is present in the community), then the rate of COVID-19 infection will also be higher among the vaccinated even though the vaccine is equally effective.

This phenomenon is called the Base Rate Fallacy. The vaccine effectiveness didn’t change but the chance of infection for vaccinated people increased because the overall chance of infection increased.


Let’s build on this example to consider another COVID-19 related example where our ability to comprehend and compute probabilities is critically important.

Recall that in the summer of 2021, there were many stories in the news stating that in areas where vaccination rates were high, most of the people contracting COVID-19 (and being hospitalized) were vaccinated – and this prompted some to conclude that the vaccine wasn’t working. Let’s consider a few scenarios to explore why calculating and presenting the percent of COVID-19 cases who are vaccinated is misleading. And, then we’ll learn how to present the data in a way that is not misleading.

Scenario 1

Imagine we are epidemiologists studying a city of 1,000,000 people. In this city there is a 50% vaccination rate (50% of people are not vaccinated, 50% are vaccinated).

Across the scenarios we will consider in this section, we’ll hold several variables constant:

If a person is unvaccinated, then there is:

  • a 10% chance of being infected

  • if an unvaccinated person is infected, they have a 10% chance of severe illness

In addition, we’ll hold constant the vaccine effectiveness – we’ll assume the vaccine is 60% effective against infection, and the vaccine is 80% effective against severe illness if infected.

Knowing the probability of infection and serious illness of the unvaccinated, and the effectiveness of the vaccine against infection and serious illness if vaccinated, we can calculate the that:

If a person is vaccinated, there there is:

  • a 4% chance of being infected: \(.10 \times(1-.6) = .04\) or 4%.

  • if a vaccinated person is infected, they have a 2% chance of severe illness \(.10 \times(1-.8) = .02\) or 2%.

Using this information, we can calculate the number of unvaccinated people who get infected and get severe illness (the left boxes in the figure) and the number of vaccinated people who get infected and get severe illness (the right boxes in the figure). Then, we can determine the percent of severe illness cases who are vaccinated.

Working through the math – you can see that in Scenario 1 – 7.4% of people with severe illness were vaccinated.

Now, lets consider a second scenario. We’ll leave all parameters from Scenario 1 the same – but we’ll change one important thing – the percent of the population vaccinated. While in Scenario 1 50% of the population was vaccinated – in Scenario 2, only 25% of the population is vaccinated. Notice how the calculated numbers of people change.

Working though the math you see that in this scenario the vaccine is equally effective against contracting COVID-19 and getting seriously ill if infected, but because the vaccination rate is substantially lower, the percent of cases with severe illness who were vaccinated is much smaller (2.6% of people with severe illness were vaccinated).

Lets consider one last scenario. Again, we’ll hold all parameters constant – but change the percent of the population who was vaccinated. Here, we’ll consider a scenario where 90% of the population is vaccinated.

Working though the math you see that in this scenario where the vast majority of people are vaccinated, the percent of cases with severe illness who were vaccinated is much, much larger (42% of people with severe illness were vaccinated). This is so, even though the effectiveness of the vaccine was held constant in all three scenarios.

To gain intuition about what is happening, we must recognize that the probability of being vaccinated given severe illness is not the same as the probability of severe illness given vaccination. The former is deceiving because, as demonstrated in this activity, this value changes depending on what proportion of the population is vaccinated. It’s the latter that provides helpful information. Consider the image below, where, for each of the three scenarios considered, in the top box I present the misleading statistic that we just calculated (the percent of severe illness cases who were vaccinated) and in the bottom box I present the percent of vaccinated people who got severe illness AND the percent of unvaccinated people who got severe illness.

When calculated the correct way – irrespective of the percent of the population who is vaccinated, we find that the vaccine effectiveness against severe illness is the same – that is, unvaccinated people were 12.5 times more likely to suffer severe illness from COVID-19 than vaccinated people.

The take home message is that probabilities can be quite confusing to our human minds, and what seems intuitive often leads us astray. Sometimes this confusion can have dire consequences – like in the case of individual’s ability to comprehend the risks associated with not being vaccinated in a pandemic. These confusions even crop up among highly knowledgeable people. For example, in April of 2022, the Washington Post ran a story that included some misleading statistics stemming from the base rate fallacy.


Though they also ran a graph along side that did calculate the estimates in a way that does account for the base rate fallacy.


The normal distribution

One of the most commonly used probability distributions in Statistics, and one that we will rely on heavily in this course, is the normal distribution. While the binomial distribution that you just learned about is appropriate for a binary variable with two discrete outcomes, a normal distribution is appropriate for a continuous variable (e.g., the birth weight of newborn babies, or the Area Disadvantage Index of counties in the US). Please watch the following Crash Course: Statistics video on the normal distribution.



Now, with a foundation in probability and probability distributions, let’s apply what we’ve learned to some data. We’ll focus on the normal distribution since that will be the primary focus in our subsequent Modules.


Introduction to the data for this Module

We will explore data from the National Health and Nutrition Examination Study (NHANES). This national study is conducted by the Centers for Disease Control and Prevention, and is one of their surveillance initiatives to monitor the health and morbidity of people living in the US. We will use NHANES data collected during 2011-2012. NHANES data are publicly available for download.

The 5,000 individuals in the dataframe used in this module are resampled from the larger NHANES study population to mimic a simple random sample of the US population.

Let’s import the the dataframe.

nhanes <- here("data", "nhanes.Rds") %>% 
  read_rds()

We’re going to use a subset of the data. Let’s construct a dataframe that includes just the cases and variables that we need for this module. We have a variety of tasks that we need to accomplish:

  1. First, we’re going to focus on people 20 years or older, we can accomplish this with filter().

  2. Then we’ll transform Height, which is expressed in cm, to a new variable called ht_inches that is expressed in inches.

  3. We will also create a new variable called sex, that is a copy of gender, and we will specify it to be a factor. Factors are similar to character variables, but have some additional benefits. One that we’ll see later in this course is the ability to specify an order to the levels of the factor.

  4. We will subset to keep just the variables needed.

  5. Finally, we’ll use drop_na() to exclude cases missing on ht_inches or sex (the two variables we are using in this module).

Let’s do all of this with a pipe. Can you map each step above to the corresponding code?

height <- nhanes %>% 
  filter(Age >= 20) %>% 
  mutate(ht_inches = Height/2.54, 
         sex = factor(Gender)) %>% 
  select(ht_inches, sex) %>%
  drop_na()

Visualize data

Let’s begin our descriptive analyses by examining the distribution of heights by sex.

height %>% 
  ggplot(aes(x = ht_inches, group = sex, fill = sex)) +
         geom_density() +
  theme_bw() +
  labs(title = "What is the distribution of heights for males and females?",
       x = "Height in Inches")

The highest density of data points for males is around 69 inches and for females around 64 inches. But, we have a reasonably large range in both groups, representing the large variation in heights of people in our population. As we would expect, males are, on average, taller than females.

Calculate central tendency and dispersion statistics in R

For a thorough display of measures of central tendency and dispersion, I like the skim() function from skimr.

height %>% 
  group_by(sex) %>% 
  skim()
Data summary
Name Piped data
Number of rows 3561
Number of columns 2
_______________________
Column type frequency:
numeric 1
________________________
Group variables sex

Variable type: numeric

skim_variable sex n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
ht_inches female 0 1 63.76 2.91 52.95 61.80 63.82 65.63 72.64 ▁▂▇▅▁
ht_inches male 0 1 69.19 3.01 59.88 67.32 69.02 71.06 78.90 ▁▃▇▃▁

Let’s consider males. The mean height is 69.2 inches and the standard deviation is 3.0 inches. The median is listed under p50, referring to the 50th percentile score – 69 inches in our example. The range is ascertained through the p0 and p100 scores, referring to the 0th (lowest) and 100th (highest) percentile score – 59.9 inches to 78.9 inches in our example.

To make these percentile scores a little more concrete, let’s randomly select 11 males from the dataframe. Once 11 are selected – I am going to sort by height, and create a value denoting the rank order from shortest to tallest of these 11 males.

set.seed(8675309)

pick11 <- height %>% 
  filter(sex == "male") %>% 
  select(ht_inches) %>% 
  sample_n(11) %>% 
  arrange(ht_inches) %>% 
  mutate(ranking = rank(ht_inches))

pick11
ht_inches ranking
64.84252 1
65.15748 2
66.29921 3
67.24409 4
67.32283 5
69.29134 6
69.72441 7
69.88189 8
70.78740 9
71.53543 10
74.25197 11

We see that the shortest male in this random sample is 64.8 inches, and the tallest male is 74.3 inches. Let’s request the skim output for this sample.

skim(pick11)
Data summary
Name pick11
Number of rows 11
Number of columns 2
_______________________
Column type frequency:
numeric 2
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
ht_inches 0 1 68.76 2.88 64.84 66.77 69.29 70.33 74.25 ▇▅▇▅▂
ranking 0 1 6.00 3.32 1.00 3.50 6.00 8.50 11.00 ▇▅▅▅▅

Take a look at the row for ht_inches. The p0 score is the 0th percentile of the data points – which represents the shortest male in the sample (64.8 inches). The p100 score is the 100th percentile of the data points – which represents the tallest male in the sample (74.3 inches). Notice that the 50th percentile is the middle score (a ranking of 6 – where 5 scores fall below and 5 scores fall above) – that is, 69.3 inches (see that this height corresponding to ranking = 6 in the table of the 11 male heights printed above). Note that 50% of the scores in our subsample are below the 50th percentile and 50% of the scores in our subsample are above the 50th percentile.

In the Modules, I primarily use the skim() function to calculate the mean and standard deviation. However, in viewing R code online, in your textbook, and in DataCamp, you will occasionally see the mean and standard deviation calculated by using the extraction operator. The $ is the extraction operator, which is used to view a single variable/column in a dataframe. As an example, the code:

mean(height$ht_inches) 
sd(height$ht_inches) 

will compute the mean and standard deviation of the variable ht_inches which resides in the height dataframe.

Application of the normal distribution

A variable with a Gaussian (i.e., bell-shaped) distribution is said to be normally distributed. A normal distribution is a symmetrical distribution. The mean, median and mode are in the same location and at the center of the distribution.

The empirical rule provides information about the spread of a normally distributed variable given the mean and standard deviation. In the graph below, the mean of a variable called z is 0 and the standard deviation is 1 (denoted on the x-axis). The empirical rule states that when a variable is normally distributed, we can except that:

  • 68% of the data will fall within about one standard deviation of the mean
  • 95% of the data will fall within about two standard deviations of the mean
  • Almost all (99.7%) of the data will fall within about three standard deviations of the mean

Let’s use the empirical rule to describe the distribution of female heights. First, we need to ensure that the distribution is normal. I constructed a histogram of the heights of all females in the dataframe and overlaid a perfect normal curve over this histogram. The resulting graph, shown below, shows that the distribution of heights for females is quite close to normal. Thus, we can use the empirical rule to describe the distribution.

The average height of females is 63.8 inches, with a standard deviation of 2.9 inches. The empirical rule states that 95% of the data will fall within about two standard deviations of the mean. Therefore, we expect 95% of all females to be between 58.0 inches and 69.6 inches tall (i.e., 63.8 \(\pm\) 2 \(\times\) 2.9). Likewise, only about 2.5% of females are expected to be shorter than 58.0 inches, and only about 2.5% of females are expected to be taller than 69.6 inches.

Please watch the video below for an application of the concepts of the normal distribution and the empirical rule to female heights.


Probability density function (pdf)

If we know the mean and standard deviation of a normally distributed variable, we can use that information to gain an understanding of the variable’s distribution – and the probability that a score from that distribution will be of a certain size or in a certain interval.

A continuous random variable, like height, can have an infinite number of possible values – think about a very precise height of an individual rather than a value rounded to the nearest inch (e.g. 74.25197 inches tall). Thus, the probability that a continuous random variable takes on any particular value is essentially 0. Therefore, in describing heights, it’s not useful to calculate the probability of an individual having some very precise height, but rather it’s more useful to find the probability that a continuous random variable falls in some interval (e.g., greater than 70 inches, or between 60 and 65 inches, or even between 70 and 71 inches). We can use a probability density function (pdf) to calculate the probability of a score falling within any desired interval.

To summarize, a probability density function (pdf) is a function associated with a continuous random variable (for example, heights of females in the US). It helps us to understand the probability of a continuous random variable falling in some interval of the density curve. In a pdf, the area under the curve is equal to 1, akin to the density plots that we studied in Module 2. The graph below approximates a pdf for female heights.

We can use the pdf of a normal distribution to understand the probability of a case having a score within some interval given the variable is normally distributed. This is where the empirical rule is derived. Using the empirical rule, we determined that 95% of all female heights are in the interval between 58.0 inches and 69.6 inches. In other words, only about 2.5% of females will be shorter than 58 inches, and only about 2.5% of females will be taller than 69.6 inches. This is demonstrated in the graph below.

Before moving to the next section, please note that there is a correspondence between percentages and probabilities. Probability quantifies the likelihood that an event will occur. The probability that an event will happen can span from from impossible (0 probability) to certainty (a probability of 1). Therefore, a probability is always a number from 0 to 1. We can also write a probability as a percentage, which is a number from 0 to 100 percent. The higher the probability number or percentage of an event, the more likely the event will occur.

The empirical rule gave us the information needed to find the middle 95% of heights. There’s a function in R that will also calculate this range. It’s called qnorm(), and stands for the quantile of the normal distribution. qnorm() finds the value of a quantile with probability p of observing a value equal to or less than that quantile value given a normal distribution with a certain mean and standard deviation.

To gain intuition for this, take a look at the graph below. It displays a normal distribution, and the intervals associated with the empirical rule are included. But, there is also some additional information. Identify the percentiles in the graph. These provide the percentile rank of a given score on the x-axis of the density plot, that is, the percentage of scores in its frequency distribution that are less than the score. Percentile ranks are common in everyday life. For example, when you were a child, your family physician probably calculated your percentile rank for your height (e.g., if you were at the 75th percentile – then 75% of children your age were shorter than you and 25% were taller). As another example, when you took the SAT, your percentile score was calculated (e.g., if you were at the 60th percentile – then 60% of test takers scored lower than you and 40% scored higher than you).

qnorm() utilizes these percentile ranks. Let’s try an example. To find the score for female heights that would mark the 2.5th percentile, we could use the following code. Note that 2.5/100 = .025 – so p = .025 in the code below indicates that we want the score for height in which 2.5% of scores will fall below (and 97.5% will lie above). Note also that the mean and sd in the code below are just the mean and standard deviation of our normally distributed variable of female heights in the population (calculated earlier in the module).

qnorm(p = .025, mean = 63.8, sd = 2.9)
## [1] 58.1161

Executing this function yields 58.1, so we expect 2.5% of females to be shorter than 58.1 inches, and 97.5% to be taller than 58.1 inches. This quantity is depicted in the graph below.

Let’s try another. Here, we’ll find the score for height in which 97.5% of scores will fall below. In this example, we set p to .975 (again 97.5/100 = .975).

qnorm(p = .975, mean = 63.8, sd = 2.9)
## [1] 69.4839

Execution of the function yields 69.5, indicating that we expect 97.5% of females to be shorter than 69.5 inches tall (and 2.5% will be taller). This quantity is depicted in the graph below.

We can compare these to the values we calculated using the empirical rule (that is, by computing 63.8 \(\pm\) 2 \(\times\) 2.9). Using the empirical rule and calculating the range by hand, we obtained 58.0 for the lower bound and 69.6 for the upper bound. Using qnorm(), we obtained 58.1 for the lower bound and 69.5 for the upper bound. You can see they match up quite closely – the difference is because qnorm() is more precise. The hand calculations using the empirical rule specified that 95% of the distribution falls within 2 standard deviations of the mean. The more precise value is actually ~1.96 standard deviations. If we used that more precise value to multiply by the standard deviation in our hand calculations, we would have obtained the same values produced by qnorm(). Please see the graph below for a summary.

We can use qnorm in this way to calculate any interval that we wish.

Please watch the video below for an introduction to the pdf.



Now, let’s try a few more examples.

pdf example 1

How tall does a female need to be in order to be taller than 75% of females in the population?

qnorm(p = .75, mean = 63.8, sd = 2.9)
## [1] 65.75602

A female needs to be about 65.7 inches tall in order to be taller than 75% of the female population.

pdf example 2

What range of heights encompasses the middle 50% of females?

We need two values here, the lower and upper bound of the specified interval.

qnorm(p = .25, mean = 63.8, sd = 2.9)
## [1] 61.84398
qnorm(p = .75, mean = 63.8, sd = 2.9)
## [1] 65.75602

25% of females are shorter than ~61.8 inches, 25% are taller than ~65.7 inches. Thus, the interval for the middle 50% is about 61.8 inches to 65.7 inches tall.

pdf example 3

We might also be interested in framing a question like this: “What is the probability that a randomly selected female from the population will be shorter than 65 inches tall?” This is a little different than the previous examples. We need a different function to solve this problem – the pnorm() function in R can help us to calculate this. While qnorm() finds the corresponding quantity, pnorm() finds the corresponding probability. Supply the value of the variable of interest (65), the mean and sd of the normal distribution of the variable, and whether or not you want the lower tail (here we want the probability of being less than 65 inches — so we want the lower tail of the distribution).

pnorm(q = 65, mean = 63.8, sd = 2.9, lower.tail = TRUE)
## [1] 0.6604872

The probability is .66 that a randomly selected female would be less that 65 inches tall. Or, we can expect that about 66% of females are shorter than 65 inches. You could equivalently say that a female who is 65 inches tall is at the 66th percentile for the distribution of female heights.

pdf example 4

Let’s try one more. Let’s ask the question: “What is the probability that a randomly selected female is taller than 70 inches?” Here, we want the proportion of the distribution above 70, so the upper tail. By specifying lower.tail = FALSE, we get the upper tail.

pnorm(q = 70, mean = 63.8, sd = 2.9, lower.tail = FALSE)
## [1] 0.01626117

The probability is about .016 that a randomly selected female would be taller than 70 inches. Or, we can expect that about 1.6% of females are taller than 70 inches.

In contrasting qnorm() and pnorm(), we find that qnorm() will return a score associated with the normally distributed variable that is likely to encompass some specified area of the distribution (e.g., minimum height to be in the top 10%). On the flip slide, pnorm() will return the area of the distribution that is encompassed by some interval of scores (e.g., percent of the distribution that is above 65 inches).

Standardized scores (z-scores)

Let’s cover one more related idea – transformations of raw scores to z-scores. A z-score is a standardized score based on some raw variable (e.g., height in inches, ht_inches) that has been transformed by first subtracting the mean, and then dividing by the standard deviation. Thus, a z-score has a mean of 0 and a standard deviation of 1.

Before applying z-scores to our example with heights in the NHANES dataset, please watch the video below from Crash Course: Statistics on z-scores and percentiles.



Now, let’s apply the concept of z-scores to the NHANES example. Among females, the mean height is 63.0 inches, and the standard deviation is 2.9 inches Let’s consider my height (70 inches): z-score = (70 - 63.8) / 2.9 ~ 2.1

This means that I am about 2.1 standard deviations above the mean height for females.

What is your z-score for height?

We can use the code below to convert the raw scores to z-scores in the NHANES data. Because we use group_by() first, the sex specific mean and sd is used to form the z-scores for males and females respectively.

zheight <- height %>%
  group_by(sex) %>%
  mutate(zht_inches = (ht_inches - mean(ht_inches))/sd(ht_inches)) %>%
  ungroup() %>% 
  select(sex, ht_inches, zht_inches)

zheight %>% head()
sex ht_inches zht_inches
female 67.71654 1.3598140
male 66.18110 -0.9971722
male 66.18110 -0.9971722
female 67.55906 1.3056485
male 68.30709 -0.2916693
male 70.35433 0.3877038
zheight %>%  group_by(sex) %>% skim()
Data summary
Name Piped data
Number of rows 3561
Number of columns 3
_______________________
Column type frequency:
numeric 2
________________________
Group variables sex

Variable type: numeric

skim_variable sex n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
ht_inches female 0 1 63.76 2.91 52.95 61.80 63.82 65.63 72.64 ▁▂▇▅▁
ht_inches male 0 1 69.19 3.01 59.88 67.32 69.02 71.06 78.90 ▁▃▇▃▁
zht_inches female 0 1 0.00 1.00 -3.72 -0.67 0.02 0.64 3.05 ▁▂▇▅▁
zht_inches male 0 1 0.00 1.00 -3.09 -0.62 -0.06 0.62 3.22 ▁▃▇▃▁

We can use either the unstandardized (raw heights) or standardized (z-scores of heights) scores with qnorm() and pnorm(). For example: What is the probability that a randomly selected female will be shorter than me? In the code below, I am going to be more precise with the mean and standard deviation, as well as my personal z-score so that you can see the close match.

pnorm(q = 70, mean = 63.76, sd = 2.91, lower.tail = TRUE)
## [1] 0.9839968

You can obtain the same answer using z-scores (notice the change to mean and sd – the mean and sd of a z-score is 0 and 1 respectively). My z-score for height with two decimal places in 2.14.

pnorm(q = 2.14, mean = 0, sd = 1, lower.tail = TRUE)
## [1] 0.9838226

The values are just slightly different because of rounding – if I used even more decimal places to record the precise scores for the raw mean and sd, then the two would match perfectly.

Continuing with z-scores, how many standard deviations above the mean does a person need to be so that they are in the top 2.5% of the distribution?

qnorm(p = .975, mean = 0, sd = 1)
## [1] 1.959964

How many standard deviations below the mean does a person need to be so that they are in the bottom 2.5% of the distribution?

qnorm(p = .025, mean = 0, sd = 1)
## [1] -1.959964

We can map these last two values (-1.96 and +1.96) onto a graph. Notice the similarities with the empirical rule. The empirical rule is derived from the probability density function for the standard (z-scores) normal distribution.

This concludes Module 4. So far in this course you’ve learned how to use R to wrangle, plot and describe data. You’ve also now brushed up on probability and probability distributions – which will be very important later in the course. In the next few Modules – we’ll begin using statistical models to describe relationships between variables.

Go further, learn more

Do you want to dig further into probability theory? Check out Chapter 9 of Learning Statistics with R by Danielle Navarro – a notable Cognitive Psychologist and Data Scientist.