Learning Objectives

  • Knitting R Markdown to different formats

  • Understanding fundamental Markdown syntax

  • Using code chunks in R Markdown

  • Using code chunk options

  • Including plots & tables in R Markdown

  • Fitting basic models with R

Getting Started

In this activity, we will learn some fundamental concepts related to R and R Markdown. As we progress through the activity, the ➡️ symbols will prompt us to complete a task to facilitate a more interactive learning session.

➡️ First, create a new R Markdown document with File > New File > R Markdown…, and then click . Knit the file by clicking the Knit button (top left).

➡️ Create one new R Markdown document for each of the three most common formats: HTML, PDF and Word. Knit each of the three documents, noting the differences in style but consistency in the information contained.

Note: If knitting to PDF causes an error, you may need to install \(\LaTeX\), a markup language popular in mathematics. To install \(\LaTeX\), first install the tinytex package by submitting install.packages("tinytex") to the Console (bottom left of RStudio) and pressing return / enter, and then submitting tinytex::install_tinytex() into the Console as well.

For the rest of the activity HTML will be the recommended output format, but you are welcome to use whichever output format you prefer.

Markdown

R Markdown is a tool for data analysis reports that integrates Markdown, a lightweight markup language for formatting plain text files, with R code and output. Markdown is designed to be easy to read and write, so it is not very customizable, but is designed with an emphasis on simplicity.

Headings

Headings and subheadings can be used to organize one’s document using Markdown syntax.

  • # 1st Level Header

  • ## 2nd Level Header

  • ### 3rd Level Header

➡️ Include a first-level header and second-level header in your R Markdown document naming them Section 1 and Subsection 1.1, respectively.

Text Formatting

Basic formatting of text, such as italicizing, bolding, superscripts, subscripts, and striking of text can be implemented as well.

Raw text input Output
*italic* italic
**bold** bold
`code` code
superscript^2^ superscript2
subscript~2~ subscript2
~~strikethrough~~ strikethrough

➡️ Include an italicized word or phrase in your R Markdown document.

➡️ Include a bolded word or phrase in your R Markdown document.

Code Chunks

The real power of R Markdown comes from the ability to integrate simple formatting of text with code included in code chunks, which allow us to run R code inside our document:

```{r}
# Defining radius
radius <- 1

# Calculating area
area <- pi * radius^2

# Displaying result
area
```

There are three main ways to insert a code chunk:

  1. The keyboard shortcut Cmd/Ctrl + Alt + I (recommended)

  2. The “Insert” button icon in the editor toolbar (located at the top right of RStudio).

  3. By manually typing the chunk delimiters ```{r} and ```.

Backticks, ```, are used to enclose code chunks in R Markdown documents, separating code from the rest of the document. Note that we will only use R as the desired language for our code chunks, but other programming languages are possible as well.

➡️ Include a code chunk containing the R code below. Then compile the code chunk by pressing the button.

# Important message
stats_message <- "All models are wrong, but some are useful."

# Printing message
paste0(stats_message)

Loading R Packages

R packages are collections of R code and functions that others have made available. R packages most commonly reside on the Comprehensive R Archive Network (CRAN). To use an R package from CRAN, it must be installed using the install.packages() function and then loaded using the library() function.

For example, we can install the janitor R package by submitting install.packages("janitor") to the Console pane in the bottom of RStudio. We can then include a code chunk with the code library(janitor) to load the package when we knit our R Markdown file.

➡️ Submit the following R code to the Console to install packages for the remainder of the activity.

# Creating vector of package names
packages <- c("tidyverse", "skimr", "flextable",
              "ggfortify", "broom", "janitor")

# Installing package if not already installed
for(p in packages) {
  if(length(find.package(p, quiet = TRUE)) == 0) {
        install.packages(p)
  }
}

Typically, it is best practice to load any R packages at the top of an R Markdown file or script for clarity and organization.

➡️ Insert a new code chunk to load R packages for this activity using the code below.

# Loading R packages
library(tidyverse)
library(skimr)
library(flextable)
library(ggfortify)
library(broom)
library(janitor)

Note that we have already installed these R packages for use in this activity, but in general you may need to submit install.packages('package_name') to the Console first to install a package called package_name (replacing package_name with the name of the desired R package) if this code yields an error saying Error in library(package_name) : there is no package called ‘package_name’.

Palmer Penguins Data

For the remainder of this activity, we will analyze the Palmer Penguins data set: a data set consisting of measurements collected on 344 penguins from 3 islands in Palmer Archipelago, Antarctica. Specifically, data were collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica Long Term Ecological Research Network1.

Artwork by @allison_horst

Data dictionary for Palmer Penguins data set

Variable

Description

species

Species of the penguin

island

Island the penguin was found on

bill_length_mm

Bill length (mm)

bill_depth_mm

Bill depth (mm)

flipper_length_mm

Flipper length (mm)

body_mass_g

Body mass (g)

sex

Sex of the penguin

year

Year data was collected

External data sets can be imported into R in several ways. Since Comma Separated Value (CSV) files are one of the most common file types for data, let’s import the Palmer Penguins data from a CSV file located at the following URL: https://raw.githubusercontent.com/dilernia/STA323/main/Data/penguins.csv

➡️ Add a new code chunk to import the Palmer Penguins data set using the read_csv() function creating an object called penguins, and run the code to import the data.

Here we imported a CSV file directly from a URL into R. More commonly, one would import a CSV file by specifying the file path to the data file on your own machine. A file path can be found on a Windows machine using the File Explorer, or on a Mac using Finder and the Terminal.

➡️ Explore the data by clicking on penguins in the Environment pane (top right of RStudio).

The skim() function from the skimr package provides a concise method for summarizing and conducting some basic exploratory data analysis of a data set in R. Note that skim() will produce output for any output format (HTML, PDF, or MS Word), but it is primarily intended for HTML documents.

➡️ Use the skim() function to explore high-level characteristics about the penguins data.

skimr::skim(penguins)

Code chunks have multiple options which control how code and output are displayed or evaluated. Some of the most commonly used options are:

  • eval

  • echo

  • include

  • warning and error

➡️ Modify the most recently included code chunk by setting the echo chunk option to be FALSE.

➡️ Toggle the include and eval chunk options for the code chunk using the skim() function, and see what happens.

To see all code chunk options and their default values, submit knitr::opts_chunk$get() to the Console.

Including Plots

Using R code inside code chunks that produce plots allows graphics to be directly included in the output document.

For example, let’s visualize the relationship between the flipper lengths (mm) and the body masses (g) of the penguins from the Palmer penguins data set. There are multiple packages for data visualization in R, but we will use the most popular package, ggplot2, created by Hadley Wickham.

➡️ Reproduce the scatter plot provided using the R code below.

# Creating scatter plot
penguins |> 
  ggplot(aes(x = body_mass_g, 
             y = flipper_length_mm)) + 
  geom_point()

We can also color the points based on the species of the penguins as in the scatter plot below.

# Creating vector of penguin colors
penguin_colors <- setNames(c("#c65dcb", "#067276", "#ff7b00"), 
                           c("Chinstrap", "Gentoo", "Adelie"))

# Creating scatter plot with color
penguins |> 
  ggplot(aes(x = body_mass_g, 
             y = flipper_length_mm,
             color = species)) + 
  geom_point() +
  scale_color_manual(values = penguin_colors)

Moreover, we can modify the plot labels using the labs() function as below.

# Creating vector of penguin colors
penguin_colors <- setNames(c("#c65dcb", "#067276", "#ff7b00"), 
                           c("Chinstrap", "Gentoo", "Adelie"))

# Creating scatter plot with color
penguins |> 
  ggplot(aes(x = body_mass_g, 
             y = flipper_length_mm,
             color = species)) + 
  geom_point() +
  scale_color_manual(values = penguin_colors) +
  labs(title = "Palmer penguins data set",
       x = "Body mass (g)",
       y = "Flipper length (mm)")

➡️ Reproduce the side-by-side box plots below showing the distribution of penguin flipper lengths for each species.

# Creating side-by-side box plots
penguins |> 
  ggplot(aes(x = species, y = flipper_length_mm, fill = species)) + 
  geom_boxplot() +
  scale_fill_manual(values = penguin_colors) +
  theme(legend.position = "none")

Inline Code

R Markdown also permits use of R code outside of code chunks to facilitate dynamic report generation using what is called inline R code. For example, we can calculate the longest and shortest observed flippers lengths for penguins in the Palmer Penguins data set and summarize these results dynamically in our report.

# Calculating maximum and minimum flipper lengths
max_flip <- max(penguins$flipper_length_mm, na.rm = TRUE)
min_flip <- min(penguins$flipper_length_mm, na.rm = TRUE)

Inline R code can be included using the following R Markdown syntax:

The longest flipper any penguin had was `r max_flip`mm, while the shortest was `r min_flip`mm.

which produces:

The longest flipper any penguin had was 231mm, while the shortest was 172mm.

➡️ Include a chunk of R code and after that inline R code at the bottom of the document to reproduce the sentence containing the results above regarding the most extreme penguins in this data set.

Using inline R code is not always necessary, but it can facilitate more efficient analyses if one’s data is ever updated, and it can also reduce the chances of mistakes in one’s report.

Tables

R Markdown can incorporate nicely displayed tables for all output formats (even Microsoft Word!). Let’s explore a few of the most common R functions for nicely displaying tables in R. To demonstrate these functions, let’s consider the mean and standard deviation of each penguin species’ flipper lengths.

# Summary statistics for flipper length by species
flipper_summary <- penguins |> 
  group_by(species) |> 
  summarize(Average = mean(flipper_length_mm, na.rm = TRUE),
            SD = sd(flipper_length_mm, na.rm = TRUE))

# Displaying the table
flipper_summary
## # A tibble: 3 × 3
##   species   Average    SD
##   <chr>       <dbl> <dbl>
## 1 Adelie       190.  6.54
## 2 Chinstrap    196.  7.13
## 3 Gentoo       217.  6.48

By default, tables do not look the nicest when displayed in R Markdown. However, there are several R packages to facilitate displaying of tables, one of the most versatile being the flextable package.

flextable

The flextable() function from the flextable package displays tables for all output formats in a consistent manner, with a large number of methods for customization that work for all output formats.

# Displaying table of summary statistics
flipper_summary |> 
  flextable() |> 
  set_caption(caption = "Table 1. Summary statistics for penguin flipper lengths in mm.") |> 
  colformat_double(digits = 2) |> 
  autofit()
Table 1. Summary statistics for penguin flipper lengths in mm.

species

Average

SD

Adelie

189.95

6.54

Chinstrap

195.82

7.13

Gentoo

217.19

6.48

➡️ Reproduce the table below containing the minimum, median, and maximum of each species’ body masses.

# Summary statistics for body mass by species
body_summary <- penguins |> 
  group_by(species) |> 
  summarize(Min = min(body_mass_g, na.rm = TRUE),
            Median = median(body_mass_g, na.rm = TRUE),
            Max = max(body_mass_g, na.rm = TRUE))

# Displaying table of summary statistics
body_summary |> 
  flextable() |> 
  set_caption(caption = "Table 2. Summary statistics for penguin body masses in grams.") |> 
  colformat_double(digits = 0) |> 
  autofit()
Table 2. Summary statistics for penguin body masses in grams.

species

Min

Median

Max

Adelie

2,850

3,700

4,775

Chinstrap

2,700

3,700

4,800

Gentoo

3,950

5,000

6,300

➡️ Reproduce the table below containing counts and percentages for the number of penguins of each species using the code below.

# Table of counts and percentages
penguins |> 
  janitor::tabyl(species) |> 
  flextable() |> 
  set_caption(caption = "Table 3. Number and proportion of penguins of each species.") |> 
  colformat_double(digits = 3) |> 
  autofit()
Table 3. Number and proportion of penguins of each species.

species

n

percent

Adelie

152

0.442

Chinstrap

68

0.198

Gentoo

124

0.360

Table Themes

There are many options for customizing aspects about tables displayed in R Markdown. One method is to apply complete themes to customize the appearance of tables with minimal code.

# Displaying table of summary statistics with applied theme
flipper_summary |> 
  flextable() |> 
  set_caption(caption = "Table 4. Summary statistics for penguin flipper lengths in mm displayed with flextable().") |> 
  colformat_double(digits = 2) |> 
  autofit() |> 
  theme_zebra()
Table 4. Summary statistics for penguin flipper lengths in mm displayed with flextable().

species

Average

SD

Adelie

189.95

6.54

Chinstrap

195.82

7.13

Gentoo

217.19

6.48

➡️ Display the flipper_summary table using a complete theme of your choice by viewing the available themes here: https://davidgohel.github.io/flextable/reference/index.html#flextable-themes.

Statistical Models

Linear Regression Model

There are a large number of statistical models available in R, with one of the most commonly used methods being linear regression. Using the penguins data, we can model the flipper lengths of penguins using their body masses with a linear regression model implemented via the lm() function as below.

# Fitting a simple linear regression model
slr_model <- lm(flipper_length_mm ~ body_mass_g, data = penguins)

We can also obtain diagnostic plots for the fitted model using the autoplot() function from the ggfortify package.

# Creating grid of diagnostic plots for the SLR model
autoplot(slr_model)

In addition to diagnostic plots for checking model assumptions, we can display the model estimates and model fit metrics as well using the broom package.

# Displaying model estimates
slr_model |> 
 tidy(conf.int = TRUE) |> 
  mutate(p.value = format.pval(p.value, digits = 4)) |> 
  flextable() |> 
  colformat_double(digits = 4) |> 
  set_caption("Table 5. Linear regression estimates.") |> 
  autofit()
Table 5. Linear regression estimates.

term

estimate

std.error

statistic

p.value

conf.low

conf.high

(Intercept)

136.7296

1.9968

68.4731

< 2.2e-16

132.8019

140.6573

body_mass_g

0.0153

0.0005

32.7222

< 2.2e-16

0.0144

0.0162

# Displaying model summary metrics
slr_model |> 
 glance() |> 
  mutate(p.value = format.pval(p.value, digits = 3),
         df = as.integer(df)) |> 
  flextable() |> 
  colformat_double(digits = 3) |> 
  set_caption("Table 6. Linear model summary metrics.") |> 
  autofit()
Table 6. Linear model summary metrics.

r.squared

adj.r.squared

sigma

statistic

p.value

df

logLik

AIC

BIC

deviance

df.residual

nobs

0.759

0.758

6.913

1,070.745

<2e-16

1

-1,145.518

2,297.035

2,308.540

16,250.301

340

342

We can also visualize the least squares regression line using the geom_smooth() function.

➡️ Reproduce the scatter plot provided using the R code below containing the geom_smooth() function which adds a line of best fit to the plot.

# Creating scatter plot with line of best fit
penguins |> 
  ggplot(aes(x = body_mass_g, 
             y = flipper_length_mm)) + 
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)

Two-sample \(t\)-test

Another common statistical method is the two-sample \(t\)-test. Let’s implement this for the penguins data as well, comparing the Adelie and Chinstrap penguins in terms of their average flipper lengths in millimeters.

# Subsetting to Adelie and Chinstrap penguins
adelie_chinstrap <- penguins |> 
  dplyr::filter(species %in% c("Adelie", "Chinstrap"))

# Creating faceted histogram
adelie_chinstrap |> 
  ggplot(aes(x = flipper_length_mm, fill = species)) + 
  geom_histogram(color = "black") +
  facet_grid(species ~ .) +
  scale_fill_manual(values = penguin_colors) +
  theme(legend.position = "none")

# Implementing the two-sample t-test
t_result <- t.test(flipper_length_mm ~ species, 
                    alternative = "two.sided",
                    var.equal = FALSE,
                    data = adelie_chinstrap)


# Displaying results of the t-test
t_result |> 
 tidy(conf.int = TRUE) |> 
  mutate(p.value = format.pval(p.value, digits = 3)) |> 
  flextable() |> 
  colformat_double(digits = 3) |> 
  set_caption("Table 7. Output for two-sample t-test.") |> 
  autofit()
Table 7. Output for two-sample t-test.

estimate

estimate1

estimate2

statistic

p.value

parameter

conf.low

conf.high

method

alternative

-5.870

189.954

195.824

-5.780

6.05e-08

119.677

-7.881

-3.859

Welch Two Sample t-test

two.sided

Much More!

  • This activity was made using R Markdown

  • This R Markdown Cheat Sheet describes additional features and fundamentals of R Markdown

\(\LaTeX\) with R Markdown

For those familiar with the typesetting language \(\LaTeX\), it can be used in R Markdown documents for any output format as below. Note that LaTeX code is used outside of code chunks.

LaTeX code:

$$\hat{\beta} = (X^TX)^{-1}X^Ty$$
  
$$\Sigma =  \begin{bmatrix} 1 & -2 \\ -2 & 1 \end{bmatrix}$$

Output:

\[\hat{\beta} = (X^TX)^{-1}X^Ty\]

\[\Sigma = \begin{bmatrix} 1 & -2 \\ -2 & 1 \end{bmatrix}\]

There are a few other R packages that facilitate use of \(\LaTeX\) with R Markdown as well.

  • xtable: facilitates converting data frames and matrices into \(\LaTeX\) tables to integrate R output with \(\LaTeX\) documents.

  • stargazer: facilitates displaying regression output and model comparisons (especially with nested models), tables of summary statistics tables, vectors, matrices, and data frames.

Word Templates

  • Organizations, such as the US Department of Agriculture, can have weekly or monthly reports that change as data / other inputs are updated

  • R Markdown to Word can use Word doc templates for consistent formatting, headers, etc. while updating charts & tables

Quarto

  • Quarto is another tool for reproducible documents in RStudio that Posit introduced in 2020, whereas R Markdown was introduced in 2012.

  • The syntax for Quarto is very similar to that of R Markdown, but the main differences are that Quarto has a simplified YAML, chunk options are specified slightly differently, and Quarto has more of a multilingual focus than R Markdown.

R for SAS Users

For those familiar with SAS who would like to learn more about R, there are a few resources to help.

Post R Workshop Survey (Optional)

A link for a survey to gather feedback related to this workshop is included below. Note that this survey is optional, and that it is not anonymous (due to the relatively small size of the workshop).

https://forms.gle/Edtxfcat6FuAH3Gt9


References


  1. Horst AM, Hill AP, Gorman KB (2020). palmerpenguins: Palmer Archipelago (Antarctica) penguin data. R package version 0.1.0. https://allisonhorst.github.io/palmerpenguins/. doi: 10.5281/zenodo.3960218.↩︎