Learning Objectives

Knitting R Markdown to different formats
Understanding fundamental Markdown syntax
Using code chunks in R Markdown
Using code chunk options
Including plots & tables in R Markdown
Fitting basic models with R

Getting Started

In this activity, we will learn some fundamental concepts related to R and R Markdown. As we progress through the activity, the ➡️ symbols will prompt us to complete a task to facilitate a more interactive learning session.

➡️ First, create a new R Markdown document with File > New File > R Markdown…, and then click . Knit the file by clicking the Knit button (top left).

➡️ Create one new R Markdown document for each of the three most common formats: HTML, PDF and Word. Knit each of the three documents, noting the differences in style but consistency in the information contained.

Note: If knitting to PDF causes an error, you may need to install $\LaTeX$, a markup language popular in mathematics. To install $\LaTeX$, first install the tinytex package by submitting install.packages("tinytex") to the Console (bottom left of RStudio) and pressing return / enter, and then submitting tinytex::install_tinytex() into the Console as well.

For the rest of the activity HTML will be the recommended output format, but you are welcome to use whichever output format you prefer.

Markdown

R Markdown is a tool for data analysis reports that integrates Markdown, a lightweight markup language for formatting plain text files, with R code and output. Markdown is designed to be easy to read and write, so it is not very customizable, but is designed with an emphasis on simplicity.

Headings

Headings and subheadings can be used to organize one’s document using Markdown syntax.

# 1st Level Header
## 2nd Level Header
### 3rd Level Header

➡️ Include a first-level header and second-level header in your R Markdown document naming them Section 1 and Subsection 1.1, respectively.

Links and Images

Markdown can be used to include hyperlinks and external images / GIFs in R Markdown documents as well.

<http://example.com>
[linked phrase](http://example.com)
![optional caption text for image or GIF](path/to/img.png)

➡️ Include a hyperlink in the R Markdown document linking to the Google homepage or your favorite website.

➡️ Include the following GIF at this URL (or another GIF you like from https://www.giphy.com)

Text Formatting

Basic formatting of text, such as italicizing, bolding, superscripts, subscripts, and striking of text can be implemented as well.

Raw text input	Output
`italic`	italic
`bold`	bold
`code`	`code`
`superscript^2^`	superscript²
`subscript~2~`	subscript₂
`~~strikethrough~~`	~~strikethrough~~

➡️ Include an italicized word or phrase in your R Markdown document.

➡️ Include a bolded word or phrase in your R Markdown document.

Code Chunks

The real power of R Markdown comes from the ability to integrate simple formatting of text with code included in code chunks, which allow us to run R code inside our document:

```{r}
# Defining radius
radius <- 1

# Calculating area
area <- pi * radius^2

# Displaying result
area
```

There are three main ways to insert a code chunk:

The keyboard shortcut Cmd/Ctrl + Alt + I (recommended)
The “Insert” button icon in the editor toolbar (located at the top right of RStudio).
By manually typing the chunk delimiters ```{r} and ```.

Backticks, ```, are used to enclose code chunks in R Markdown documents, separating code from the rest of the document. Note that we will only use R as the desired language for our code chunks, but other programming languages are possible as well.

➡️ Include a code chunk containing the R code below. Then compile the code chunk by pressing the button.

# Important message
stats_message <- "All models are wrong, but some are useful."

# Printing message
paste0(stats_message)

Loading R Packages

R packages are collections of R code and functions that others have made available. R packages most commonly reside on the Comprehensive R Archive Network (CRAN). To use an R package from CRAN, it must be installed using the install.packages() function and then loaded using the library() function.

For example, we can install the janitor R package by submitting install.packages("janitor") to the Console pane in the bottom of RStudio. We can then include a code chunk with the code library(janitor) to load the package when we knit our R Markdown file.

➡️ Submit the following R code to the Console to install packages for the remainder of the activity.

# Creating vector of package names
packages <- c("tidyverse", "skimr", "flextable",
              "ggfortify", "broom", "janitor")

# Installing package if not already installed
for(p in packages) {
  if(length(find.package(p, quiet = TRUE)) == 0) {
        install.packages(p)
  }
}

Typically, it is best practice to load any R packages at the top of an R Markdown file or script for clarity and organization.

➡️ Insert a new code chunk to load R packages for this activity using the code below.

# Loading R packages
library(tidyverse)
library(skimr)
library(flextable)
library(ggfortify)
library(broom)
library(janitor)

Note that we have already installed these R packages for use in this activity, but in general you may need to submit install.packages('package_name') to the Console first to install a package called package_name (replacing package_name with the name of the desired R package) if this code yields an error saying Error in library(package_name) : there is no package called ‘package_name’.

Palmer Penguins Data

For the remainder of this activity, we will analyze the Palmer Penguins data set: a data set consisting of measurements collected on 344 penguins from 3 islands in Palmer Archipelago, Antarctica. Specifically, data were collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica Long Term Ecological Research Network ¹.

Artwork by @allison_horst

Data dictionary for Palmer Penguins data set
Variable	Description
species	Species of the penguin
island	Island the penguin was found on
bill_length_mm	Bill length (mm)
bill_depth_mm	Bill depth (mm)
flipper_length_mm	Flipper length (mm)
body_mass_g	Body mass (g)
sex	Sex of the penguin
year	Year data was collected

External data sets can be imported into R in several ways. Since Comma Separated Value (CSV) files are one of the most common file types for data, let’s import the Palmer Penguins data from a CSV file located at the following URL: https://raw.githubusercontent.com/dilernia/STA323/main/Data/penguins.csv

➡️ Add a new code chunk to import the Palmer Penguins data set using the read_csv() function creating an object called penguins, and run the code to import the data.

Here we imported a CSV file directly from a URL into R. More commonly, one would import a CSV file by specifying the file path to the data file on your own machine. A file path can be found on a Windows machine using the File Explorer, or on a Mac using Finder and the Terminal.

➡️ Explore the data by clicking on penguins in the Environment pane (top right of RStudio).

The skim() function from the skimr package provides a concise method for summarizing and conducting some basic exploratory data analysis of a data set in R. Note that skim() will produce output for any output format (HTML, PDF, or MS Word), but it is primarily intended for HTML documents.

➡️ Use the skim() function to explore high-level characteristics about the penguins data.

skimr::skim(penguins)

Code chunks have multiple options which control how code and output are displayed or evaluated. Some of the most commonly used options are:

eval
echo
include
warning and error

➡️ Modify the most recently included code chunk by setting the echo chunk option to be FALSE.

➡️ Toggle the include and eval chunk options for the code chunk using the skim() function, and see what happens.

To see all code chunk options and their default values, submit knitr::opts_chunk$get() to the Console.

Including Plots

Using R code inside code chunks that produce plots allows graphics to be directly included in the output document.

For example, let’s visualize the relationship between the flipper lengths (mm) and the body masses (g) of the penguins from the Palmer penguins data set. There are multiple packages for data visualization in R, but we will use the most popular package, ggplot2, created by Hadley Wickham.

➡️ Reproduce the scatter plot provided using the R code below.

# Creating scatter plot
penguins |> 
  ggplot(aes(x = body_mass_g, 
             y = flipper_length_mm)) + 
  geom_point()

We can also color the points based on the species of the penguins as in the scatter plot below.

# Creating vector of penguin colors
penguin_colors <- setNames(c("#c65dcb", "#067276", "#ff7b00"), 
                           c("Chinstrap", "Gentoo", "Adelie"))

# Creating scatter plot with color
penguins |> 
  ggplot(aes(x = body_mass_g, 
             y = flipper_length_mm,
             color = species)) + 
  geom_point() +
  scale_color_manual(values = penguin_colors)

Moreover, we can modify the plot labels using the labs() function as below.

# Creating vector of penguin colors
penguin_colors <- setNames(c("#c65dcb", "#067276", "#ff7b00"), 
                           c("Chinstrap", "Gentoo", "Adelie"))

# Creating scatter plot with color
penguins |> 
  ggplot(aes(x = body_mass_g, 
             y = flipper_length_mm,
             color = species)) + 
  geom_point() +
  scale_color_manual(values = penguin_colors) +
  labs(title = "Palmer penguins data set",
       x = "Body mass (g)",
       y = "Flipper length (mm)")

➡️ Reproduce the side-by-side box plots below showing the distribution of penguin flipper lengths for each species.

# Creating side-by-side box plots
penguins |> 
  ggplot(aes(x = species, y = flipper_length_mm, fill = species)) + 
  geom_boxplot() +
  scale_fill_manual(values = penguin_colors) +
  theme(legend.position = "none")

Inline Code

R Markdown also permits use of R code outside of code chunks to facilitate dynamic report generation using what is called inline R code. For example, we can calculate the longest and shortest observed flippers lengths for penguins in the Palmer Penguins data set and summarize these results dynamically in our report.

# Calculating maximum and minimum flipper lengths
max_flip <- max(penguins$flipper_length_mm, na.rm = TRUE)
min_flip <- min(penguins$flipper_length_mm, na.rm = TRUE)

Inline R code can be included using the following R Markdown syntax:

The longest flipper any penguin had was `r max_flip`mm, while the shortest was `r min_flip`mm.

which produces:

The longest flipper any penguin had was 231mm, while the shortest was 172mm.

➡️ Include a chunk of R code and after that inline R code at the bottom of the document to reproduce the sentence containing the results above regarding the most extreme penguins in this data set.

Using inline R code is not always necessary, but it can facilitate more efficient analyses if one’s data is ever updated, and it can also reduce the chances of mistakes in one’s report.

Tables

R Markdown can incorporate nicely displayed tables for all output formats (even Microsoft Word!). Let’s explore a few of the most common R functions for nicely displaying tables in R. To demonstrate these functions, let’s consider the mean and standard deviation of each penguin species’ flipper lengths.

# Summary statistics for flipper length by species
flipper_summary <- penguins |> 
  group_by(species) |> 
  summarize(Average = mean(flipper_length_mm, na.rm = TRUE),
            SD = sd(flipper_length_mm, na.rm = TRUE))

# Displaying the table
flipper_summary

## # A tibble: 3 × 3
##   species   Average    SD
##   <chr>       <dbl> <dbl>
## 1 Adelie       190.  6.54
## 2 Chinstrap    196.  7.13
## 3 Gentoo       217.  6.48

By default, tables do not look the nicest when displayed in R Markdown. However, there are several R packages to facilitate displaying of tables, one of the most versatile being the flextable package.

`flextable`

The flextable() function from the flextable package displays tables for all output formats in a consistent manner, with a large number of methods for customization that work for all output formats.

# Displaying table of summary statistics
flipper_summary |> 
  flextable() |> 
  set_caption(caption = "Table 1. Summary statistics for penguin flipper lengths in mm.") |> 
  colformat_double(digits = 2) |> 
  autofit()

Table 1. Summary statistics for penguin flipper lengths in mm.
species	Average	SD
Adelie	189.95	6.54
Chinstrap	195.82	7.13
Gentoo	217.19	6.48

➡️ Reproduce the table below containing the minimum, median, and maximum of each species’ body masses.

# Summary statistics for body mass by species
body_summary <- penguins |> 
  group_by(species) |> 
  summarize(Min = min(body_mass_g, na.rm = TRUE),
            Median = median(body_mass_g, na.rm = TRUE),
            Max = max(body_mass_g, na.rm = TRUE))

# Displaying table of summary statistics
body_summary |> 
  flextable() |> 
  set_caption(caption = "Table 2. Summary statistics for penguin body masses in grams.") |> 
  colformat_double(digits = 0) |> 
  autofit()

Table 2. Summary statistics for penguin body masses in grams.
species	Min	Median	Max
Adelie	2,850	3,700	4,775
Chinstrap	2,700	3,700	4,800
Gentoo	3,950	5,000	6,300

➡️ Reproduce the table below containing counts and percentages for the number of penguins of each species using the code below.

# Table of counts and percentages
penguins |> 
  janitor::tabyl(species) |> 
  flextable() |> 
  set_caption(caption = "Table 3. Number and proportion of penguins of each species.") |> 
  colformat_double(digits = 3) |> 
  autofit()

Table 3. Number and proportion of penguins of each species.
species	n	percent
Adelie	152	0.442
Chinstrap	68	0.198
Gentoo	124	0.360

Table Themes

There are many options for customizing aspects about tables displayed in R Markdown. One method is to apply complete themes to customize the appearance of tables with minimal code.

# Displaying table of summary statistics with applied theme
flipper_summary |> 
  flextable() |> 
  set_caption(caption = "Table 4. Summary statistics for penguin flipper lengths in mm displayed with flextable().") |> 
  colformat_double(digits = 2) |> 
  autofit() |> 
  theme_zebra()

Table 4. Summary statistics for penguin flipper lengths in mm displayed with flextable().
species	Average	SD
Adelie	189.95	6.54
Chinstrap	195.82	7.13
Gentoo	217.19	6.48

➡️ Display the flipper_summary table using a complete theme of your choice by viewing the available themes here: https://davidgohel.github.io/flextable/reference/index.html#flextable-themes.

Statistical Models

Linear Regression Model

There are a large number of statistical models available in R, with one of the most commonly used methods being linear regression. Using the penguins data, we can model the flipper lengths of penguins using their body masses with a linear regression model implemented via the lm() function as below.

# Fitting a simple linear regression model
slr_model <- lm(flipper_length_mm ~ body_mass_g, data = penguins)

We can also obtain diagnostic plots for the fitted model using the autoplot() function from the ggfortify package.

# Creating grid of diagnostic plots for the SLR model
autoplot(slr_model)

In addition to diagnostic plots for checking model assumptions, we can display the model estimates and model fit metrics as well using the broom package.

# Displaying model estimates
slr_model |> 
 tidy(conf.int = TRUE) |> 
  mutate(p.value = format.pval(p.value, digits = 4)) |> 
  flextable() |> 
  colformat_double(digits = 4) |> 
  set_caption("Table 5. Linear regression estimates.") |> 
  autofit()

Table 5. Linear regression estimates.
term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	136.7296	1.9968	68.4731	< 2.2e-16	132.8019	140.6573
body_mass_g	0.0153	0.0005	32.7222	< 2.2e-16	0.0144	0.0162

# Displaying model summary metrics
slr_model |> 
 glance() |> 
  mutate(p.value = format.pval(p.value, digits = 3),
         df = as.integer(df)) |> 
  flextable() |> 
  colformat_double(digits = 3) |> 
  set_caption("Table 6. Linear model summary metrics.") |> 
  autofit()

Table 6. Linear model summary metrics.
r.squared	adj.r.squared	sigma	statistic	p.value	df	logLik	AIC	BIC	deviance	df.residual	nobs
0.759	0.758	6.913	1,070.745	<2e-16	1	-1,145.518	2,297.035	2,308.540	16,250.301	340	342

We can also visualize the least squares regression line using the geom_smooth() function.

➡️ Reproduce the scatter plot provided using the R code below containing the geom_smooth() function which adds a line of best fit to the plot.

# Creating scatter plot with line of best fit
penguins |> 
  ggplot(aes(x = body_mass_g, 
             y = flipper_length_mm)) + 
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)

Two-sample $t$-test

Another common statistical method is the two-sample $t$-test. Let’s implement this for the penguins data as well, comparing the Adelie and Chinstrap penguins in terms of their average flipper lengths in millimeters.

# Subsetting to Adelie and Chinstrap penguins
adelie_chinstrap <- penguins |> 
  dplyr::filter(species %in% c("Adelie", "Chinstrap"))

# Creating faceted histogram
adelie_chinstrap |> 
  ggplot(aes(x = flipper_length_mm, fill = species)) + 
  geom_histogram(color = "black") +
  facet_grid(species ~ .) +
  scale_fill_manual(values = penguin_colors) +
  theme(legend.position = "none")

# Implementing the two-sample t-test
t_result <- t.test(flipper_length_mm ~ species, 
                    alternative = "two.sided",
                    var.equal = FALSE,
                    data = adelie_chinstrap)


# Displaying results of the t-test
t_result |> 
 tidy(conf.int = TRUE) |> 
  mutate(p.value = format.pval(p.value, digits = 3)) |> 
  flextable() |> 
  colformat_double(digits = 3) |> 
  set_caption("Table 7. Output for two-sample t-test.") |> 
  autofit()

Table 7. Output for two-sample t-test.
estimate	estimate1	estimate2	statistic	p.value	parameter	conf.low	conf.high	method	alternative
-5.870	189.954	195.824	-5.780	6.05e-08	119.677	-7.881	-3.859	Welch Two Sample t-test	two.sided

Much More!

This activity was made using R Markdown
This R Markdown Cheat Sheet describes additional features and fundamentals of R Markdown

$\LaTeX$ with R Markdown

For those familiar with the typesetting language $\LaTeX$, it can be used in R Markdown documents for any output format as below. Note that LaTeX code is used outside of code chunks.

LaTeX code:

$$\hat{\beta} = (X^TX)^{-1}X^Ty$$
  
$$\Sigma =  \begin{bmatrix} 1 & -2 \\ -2 & 1 \end{bmatrix}$$

Output:

\[\hat{\beta} = (X^TX)^{-1}X^Ty\]

\[\Sigma = \begin{bmatrix} 1 & -2 \\ -2 & 1 \end{bmatrix}\]

There are a few other R packages that facilitate use of $\LaTeX$ with R Markdown as well.

xtable: facilitates converting data frames and matrices into $\LaTeX$ tables to integrate R output with $\LaTeX$ documents.
stargazer: facilitates displaying regression output and model comparisons (especially with nested models), tables of summary statistics tables, vectors, matrices, and data frames.

Word Templates

Organizations, such as the US Department of Agriculture, can have weekly or monthly reports that change as data / other inputs are updated
R Markdown to Word can use Word doc templates for consistent formatting, headers, etc. while updating charts & tables

Quarto

Quarto is another tool for reproducible documents in RStudio that Posit introduced in 2020, whereas R Markdown was introduced in 2012.
The syntax for Quarto is very similar to that of R Markdown, but the main differences are that Quarto has a simplified YAML, chunk options are specified slightly differently, and Quarto has more of a multilingual focus than R Markdown.

R for SAS Users

For those familiar with SAS who would like to learn more about R, there are a few resources to help.

Post R Workshop Survey (Optional)

A link for a survey to gather feedback related to this workshop is included below. Note that this survey is optional, and that it is not anonymous (due to the relatively small size of the workshop).

https://forms.gle/Edtxfcat6FuAH3Gt9

References

Horst AM, Hill AP, Gorman KB (2020). palmerpenguins: Palmer Archipelago (Antarctica) penguin data. R package version 0.1.0. https://allisonhorst.github.io/palmerpenguins/. doi: 10.5281/zenodo.3960218.↩︎

Introduction to R & R Markdown

Andrew DiLernia

Grand Valley State University