Knitting R Markdown to different formats
Understanding fundamental Markdown syntax
Using code chunks in R Markdown
Using code chunk options
Including plots & tables in R Markdown
Fitting basic models with R
In this activity, we will learn some fundamental concepts related to R and R Markdown. As we progress through the activity, the ➡️ symbols will prompt us to complete a task to facilitate a more interactive learning session.
➡️ First, create a new R Markdown document with File > New File > R Markdown…, and then click . Knit the file by clicking the Knit button (top left).
➡️ Create one new R Markdown document for each of the three most common formats: HTML, PDF and Word. Knit each of the three documents, noting the differences in style but consistency in the information contained.
Note: If knitting to PDF causes an error, you may
need to install \(\LaTeX\), a markup
language popular in mathematics. To install \(\LaTeX\), first install the
tinytex
package by submitting
install.packages("tinytex")
to the Console (bottom left of
RStudio) and pressing return / enter, and then submitting
tinytex::install_tinytex()
into the Console as well.
For the rest of the activity HTML will be the recommended output format, but you are welcome to use whichever output format you prefer.
R Markdown is a tool for data analysis reports that integrates Markdown, a lightweight markup language for formatting plain text files, with R code and output. Markdown is designed to be easy to read and write, so it is not very customizable, but is designed with an emphasis on simplicity.
Headings and subheadings can be used to organize one’s document using Markdown syntax.
# 1st Level Header
## 2nd Level Header
### 3rd Level Header
➡️ Include a first-level header and second-level header in your R Markdown document naming them Section 1 and Subsection 1.1, respectively.
Markdown can be used to include hyperlinks and external images / GIFs in R Markdown documents as well.
<http://example.com>
[linked phrase](http://example.com)
![optional caption text for image or GIF](path/to/img.png)
➡️ Include a hyperlink in the R Markdown document linking to the Google homepage or your favorite website.
➡️ Include the following GIF at this URL (or another GIF you like from https://www.giphy.com)
Basic formatting of text, such as italicizing, bolding, superscripts, subscripts, and striking of text can be implemented as well.
Raw text input | Output |
---|---|
*italic* |
italic |
**bold** |
bold |
`code` |
code |
superscript^2^ |
superscript2 |
subscript~2~ |
subscript2 |
~~strikethrough~~ |
➡️ Include an italicized word or phrase in your R Markdown document.
➡️ Include a bolded word or phrase in your R Markdown document.
The real power of R Markdown comes from the ability to integrate simple formatting of text with code included in code chunks, which allow us to run R code inside our document:
```{r}
# Defining radius
radius <- 1
# Calculating area
area <- pi * radius^2
# Displaying result
area
```
There are three main ways to insert a code chunk:
The keyboard shortcut Cmd/Ctrl + Alt + I (recommended)
The “Insert” button icon in the editor toolbar (located at the top right of RStudio).
By manually typing the chunk delimiters
```{r} and ```
.
Backticks, ```
, are used to enclose code chunks in R
Markdown documents, separating code from the rest of the document. Note
that we will only use R as the desired language for our code chunks, but
other programming languages are possible as well.
➡️ Include a code chunk containing the R code below. Then compile the code chunk by pressing the button.
# Important message
stats_message <- "All models are wrong, but some are useful."
# Printing message
paste0(stats_message)
R packages are collections of R code and functions that others have
made available. R packages most commonly reside on the Comprehensive R Archive Network
(CRAN). To use an R package from CRAN, it must be installed using the
install.packages()
function and then loaded using the
library()
function.
For example, we can install the janitor
R package by
submitting install.packages("janitor")
to the Console pane
in the bottom of RStudio. We can then include a code chunk with the code
library(janitor)
to load the package when we knit our R
Markdown file.
➡️ Submit the following R code to the Console to install packages for the remainder of the activity.
# Creating vector of package names
packages <- c("tidyverse", "skimr", "flextable",
"ggfortify", "broom", "janitor")
# Installing package if not already installed
for(p in packages) {
if(length(find.package(p, quiet = TRUE)) == 0) {
install.packages(p)
}
}
Typically, it is best practice to load any R packages at the top of an R Markdown file or script for clarity and organization.
➡️ Insert a new code chunk to load R packages for this activity using the code below.
# Loading R packages
library(tidyverse)
library(skimr)
library(flextable)
library(ggfortify)
library(broom)
library(janitor)
Note that we have already installed these R packages for use in this
activity, but in general you may need to submit
install.packages('package_name')
to the Console first to
install a package called package_name
(replacing
package_name
with the name of the desired R package) if
this code yields an error saying
For the remainder of this activity, we will analyze the Palmer Penguins data set: a data set consisting of measurements collected on 344 penguins from 3 islands in Palmer Archipelago, Antarctica. Specifically, data were collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica Long Term Ecological Research Network1.
Artwork by @allison_horst
Variable | Description |
---|---|
species | Species of the penguin |
island | Island the penguin was found on |
bill_length_mm | Bill length (mm) |
bill_depth_mm | Bill depth (mm) |
flipper_length_mm | Flipper length (mm) |
body_mass_g | Body mass (g) |
sex | Sex of the penguin |
year | Year data was collected |
External data sets can be imported into R in several ways. Since Comma Separated Value (CSV) files are one of the most common file types for data, let’s import the Palmer Penguins data from a CSV file located at the following URL: https://raw.githubusercontent.com/dilernia/STA323/main/Data/penguins.csv
➡️ Add a new code chunk to import the Palmer Penguins data set using
the read_csv()
function creating an object called
penguins
, and run the code to import the data.
Here we imported a CSV file directly from a URL into R. More commonly, one would import a CSV file by specifying the file path to the data file on your own machine. A file path can be found on a Windows machine using the File Explorer, or on a Mac using Finder and the Terminal.
➡️ Explore the data by clicking on penguins
in the
Environment pane (top right of RStudio).
The skim()
function from the skimr
package
provides a concise method for summarizing and conducting some basic
exploratory data analysis of a data set in R. Note that
skim()
will produce output for any output format (HTML,
PDF, or MS Word), but it is primarily intended for HTML documents.
➡️ Use the skim()
function to explore high-level
characteristics about the penguins
data.
Code chunks have multiple options which control how code and output are displayed or evaluated. Some of the most commonly used options are:
eval
echo
include
warning
and error
➡️ Modify the most recently included code chunk by setting the
echo
chunk option to be FALSE
.
➡️ Toggle the include
and eval
chunk
options for the code chunk using the skim()
function, and
see what happens.
To see all code chunk options and their default values, submit
knitr::opts_chunk$get()
to the Console.
Using R code inside code chunks that produce plots allows graphics to be directly included in the output document.
For example, let’s visualize the relationship between the flipper
lengths (mm) and the body masses (g) of the penguins from the Palmer
penguins data set. There are multiple packages for data visualization in
R, but we will use the most popular package, ggplot2
, created
by Hadley Wickham.
➡️ Reproduce the scatter plot provided using the R code below.
# Creating scatter plot
penguins |>
ggplot(aes(x = body_mass_g,
y = flipper_length_mm)) +
geom_point()
We can also color the points based on the species of the penguins as in the scatter plot below.
# Creating vector of penguin colors
penguin_colors <- setNames(c("#c65dcb", "#067276", "#ff7b00"),
c("Chinstrap", "Gentoo", "Adelie"))
# Creating scatter plot with color
penguins |>
ggplot(aes(x = body_mass_g,
y = flipper_length_mm,
color = species)) +
geom_point() +
scale_color_manual(values = penguin_colors)
Moreover, we can modify the plot labels using the labs()
function as below.
# Creating vector of penguin colors
penguin_colors <- setNames(c("#c65dcb", "#067276", "#ff7b00"),
c("Chinstrap", "Gentoo", "Adelie"))
# Creating scatter plot with color
penguins |>
ggplot(aes(x = body_mass_g,
y = flipper_length_mm,
color = species)) +
geom_point() +
scale_color_manual(values = penguin_colors) +
labs(title = "Palmer penguins data set",
x = "Body mass (g)",
y = "Flipper length (mm)")
➡️ Reproduce the side-by-side box plots below showing the distribution of penguin flipper lengths for each species.
# Creating side-by-side box plots
penguins |>
ggplot(aes(x = species, y = flipper_length_mm, fill = species)) +
geom_boxplot() +
scale_fill_manual(values = penguin_colors) +
theme(legend.position = "none")
R Markdown also permits use of R code outside of code chunks to facilitate dynamic report generation using what is called inline R code. For example, we can calculate the longest and shortest observed flippers lengths for penguins in the Palmer Penguins data set and summarize these results dynamically in our report.
# Calculating maximum and minimum flipper lengths
max_flip <- max(penguins$flipper_length_mm, na.rm = TRUE)
min_flip <- min(penguins$flipper_length_mm, na.rm = TRUE)
Inline R code can be included using the following R Markdown syntax:
`r max_flip`
mm,
while the shortest was `r min_flip`
mm.
which produces:
The longest flipper any penguin had was 231mm, while the shortest was 172mm.
➡️ Include a chunk of R code and after that inline R code at the bottom of the document to reproduce the sentence containing the results above regarding the most extreme penguins in this data set.
Using inline R code is not always necessary, but it can facilitate more efficient analyses if one’s data is ever updated, and it can also reduce the chances of mistakes in one’s report.
R Markdown can incorporate nicely displayed tables for all output formats (even Microsoft Word!). Let’s explore a few of the most common R functions for nicely displaying tables in R. To demonstrate these functions, let’s consider the mean and standard deviation of each penguin species’ flipper lengths.
# Summary statistics for flipper length by species
flipper_summary <- penguins |>
group_by(species) |>
summarize(Average = mean(flipper_length_mm, na.rm = TRUE),
SD = sd(flipper_length_mm, na.rm = TRUE))
# Displaying the table
flipper_summary
## # A tibble: 3 × 3
## species Average SD
## <chr> <dbl> <dbl>
## 1 Adelie 190. 6.54
## 2 Chinstrap 196. 7.13
## 3 Gentoo 217. 6.48
By default, tables do not look the nicest when displayed in R
Markdown. However, there are several R packages to facilitate displaying
of tables, one of the most versatile being the flextable
package.
flextable
The flextable()
function from the flextable
package displays tables for all output formats in a consistent manner,
with a large number of methods for customization that work for all
output formats.
# Displaying table of summary statistics
flipper_summary |>
flextable() |>
set_caption(caption = "Table 1. Summary statistics for penguin flipper lengths in mm.") |>
colformat_double(digits = 2) |>
autofit()
species | Average | SD |
---|---|---|
Adelie | 189.95 | 6.54 |
Chinstrap | 195.82 | 7.13 |
Gentoo | 217.19 | 6.48 |
➡️ Reproduce the table below containing the minimum, median, and maximum of each species’ body masses.
# Summary statistics for body mass by species
body_summary <- penguins |>
group_by(species) |>
summarize(Min = min(body_mass_g, na.rm = TRUE),
Median = median(body_mass_g, na.rm = TRUE),
Max = max(body_mass_g, na.rm = TRUE))
# Displaying table of summary statistics
body_summary |>
flextable() |>
set_caption(caption = "Table 2. Summary statistics for penguin body masses in grams.") |>
colformat_double(digits = 0) |>
autofit()
species | Min | Median | Max |
---|---|---|---|
Adelie | 2,850 | 3,700 | 4,775 |
Chinstrap | 2,700 | 3,700 | 4,800 |
Gentoo | 3,950 | 5,000 | 6,300 |
➡️ Reproduce the table below containing counts and percentages for the number of penguins of each species using the code below.
# Table of counts and percentages
penguins |>
janitor::tabyl(species) |>
flextable() |>
set_caption(caption = "Table 3. Number and proportion of penguins of each species.") |>
colformat_double(digits = 3) |>
autofit()
species | n | percent |
---|---|---|
Adelie | 152 | 0.442 |
Chinstrap | 68 | 0.198 |
Gentoo | 124 | 0.360 |
There are many options for customizing aspects about tables displayed in R Markdown. One method is to apply complete themes to customize the appearance of tables with minimal code.
# Displaying table of summary statistics with applied theme
flipper_summary |>
flextable() |>
set_caption(caption = "Table 4. Summary statistics for penguin flipper lengths in mm displayed with flextable().") |>
colformat_double(digits = 2) |>
autofit() |>
theme_zebra()
species | Average | SD |
---|---|---|
Adelie | 189.95 | 6.54 |
Chinstrap | 195.82 | 7.13 |
Gentoo | 217.19 | 6.48 |
➡️ Display the flipper_summary
table using a complete
theme of your choice by viewing the available themes here: https://davidgohel.github.io/flextable/reference/index.html#flextable-themes.
There are a large number of statistical models available in R, with
one of the most commonly used methods being linear regression. Using the
penguins
data, we can model the flipper lengths of penguins
using their body masses with a linear regression model implemented via
the lm()
function as below.
# Fitting a simple linear regression model
slr_model <- lm(flipper_length_mm ~ body_mass_g, data = penguins)
We can also obtain diagnostic plots for the fitted model using the
autoplot()
function from the ggfortify
package.
In addition to diagnostic plots for checking model assumptions, we
can display the model estimates and model fit metrics as well using the
broom
package.
# Displaying model estimates
slr_model |>
tidy(conf.int = TRUE) |>
mutate(p.value = format.pval(p.value, digits = 4)) |>
flextable() |>
colformat_double(digits = 4) |>
set_caption("Table 5. Linear regression estimates.") |>
autofit()
term | estimate | std.error | statistic | p.value | conf.low | conf.high |
---|---|---|---|---|---|---|
(Intercept) | 136.7296 | 1.9968 | 68.4731 | < 2.2e-16 | 132.8019 | 140.6573 |
body_mass_g | 0.0153 | 0.0005 | 32.7222 | < 2.2e-16 | 0.0144 | 0.0162 |
# Displaying model summary metrics
slr_model |>
glance() |>
mutate(p.value = format.pval(p.value, digits = 3),
df = as.integer(df)) |>
flextable() |>
colformat_double(digits = 3) |>
set_caption("Table 6. Linear model summary metrics.") |>
autofit()
r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs |
---|---|---|---|---|---|---|---|---|---|---|---|
0.759 | 0.758 | 6.913 | 1,070.745 | <2e-16 | 1 | -1,145.518 | 2,297.035 | 2,308.540 | 16,250.301 | 340 | 342 |
We can also visualize the least squares regression line using the
geom_smooth()
function.
➡️ Reproduce the scatter plot provided using the R code below
containing the geom_smooth()
function which adds a line of
best fit to the plot.
# Creating scatter plot with line of best fit
penguins |>
ggplot(aes(x = body_mass_g,
y = flipper_length_mm)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
Another common statistical method is the two-sample \(t\)-test. Let’s implement this for the penguins data as well, comparing the Adelie and Chinstrap penguins in terms of their average flipper lengths in millimeters.
# Subsetting to Adelie and Chinstrap penguins
adelie_chinstrap <- penguins |>
dplyr::filter(species %in% c("Adelie", "Chinstrap"))
# Creating faceted histogram
adelie_chinstrap |>
ggplot(aes(x = flipper_length_mm, fill = species)) +
geom_histogram(color = "black") +
facet_grid(species ~ .) +
scale_fill_manual(values = penguin_colors) +
theme(legend.position = "none")
# Implementing the two-sample t-test
t_result <- t.test(flipper_length_mm ~ species,
alternative = "two.sided",
var.equal = FALSE,
data = adelie_chinstrap)
# Displaying results of the t-test
t_result |>
tidy(conf.int = TRUE) |>
mutate(p.value = format.pval(p.value, digits = 3)) |>
flextable() |>
colformat_double(digits = 3) |>
set_caption("Table 7. Output for two-sample t-test.") |>
autofit()
estimate | estimate1 | estimate2 | statistic | p.value | parameter | conf.low | conf.high | method | alternative |
---|---|---|---|---|---|---|---|---|---|
-5.870 | 189.954 | 195.824 | -5.780 | 6.05e-08 | 119.677 | -7.881 | -3.859 | Welch Two Sample t-test | two.sided |
This activity was made using R Markdown
This R Markdown Cheat Sheet describes additional features and fundamentals of R Markdown
For those familiar with the typesetting language \(\LaTeX\), it can be used in R Markdown documents for any output format as below. Note that LaTeX code is used outside of code chunks.
LaTeX code:
Output:
\[\hat{\beta} = (X^TX)^{-1}X^Ty\]
\[\Sigma = \begin{bmatrix} 1 & -2 \\ -2 & 1 \end{bmatrix}\]
There are a few other R packages that facilitate use of \(\LaTeX\) with R Markdown as well.
xtable: facilitates converting data frames and matrices into \(\LaTeX\) tables to integrate R output with \(\LaTeX\) documents.
stargazer: facilitates displaying regression output and model comparisons (especially with nested models), tables of summary statistics tables, vectors, matrices, and data frames.
Organizations, such as the US Department of Agriculture, can have weekly or monthly reports that change as data / other inputs are updated
R Markdown to Word can use Word doc templates for consistent formatting, headers, etc. while updating charts & tables
Quarto is another tool for reproducible documents in RStudio that Posit introduced in 2020, whereas R Markdown was introduced in 2012.
The syntax for Quarto is very similar to that of R Markdown, but the main differences are that Quarto has a simplified YAML, chunk options are specified slightly differently, and Quarto has more of a multilingual focus than R Markdown.
For those familiar with SAS who would like to learn more about R, there are a few resources to help.
A link for a survey to gather feedback related to this workshop is included below. Note that this survey is optional, and that it is not anonymous (due to the relatively small size of the workshop).
https://forms.gle/Edtxfcat6FuAH3Gt9
Horst AM, Hill AP, Gorman KB (2020). palmerpenguins: Palmer Archipelago (Antarctica) penguin data. R package version 0.1.0. https://allisonhorst.github.io/palmerpenguins/. doi: 10.5281/zenodo.3960218.↩︎