02-overview.Rmd

# Conceptual Overview {#conceptual-overview}

The goal of this text is to showcase many tips and tricks for getting the most value from R Markdown. The following chapters demonstrate techniques to write more efficient and succinct code and to customize your output. Before we begin, it may be helpful to learn just a bit more about how R Markdown works, to help you to understand, remember, apply, and "remix" these tricks. In this section, we provide a brief overview of the process of knitting a document and the "key levers to pull" to change the output. This material is not necessary to understand the subsequent chapters (so feel free to skip ahead!), but it may help you to build a richer mental model for how all the pieces fit together.

## What happens when we render? {#rmarkdown-process}

R Markdown combines several different processes together to create documents, and one of the main sources of confusion from R Markdown is how all the components work together.^[Allison Horst has created an amusing artwork that describes the R Markdown process as wizardry: https://github.com/allisonhorst/stats-illustrations/raw/master/rstats-artwork/rmarkdown_wizards.png. As a matter of fact, the cover image of this book was adapted from this artwork.] Fortunately, as a user, it is not essential to understand all the inner workings of these processes to be able to create documents. However, as a user who may be seeking to alter the behavior of a document, it is important to understand which component is responsible for what. This makes it a lot easier to seek help as you can target your searches on the correct area.

The basic workflow structure for an R Markdown document is shown in Figure \@ref(fig:rmdworkflow), highlighting the steps (arrows) and the intermediate files that are created before producing the output. The whole process is implemented via the function `rmarkdown::render()`. Each stage is explained in further detail below.

```{r rmdworkflow, echo = FALSE, fig.cap = "A diagram illustrating how an R Markdown document is converted to the final output document.", out.width='100%'}
knitr::include_graphics("images/workflow.png", dpi = NA)
```

The `.Rmd` document is the original format of the document. It contains a combination of YAML (metadata)\index{YAML}, text (narratives), and code chunks\index{code chunk}.

First, the `knit()` function in **knitr**\index{knitr} [@R-knitr] is used to execute all code embedded within the `.Rmd` file, and prepare the code output to be displayed within the output document. All these results are converted into the correct markup language to be contained within the temporary `.md` file.

Then the `.md` file is processed by Pandoc\index{Pandoc}, a multipurpose tool designed to convert files from one markup language to another. It takes any parameters specified within the YAML frontmatter of the document (e.g., `title`, `author`, and `date`) to convert the document to the output format specified in the `output` parameter (such as `html_document` for HTML output).

If the output format is PDF, there is an additional layer of processing, as Pandoc will convert the intermediate `.md` file into an intermediate `.tex` file. This file is then processed by LaTeX to form the final PDF document. As we mentioned in Section \@ref(install-latex), the **rmarkdown** package calls the `latexmk()` function in the **tinytex** package [@R-tinytex], which in turn calls LaTeX to compile `.tex` to `.pdf`.

In short, `rmarkdown::render()` = `knitr::knit()` + Pandoc (+ LaTeX for PDF output only).

Robin Linacre has written a nice summary of the relationship between R Markdown, **knitr**, and Pandoc at https://stackoverflow.com/q/40563479/559676, which contains more technical details than the above overview.

Note that not all R Markdown documents are eventually compiled through Pandoc. The intermediate `.md` file could be compiled by other Markdown renderers. Below are two examples:

- The **xaringan**\index{xaringan} package [@R-xaringan] passes the `.md` output to a JavaScript library, which renders the Markdown content in the web browser.

- The **blogdown**\index{blogdown} package [@R-blogdown] supports the `.Rmarkdown` document format, which is knitted to `.markdown`, and this Markdown document is usually rendered to HTML by an external site generator.

## R Markdown anatomy {#rmarkdown-anatomy}

We can dig one level deeper by considering the different components of an R Markdown. Specifically, let's look at when and how these are altered during the rendering workflow. 

### YAML metadata

The YAML metadata\index{YAML} (also called the YAML header) is processed in many stages of the rendering process and can influence the final document in many different ways. It is placed at the very beginning of the document and is read by each of Pandoc, **rmarkdown**, and **knitr**. Along the way, the information that it contains can affect the code, content, and the rendering process. 

A typical YAML header looks like this, and contains basic metadata about the document and rendering instructions:

```yaml
---
title: My R Markdown Report
author: Yihui Xie
output: html_document
---
```

In this case, the `title` and `author` fields are processed by Pandoc to set the values of template variables. With the default template, the title and author information will appear at the beginning of the resulting document. More details on how Pandoc uses information from the YAML header are included in the Pandoc manual's section on the [YAML metadata block.](https://pandoc.org/MANUAL.html#extension-yaml_metadata_block)

In contrast, the `output` field is used by **rmarkdown** to apply the output format function `rmarkdown::html_document()` in the rendering process. We can further influence the rendering process by passing arguments to the output format that we are specifying in `output`. For example, writing:

```yaml
output:
  html_document:
    toc: true
    toc_float: true
```

is the equivalent of telling `rmarkdown::render()` to apply the output format `rmarkdown::html_document(toc = TRUE, toc_float = TRUE)`. To find out what these options do, and to learn about other possible options, you may run `?rmarkdown::html_document` in your R console and read the help page. Note that `output: html_document` is equivalent to `output: rmarkdown::html_document`. When an output format does not have a qualifier like `rmarkdown::`, it is assumed that it is from the **rmarkdown** package, otherwise it must be prefixed with the R package name, e.g., `bookdown::html_document2`.

The YAML header can also influence our content and code if we choose to use parameters in YAML, as described in Section \@ref(parameterized-reports). In short, we can include variables and R expressions in this header that can be referenced throughout our R Markdown document. For example, the following header defines `start_date` and `end_date` parameters, which will be reflected in a list called `params` later in the R Markdown document. Thus, if we want to use these values in our R code, we can access them via `params$start_date` and `params$end_date`.

```yaml
---
title: My RMarkdown
author: Yihui Xie
output: html_document
params:
  start_date: '2020-01-01'
  end_date: '2020-06-01'
---
```

### Narrative {#narrative}

The narrative textual elements of R Markdown may be simpler to understand than the YAML metadata and code chunks. Typically, this will feel quite a bit like writing in a text editor. However, this Markdown content can be more powerful and interesting than simple text---both in how its content is made, and how the document structure is made from it.

While much of our narrative is human-written, many R Markdown documents will likely wish to reference the code and analysis being used. For this reason, Chapter \@ref(document-elements) demonstrates the many ways that code can help generate parts of the text, such as combining words into a list (Section \@ref(combine-words)) or writing a bibliography (Section \@ref(bibliography)). This conversion is handled by **knitr**\index{knitr} as we convert from `.Rmd` to `.md`.

Our Markdown text can also provide structure to the document. While we do not have enough space here to review the Markdown syntax,^[Instead, for a review of the Markdown syntax, please see https://bookdown.org/yihui/bookdown/markdown-syntax.html.] one particularly relevant concept is section headers, which are denoted by one or more hashes (`#`) corresponding to different levels, e.g.,

```md
# First-level header

## Second-level header

### Third-level header
```

These headers give structure to our entire document as **rmarkdown** converts the `.md` to our final output format. This structure is useful for referencing and formatting these sections by appending certain attributes to them. To create such references, Pandoc syntax allows us to provide a unique identifier by following the header notation with `{#id}`, or attach one or more classes to a section with `{.class-name}`, e.g.,

```md
## Second-level header {#introduction .important .large}
```

We can then access this section with many of the tools that you will learn, e.g., by referencing it with its ID or class. As examples, Section \@ref(cross-ref) demonstrates how to use the section ID to make cross-references throughout your document, and Section \@ref(html-tabs) introduces the `.tabset` class to help reorganize subsections.

The final interesting type of content that we might find in the textual part of our R Markdown is raw content written specifically for our desired output format, e.g., raw LaTeX code for LaTeX output (Section \@ref(raw-latex)), raw HTML code for HTML output, and so on (Section \@ref(raw-content)). Raw content may help you achieve things that cannot be done with plain Markdown, but please keep in mind that it is usually ignored when the output format is a different format, e.g., raw LaTeX commands in Markdown will be ignored when the output format is HTML.

### Code chunks

Code chunks\index{code chunk} are the beating heart of our R Markdown. The code in these chunks is run by **knitr**, and its output is translated to Markdown to dynamically keep our reports in sync with our current scripts and data. Each code chunk consists of a language engine (Chapter \@ref(other-languages)), an optional label, chunk options (Chapter \@ref(chunk-options)), and code.

To understand some of the modifications that we can make to code chunks, it is worth understanding the **knitr** process in slightly more detail. For each chunk, a **knitr** language engine gets three pieces of input: the knitting environment (`knitr::knit_global()`), the code input, and a list of chunk options. It returns the formatted representations of the code as well as its output. As a side effect, the knitting environment may also be modified, e.g., new variables may have been created in this environment via the source code in the code chunk. This process is illustrated in Figure \@ref(fig:knitr-workflow).

```{r knitr-workflow, echo = FALSE, fig.cap = 'A flowchart of inputs and outputs to a language engine.', out.width = '100%', fig.dim=c(7, 3.5), fig.align='center', cache=TRUE}
nomnoml::nomnoml(
  "
  [<frame>Code chunk|
  [Code]->[Language Engine]
  [Chunk Options]->[Language Engine]
  [Environment]->[Language Engine]
  [Language Engine]->[Formatted Code]
  [Language Engine]->[Formatted Output]
  [Language Engine]->[(Modified) Environment]
  ]")
```

We can modify this process by:

- changing our language engine;

- modifying chunk options, which can be global, local, or engine-specific;

- and by using hooks (Chapter \@ref(output-hooks) and Chapter \@ref(chunk-hooks)) to further process these inputs and outputs.

For example, in Section \@ref(hook-hide), you will learn how to add a hook to post-process the code output to redact certain lines in the source code.

Code chunks also have analogous concepts to the classes and unique identifiers that we explored for narratives in Section \@ref(narrative). A code chunk can specify an optional identifier (often called the "chunk label") immediately after its language engine. It can set classes for code and text output blocks in the output document via the chunk options `class.source` and `class.output`, respectively (see Section \@ref(chunk-styling)). For example, the chunk header ```` ```{r summary-stats, class.output = 'large-text'}```` gives this chunk a label `summary-stats`, and the class `large-text` for the text output blocks. A chunk can have only one label, but can have multiple classes.

### Document body

One important thing to understand when authoring and modifying a document is how code and narrative pieces create different sections, or containers within the document. For example, suppose we have a document that looks like this:

````md
# Title

## Section X

This is my introduction.

```{r chunk-x}`r ''`
x <- 1
print(x)
```

### Subsection 1

Here are some details.

### Subsection 2

These are more details.

## Section Y

This is another section.

```{r chunk-y}`r ''`
y <- 2
print(y)
```
````

When writing this document, we might think of each piece as linear with independent sections of text and code following in a sequence one after the other. However, what we are actually doing is creating a set of nested containers that conceptually^[In reality, there are many more containers than shown. For example, for a knitted code chunk, the code and output exist in separate containers that share a common parent.] looks more like Figure \@ref(fig:rmd-containers).

```{r rmd-containers, echo = FALSE, fig.cap = 'A simple R Markdown document illustrated as a set of nested containers.', out.width = '50%', fig.align='center', cache=TRUE}
nomnoml::nomnoml(
  "
  [Title (Level 1)|
  
    [Section X (Level 2)| - Text | - Code (chunk-x) | - Subsection 1 | - Subsection 2]
    [Section Y (Level 2)| - Text | - Code (chunk-y) ]

  ]")
```

Two key features of this diagram are (1) every section of text or code is its own discrete container, and (2) containers can be nested within one another. This nesting is particularly apparent if you are authoring your R Markdown document in the RStudio IDE and expand the document outline.

Note that in Figure \@ref(fig:rmd-containers), headers of the same level represent containers at the same level of nesting. Lower-level headers exist inside of the container of higher-level headers. In this case, it is common to call the higher-level sections the "parent" and the minor sections the "child." For example, a subsection is the child of a section. Besides headers, you can also create divisions in your text using `:::`, as demonstrated in Section \@ref(multi-column).

This structure has important implications as we attempt to apply some of the formatting and styling options that are described in this text. For example, we will see this nested structure when we learn about how Pandoc represents our document as an abstract syntax tree (Section \@ref(lua-filters)), or when we use CSS selectors (Section \@ref(html-css), among others) to style our HTML output. 

Formatting and styling can be applied to either containers of similar types (e.g., all code blocks), or all containers that exist inside of another container (e.g., everything under "Section Y"). Additionally, as explained in Section \@ref(narrative), we can apply the same classes to certain sections to designate them as being similar, and in this case, the common class names denote the common properties or intent of these sections.

As you read through this cookbook, it may be useful to quiz yourself and think about what sort of container the specific "recipe" is acting upon.

## What can we change to change the results? {#what-to-change}

Let's summarize what we have seen so far and preview what is to come.

Rendering R Markdown documents with **rmarkdown** consists of converting `.Rmd` to `.md` with **knitr**, and then `.md` to our desired output with Pandoc (typically).

The `.Rmd`-to-`.md` step handles the execution and "translation" of all code within our report, so most changes to *content* involve editing the `.Rmd` with code for **knitr** to translate. Tools that we have control over these steps include **knitr** chunk options and hooks.

Our `.md` is a plain text file with no formatting. This is where Pandoc comes in to convert to the final output format such as HTML, PDF, or Word. Along the way, we add structure and style. A wide range of tools to help us in this process include style sheets (CSS), raw LaTeX or HTML code, Pandoc templates, and Lua filters. By understanding the nested structure of an R Markdown document, and by thoughtfully using identifiers and classes, we can apply some of these tools selectively to targeted parts of our output.

Finally, our YAML metadata may help us toggle any of these steps. Changing parameters can change how our code runs, changing metadata alters the text content, and changing output options provides the `render()` function with a different set of instructions.

Of course, these are all rough heuristics and should not be taken as absolute facts. Ultimately, there is not a completely clear division of labor. Throughout this book, you will see that there are often multiple valid paths to achieving many of the outcomes described in this book, and these may enter different stages of the pipeline. For example, for the simple task of inserting an image in your document, you may either use the R code `knitr::include_graphics()`, which would execute in the `.Rmd` to `.md` stage, or directly use Markdown syntax (`![]()`). This may seem confusing, and sometimes different approaches will have different advantages. However, do not be concerned---if anything, this often means there are many valid ways to solve your problem, and you can follow whichever approach makes the most sense to you.

And that's that! In the rest of the book, you can now color in this rough sketch with many more concrete examples of ways to modify all of the components that we have discussed to get the most value out of R Markdown.