baselinetables.qmd

---
title: "Make a baseline table"
bibliography: references.bib
# format: pdf
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, 
                      warning = FALSE,
                      message= FALSE)
options(huxtable.knitr_output_format="md")
```

::: callout-warning
This page is under development. Changes should be expected.
:::

# Background

Epidemiologic and clinical research papers often describe the study sample in a first, baseline table. Providing a baseline table is one of the recommendations of the CONSORT (@MOHER201228) and STROBE (@VONELM20071453) statements. If well-executed, it provides a rapid, objective, and coherent grasp of the data and can illuminate potential threats to internal and external validity (@HAYESLARSON2019125).

# Main important considerations

Baseline tables can be presented in different ways, but certain features are common and some recommendations need be followed. In the next paragraph and figure we provide an overview of the main important considerations, for more detailed review and guidelines see (e.g. @MOHER201228, @HAYESLARSON2019125).

Characteristics are generally presented on the left as rows and groupings at the top as columns. In RCTs, groups are defined by treatment allocation. If appropriated, the overall population's characteristics may be presented in a separate column. Characteristics should include information about the study participants (e.g., demographic, clinical, social) and information on exposures and any particular characteristics that may be predictive of the studied outcome (e.g., health status at baseline). Inside each cell, descriptive statistics are typically given as frequency and percentage for categorical variables and mean (standard deviation) or median (25th-75th percentile or minimum-maximum) for continuous variables. The type of statistical measurement used should be indicated (e.g., together with the characteristic, in a footnote). The number of participants with missing data should be reported for each variable of interest.

**Important note: Baseline table are descriptive tables, significance tests and inferential measures (e.g. standard errors and confidence intervals) should be avoided in observational studies as well as in RCTs (see items 14a and 15a and of the STROBE and CONSORT statements, respectively, @Vandenbroucke_strobe\@MOHER201228)**. In RCTs, testing for differences between randomized groups is irrelevant because baseline differences are, by definition, caused by chance.

![Basic baseline structure and analysis-specific considerations affecting columns, rows, and cells. Figure from @HAYESLARSON2019125](doc/bl_fig1.jpg)

```{r gg-oz-gapminder, echo=FALSE, fig.cap="Figure 1: Basic baseline structure and analysis-specific considerations affecting columns, rows, and cells. Figure from (@HAYESLARSON2019125)", message=FALSE, include=FALSE}
# library(knitr)
# library(jpeg)
# img1_path <- "Figure/fig1.jpg"
# img1 <- readJPEG(img1_path, native = TRUE)
# include_graphics(img1_path)
```

# How to program a baseline table

Many developers have published tools for building baselines tables. When choosing one please consider the flexibility of the tool. Indeed, requirements for the contents and formatting of baseline tables may vary depending on the project, the authors, the target journal, etc. Here we present few (very flexible) options for R and stata.

## In R

### Data

```{r r_data, eval=TRUE, include=TRUE}
library(dplyr)
library(Hmisc)
data(mtcars)
mtcars$am_f <- factor(mtcars$am, 0:1, c("Manual", "Automatic")) 
mtcars$vs_f <- factor(mtcars$vs, 0:1, c("V", "Straight")) 
d <- mtcars %>%  select(mpg, cyl, am_f,vs_f)
Hmisc::label(d$mpg) <- "Miles per gallon"
Hmisc::label(d$cyl) <- "Cylinders"
Hmisc::label(d$am_f) <- "Transmission"
Hmisc::label(d$vs_f) <- "Engine"
```

### Using gtsummary

```{r gt_summary_example1}
#Make a table stratified by transmission
library(gtsummary)
d %>%  tbl_summary(by = am_f)
```

This is the basic usage:

-   variable types are automatically detected so that appropriate descriptive statistics are calculated,
-   label attributes from the data set are automatically printed,
-   missing values are listed as "Unknown" in the table,
-   variable levels are indented and footnotes are added.

Defaults options may be customized.

```{r gt_summary_example2}
# declare cylinders as a continuous variable,
# for this variable calculate the mean and sd value,
# add an overall column, 
# change the missing text
d %>%
        tbl_summary(by = am_f, 
                    type = list(cyl ~ 'continuous'),
                    statistic = list(cyl ~ "{mean} ({sd})"),
                    missing_text = "Missing") %>%
        add_overall()
```

#### Additional information

Once produced gtsummary tables can be converted to your favorite format (e.g. html/pdf/word).

For more information see [here](https://www.danieldsjoberg.com/gtsummary/articles/rmarkdown.html). For detailed tutorial and additional options see the very complete [vignette](https://www.danieldsjoberg.com/gtsummary/articles/tbl_summary.html) and [website](https://www.danieldsjoberg.com/gtsummary/index.html)

### Using atable

```{r atable1}
library(atable)
table1=atable(d,
       target_cols = c("mpg" , "cyl" , "vs_f"),
       group_col = "am_f",
       format_to="Word")

#Or similar using the formula interface
## Not run: 
table1=atable(mpg+cyl+ vs_f ~ am_f, d,format_to="Word")
## End(Not run)

table1

```

By default atable is printing p-values, test statistics as well as effect sizes with a 95% confidence interval. As stated above, baseline tables are descriptive tables and should not contain this type of information. Don't forget to remove the columns.

```{r atable1c}
table1=table1%>%select(-"p",-"stat",-"Effect Size (CI)")
table1
```

The table may also be split up by strata. For example, we can decide to present separately the characteristics of car with a "V" or a "Straight" engine.

```{r atable2}
table1=atable(mpg+cyl  ~ am_f|vs_f , d,
              format_to="Word")
table1=table1%>%select(-"p",-"stat",-"Effect Size (CI)")
table1
```

As gtsummary, atable may be exported in different formats (e.g. LATEX, HTML, Word) and it is intended that some parts of atable can be altered by the user. For more details see @Strbel2019atableCT as well as the package [vignette](https://cran.r-project.org/web/packages/atable/vignettes/atable_usage.pdf). An other informative vignette can be found by typing the following command in R:

```{r atable_vignette, eval=FALSE, include=TRUE}
vignette("modifying", package = "atable")
```

## In Stata

### Using btable

The table is constructed in a two-step approach using two functions: btable() produces an unformatted, raw table, which is then formatted by btable_format to produce a final, publication-ready table. By default, the raw table contains all summary measures, and---if there are two groups---effect measures and p-values. Optionally, the table can be restricted to effect measures of choice and a number of alternative calculations for confidence intervals are available.

## Instalation

```{r btable_setup, eval=FALSE, include=TRUE}
#In order to install btable from github the github-package is required:
net install github, from("https://haghish.github.io/github/")
#You can then install the development version of btable with:
github install CTU-Bern/btable
```

## Example

```{r btable1_code, eval=FALSE, include=TRUE}
# load example dataset
sysuse auto2
# generate table
btable price mpg rep78 headroom, by(foreign) saving("excars") denom(nonmiss)
# format table (default formatting, removing the effect measures and P-values)
btable_format using "excars", clear drop(effect test)
```

```{r, include = FALSE}
dat <- readr::read_csv("doc/btab1.csv")
dat[is.na(dat)] <- "" # replace NAs
names(dat)[1] <- " "  # set missing var name
dat %>%
  flextable::flextable() %>%
  flextable::autofit() #%>%
  # flextable::save_as_image( path = "Table/btabl1.png")
```

<!-- <img src="Table/btabl1.png" align="center" width="90%"/> -->

The formatting option can be modified. For example we can decide we may want to

-   present the median and lower and upper quartiles instead of the mean and standard deviation
-   remove the overall column, and the information column

```{r btable2_code, eval=FALSE, include=TRUE}
#If we want to display median [lq, up] for all the continuous variables
btable_format using "excars", clear descriptive(conti median [lq, uq]) drop(effect test total info)
#If we want to display mean (sd) for the mpg variable and median [lq, up] for all the other continuous variables
btable_format using "excars", clear desc(conti median [lq, uq] mpg mean (sd)) drop(effect test total info)
```

```{r, echo = FALSE}
dat <- readr::read_csv("doc/btab2.csv")
dat[is.na(dat)] <- "" # replace NAs
names(dat)[1] <- " "  # set missing var name
dat %>%
  flextable::flextable() %>%
  flextable::autofit() #%>%
  # flextable::save_as_image( path = "Table/btabl2.png")
```

<!-- <img src="Table/btabl2.png" align="center" width="80%"/> -->

# Quiz

Here is a short quiz to check your understanding...

###### Question 1:

Which summary statistics can I give to describe continuous and categorical variables?

<details>

<summary>

Answer

</summary>

Descriptive statistics are typically given as frequency (percentage) for categorical or binary variables, mean and (standard deviation) for continuous normal variables and median (25th-75th percentile or minimum-maximum) for non-normal continuous variables.

</details>

# References