04-probability-distributions.qmd

# Probability distribution {#sec-chap04}

```{r}
#| label: setup
#| include: false

base::source(file = "R/helper.R")
```


## Achievements to unlock

::: {#obj-chap04}
::: {.my-objectives}
::: {.my-objectives-header}
Objectives for chapter 04
:::

::: {.my-objectives-container}
**SwR Achievements**

- **Achievement 1**: Defining and using probability distributions to infer from a sample (@sec-chap04-achievement1)
- **Achievement 2**: Understanding the characteristics and uses of a binomial distribution of a binary variable (@sec-chap04-achievement2)
- **Achievement 3**: Understanding the characteristics and uses of the normal distribution of a continuous variable (@sec-chap04-achievement3)
- **Achievement 4**: Computing and interpreting z-scores to compare observations to groups (@sec-chap04-achievement4)
- **Achievement 5**: Estimating population means from sample means using the normal distribution (@sec-chap04-achievement5)
- **Achievement 6**: Computing and interpreting confidence intervals around means and proportions (@sec-chap04-achievement6)

:::
:::
Achievements for chapter 04
:::


## The opioid overdose problem

There is an alarming increases in drug overdoses in the United States in recent years (see [County Health Rankings & Roadmaps website](https://https://www.countyhealthrankings.org/findings-and-insights/2023-county-health-rankings-national-findings-report) and [Data & Documentation](https://www.countyhealthrankings.org/health-data/methodology-and-sources/rankings-data-documentation#main)).

The `r glossary("CDC")` Wonder website has data on the underlying cause of each death in the United States. For drug deaths, the CDC WONDER data include the drug implicated in each death, if available. 

States had begun to adopt policies to try to combat the opioid epidemic. Some of the state-level policy solutions to addressing the increasing number of opioid overdoses: 

- Imposition of quantity limits 
- Required prior authorization for opioids 
- Use of clinical criteria for prescribing opioids 
- Step therapy requirements 
- Required use of prescription drug monitoring programs.

The Kaiser Family Foundation (`r glossary("KFF")`) keeps track of the adoption of these policies across all 50 states and the District of Columbia.

Treatment programs as well as policies depend partly on the distance people have to travel to the nearest health facility. `r glossary("amfAR")`, the Foundation for AIDS Research, which has an Opioid & Health Indicators Database (https://opioid.amfar.org). The data in amfAR’s database includes distance to the nearest substance abuse treatment facility that has medication assisted therapies (MAT).

## Resources & Chapter Outline

### Data, codebook, and R packages {#sec-chap04-data-codebook-packages}

::: {.my-resource}
::: {.my-resource-header}
:::::: {#lem-chap04-resources}
: Data, codebook, and R packages for learning about descriptive statistics
::::::
:::

::: {.my-resource-container}

**Data**

1.  Download clean data sets `pdmp_2017_kff_ch4.csv` and `opioid_dist_to_facility_2017_ch4.csv` from
    <https://edge.sagepub.com/harris1e>.
2.  Download the county-level distance data files directly from the amfAR website (https://opioid.amfar.org/indicator/dist_MAT)
3.  Import and clean the data for 2017 from Table 19 in the online report on the [KFF website](https://www.kff.org/report-section/implementing-coverage-and-payment-initiatives-benefits-and-pharmacy/)

**Codebook**

Two options:

1.  Download the codebook file `opioid_county_codebook.xlsx` from
    <https://edge.sagepub.com/harris1e>.
2.  Use the online version of the codebook from the amfAR Opioid & Health Indicators Database website (https://opioid.amfar.org)


**Packages**

1. Packages used with the book (sorted alphabetically)

-   {**tidyverse**}: @sec-tidyverse (Hadley Wickham)

    
2. My additional packages (sorted alphabetically)


:::
:::

### Get data

:::::{.my-example}
:::{.my-example-header}
:::::: {#exm-chap04-get-data}
: Get data for chapter 4
::::::
:::
::::{.my-example-container}

::: {.panel-tabset}

###### PDMP

:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-chap04-get-pdmp-book}
: Get the cleaned PDMP data from the book `.csv` file
::::::
:::
::::{.my-r-code-container}
```{r}
#| label: get-pdmp-book
#| reslts: hold
#| eval: false

## run code only once manually ##########

## get pdmp data from .csv file of the book
pdmp_2017_book <- readr::read_csv("data/chap04/pdmp_2017_kff_ch4.csv")
save_data_file("chap04", pdmp_2017_book, "pdmp_2017_book.rds")

```

***

(*For this R code chunk is no output available*)

::::
:::::


###### anfAR

:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-chap04-amfAR}
: Numbered R Code Title (Tidyverse)
::::::
:::
::::{.my-r-code-container}

```{r}
#| label: get-amfar-data
#| eval: false

## run only once, manually ############
amfar_file <- "data/chap04/opioid_dist_to_facility_2017_ch4.csv"

dist_mat <- readr::read_csv(amfar_file)
save_data_file("chap04", dist_mat, "dist_mat.rds")
```

***

(*For this R code chunk is no output available*)

::::
:::::

:::

::::
:::::

***


### Show raw data

:::::{.my-example}
:::{.my-example-header}
:::::: {#exm-chap04-show-data}
: Show raw data for chapter 4
::::::
:::
::::{.my-example-container}

::: {.panel-tabset}

###### PDMP

:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-chap04-show-pdmp-data}
: Show data for the prescription drug monitoring programs (PDMPs)
::::::
:::
::::{.my-r-code-container}

```{r}
#| label: show-pdmp-data
#| results: hold

pdmp_2017_book <- base::readRDS("data/chap04/pdmp_2017_book.rds")

glue::glue("********************* Show summary *******************")
base::summary(pdmp_2017_book)

glue::glue("")
glue::glue("****************** Show selected data ****************")
my_glance_data(pdmp_2017_book)
```

::::
:::::


###### dist-mat

:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-chap04-show-dist-mat}
: Show the distances to nearest substance abuse facility providing medication assisted treatment (MAT)
::::::
:::
::::{.my-r-code-container}

```{r}
#| label: show-dist-mat
#| results: hold

dist_mat <- base::readRDS("data/chap04/dist_mat.rds")

glue::glue("********************* Show summary *******************")
base::summary(dist_mat)

glue::glue("")
glue::glue("****************** Show selected data ****************")
my_glance_data(dist_mat)

```

::::
:::::

:::

::::
:::::

***

### Recode data

:::::{.my-example}
:::{.my-example-header}
:::::: {#exm-chap04-recode}
: Recode data for chapter 4
::::::
:::
::::{.my-example-container}

::: {.panel-tabset}

###### Transform amfAR

:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-code-name-a}
: Extend amfAR data with transformed values
::::::
:::
::::{.my-r-code-container}

```{r}
#| label: transform-amfar-distances
#| results: hold
#| cache: true

dist_mat_clean <- dist_mat |> 
    dplyr::mutate(square_root = sqrt(VALUE),
                  cube_root = VALUE^(1/3),
                  log = log(VALUE),
                  inverse = 1/VALUE
    )

save_data_file("chap04", dist_mat_clean, "dist_mat_clean.rds")

base::summary(dist_mat_clean)
my_glance_data(dist_mat_clean)
```

::::
:::::


###### Rename `VALUE` in amfAR

:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-chap04-rename-amfar-distance}
: Rename amfAR `VALUE` to `distance`
::::::
:::
::::{.my-r-code-container}

```{r}
#| label: rename-amfar-distance
#| cache: true

dist_mat_clean2 <- dist_mat |> 
    dplyr::rename(distance = VALUE) 

save_data_file("chap04", dist_mat_clean2, "dist_mat_clean2.rds")
```

***

(*For this R code chunk is no output available*)

::::
:::::

###### Prepare PDMP data

:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-chap04-prepare-pdmp}
: Rename column in PDMP and recode Yes/No to 1/0
::::::
:::
::::{.my-r-code-container}
```{r}
#| label: prepare-bdmp

pdmp_2017_book <- base::readRDS("data/chap04/pdmp_2017_book.rds")

## recode Yes to 1 and No to 0
pdmp_2017_book_clean <- pdmp_2017_book |> 
    dplyr::rename(PDMP = 6) |> 
    dplyr::mutate(PDMP =
          dplyr::if_else(PDMP == "Yes", 1, 0)
          ) |> 
    dplyr::mutate(PDMP = as.numeric(PDMP))

save_data_file("chap04", pdmp_2017_book_clean, "pdmp_2017_book_clean.rds")
    

```

***

(*For this R code chunk is no output available*)

::::
:::::


:::

::::
:::::

***


## Achievement 1: Probability distributions to infer from a sample {#sec-chap04-achievement1}

A `r glossary("probability distribution")` is the set of probabilities that each possible value (or range of values) of a variable occurs.

Probability distributions have two characteristics:

1. The probability of each real value of some variable is non-negative; it is either zero or positive. 2. The sum of the probabilities of all possible values of a variable is 1.

There are two categories of probability distributions:

1. Discrete probability distributions: An example is the binomial distribution.
2. Continuous probability distributions: An example is the normal distribution.


## Achievement 2: Binomial distribution of a binary variable {#sec-chap04-achievement2}

### Characteristics of binomial random variables

:::{#bul-chap04-binomial-random-variable}
:::::{.my-bullet-list}
:::{.my-bullet-list-header}
Bullet List
:::
::::{.my-bullet-list-container}

- A variable is measured in the same way n times. 
- There are only two possible values of the variable, often called “success” and “failure.” 
- Each observation is independent of the others. 
- The probability of “success” is the same for each observation. 
- The random variable is the number of successes in n measurements.

The binomial distribution is defined by two things: 

- **n**, which is the number of observations (e.g., coin flips, people surveyed, states selected) 
- **p**, which is the probability of success (e.g., 50% chance of heads for a coin flip, 51% chance of a state having a PDMP)

::::
:::::
Characteristics of binomial random variables
:::

***

### dbinomial() & pbinomial()

:::::{.my-example}
:::{.my-example-header}
:::::: {#exm-chap04-binomial-distributions}
: Statistical properties of binomial distributions
::::::
:::
::::{.my-example-container}

::: {.panel-tabset}

###### `dbinomial()` with exact `n`

:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-chap04-comp-dbinomial-exact}
: Compute binomial probability with exact number of success
::::::
:::
::::{.my-r-code-container}

```{r}
#| label: comp-dbinomial-exact

## exact 5 successes from 20 selections 
## with 51% probability of success 
stats::dbinom(x = 5, size = 20, prob = .51) * 100
```
***

Computed the probability given 

- the number of successes (`x`), 
- the sample size (`size =`), and 
- the probability of success (`prob =`).

The probabilities are very small for scenarios of getting *exactly* 10 states with PDMPs in a sample.
::::
:::::

The probabilities are very small for scenarios of getting *exactly* 10 states with `r glossary("PDMP")`s in a sample. The `r glossary("cumulative distribution function")` for the binomial distribution can determine the probability of getting some range of values, which is often more useful than finding the probability of one specific number of successes.


###### `pbinomial()` with range

:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-chap04-comp-dbinomial-range}
: Compute binomial probability of getting some range of values
::::::
:::
::::{.my-r-code-container}

```{r}
#| label: comp-pbinomial-range
#| results: hold
#| cache: true

base::options(scipen = 999)


## 5 or less successes from 20 selections 
## with 51% probability of success 
pbinom(q = 5, size = 20, prob = .51) * 100

## 10 or more successes from 20 selections 
### with 51% probability of success 
pbinom(q = 5, size = 20, prob = .51, lower.tail = FALSE) * 100


base::options(scipen = 0)
```

- **Exactly 5** successes with a success probability of 51% = `base::round(dbinom(x = 5, size = 20, prob = .51) * 100, 3)` : `r base::round(dbinom(x = 5, size = 20, prob = .51) * 100, 3)`%.
- **5 or fewer** successes with a success probability of 51% = `base::round(pbinom(q = 5, size = 20, prob = .51) * 100, 3)`: `r base::round(pbinom(q = 5, size = 20, prob = .51) * 100, 3)`%.
- **6 or more** successes with a success probability of 51% = `base::round(pbinom(q = 5, size = 20, prob = .51, lower.tail = FALSE) * 100, 3)`: 
`r base::round(pbinom(q = 5, size = 20, prob = .51, lower.tail = FALSE) * 100, 3)`%.

:::::{.my-important}
:::{.my-important-header}
For probabilities `q and more` you have to take `q - 1` and add `lower.tail = FALSE`.
:::
:::::

::::
:::::

###### Sample PDMPs from data

:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-chap04-sample-pdmp}
: Sample 25 states from population data (n = 51)
::::::
:::
::::{.my-r-code-container}
```{r}
#| label: sample-25-pmpd

pdmp_2017_book <- base::readRDS("data/chap04/pdmp_2017_book.rds")

## set a starting value for sampling 
set.seed(seed = 10) 

## sample 25 states and check 
pdmp_2017_book |>  
    dplyr::select(`Required Use of Prescription Drug Monitoring Programs`) |> 
    dplyr::mutate(`Required Use of Prescription Drug Monitoring Programs` =
          forcats::as_factor(`Required Use of Prescription Drug Monitoring Programs`)) |> 
    dplyr::slice_sample(n = 25) |> 
    base::summary()
```
***

The book features a lengthy explication of the `set.seed()` function and their revised working after R version 3.6. But this important detail is now --- several years after 3.6.0 appeared in April 2019 --- not so relevant anymore.

I had to recode the character variable to a factor and I used `dplyr::slice_sample()` instead of the superseded `dplyr::sample_n()` function.
::::
:::::


:::

::::
:::::

### Visualizing the binomial distribution

:::::{.my-example}
:::{.my-example-header}
:::::: {#exm-chap04-visualize-binomial-dist}
: Visualizing the binomial distribution
::::::
:::
::::{.my-example-container}

::: {.panel-tabset}

###### Distribution only

:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-chap04-binomial-dist-only}
: Binomial distribution of 20 selected states when 51% have PDMPs
::::::
:::
::::{.my-r-code-container}

```{r}
#| label: fig-binomial-dist-only
#| fig-cap: "Probability mass function plot showing probability of number of selected states \nwith PDMPs out of 20 total selected when 51% have PDMPs overall"


base::set.seed(42)

binomial_data <- tibble::tibble(stats::rbinom(1000, 20, .51)) |> 
    dplyr::rename(data = 1) |> 
    dplyr::mutate(my_color = 
                dplyr::if_else(data <= 5, "purple", "grey")
    )


binomial_data |> 
    ggplot2::ggplot() +
    ggplot2::aes(x = data,
                 y = ggplot2::after_stat(count) / 
                     base::sum(count)
                 ) +
    ggplot2::geom_histogram(
        color = "black", 
        fill = "grey",
        binwidth = 1
        ) +
    ggplot2::theme_bw() +
    ggplot2::scale_x_continuous(
        breaks = base::seq(0, 20, 2)) +
    ggplot2::labs(x = 'States with monitoring programs',
       y = 'Probability exactly this many selected')
```


::::
:::::

:::::{.my-resource}
:::{.my-resource-header}
:::::: {#lem-chap04-econ41-lab}
Helpful code snippet at "ECON 41 Lab"
::::::
:::
::::{.my-resource-container}
I got help for the code from [15 Tutorial 4: The Binomial Distribution](https://bookdown.org/gabriel_butler/ECON41Labs/tutorial-4-the-binomial-distribution.html) [@butler2019].

::::
:::::


###### Distribution with marker

:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-chap04-binomial-dist-marker}
: Probability of 5 or fewer selected states with PDMPs out of 20 total selected \nwhen 51% have PDMPs overall
::::::
:::
::::{.my-r-code-container}

```{r}
#| label: fig-binomial-dist-marker
#| fig-cap: "Probability of 5 or fewer selected states with PDMPs out of 20 total selected when 51% have PDMPs overall"

base::set.seed(42)

colors <- c(rep("purple", 2), rep("grey", 13))

binomial_data <- tibble::tibble(stats::rbinom(1000, 20, .51)) |> 
    dplyr::rename(data = 1) |> 
    dplyr::mutate(my_color = 
                dplyr::if_else(data <= 5, "purple", "grey")
    )


binomial_data |> 
    ggplot2::ggplot() +
    ggplot2::aes(x = data,
                 y = ggplot2::after_stat(count) / 
                     base::sum(count)
                 ) +
    ggplot2::geom_histogram(
        color = "black", 
        fill = colors,
        binwidth = 1
        ) +
    ggplot2::geom_vline(xintercept = 5, 
             linewidth = 1, 
             linetype = 'dashed',
             color = 'red') +
    ggplot2::theme_bw() +
    ggplot2::scale_x_continuous(breaks = base::seq(0, 20, 2)) +
    ggplot2::labs(x = 'States with monitoring programs',
       y = 'Probability exactly this many selected') 
```

::::
:::::

###### Histogram colorized

:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-chap04-binomial-dist-color}
: Probability of 5 or fewer selected states with PDMPs out of 20 total selected \nwhen 51% have PDMPs overall
::::::
:::
::::{.my-r-code-container}

```{r}
#| label: fig-binomial-dist-color
#| fig-cap: "Probability of 5 or fewer selected states with PDMPs out of 20 total selected when 51% have PDMPs overall"

base::set.seed(42)

binomial_data <- tibble::tibble(stats::rbinom(1000, 20, .51)) |> 
    dplyr::rename(data = 1) |> 
    dplyr::mutate(my_color = 
            dplyr::if_else(data <= 5, "purple", "grey")
    ) |> 
    dplyr::mutate(my_color =
            forcats::as_factor(my_color))


binomial_data |> 
    ggplot2::ggplot() +
    ggplot2::aes(x = data,
                 y = ggplot2::after_stat(count) / 
                     base::sum(count),
                 fill = my_color
                 ) +
    ggplot2::geom_histogram(
        binwidth = 1,
        color = "black"
    ) +
    ggplot2::geom_vline(xintercept = 5, 
             linewidth = 1, 
             linetype = 'dashed',
             color = 'red') +
    ggplot2::theme_bw() +
    ggplot2::scale_x_continuous(breaks = base::seq(0, 20, 2)) +
    ggplot2::scale_fill_manual(name = "Number of states\nwith PDMP",
                               values = c("grey" = "grey",
                               "purple" = "purple"),
                               labels = c("> 5", "5 or fewer")) +
    ggplot2::labs(x = 'States with monitoring programs',
       y = 'Probability exactly this many selected') 
```

::::
:::::

:::

::::
:::::

***

## Achievement 3: Normal distribution of a continuous variable {#sec-chap04-achievement3}

### Working with normal distributions

Binomial data in social sciences are only one type of data. Many data are continuous variables. Just as the shape of the binomial distribution is determined by `n` and `p`, the shape of the normal distribution for a variable in a sample is determined by `$mu$` and `$sigma$`.

:::::{.my-example}
:::{.my-example-header}
:::::: {#exm-chap04-dist-mat-dist}
: Distribution of the distances to nearest facility providing MAT
::::::
:::
::::{.my-example-container}

::: {.panel-tabset}

###### Distances

:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-chap04-dist-mat-normal}
: Distribution of the original distance variable
::::::
:::
::::{.my-r-code-container}

```{r}
#| label: fig-dist-mat-normal
#| fig-cap: "Distribution of the distance to the nearest facility with MAT"

dist_mat_clean <- base::readRDS("data/chap04/dist_mat_clean.rds")

dist_mat_clean |> 
    ggplot2::ggplot(
        ggplot2::aes(x = VALUE) 
    ) +
    ggplot2::geom_histogram(
        bins = 30,
        fill = "grey",
        color = "black"
        ) +
    ggplot2::theme_bw() +
    ggplot2::labs(
        x = "Distance in miles",
        y = "Number of counties"
    )
```

::::
:::::


###### Distances transformed

:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-chap04-dist-mat-transformed}
: Distribution of the distance variable transformed by various factors
::::::
:::
::::{.my-r-code-container}

```{r}
#| label: fig-dist-mat-transformed
#| fig-cap: "Distribution of the distance variable transformed by various factors"
#| warning: false
#| cache: true

## using the extended data frame 
## with square root, cube root, inverse $ log values

p_cube_root <- dist_mat_clean |>
    ggplot2::ggplot(
        ggplot2::aes(x = cube_root)
    ) +
    ggplot2::geom_density(
        color = "black",
        fill = "grey"
    ) +
    ggplot2::theme_bw() +
    ggplot2::labs(
        x = "Cube root of miles to nearest facility",
        y = "Density"
    )


p_square_root <- dist_mat_clean |>
    ggplot2::ggplot(
        ggplot2::aes(x = square_root)
    ) +
    ggplot2::geom_density(
        color = "black",
        fill = "grey"
    ) +
    ggplot2::theme_bw() +
    ggplot2::labs(
        x = "Distance in square root of miles",
        y = "Density"
    )

p_inverse <- dist_mat_clean |>
    ggplot2::ggplot(
        ggplot2::aes(x = inverse)
    ) +
    ggplot2::geom_density(
        color = "black",
        fill = "grey"
    ) +
    ggplot2::theme_bw() +
    ggplot2::xlim(0, 1) +
    ggplot2::labs(
        x = "Inverse of miles to nearest facility",
        y = "Density"
    )

p_log <- dist_mat_clean |>
    ggplot2::ggplot(
        ggplot2::aes(x = log)
    ) +
    ggplot2::geom_density(
        color = "black",
        fill = "grey"
    ) +
    ggplot2::theme_bw() +
    ggplot2::labs(
        x = "Log of miles to nearest facility",
        y = "Density"
    )
gridExtra::grid.arrange(grobs = list(p_cube_root,
                                     p_square_root,
                                     p_inverse,
                                     p_log),
                        ncol = 2)
```

***

The best result of these transformation was with cube root.

I tried to write a function for these four graphs, but it was not easy to pass the dataframe and column to the function. I finally succeeded with passing the column name as character string and using `[[` inside the function to select the column. (See [StackOverflow](https://stackoverflow.com/a/36015931/7322615)) But I gave up with `xlim()` parameter for the inverse transformation.
::::
:::::

###### Mean and sd

:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-chap04-mean-sd-distance-transformed}
: Mean and standard deviation for cube root of mile transformation
::::::
:::
::::{.my-r-code-container}
```{r}
#| label: mean-sd-distance-transformed

dist_mat_clean |> 
    dplyr::summarize(mean = mean(cube_root),
                  sd = sd(cube_root))

```

::::
:::::

###### Probability distribution

:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-chap04-prob-distance-dist}
: Probability density function for a variable with a mean of 2.66 and a standard deviation of .79
::::::
:::
::::{.my-r-code-container}
```{r}
#| label: fig-prob-distance-dist
#| fig-cap: "Probability density function for a variable with a mean of 2.66 and a standard deviation of .79"
#| cache: true

base::set.seed(42)
normal_data <- tibble::tibble(stats::rnorm(
    n = 1e3, 
    mean = 2.66, 
    sd = .79)) |> 
    dplyr::rename(data = 1) 

normal_data |> 
    ggplot2::ggplot() +
    ggplot2::aes(x = data,
                 y = ggplot2::after_stat(count) / 
                     base::sum(count)
                 ) +
    ggplot2::geom_density() +
    ggplot2::theme_bw() +
    ggplot2::labs(x = 'Cube root of miles to the nearest facility with MAT',
       y = 'Probability density')
```
***

In this plot I draw the probability density function with randomly generated data. The above curve will smooth out when I will take a bigger sample (for instance 1e5 instead 1e3).
::::
:::::

###### with shaded area

:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-chap04-prob-shaded}
: Probability density function of the cube root transformation for 64 miles distance to a treatment facility
::::::
:::
::::{.my-r-code-container}
```{r}
#| label: fig-chap04-prob-shaded
#| fig-cap: "Probability density function of the cube root transformation for 64 miles distance to a treatment facility"
#| cache: true

normal_data |> 
    ggplot2::ggplot(
        ggplot2::aes(x = data)
    ) +
    ggplot2::stat_function(
        fun = dnorm, 
        n = 1e3, 
        args = list(mean = 2.66, 
                    sd = .79),
        linewidth = .5) +
    ggplot2::geom_area(stat = 'function',
            fun = dnorm,
            fill = 'blue',
            args = list(mean = 2.66, 
                    sd = .79),
            xlim = c(4, 6),
            alpha = 0.3) +
    ggplot2::theme_bw() +
    ggplot2::labs(x = 'Cube root of miles to the nearest facility with MAT',
       y = 'Probability density')

```
***

For this plot I have used the `dnorm()` function. Therefore this normal distribution curve is smooth.

The shaded area is the probability for counties that are $4^3 = 64$ miles from a facility that provides medical assisted treatment (MAT) 
::::
:::::

###### Compute shaded area

:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-chap04-comp-shaded-area}
: Compute shaded area: Percentage of counties where the nearest facility with MAT is 64 miles or more far away
::::::
:::
::::{.my-r-code-container}
```{r}
#| label: comp-shaded-area

stats::pnorm(4, 2.66, .79, lower.tail = FALSE)

```
***

If you want to calculate the right part of the distribution then you need to change the default value from `lower.tail = TRUE` to `lower.tail = FALSE`.
::::
:::::

4.49% of observations were in the shaded part of this distribution and therefore had a value for the distance variable of 4 or greater. Reversing the transformation, this indicated that residents of 4.49% of counties have to travel 43 or 64 miles or more to get to the nearest substance abuse facility providing medication-assisted treatment.

:::

::::
:::::

### Check understanding

Shows shading for the part of the distribution that is less than 2. Estimate (without computing the answer) the percentage of counties in the shaded area.

:::::{.my-exercise}
:::{.my-exercise-header}
:::::: {#exr-chap04-check-achievement3}
: Achievement 3: Check understanding
::::::
:::
::::{.my-exercise-container}

::: {.panel-tabset}

###### Show shaded graph

:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-chap04-achievement3-graph}
: Shows shading for the part of the distribution that is less than 2
::::::
:::
::::{.my-r-code-container}
```{r}
#| label: fig-chap04-achievement3-graph
#| fig-cap: "Probability density function for a variable with a mean of 2.66 and a standard deviation of .79 with the shaded area for counties that are 16 miles or less from the nearest facility with MAT"


dist_mat_clean |> 
    ggplot2::ggplot(
        ggplot2::aes(x = cube_root)
    ) +
    ggplot2::stat_function(
            fun = dnorm, 
            n = 1e3, 
            args = list(mean = 2.66,
                        sd = .79),
            linewidth = .5) +
    ggplot2::geom_area(stat = 'function',
            fun = dnorm,
            fill = 'blue',
            args = list(mean = 2.66, 
                    sd = .79),
            xlim = c(0, 2),
            alpha = 0.3) +
    ggplot2::theme_bw() +
    ggplot2::labs(x = 'Cube root of miles to the nearest facility with MAT',
       y = 'Probability density')


```

::::
:::::

###### Computation shaded area

:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-chap04-achievement3-computation}
: Compute area of the shading for the part of the distribution that is 8 miles or less from the nearest facility with MAT
::::::
:::
::::{.my-r-code-container}
```{r}
#| label: achievement3-computation

stats::pnorm(2, mean = 2.66, sd = .79)

```
***

About 20% of the counties are 8 miles or less from the nearest facility with MAT. My estimation of the shaded area without computation would have been much more (about 30-35%).
::::
:::::


:::

::::
:::::


## Achievement 4: z-scores {#sec-chap04-achievement4}

:::::{.my-important}
:::{.my-important-header}
Values of normally distributed variables
:::
::::{.my-important-container}
Regardless of what the mean and standard deviation are, a normally distributed variable has approximately 

- 68% of values within one standard deviation of the mean 
- 95% of values within two standard deviations of the mean 
- 99.7% of values within three standard deviations of the mean

These characteristics of the normal distribution can be used to describe and compare how far individual observations are from a mean value.

::::
:::::

### Defining z-score

:::::{.my-theorem}
:::{.my-theorem-header}
:::::: {#thm-chap04-z-score}
: Z-Score formula
::::::
:::
::::{.my-theorem-container}
$$
z_{i} = \frac{x_{i} - m_{x}}{s_{x}}
$$ {#eq-chap04-z-score}

The `r glossary("z-score")` for an observation is the number of standard deviations from the mean.

::::
:::::

### z-score calculation & interpretation

:::::{.my-example}
:::{.my-example-header}
:::::: {#exm-chap04-calc-z-scores}
: Calculation and interpretation of z-scores
::::::
:::
::::{.my-example-container}

::: {.panel-tabset}

###### Example 1

:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-chap04-z-score1}
: Z-score for a county with residents who have to travel 50 miles to the nearest facility
::::::
:::
::::{.my-r-code-container}

```{r}
#| label: z-score1
#| results: hold

cube_miles <- 50^(1/3)
mean = 2.66
sd = .79

(cube_miles - mean) / sd
```
***

This example county is farther than the mean away from the nearest facility with MAT.
::::
:::::


###### Example 2

:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-code-name-b}
: Z-score for a county with residents who have to travel 10 miles to the nearest facility
::::::
:::
::::{.my-r-code-container}

```{r}
#| label: z-score2

cube_miles <- 10^(1/3)
mean = 2.66
sd = .79

(cube_miles - mean) / sd
```
***

This example county is less than the mean away from the nearest facility with MAT.

::::
:::::

###### Achievement 4

:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-chap04-z-score3}
: Z-score for a county where you have to drive 15 miles to the nearest facility with MAT.
::::::
:::
::::{.my-r-code-container}
```{r}
#| label: z-score3

cube_miles <- 15^(1/3)
mean = 2.66
sd = .79

(cube_miles - mean) / sd

```

This example county is less than the mean away from the nearest facility with MAT. (The mean of our transformed data is $2.66^3$ miles = `r 2.66^3`).
::::
:::::


:::

::::
:::::

## Achievement 5: Estimating population means {#sec-chap04-achievement5}

### Samples and populations

:::::{.my-example}
:::{.my-example-header}
:::::: {#exm-chap04-sample-and-population}
: Estimating population means
::::::
:::
::::{.my-example-container}

::: {.panel-tabset}

###### Summarize all

:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-chap04-summarize-all-distances}
: Summarize distances from the `r glossary("amfAR")` database 
::::::
:::
::::{.my-r-code-container}

```{r}
#| label: summarize-all-distances

## load data with renamed `VALUE` column to `distance`
dist_mat_clean2 <- base::readRDS("data/chap04/dist_mat_clean2.rds")

dist_mat_clean2 |> 
    dplyr::summarize(
        mean_distance = base::mean(distance),
        sd_distance = stats::sd(distance),
        n = dplyr::n()
    )
```

***

These are the value for the population of (almost all) counties of the US (n = 3214). We are going now to get a sample of 500 counties to see how near we will come with the sample summaries to mean and sd of the population .
::::
:::::


###### Summarize sample

:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-chap04-summarize-sample-distances}
: Draw a sample of 500 counties and compute the summaries
::::::
:::
::::{.my-r-code-container}

```{r}
#| label: summarize-sample-distances
#| results: hold
#| cache: true

set.seed(seed = 1945)
dist_mat_clean2 |> 
    dplyr::slice_sample(n = 500, replace = TRUE) |> 
    dplyr::summarize(
    mean_distance = base::mean(distance),
    sd_distance = stats::sd(distance),
    n = dplyr::n()
    )

set.seed(seed = 48)
dist_mat_clean2 |> 
    dplyr::slice_sample(n = 500, replace = TRUE) |> 
    dplyr::summarize(
    mean_distance = base::mean(distance),
    sd_distance = stats::sd(distance),
    n = dplyr::n()
    )
```

***

One sample is somewhat higher, the other a little lower than the population mean.
::::
:::::

###### Sample of 20 samples

:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-chap04-sample-of-20-samples}
: Examining a sample of 20 samples from a population
::::::
:::
::::{.my-r-code-container}
```{r}
#| label: sample-of-20-samples
#| results: hold
#| cache: true

## get 20 samples 
## each sample has 500 counties 
## put samples in a data frame with each sample 
## having a unique id called "sample_num"

base::set.seed(111)
dist_mat_sample_20 <- 
    dplyr::bind_rows(
        base::replicate(n = 20, dist_mat_clean2 |> 
                        dplyr::slice_sample(n = 500, replace = TRUE),
                        simplify = FALSE),
        .id = "sample_num")

## find the mean for each sample 
dist_mat_sample_20_means <- dist_mat_sample_20 |> 
    dplyr::group_by(sample_num) |> 
    dplyr::summarize(
        mean_distance = mean(x = distance, na.rm = TRUE))

dist_mat_sample_20_means

## find the mean of the 20 sample means
dist_mat_sample_20_means |> 
    dplyr::summarize(mean_20_means = mean(mean_distance))
```

::::
:::::

###### Sample 100 samples

:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-chap04-sample-100-samples}
: Examining a sample of 100 samples from a population
::::::
:::
::::{.my-r-code-container}
```{r}
#| label: sample-100-samples
#| results: hold
#| cache: true

## get 100 samples 
## each sample has 500 counties 
## put samples in a data frame with each sample 
## having a unique id called "sample_num"

base::set.seed(143)
dist_mat_sample_100 <- 
    dplyr::bind_rows(
        base::replicate(n = 100, dist_mat_clean2 |> 
                        dplyr::slice_sample(n = 500, replace = TRUE),
                        simplify = FALSE),
        .id = "sample_num")

## find the mean for each sample 
dist_mat_sample_100_means <- dist_mat_sample_100 |> 
    dplyr::group_by(sample_num) |> 
    dplyr::summarize(
        mean_distance = mean(x = distance, na.rm = TRUE))

dist_mat_sample_100_means

## find the mean of the 100 sample means
dist_mat_sample_100_means |> 
    dplyr::summarize(mean_100_means = mean(mean_distance))

dist_mat_sample_100_means |> 
    ggplot2::ggplot(
        ggplot2::aes(x = mean_distance)
    ) +
    ggplot2::geom_histogram(
        bins = 30,
        color = "black",
        fill = "grey") +
    ggplot2::theme_bw()
```


***

Even if the mean of the 100 sample means is already very near from the population value (`r dist_mat_sample_100_means |> dplyr::summarize(mean_100_means = mean(mean_distance)) |> dplyr::pull()` versus `r dist_mat_clean2 |> dplyr::summarize(mean_distance = base::mean(distance)) |> dplyr::pull()`, difference = `r dist_mat_sample_100_means |> dplyr::summarize(mean_100_means = mean(mean_distance)) |> dplyr::pull() - dist_mat_clean2 |> dplyr::summarize(mean_distance = base::mean(distance)) |> dplyr::pull()`) the sampling distribution is still far from a nice normal distribution. This will change if we are going to generate 1000 sample means.
::::
:::::

###### Sample 1000 samples

:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-chap04-sample-1000-distribution}
: Plot the sample distribution of 1000 samples
::::::
:::
::::{.my-r-code-container}

::: {#lst-sample-distributions}
```{r}
#| label: fig-sample-1000-distribution
#| fig-cap: "Sample distribution of 1000 samples"
#| results: hold
#| cache: true

## get 1000 samples 
## each sample has 500 counties 
## put samples in a data frame with each sample 
## having a unique id called "sample_num"

base::set.seed(159)
dist_mat_sample_1000 <- 
    dplyr::bind_rows(
        base::replicate(n = 1000, dist_mat_clean2 |> 
                        dplyr::slice_sample(n = 500, replace = TRUE),
                        simplify = FALSE),
        .id = "sample_num")

## find the mean for each sample 
dist_mat_sample_1000_means <- dist_mat_sample_1000 |> 
    dplyr::group_by(sample_num) |> 
    dplyr::summarize(
        mean_distance = mean(x = distance, na.rm = TRUE))

dist_mat_sample_1000_means

## find the mean of the 100 sample means
dist_mat_sample_1000_means |> 
    dplyr::summarize(mean_1000_means = mean(mean_distance))

dist_mat_sample_1000_means |> 
    ggplot2::ggplot(
        ggplot2::aes(x = mean_distance)
    ) +
    ggplot2::geom_histogram(
        bins = 30,
        color = "black",
        fill = "grey") +
    ggplot2::theme_bw()


```

Sample distribution of 1000 samples
:::

***

Taking a lot of large samples and graphing their means results in a `r glossary("sampling distribution")` that looks like a normal distribution, and, more importantly, the mean of the sample means is nearly the same as the population mean (`r dist_mat_sample_1000_means |> dplyr::summarize(mean_1000_means = mean(mean_distance)) |> dplyr::pull()` versus `r dist_mat_clean2 |> dplyr::summarize(mean_distance = base::mean(distance)) |> dplyr::pull()`, difference = `r dist_mat_sample_1000_means |> dplyr::summarize(mean_1000_means = mean(mean_distance)) |> dplyr::pull() - dist_mat_clean2 |> dplyr::summarize(mean_distance = base::mean(distance)) |> dplyr::pull()`).

::::
:::::


:::

::::
:::::

### Central Limit Theorem {#sec-chap04-clt}

The fact that the mean of the sample distribution of many samples approximates the population mean is called `r glossary("Central Limit Theorem")`. It holds true for continuous variables that both are and are not normally distributed.

Another characteristic of the Central Limit Theorem is that the standard deviation of the sample means can be estimated using the population standard deviation and the size of the samples that make up the distribution:

$$
s_{sample\space distribution} = \frac{\sigma}{\sqrt{n}}
$$ {#eq-chap04-sd-sample-means}

If we want to calculate the standard deviation of the population we cannot use the `stats::sd()`. The reason is that `stats::sd()` uses the `r glossary("Bessel’s correction")` for samples which is not correct for the standard deviation of the population.

Instead to apply the rather complex procedure in the book, I recommend to use the `sd_pop()` function from the {**sjstats**} package (see @sec-sjstats).

### Standard deviation (sd)

The standard deviation of the sampling distribution shows how much we expect sample means to vary from the population mean.


:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-chap04-estimated-sd}
: Compute estimated standard deviation of the sampling distributions
::::::
:::
::::{.my-r-code-container}

```{r}
#| label: estimated-sd
#| results: hold

## compute parameters for population
dist_mat_clean2  |>  
    tidyr::drop_na(distance) |> # not necessary - no NAs
    dplyr::summarize(n = dplyr::n(), 
                     pop.var = sjstats::var_pop(distance),
                     pop.sd = sjstats::sd_pop(distance),
                     samp_dist_est = pop.sd / base::sqrt(x = 500)
    )

## computing the sample dist standard deviation 
## directly from the 1000 sample means

sd(x = dist_mat_sample_1000_means$mean_distance, 
   na.rm = T)
```

::::
:::::

### Standard error (se)

Since it is unusual to have the entire population for computing the population `r glossary("standard deviation")`, and it is also unusual to have a large number of samples from one population, a close approximation to this value is called the `r glossary("standard error")` of the mean (often referred to simply as the “standard error”). The standard error is computed by dividing the standard deviation of a variable by the square root of the sample size.

$$
se = \frac{s}{\sqrt{n}}
$$ {#eq-chap04-se}


:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-chap04-compute-se}
: Compute standard error of the mean
::::::
:::
::::{.my-r-code-container}

```{r}
#| label: compute-se
#| results: hold

## mean, sd, se for first sample of 500 counties
set.seed(seed = 1945)
dist_mat_clean2 |> 
    dplyr::slice_sample(n = 500, replace = TRUE) |> 
    dplyr::summarize(
    mean_distance = base::mean(distance),
    sd_distance = stats::sd(distance),
    se_distance = stats::sd(x = distance) /
        base::sqrt(x = base::length(x = distance)),
    n = dplyr::n()
    )

set.seed(seed = 48)
dist_mat_clean2 |> 
    dplyr::slice_sample(n = 500, replace = TRUE) |> 
    dplyr::summarize(
    mean_distance = base::mean(distance),
    se_distance = stats::sd(x = distance) /
        base::sqrt(x = base::length(x = distance)),
    sd_distance = stats::sd(distance),
    n = dplyr::n()
    )
```
***

Both of the standard error (se) values are close to the sampling distribution standard deviation of 1.05, but they are not exactly the same. The first sample standard error of 1.06 was a little above and the second sample standard error of .90 was a little below.

::::
:::::

::: {.callout-note #nte-chap04-se}
**Summary**

- The standard deviation of the sampling distribution is 1.05. 
- The standard error from the first sample is 1.06. 
- The standard error from the second sample is 0.90.

Most of the time researchers have a single sample and so the only feasible way to determine the `r glossary("standard deviation")` of the `r glossary("sampling distribution")` is by computing the `r glossary("standard error")` of the single sample. This value tends to be a good estimate of the standard deviation of sample means.

**The sample standard error is a good estimate of the sampling distribution standard deviation!**
:::

:::::{.my-important}
:::{.my-important-header}
Difference between standard deviation and standard error
:::
::::{.my-important-container}

The `r glossary("standard deviation")` is a measure of the variability in the sample, while the `r glossary("standard error")` is an estimate of how closely the sample represents the population.
::::
:::::

## Achievement 6: Confidence intervals {#sec-chap04-achievement6}

### Introduction

95% `r glossary("confidence interval")` (CIs) show the range where the population value would likely be 95 times if the study were conducted 100 times.

**The 95% interval idea summarized:**

- About 95% of values lie within two `r glossary("standard deviation", "standard deviations")` of the mean for a variable that is normally distributed. 
- The `r glossary("standard error")` of a sample is a good estimate of the standard deviation of the `r glossary("sampling distribution")`, which is normally distributed. 
- The mean of the sampling distribution is a good estimate of the population mean. 
- So, most sample means will be within two standard errors (or more exact 1.96) of the population mean.
- The number of standard deviations some observation is away from the mean is called a `r glossary("z-score")`.

In the following example I am going to replicate Figure 4.20 and 4.21 of the book. For these two graphs there exist no demonstration how to use R code to produce the figures in the book.

### Working with 95% CIs

#### Compute and plot stats: mean, sd, se and CIs

:::::{.my-example}
:::{.my-example-header}
:::::: {#exm-chap04-working-with.ci}
: Working with 95% confidence intervals
::::::
:::
::::{.my-example-container}

::: {.panel-tabset}

###### Compute 1

:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-chap04-compute-ci1}
: Compute with a sample of 500 counties CI together with mean, sd, and se 
::::::
:::
::::{.my-r-code-container}

```{r}
#| label: compute-ci1
#| results: hold
#| cache: true


## get the population mean
dist_mat_clean2 <- base::readRDS("data/chap04/dist_mat_clean2.rds")
mean_dist_pop <- dist_mat_clean2 |> 
    dplyr::summarize(mean_pop = mean(distance)) |> 
    dplyr::pull()

## mean, sd, se and 95% ci for first sample of 500 counties
set.seed(seed = 1945)
dist_mat_sample1 <- dist_mat_clean2 |> 
    dplyr::slice_sample(n = 500, replace = TRUE)

dist_mat_param_sample1 <-  dist_mat_sample1 |> 
    dplyr::summarize(
    mean_distance = base::mean(distance),
    sd_distance = stats::sd(distance),
    se_distance = stats::sd(distance) /
        base::sqrt(x = base::length(distance)),
    lower_ci_distance = mean_distance - 1.96 * se_distance, 
    upper_ci_distance = mean_distance + 1.96 * se_distance
    )
```
***

::: {.callout-tip}
The mean distance in miles to the nearest substance abuse treatment facility with MAT in a sample of 500 counties is `r round(dist_mat_param_sample1$mean_distance, 2)`; the true or population mean distance in miles to a facility likely lies between `r round(dist_mat_param_sample1$lower_ci_distance, 2)` and `r round(dist_mat_param_sample1$upper_ci_distance, 2)` (m = `r round(dist_mat_param_sample1$mean_distance, 2)`; 95% CI = `r round(dist_mat_param_sample1$lower_ci_distance, 2)` – `r round(dist_mat_param_sample1$upper_ci_distance, 2)`).
:::

In this special case we have the population mean also available: So we can compare: The population mean = `r round(mean_dist_pop, 2)`, e.g., it lies withing the 95 CI!

::::
:::::

###### Compute 2

:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-chap04-compute-ci2}
: Compute with another sample of 500 counties CI together with mean, sd, and se 
::::::
:::
::::{.my-r-code-container}

```{r}
#| label: compute-ci2
#| results: hold

dist_mat_clean2 <-  base::readRDS("data/chap04/dist_mat_clean2.rds")

## mean, sd, se and 95% ci for first sample of 500 counties
set.seed(seed = 48)
dist_mat_sample2 <- dist_mat_clean2 |> 
    dplyr::slice_sample(n = 500, replace = TRUE)

dist_mat_param_sample2 <-  dist_mat_sample2 |> 
    dplyr::summarize(
    mean_distance = base::mean(distance),
    sd_distance = stats::sd(distance),
    se_distance = stats::sd(distance) /
        base::sqrt(x = base::length(distance)),
    lower_ci_distance = mean_distance - 1.96 * se_distance, 
    upper_ci_distance = mean_distance + 1.96 * se_distance
    )
```

***

::: {.callout-tip}
The mean distance in miles to the nearest substance abuse treatment facility with MAT in a sample of 500 counties is `r round(dist_mat_param_sample2$mean_distance, 2)`; the true or population mean distance in miles to a facility likely lies between `r round(dist_mat_param_sample2$lower_ci_distance, 2)` and `r round(dist_mat_param_sample2$upper_ci_distance, 2)` (m = `r round(dist_mat_param_sample2$mean_distance, 2)`; 95% CI = `r round(dist_mat_param_sample2$lower_ci_distance, 2)` – `r round(dist_mat_param_sample2$upper_ci_distance, 2)`).

:::

In this special case we have the population mean also available: So we can compare: The population mean = `r round(mean_dist_pop, 2)`, e.g., it lies withing the 95 CI!

::::
:::::


###### Plot 1

:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-chap04-plot-ci1}
: Plot CI of a sample 0f 500 counties compared with population mean
::::::
:::
::::{.my-r-code-container}

```{r}
#| label: fig-plot-ci1
#| fig-cap: "Distribution of the distance to the nearest facility with MAT with a 95% cofindence interval and compared to the population mean"
#| cache: true

## create data frame for the 4 different vertical lines
vlines <- tibble::tibble(labels = c(
        "Lower CI", "Upper CI", 
        "Sample Mean", "Population Mean"
        ),
    xintercepts = c(
        dist_mat_param_sample1$lower_ci_distance,
        dist_mat_param_sample1$upper_ci_distance,
        dist_mat_param_sample1$mean_distance,
        mean_dist_pop
        ),
    colors = c("coral", "blue4", "seagreen", "yellow"),
    linetypes = c("solid", "solid", "dotted", "dashed" )
     )

## plot with scale & legend ##############
dist_mat_sample1 |> 
    ggplot2::ggplot(
        ggplot2::aes(x = distance)
        ) +
    ggplot2::geom_histogram(
        bins = 30,
        fill = "grey",
        color = "black"
        ) + 
    
    ## add all vertical lines via data frame
    ggplot2::geom_vline(
        data = vlines,
        ggplot2::aes(
            xintercept = xintercepts,
            color = colors, # color order is alphabetically
            linetype = linetypes)
    ) +
    
    ## change / prepare legend
    ggplot2::scale_color_identity(        
        name = "Parameter",
        labels = vlines$labels, 
        guide = "legend",
        breaks = c("coral", "blue4", "seagreen", "yellow")
        ) +
    
    ## prevent second legend for line type
    ## `guide = "none"` as default not necessary
    ggplot2::scale_linetype_identity(guide = "none") + 
    ggplot2::theme_bw() +
    
    ## make legend bigger so that the lines are better visible
    ## and position legend on top with a border around
    ggplot2::guides(color = 
        ggplot2::guide_legend(override.aes = base::list(size = 8))) +
    ggplot2::theme(
        legend.position = "top",
        legend.background = 
           ## color = legend border,
           ## fill would be background, here not used
           ggplot2::element_rect(color = "black")
        ) +       
    ggplot2::labs(
        x = "Distance in miles",
        y = "Number of counties"
    )
```
***
The 95% interval is very small. The population mean is inside the sample CI, very near and therefore almost overlapping the sample mean.

- The population mean = `r mean_dist_pop`
- The sample mean = `r dist_mat_param_sample1$mean_distance`
- The difference = `r mean_dist_pop - dist_mat_param_sample1$mean_distance`

::::
:::::

:::::{.my-watch-out}
:::{.my-watch-out-header}
Order of legend labels
:::
::::{.my-watch-out-container}
The standard for the order of the legend label is alphabetically. For instance I could manage my desired order with `colors = c("blue4", "coral", "seagreen", "yellow")` in the `vlines` data frame.

But I wanted to learn how to re-order the colors if they are not sorted. So I have changed my color code to `colors = c("coral", "blue4", "seagreen", "yellow")` in the `vlines` data frame. The first two colors are exchanged and not alphabetically sorted anymore.

It turned out that in this case I need to add the `breaks` argument with the correct order of colors inside my scale specification. So I have added `breaks = c("coral", "blue4", "seagreen", "yellow")` into `ggplot2::scale_color_identity()`.

BTW: `ggplot2::scale_color_identity()` was necessary because I have passed the color names to be used for the vertical lines directly from the data frame (and not via a color palette). Calling `scale_color_indentity()` tells ggplot2 that it doesn’t need to create a new color scale in that situation.


::::
:::::


###### Plot 2

:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-chap04-plot-ci2}
: Plot CI of another sample 0f 500 counties compared with population mean
::::::
:::
::::{.my-r-code-container}

```{r}
#| label: fig-plot-ci2
#| fig-cap: "Distribution of the distance to the nearest facility with MAT with a 95% cofindence interval and compared to the population mean"
#| cache: true

## plot with scale & legend ##############
dist_mat_sample2 |> 
    ggplot2::ggplot(
        ggplot2::aes(x = distance)
        ) +
    ggplot2::geom_histogram(
        bins = 30,
        fill = "grey",
        color = "black"
        ) + 
    
    ## add all vertical lines via data frame
    ggplot2::geom_vline(
        data = vlines,
        ggplot2::aes(
            xintercept = xintercepts,
            color = colors, # color order is alphabetically
            linetype = linetypes)
    ) +
    
    ## change / prepare legend
    ggplot2::scale_color_identity(        
        name = "Parameter",
        labels = vlines$labels, 
        guide = "legend",
        breaks = c("coral", "blue4", "seagreen", "yellow")
        ) +
    
    ## prevent second legend for line type
    ## `guide = "none"` as default not necessary
    ggplot2::scale_linetype_identity(guide = "none") + 
    ggplot2::theme_bw() +
    
    ## make legend bigger so that the lines are better visible
    ## and position legend on top with a border around
    ggplot2::guides(color = 
        ggplot2::guide_legend(override.aes = base::list(size = 8))) +
    ggplot2::theme(
        legend.position = "top",
        legend.background = 
           ## color = legend border,
           ## fill would be background, here not used
           ggplot2::element_rect(color = "black")
        ) +       
    ggplot2::labs(
        x = "Distance in miles",
        y = "Number of counties"
    )
```
***


The 95% interval is very small. The population mean is inside the sample CI, very near and therefore almost overlapping the sample mean.

- The population mean = `r mean_dist_pop`
- The sample mean = `r dist_mat_param_sample2$mean_distance`
- The difference = `r mean_dist_pop - dist_mat_param_sample2$mean_distance`

::::
:::::


:::::{.my-watch-out}
:::{.my-watch-out-header}
Order of legend labels
:::
::::{.my-watch-out-container}
The standard for the order of the legend label is alphabetically. For instance I could manage my desired order with `colors = c("blue4", "coral", "seagreen", "yellow")` in the `vlines` data frame.

But I wanted to learn how to re-order the colors if they are not sorted. So I have changed my color code to `colors = c("coral", "blue4", "seagreen", "yellow")` in the `vlines` data frame. The first two colors are exchanged and not alphabetically sorted anymore.

It turned out that in this case I need to add the `breaks` argument with the correct order of colors inside my scale specification. So I have added `breaks = c("coral", "blue4", "seagreen", "yellow")` into `ggplot2::scale_color_identity()`.

BTW: `ggplot2::scale_color_identity()` was necessary because I have passed the color names to be used for the vertical lines directly from the data frame (and not via a color palette). Calling `scale_color_indentity()` tells ggplot2 that it doesn’t need to create a new color scale in that situation.

::::
:::::


:::

::::
:::::

#### Population mean & sample CIs for continuous variable

We have seen that the population mean of both samples is inside the 95% confidence intervals. But lets get more calculation and see if 5% of the samples --- per definition --- really fall outside the CIs.

:::::{.my-example}
:::{.my-example-header}
:::::: {#exm-chap04-ci-samples-with-mean}
: Check how often the population mean falls outside the sample CIs
::::::
:::
::::{.my-example-container}

::: {.panel-tabset}

20 Sample stats

:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-chap04-sample-stats-20}
: Means and 95% confidence intervals of 20 samples
::::::
:::
::::{.my-r-code-container}
```{r}
#| label: sample-stats-20
#| cache: true

sample_stats_20 <- dist_mat_sample_20 |> 
    dplyr::group_by(sample_num) |> 
    dplyr::summarize(mean_20 = mean(distance),
                     sd_20 = sd(distance),
                     se_20 = sd_20 / 
                         base::sqrt(dplyr::n()),
                     ci_lower_20 = mean_20 - 2 * se_20,
                     ci_upper_20 = mean_20 + 2 * se_20
                     )
sample_stats_20
```

::::
:::::

###### 20 samples graph

:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-chap04-ci-sample-stats-20}
: Visualizing position of population mean in relation to 95% confidence intervals of 20 samples
::::::
:::
::::{.my-r-code-container}

```{r}
#| label: fig-ci-sample-stats-20
#| fig-cap: "Means and 95% confidence intervals of miles to the nearest substance abuse treatment facility with MAT from 20 samples of counties in the United States"
#| cache: true

sample_stats_20 |> 
    dplyr::group_by(sample_num) |> 
    ggplot2::ggplot(
        ggplot2::aes(x = sample_num,
                     y = mean_20
        )
    ) +
    ggplot2::geom_errorbar(
        ggplot2::aes(
            ymin = ci_lower_20, 
            ymax = ci_upper_20,
            linetype = "95% CI\nof sample mean"
        )
    ) +
    ggplot2::geom_point(
        ggplot2::aes(
            x = sample_num,
            y = mean_20,
            size = "Sample mean"
        )
    ) +
    ggplot2::geom_hline(
        ggplot2::aes(
            yintercept = mean_dist_pop,
            color = "darkred"
        ),
        linewidth = 1.5
    ) +
    ggplot2::theme_bw() +
    ggplot2::labs(
        x = "Sample",
        y = "Mean distance to treatment facility (95% CI)"
    ) +
    ggplot2::scale_color_discrete(
        name = "",
        labels = "Population mean"
    ) +
    ggplot2::scale_linetype_manual(
        name = "",
        values = c(1, NULL) 
    ) +
    ggplot2::scale_size_manual(
        name = "",
        values = 4
    ) +
    ggplot2::theme(
        legend.position = "top"
    ) 
```
***

One confidence interval did not contain the population mean. This is 5% of 20 sample, which corresponds exactly to the definition of the 95% CI!
::::
:::::


###### 100 samples graph

:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-chap04-ci-sample-stats-100}
: Visualizing position of population mean in relation to 95% confidence intervals of 100 samples
::::::
:::
::::{.my-r-code-container}

```{r}
#| label: fig-ci-sample-stats-100
#| fig-cap: "Means and 95% confidence intervals of miles to the nearest substance abuse treatment facility with MAT from 100 samples of counties in the United States"
#| cache: true
#| results: hold

sample_stats_100 <- dist_mat_sample_100 |> 
    dplyr::group_by(sample_num) |> 
    dplyr::summarize(mean_100 = mean(distance),
                     sd_100 = sd(distance),
                     se_100 = sd_100 / 
                         base::sqrt(dplyr::n()),
                     ci_lower_100 = mean_100 - 2 * se_100,
                     ci_upper_100 = mean_100 + 2 * se_100
                     )

sample_stats_100 |> 
    dplyr::group_by(sample_num) |> 
    ggplot2::ggplot(
        ggplot2::aes(x = sample_num,
                     y = mean_100
        )
    ) +
    ggplot2::geom_errorbar(
        ggplot2::aes(
            ymin = ci_lower_100, 
            ymax = ci_upper_100,
            linetype = "95% CI\nof sample mean"
        )
    ) +
    ggplot2::geom_point(
        ggplot2::aes(
            x = sample_num,
            y = mean_100,
            size = "Sample mean"
        )
    ) +
    ggplot2::geom_hline(
        ggplot2::aes(
            yintercept = mean_dist_pop,
            color = "darkred"
        ),
        linewidth = .5
    ) +
    ggplot2::theme_bw() +
    ggplot2::labs(
        x = "Sample",
        y = "Mean distance to treatment facility (95% CI)"
    ) +
    ggplot2::scale_color_discrete(
        name = "",
        labels = "Population mean"
    ) +
    ggplot2::scale_linetype_manual(
        name = "",
        values = c(1, NULL) 
    ) +
    ggplot2::scale_size_manual(
        name = "",
        labels = "Sample mean",
        values = 1
    ) +
    ggplot2::scale_x_discrete(
        breaks = NULL
    ) +
    ggplot2::theme(
        legend.position = "top"
    ) 
```
***

This time four confidence intervals that did contain the population mean. This is within the tolerance limit of the 95% CI: With 100 samples 5 would be allowed not to contain the population mean.

Form the graph it is difficult to find those CIs that do not contain the population mean. In the next tab I am trying to colorize those intervals. 

::::
:::::

::: {.callout-note #nte-chap04-x-axis-removed}
In the book the scale of the x-axis was removed with `ggplot2::theme(axis.text.x = ggplot2::element_blank())`. I have used `ggplot2::scale_x_discrete(breaks = NULL)` with the same effect. 
:::

###### 100 samples graph colored

:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-chap04-ci-sample-stats-100-colored}
: Visualizing position of population mean in relation to 95% confidence intervals of 100 samples, colorizing those CIs that do not include the population mean
::::::
:::
::::{.my-r-code-container}

```{r}
#| label: fig-ci-sample-stats-100-colored
#| fig-cap: "Means and 95% confidence intervals of miles to the nearest substance abuse treatment facility with MAT from 100 samples of counties in the United States"
#| cache: true
#| results: hold

sample_stats_100 <- dist_mat_sample_100 |> 
    dplyr::group_by(sample_num) |> 
    dplyr::summarize(mean_100 = mean(distance),
                     sd_100 = sd(distance),
                     se_100 = sd_100 / 
                         base::sqrt(dplyr::n()),
                     ci_lower_100 = mean_100 - 2 * se_100,
                     ci_upper_100 = mean_100 + 2 * se_100
                     )

sample_stats_100 |> 
    dplyr::group_by(sample_num) |> 
    ggplot2::ggplot(
        ggplot2::aes(x = sample_num,
                     y = mean_100
        )
    ) +
    ggplot2::geom_errorbar(
        ggplot2::aes(
            ymin = ci_lower_100, 
            ymax = ci_upper_100,
            linetype = "95% CI\nof sample mean"
        ),
        color = dplyr::if_else(
            mean_dist_pop >= sample_stats_100$ci_lower_100 & 
            mean_dist_pop <= sample_stats_100$ci_upper_100,
            "black",
            "orange"
        )
    ) +
    ggplot2::geom_point(
        ggplot2::aes(
            x = sample_num,
            y = mean_100,
            size = "Sample mean"
        )
    ) +
    ggplot2::geom_hline(
        ggplot2::aes(
            yintercept = mean_dist_pop,
            color = "darkred"
        ),
        linewidth = .5
    ) +
    ggplot2::theme_bw() +
    ggplot2::labs(
        x = "Sample",
        y = "Mean distance to treatment facility (95% CI)"
    ) +
    ggplot2::scale_color_discrete(
        name = "",
        labels = "Population mean"
    ) +
    ggplot2::scale_linetype_manual(
        name = "",
        values = c(1, NULL) 
    ) +
    ggplot2::scale_size_manual(
        name = "",
        labels = "Sample mean",
        values = 1
    ) +
    ggplot2::scale_x_discrete(
        breaks = NULL
    ) +
    ggplot2::theme(
        legend.position = "top"
    ) 
```
***

Now you can better see that 4 CIs do not include the population mean. The graph is a reproduction of book’s Figure 4.24 which is not accompanied with the appropriate R code.

::::
:::::

::: {.callout-note-chap04-x-axis-removed2}
In the book the scale of the x-axis was removed with `ggplot2::theme(axis.text.x = ggplot2::element_blank())`. I have used `ggplot2::scale_x_discrete(breaks = NULL)` with the same effect. 
:::


:::

::::
:::::

#### Population mean & sample CIs for binomial variable

Given that the sampling distribution is normally distributed, 95% of sample means would be within two standard deviations of the mean of the means. This is also valid for binomial distributions, e.g. there are also confidence interval around the proportion of successes for a binary variable.

But there are two points to observe:

1. For variables that have only two values (e.g., Yes and No, success and failure, 1 and 0), the mean of the variable is the same as the percentage of the group of interest. (The mean of a binary variable is typically abbreviated as `p` for proportion rather than `m` for mean.)
2. For any given sample, then, the 95% confidence interval for the mean (which is the percentage in the category of interest) can be computed using the same formula of $m + (1.96 × se)$ and $m – (1.96 × se)$. **But there is a difference in the calculation of the standard error!**


:::::{.my-remark}
:::{.my-remark-header}
How to calculate the standard error for binomial distributions?
:::
::::{.my-remark-container}
Instead of the formula for standard error for continuous variables (see @eq-chap04-se) the standard error for binomial distribution is:

$$
\sqrt\frac{p (1 - p)}{n}
$$ {#eq-chap04-se-binomial}
::::
:::::

The essence of this difference: Instead of computing the CI via the `r glossary("standard deviation")` you have to calculate the CIs directly from mean and `r glossary("standard error")`.

:::::{.my-example}
:::{.my-example-header}
:::::: {#exm-chap04-ci-binomial}
: Population mean & sample CIs for binomial variable
::::::
:::
::::{.my-example-container}

::: {.panel-tabset}

###### Sample PDMP

:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-chap04-prop-sample-pdmp}
: Get 100 samples of 30 states for PDMPs
::::::
:::
::::{.my-r-code-container}
```{r}
#| label: prop-ci-pdmp
#| results: hold
#| cache: true


pdmp_2017_book_clean <- base::readRDS("data/chap04/pdmp_2017_book_clean.rds")

## find the mean of pdmp
pdmp_mean_2017 <- pdmp_2017_book_clean |> 
    dplyr::summarize(p = base::mean(PDMP)) |> 
    dplyr::pull()


## get 100 samples: each sample has 30 states 
## put samples in a data frame with each sample having 
## a unique id called sample_num

base::set.seed(143)
pdmp_2017_book_samples <- 
    dplyr::bind_rows(
        base::replicate(n = 100, 
            pdmp_2017_book_clean |> 
                        dplyr::slice_sample(n = 30, replace = TRUE),
                        simplify = FALSE),
        .id = "sample_num")  

## find the mean for each sample
pdmp_2017_book_samples_states <-  pdmp_2017_book_samples |> 
    dplyr::group_by(sample_num) |> 
    dplyr::summarize(pdmp_p = base::mean(PDMP))
    
pdmp_mean_2017 
pdmp_2017_book_samples_states
```

::::
:::::

###### Histogram samples PDMP 

:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-chap04-graph-ci-pdmp}
: Histogram of 100 samples of states with PDMPs (2017)
::::::
:::
::::{.my-r-code-container}
```{r}
#| label: fig-histo-bdmp-100-samples
#| fig-cap: "Histogram of 100 samples of states with PDMPs (2017)"
#| results: hold
#| cache: true

pdmp_2017_book_samples_states |> 
    ggplot2::ggplot(
        ggplot2::aes(x = pdmp_p)
    ) +
    ggplot2::geom_histogram(
        bins = 10,
        color = "black",
        fill = "grey") +
    ggplot2::theme_bw()
```
***

The group looks normally distributed and it would even look more normally distributed with more samples. Given that the sampling distribution is normally distributed, 95% of sample means would be within 1.96 standard deviations of the mean of the means. 

::::
:::::

###### 100 samples binomial

:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-chap04-ci-sample-binomial-100-colored}
: Visualizing position of population mean in relation to 95% confidence intervals of 100 samples of the binomial distribution of the BDMPs, colorizing those CIs that do not include the population mean
::::::
:::
::::{.my-r-code-container}

```{r}
#| label: fig-ci-sample-binomial-100-colored
#| fig-cap: "Mean and 95% CI for proportion of states with PDMPs from 100 samples of 30 states from a population where 62.75% of states have PDMPs"
#| results: hold
#| cache: true


sample_binomial_100 <- pdmp_2017_book_samples |> 
    dplyr::group_by(sample_num) |>
    dplyr::summarize(mean_100 = base::mean(PDMP),
                     se_100 = base::sqrt(mean_100 * (1 - mean_100) / 
                              dplyr::n()),
                     ci_lower_100 = mean_100 - 2 * se_100,
                     ci_upper_100 = mean_100 + 2 * se_100
                     )

sample_binomial_100 |> 
    dplyr::group_by(sample_num) |> 
    ggplot2::ggplot(
        ggplot2::aes(x = sample_num,
                     y = mean_100
        )
    ) +
    ggplot2::geom_errorbar(
        ggplot2::aes(
            ymin = ci_lower_100, 
            ymax = ci_upper_100,
            linetype = "95% CI\nof sample mean"
        ),
        color = dplyr::if_else(
            pdmp_mean_2017 >= sample_binomial_100$ci_lower_100 & 
            pdmp_mean_2017 <= sample_binomial_100$ci_upper_100,
            "black",
            "orange"
        )
    ) +
    ggplot2::geom_point(
        ggplot2::aes(
            x = sample_num,
            y = mean_100,
            size = "Sample mean"
        )
    ) +
    ggplot2::geom_hline(
        ggplot2::aes(
            yintercept = pdmp_mean_2017,
            color = "darkred"
        ),
        linewidth = .5
    ) +
    ggplot2::theme_bw() +
    ggplot2::labs(
        x = "Sample",
        y = "Mean distance to treatment facility (95% CI)"
    ) +
    ggplot2::scale_color_discrete(
        name = "",
        labels = "Population mean"
    ) +
    ggplot2::scale_linetype_manual(
        name = "",
        values = c(1, NULL) 
    ) +
    ggplot2::scale_size_manual(
        name = "",
        labels = "Sample mean",
        values = 1
    ) +
    ggplot2::scale_x_discrete(
        breaks = NULL
    ) +
    ggplot2::theme(
        legend.position = "top"
    ) 
```
***

We see that 2 CIs do not include the population mean. The graph is a reproduction of book’s Figure 4.26 which is not accompanied with the appropriate R code.

::::
:::::

:::

::::
:::::

#### Other confidence intervals

The three most common intervals have the following z-scores: 

***
:::{#bul-three-cis}
:::::{.my-bullet-list}
:::{.my-bullet-list-header}
Bullet List
:::
::::{.my-bullet-list-container}

- 90% confidence interval z-score = 1.645 
- 95% confidence interval z-score = 1.96 
- 99% confidence interval z-score = 2.576

::::
:::::
The three most common intervals with its z-scores
:::
***

Confidence intervals for small samples, usually defined as samples with fewer than 30 observations [@field2012], use a `r glossary("t-statistic")` instead of a `r glossary("z-score")` in computing `r glossary("confidence interval")` for means and in other types of analyses.

> The t-statistic is from the t-distribution and, like the z-score, it measures the distance from the mean. However, the t-statistic does this using the standard deviation of the sampling distribution, also known as the standard error, rather than the standard deviation of the sample.

$$
\begin{align*}
t = \frac{m}{\frac{s}{\sqrt{n}}} \\
m = \text{sample mean for a variable} \\
s = \text{sample standard deviation for the same variable} \\
n = \text{sample size} \\
note = \text{the denominator for t is} \frac{s}{\sqrt(n)}\text{This is the standard error!}
\end{align*}
$$ {#eq-chap04-t-statistic}

The main practical difference between the two is that the t-statistic works better when samples are small; once samples are very large (n > 1,000), the two values will be virtually identical. (See @sec-chap06 for more about the t-statistics.)

## Experiments

### Get PDMP data


:::::{.my-experiment}
:::{.my-experiment-header}
:::::: {#def-chap04-get-pdmp-data}
: Get Prescription Drug Monitory Program (PMDP) data 
::::::
:::
::::{.my-experiment-container}

::: {.panel-tabset}

###### book

:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-chap04-pdmp-book}
: Get the cleaned PDMP data from the book `.csv` file
::::::
:::
::::{.my-r-code-container}


```{r}
#| label: pdmp-book
#| lst-label: lst-chap04-pdmp-book
#| lst-cap: "Get the cleaned PDMP data from the book `.csv` file"
#| results: hold
#| eval: false

## run code only once manually ##########

## get pdmp data from books .csv
pdmp_2017_book <- readr::read_csv("data/chap04/pdmp_2017_kff_ch4.csv")
save_data_file("chap04", pdmp_2017_book, "pdmp_2017_book.rds")

```


***

(*For this R code chunk is no output available*)

::::
:::::


###### tabulizer

:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-chap04-pdmp-tabulizer}
: Get PDMP data with {**tabulizer**}
::::::
:::
::::{.my-r-code-container}


```{r}
#| label: pdmp-tabulizer
#| lst-label: lst-chap04-pdmp-tabulizer
#| lst-cap: "Get PDMP data with {**tabulizer**}"
#| results: hold
#| eval: false

## run only once (manually) ##########

## get pdmp table via tabulizer
pdmp_2017_temp <- tabulizer::extract_tables(
    "data/chap04/PDMPs-2017.pdf")
pdmp_2017_tabulizer <- pdmp_2017_temp[[1]]

save_data_file("chap04", pdmp_2017_tabulizer, "pdmp_2017_tabulizer.rds")
```

***
(*For this R code chunk is no output available*)


::::
:::::

###### Clipboard

:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-chap04-pdmp-clipboard}
: Get PDMP data via clipboard
::::::
:::
::::{.my-r-code-container}


```{r}
#| label: pdmp-clipboard
#| lst-label: lst-chap04-pdmp-clipboard
#| lst-cap: "Get PDMP data via clipboard"
#| results: hold
#| eval: false

## run code only once manually ###########

## readr::read_delim("clipboard") # Windows

pdmp_2017_clipboard1 <- readr::read_table(pipe("pbpaste")) # normal copy & paste
pdmp_2017_clipboard2 <- readr::read_table(pipe("pbpaste")) # TextSniper
save_data_file("chap04", pdmp_2017_clipboard1, "pdmp_2017_clipboard1.rds")
save_data_file("chap04", pdmp_2017_clipboard2, "pdmp_2017_clipboard2.rds")
```

***

(*For this R code chunk is no output available*)

With this approach I have selected the table data and copied it into the clipboard. Be aware that here are two different functions for Windows and macOS.

::::
:::::


###### rvest 

:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-chap04-pdmp-rvest}
: Get the PDMP data with {**rvest**}
::::::
:::
::::{.my-r-code-container}


```{r}
#| label: pdmp-rvest
#| lst-label: lst-chap04-pdmp-rvest
#| lst-cap: "Get the PDMP data with {**rvest**}"
#| results: hold
#| eval: false

## run code only once manually

## 1. check if web scrapping is allowed
url <- paste0("https://www.kff.org/report-section/",
  "implementing-coverage-and-payment-initiatives-benefits-and-pharmacy/")
robotstxt::paths_allowed((url))

## 2. get the whole KFF page
pdmp_2017_page <- rvest::read_html(url)
save_data_file("chap04", pdmp_2017_page, "pdmp_2017_page.rds")

## 3. extract PDMP table
pdmp_2017_rvest <- pdmp_2017_page |> 
    rvest::html_nodes("table") |> 
    purrr::pluck(10) |> 
    rvest::html_table()

save_data_file("chap04", pdmp_2017_rvest, "pdmp_2017_rvest.rds")

```


***

(*For this R code chunk is no output available*)

::::
:::::


:::

::::
:::::

In @def-chap04-get-pdmp-data I have data imported in four different ways: 

1. @lst-chap04-pdmp-book: This is the reference data frame, imported form the `.csv` file of the book.
2. @lst-chap04-pdmp-tabulizer: The package {**tabulizer**} worked fine, but the PDF table 19 (a) did not separate several columns with a vertical line and (b) did not put entries if there was no PDMP in place, but left the place entry. {**tabulizer**} could therefore not detect which entries belong to which column.
3. @lst-chap04-pdmp-clipboard: Even if the web page has dividing vertical lines for all columns, the same problem (namely empty) prevents a correct data transfer.
4. @lst-chap04-pdmp-rvest: This is the best option of my experiment: After confirming that web scraping is allowed, I scrapped all tables from the `r glossary("KFF")` web page, because there was no unique selector for table 19 available. I received 10 tables. That was strange because on the web page I could only visible detect seven tables. But luckily the last one was the table I was interested.

Conclusion: If I would not have the data to work with, I would take for further recoding the table the data imported by web scparing with {**rvest**}.


## Exercises (empty)


## Glossary

```{r}
#| label: glossary-table
#| echo: false

glossary_table()
```

------------------------------------------------------------------------


## Session Info {.unnumbered}

:::::{.my-r-code}
:::{.my-r-code-header}
Session Info
:::
::::{.my-r-code-container}

```{r}
#| label: session-info

sessioninfo::session_info()
```


::::
:::::