Skip to content

Commit

Permalink
Merge pull request #3 from ngreifer/main
Browse files Browse the repository at this point in the history
Updates
  • Loading branch information
ngreifer authored Nov 4, 2024
2 parents f8a4a96 + 6c1ddc6 commit f5e1a8c
Show file tree
Hide file tree
Showing 2 changed files with 8 additions and 3 deletions.
9 changes: 6 additions & 3 deletions _freeze/example/execute-results/html.json
Original file line number Diff line number Diff line change
@@ -1,8 +1,11 @@
{
"hash": "328e6606120e7ddb69d054eaafbd270c",
"hash": "2d3a2efc8eba838f475c579a52817e77",
"result": {
"markdown": "# Example Data {#sec-example}\n\nBelow, we'll demonstrate how to perform matching and weighting in R. We'll use the famous right-heart catheterization (RHC) dataset analyzed in @connorsEffectivenessRightHeart1996a, which examines the effect of RHC on death by 60 days. @connorsEffectivenessRightHeart1996a used 1:1 matching with a caliper to estimate the effect, which corresponds to an ATO (though they provided no justification for this choice of estimand). It turns out this matters quite a bit; the ATT, ATC, and ATE differ from each other and lead to different conclusions about the risk of RHC.\n\nThe choice of estimand depends on the policy implied by the analysis. Are we interested in examining whether RHC is harmful and should be withheld from patients receiving it? If so, we are interested in the ATT of RHC. Are we interested in examining whether RHC would benefit patients not receiving it? If so, we are interested in the ATC of RHC. Are we interested in the average effect of RHC for the whole study population? If so, we are interested in the ATE of RHC.\n\nWe'll assume that if we are making a causal inference about the effect of RHC, we have collected a sufficient set of variables to remove confounding. This may be a long list, but to keep the example short, we'll use a list of 13 covariates thought to be related to receipt of RHC and death at 60 days, all measured prior to receipt of RHC.\n\nLet's take a look at our dataset:\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsummary(rhc)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n aps1 meanbp1 pafi1 crea1 \n Min. : 3.00 Min. : 0.00 Min. : 11.6 Min. : 0.09999 \n 1st Qu.: 41.00 1st Qu.: 50.00 1st Qu.:133.3 1st Qu.: 1.00000 \n Median : 54.00 Median : 63.00 Median :202.5 Median : 1.50000 \n Mean : 54.67 Mean : 78.52 Mean :222.3 Mean : 2.13302 \n 3rd Qu.: 67.00 3rd Qu.:115.00 3rd Qu.:316.6 3rd Qu.: 2.39990 \n Max. :147.00 Max. :259.00 Max. :937.5 Max. :25.09766 \n hema1 paco21 surv2md1 resp1 card \n Min. : 2.00 Min. : 1.00 Min. :0.0000 Min. : 0.00 No :3804 \n 1st Qu.:26.10 1st Qu.: 31.00 1st Qu.:0.4709 1st Qu.: 14.00 Yes:1931 \n Median :30.00 Median : 37.00 Median :0.6280 Median : 30.00 \n Mean :31.87 Mean : 38.75 Mean :0.5925 Mean : 28.09 \n 3rd Qu.:36.30 3rd Qu.: 42.00 3rd Qu.:0.7430 3rd Qu.: 38.00 \n Max. :66.19 Max. :156.00 Max. :0.9620 Max. :100.00 \n edu age race sex RHC \n Min. : 0.00 Min. : 18.04 white:4460 Female:2543 Min. :0.0000 \n 1st Qu.:10.00 1st Qu.: 50.15 black: 920 Male :3192 1st Qu.:0.0000 \n Median :12.00 Median : 64.05 other: 355 Median :0.0000 \n Mean :11.68 Mean : 61.38 Mean :0.3808 \n 3rd Qu.:13.00 3rd Qu.: 73.93 3rd Qu.:1.0000 \n Max. :30.00 Max. :101.85 Max. :1.0000 \n death \n Min. :0.000 \n 1st Qu.:0.000 \n Median :1.000 \n Mean :0.649 \n 3rd Qu.:1.000 \n Max. :1.000 \n```\n:::\n:::\n\n\nOur treatment variable is `RHC` (1 for receipt, 0 for non-receipt), our outcome is `death` (1 for died at 60 days, 0 otherwise), and the other variables are covariates thought to remove confounding, which include a mix of continuous and categorical variables.\n\nLet's examine balance on the variables between the treatment groups using `cobalt`, which provides the function `bal.tab()` for creating a balance table containing balance statistics for each variables.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(\"cobalt\")\n```\n\n::: {.cell-output .cell-output-stderr}\n```\n cobalt (Version 4.5.1.9001, Build Date: 2023-08-01)\n```\n:::\n:::\n\n\nWe'll request the standardized mean difference by including `\"m\"` in the `stats` argument and setting `binary = \"std\"` (by default binary variables are not standardized) and we'll request KS statistics by including `\"ks\"` in `stats`. Supplying the treatment and covariates in the first argument using a formula and supplying the data set gives us the following:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbal.tab(RHC ~ aps1 + meanbp1 + pafi1 + crea1 + hema1 +\n paco21 + surv2md1 + resp1 + card + edu +\n age + race + sex, data = rhc,\n stats = c(\"m\", \"ks\"), binary = \"std\")\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nNote: `s.d.denom` not specified; assuming pooled.\n```\n:::\n\n::: {.cell-output .cell-output-stdout}\n```\nBalance Measures\n Type Diff.Un KS.Un\naps1 Contin. 0.5014 0.2127\nmeanbp1 Contin. -0.4551 0.2117\npafi1 Contin. -0.4332 0.1816\ncrea1 Contin. 0.2696 0.2011\nhema1 Contin. -0.2693 0.1479\npaco21 Contin. -0.2486 0.1081\nsurv2md1 Contin. -0.1985 0.0957\nresp1 Contin. -0.1655 0.0910\ncard_Yes Binary 0.2950 0.1395\nedu Contin. 0.0914 0.0511\nage Contin. -0.0614 0.0703\nrace_white Binary 0.0152 0.0063\nrace_black Binary -0.0310 0.0114\nrace_other Binary 0.0208 0.0050\nsex_Male Binary 0.0931 0.0462\n\nSample sizes\n Control Treated\nAll 3551 2184\n```\n:::\n:::\n\n\nWe can see significant imbalances in many of the covariates, with high SMDs (greater than .1) and KS statistics (greater than .1, but there is no accepted threshold for these). We can also see the sample sizes for each treatment group. Note that because they are somewhat close in size (the control group is not even twice the size of the treatment group), this will limit the available matching options available and might affect our ability to achieve balance using methods that require a large pool of controls relative to the treated group.\n\nOther balance statistics can be requested, too, using the `stats` argument. It is straightforward to assess balance on particular transformations of covariates using the `addl` argument, e.g., `addl = ~age:educ` to assess balance on the interaction (i.e., product) of `age` and `educ`. We can also supply `int = TRUE` and `poly = 3`, for example, to assess balance on all pairwise interactions of covariates and all squares and cubes of the continuous covariates. This can make for large tables, but there are ways to keep them short and summarize them. For example, we can hide the balance table and request the number of covariates that fail to satisfy balance criteria and the covariates with the worst imbalance using code below:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbal.tab(RHC ~ aps1 + meanbp1 + pafi1 + crea1 + hema1 +\n paco21 + surv2md1 + resp1 + card + edu +\n age + race + sex, data = rhc,\n int = TRUE, poly = 3,\n stats = c(\"m\", \"ks\"), binary = \"std\",\n thresholds = c(m = .1, ks = .1),\n disp.bal.tab = FALSE)\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nNote: `s.d.denom` not specified; assuming pooled.\n```\n:::\n\n::: {.cell-output .cell-output-stdout}\n```\nBalance tally for mean differences\n count\nBalanced, <0.1 61\nNot Balanced, >0.1 105\n\nVariable with the greatest mean difference\n Variable Diff.Un M.Threshold.Un\n meanbp1 * pafi1 -0.5965 Not Balanced, >0.1\n\nBalance tally for KS statistics\n count\nBalanced, <0.1 80\nNot Balanced, >0.1 86\n\nVariable with the greatest KS statistic\n Variable KS.Un KS.Threshold.Un\n meanbp1 * pafi1 0.2562 Not Balanced, >0.1\n\nSample sizes\n Control Treated\nAll 3551 2184\n```\n:::\n:::\n\n\nWe can see that many covariates and their transformations (interactions, squares, and cubes) are not balanced based on our criteria for SMDs or KS statistics. We'll use matching and weighting in the next sections to attempt to achieve balance on the covariates.\n",
"supporting": [],
"engine": "knitr",
"markdown": "# Example Data {#sec-example}\n\nBelow, we'll demonstrate how to perform matching and weighting in R. We'll use the famous right-heart catheterization (RHC) dataset analyzed in @connorsEffectivenessRightHeart1996a, which examines the effect of RHC on death by 60 days. This dataset can be downloaded [here](https://hbiostat.org/data/) or using `Hmisc::getHdata(\"rhc\")`[^example-1]. @connorsEffectivenessRightHeart1996a used 1:1 matching with a caliper to estimate the effect, which corresponds to an ATO (though they provided no justification for this choice of estimand). It turns out this matters quite a bit; the ATT, ATC, and ATE differ from each other and lead to different conclusions about the risk of RHC.\n\n[^example-1]: The version we use here has slight modifications and can be downloaded [here](https://github.com/IQSS/dss-ps/blob/main/rhc.rds) or brought into R using `rhc <- readRDS(url(\"https://github.com/IQSS/dss-ps/raw/refs/heads/main/rhc.rds\"))`\n\nThe choice of estimand depends on the policy implied by the analysis. Are we interested in examining whether RHC is harmful and should be withheld from patients receiving it? If so, we are interested in the ATT of RHC. Are we interested in examining whether RHC would benefit patients not receiving it? If so, we are interested in the ATC of RHC. Are we interested in the average effect of RHC for the whole study population? If so, we are interested in the ATE of RHC.\n\nWe'll assume that if we are making a causal inference about the effect of RHC, we have collected a sufficient set of variables to remove confounding. This may be a long list, but to keep the example short, we'll use a list of 13 covariates thought to be related to receipt of RHC and death at 60 days, all measured prior to receipt of RHC.\n\nLet's take a look at our dataset:\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsummary(rhc)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n aps1 meanbp1 pafi1 crea1 \n Min. : 3.00 Min. : 0.00 Min. : 11.6 Min. : 0.09999 \n 1st Qu.: 41.00 1st Qu.: 50.00 1st Qu.:133.3 1st Qu.: 1.00000 \n Median : 54.00 Median : 63.00 Median :202.5 Median : 1.50000 \n Mean : 54.67 Mean : 78.52 Mean :222.3 Mean : 2.13302 \n 3rd Qu.: 67.00 3rd Qu.:115.00 3rd Qu.:316.6 3rd Qu.: 2.39990 \n Max. :147.00 Max. :259.00 Max. :937.5 Max. :25.09766 \n hema1 paco21 surv2md1 resp1 card \n Min. : 2.00 Min. : 1.00 Min. :0.0000 Min. : 0.00 No :3804 \n 1st Qu.:26.10 1st Qu.: 31.00 1st Qu.:0.4709 1st Qu.: 14.00 Yes:1931 \n Median :30.00 Median : 37.00 Median :0.6280 Median : 30.00 \n Mean :31.87 Mean : 38.75 Mean :0.5925 Mean : 28.09 \n 3rd Qu.:36.30 3rd Qu.: 42.00 3rd Qu.:0.7430 3rd Qu.: 38.00 \n Max. :66.19 Max. :156.00 Max. :0.9620 Max. :100.00 \n edu age race sex RHC \n Min. : 0.00 Min. : 18.04 white:4460 Female:2543 Min. :0.0000 \n 1st Qu.:10.00 1st Qu.: 50.15 black: 920 Male :3192 1st Qu.:0.0000 \n Median :12.00 Median : 64.05 other: 355 Median :0.0000 \n Mean :11.68 Mean : 61.38 Mean :0.3808 \n 3rd Qu.:13.00 3rd Qu.: 73.93 3rd Qu.:1.0000 \n Max. :30.00 Max. :101.85 Max. :1.0000 \n death \n Min. :0.000 \n 1st Qu.:0.000 \n Median :1.000 \n Mean :0.649 \n 3rd Qu.:1.000 \n Max. :1.000 \n```\n\n\n:::\n:::\n\n\n\n\nOur treatment variable is `RHC` (1 for receipt, 0 for non-receipt), our outcome is `death` (1 for died at 60 days, 0 otherwise), and the other variables are covariates thought to remove confounding, which include a mix of continuous and categorical variables.\n\nLet's examine balance on the variables between the treatment groups using `cobalt`, which provides the function `bal.tab()` for creating a balance table containing balance statistics for each variables.\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(\"cobalt\")\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\n cobalt (Version 4.5.5, Build Date: 2024-04-02)\n```\n\n\n:::\n:::\n\n\n\n\nWe'll request the standardized mean difference by including `\"m\"` in the `stats` argument and setting `binary = \"std\"` (by default binary variables are not standardized) and we'll request KS statistics by including `\"ks\"` in `stats`. Supplying the treatment and covariates in the first argument using a formula and supplying the data set gives us the following:\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbal.tab(RHC ~ aps1 + meanbp1 + pafi1 + crea1 + hema1 +\n paco21 + surv2md1 + resp1 + card + edu +\n age + race + sex, data = rhc,\n stats = c(\"m\", \"ks\"), binary = \"std\")\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\nNote: `s.d.denom` not specified; assuming \"pooled\".\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stdout}\n\n```\nBalance Measures\n Type Diff.Un KS.Un\naps1 Contin. 0.5014 0.2127\nmeanbp1 Contin. -0.4551 0.2117\npafi1 Contin. -0.4332 0.1816\ncrea1 Contin. 0.2696 0.2011\nhema1 Contin. -0.2693 0.1479\npaco21 Contin. -0.2486 0.1081\nsurv2md1 Contin. -0.1985 0.0957\nresp1 Contin. -0.1655 0.0910\ncard_Yes Binary 0.2950 0.1395\nedu Contin. 0.0914 0.0511\nage Contin. -0.0614 0.0703\nrace_white Binary 0.0152 0.0063\nrace_black Binary -0.0310 0.0114\nrace_other Binary 0.0208 0.0050\nsex_Male Binary 0.0931 0.0462\n\nSample sizes\n Control Treated\nAll 3551 2184\n```\n\n\n:::\n:::\n\n\n\n\nWe can see significant imbalances in many of the covariates, with high SMDs (greater than .1) and KS statistics (greater than .1, but there is no accepted threshold for these). We can also see the sample sizes for each treatment group. Note that because they are somewhat close in size (the control group is not even twice the size of the treatment group), this will limit the available matching options available and might affect our ability to achieve balance using methods that require a large pool of controls relative to the treated group.\n\nOther balance statistics can be requested, too, using the `stats` argument. It is straightforward to assess balance on particular transformations of covariates using the `addl` argument, e.g., `addl = ~age:educ` to assess balance on the interaction (i.e., product) of `age` and `educ`. We can also supply `int = TRUE` and `poly = 3`, for example, to assess balance on all pairwise interactions of covariates and all squares and cubes of the continuous covariates. This can make for large tables, but there are ways to keep them short and summarize them. For example, we can hide the balance table and request the number of covariates that fail to satisfy balance criteria and the covariates with the worst imbalance using code below:\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbal.tab(RHC ~ aps1 + meanbp1 + pafi1 + crea1 + hema1 +\n paco21 + surv2md1 + resp1 + card + edu +\n age + race + sex, data = rhc,\n int = TRUE, poly = 3,\n stats = c(\"m\", \"ks\"), binary = \"std\",\n thresholds = c(m = .1, ks = .1),\n disp.bal.tab = FALSE)\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\nNote: `s.d.denom` not specified; assuming \"pooled\".\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stdout}\n\n```\nBalance tally for mean differences\n count\nBalanced, <0.1 61\nNot Balanced, >0.1 105\n\nVariable with the greatest mean difference\n Variable Diff.Un M.Threshold.Un\n meanbp1 * pafi1 -0.5965 Not Balanced, >0.1\n\nBalance tally for KS statistics\n count\nBalanced, <0.1 80\nNot Balanced, >0.1 86\n\nVariable with the greatest KS statistic\n Variable KS.Un KS.Threshold.Un\n meanbp1 * pafi1 0.2562 Not Balanced, >0.1\n\nSample sizes\n Control Treated\nAll 3551 2184\n```\n\n\n:::\n:::\n\n\n\n\nWe can see that many covariates and their transformations (interactions, squares, and cubes) are not balanced based on our criteria for SMDs or KS statistics. We'll use matching and weighting in the next sections to attempt to achieve balance on the covariates.\n",
"supporting": [
"example_files"
],
"filters": [
"rmarkdown/pagebreak.lua"
],
Expand Down
2 changes: 2 additions & 0 deletions example.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,8 @@ if (!file.exists("rhc.rds")) {
rhc <- rhc[c(covs, treat, outcome)]
saveRDS(rhc, "rhc.rds")
}
rhc <- readRDS("rhc.rds")
```

```{r}
Expand Down

0 comments on commit f5e1a8c

Please sign in to comment.