forked from pbiecek/ema
-
Notifications
You must be signed in to change notification settings - Fork 0
/
03a-Do-it-with-R.Rmd
141 lines (85 loc) · 9.8 KB
/
03a-Do-it-with-R.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
# Do-it-yourself
Most of the methods presented in this book are available in both R and Python and can be used in a uniform way. But each of these languages has also many other tools for Explanatory Model Analysis.
In this book, we introduce various methods for instance-level and dataset-level exploration and explanation of predictive models. In each chapter, there is a section with code snippets for R and Python that shows how to use a particular method.
## Do-it-yourself with R {#doItYourselfWithR}
In this section, we provide a short description of the steps that are needed to set-up the R environment with the required libraries.
### What to install?
Obviously, the R software [@RcoreT] is needed. It is always a good idea to use the newest version. At least R in version 3.6 is recommended. It can be downloaded from the CRAN website [https://cran.r-project.org/](https://cran.r-project.org/).
A good editor makes working with R much easier. There is plenty of choices, but, especially for beginners, it is worth considering the RStudio editor, an open-source and enterprise-ready tool for R. It can be downloaded from https://www.rstudio.com/.
Once R and the editor are available, the required packages should be installed.
The most important one is the `DALEX` package in version 1.0 or newer. It is the entry point to solutions introduced in this book. The package can be installed by executing the following command from the R command line:
```{r, eval=FALSE}
install.packages("DALEX")
```
Installation of `DALEX` will automatically take care about installation of other requirements (packages required by it), like the `ggplot2` package for data visualization, or `ingredients` and `iBreakDown` with specific methods for model exploration.
### How to work with `DALEX`? {#infoDALEX}
To conduct model exploration with `DALEX`, first, a model has to be created. Then the model has got to be prepared for exploration.
There are many packages in R that can be used to construct a model. Some packages are algorithm-specific, like `randomForest` for random forest classification and regression models [@randomForest], `gbm` for generalized boosted regression models [@gbm], `rms` with extensions for generalized linear models [@rms], and many others. There are also packages that can be used for constructing models with different algorithms, these include the `h2o` package [@h2oPackage], `caret` [@caret] and its successor `parsnip` [@parsnipPackage], a very powerful and extensible framework `mlr` [@mlr], or `keras` that is a wrapper to Python library with the same name [@kerasPackage].
While it is great to have such a large choice of tools for constructing models, the disadvantage is that different packages have different interfaces and different arguments. Moreover, model-objects created with different packages may have different internal structures. The main goal of the `DALEX` package is to create a level of abstraction around a model that makes it easier to explore and explain the model. Figure \@ref(fig:DALEXarchitecture) illustrates the contents of the package. In particular, function `DALEX::explain` is THE function for model wrapping. There is only one argument that is required by the function; it is `model`, which is used to specify the model-object with the fitted form of the model. However, the function allows additional arguments that extend its functionalities. They are discussed in Section \@ref(ExplainersTitanicRCode).
<!---
* `data`, a data frame to which the model is to be applied;
* `y`, observed values of the dependent variable for the validation data; it is an optional argument, required for explainers focused on model validation and benchmarking.
* `predict_function`, a function that returns prediction scores; if not specified, then a default `predict()` function is used. Note that, for some models, the default `predict()` function returns classes; in such cases, you should provide a function that will return numerical scores.
* `label`, a name of a model; if not specified, then it is extracted from the `class(model)`. This name will be presented in figures, so it is recommended to make the name informative.
--->
(ref:DALEXarchitecture) The `DALEX` package creates a layer of abstraction around models, allowing you to work with different models in a uniform way. The key function is the `explain()` function, which wraps any model into a uniform interface. Then other functions from the `DALEX` package can be applied to the resulting object to explore the model.
```{r DALEXarchitecture, echo=FALSE, fig.cap='(ref:DALEXarchitecture)', out.width = '99%', fig.align='center'}
knitr::include_graphics("figure/DALEX_architecture.png")
```
### How to work with `archivist`?
As we will focus on the exploration of predictive models, we prefer not to waste space nor time on replication of the code necessary for model development. This is where the `archivist` packages help.
The `archivist` package [@archivist] is designed to store, share, and manage R objects. We will use it to easily access R objects for pre-constructed models and pre-calculated explainers. To install the package, the following command should be executed in the R command line:
```{r, eval=FALSE}
install.packages("archivist")
```
Once the package has been installed, function `aread()` can be used to retrieve R objects from any remote repository. For this book, we use a GitHub repository `models` hosted at https://github.com/pbiecek/models. For instance, to download a model with the md5 hash `ceb40`, the following command has to be executed:
```{r, eval=FALSE}
archivist::aread("pbiecek/models/ceb40")
```
Since the md5 hash `ceb40` uniquely defines the model, referring to the repository object results in using exactly the same model and the same explanations. Thus, in the subsequent chapters, pre-constructed models will be accessed with `archivist` hooks. In the following sections, we will also use `archivist` hooks when referring to datasets.
## Do-it-yourself with Python {#doItYourselfWithPython}
In this section, we provide a short description of steps that are needed to set-up the Python environment with the required libraries.
### What to install?
The Python interpreter [@python3] is needed. It is always a good idea to use the newest version. At least Python in version 3.6 is recommended. It can be downloaded from the Python website [https://python.org/](https://python.org/).
A popular environment for a simple Python installation and configuration is Anaconda, which can be downloaded from website [https://www.anaconda.com/](https://www.anaconda.com/).
There are many editors available for Python that allow editing the code in a convenient way. In the data science community a very popular solution is Jupyter Notebook. It is a web application that allows creating and sharing documents that contain live code, visualizations, and descriptions. Jupyter Notebook can be installed from the website [https://jupyter.org/](https://jupyter.org/).
Once Python and the editor are available, the required libraries should be installed. The most important one is the `dalex` library, currently in version `0.2.0`. The library can be installed with `pip` by executing the following instruction from the command line:
```
pip install dalex
```
Installation of `dalex` will automatically take care about other required libraries.
### How to work with `dalex`? {#infoDALEXpy}
There are many libraries in Python that can be used to construct a predictive model. Among the most popular ones are algorithm-specific libraries like `catboost` [@catbooost], `xgboost` [@xgboost], and `keras` [@chollet2015keras], or libraries with multiple ML algorithms like `scikit-learn` [@scikitlearn].
While it is great to have such a large choice of tools for constructing models, the disadvantage is that different libraries have different interfaces and different arguments. Moreover, model-objects created with different library may have different internal structures. The main goal of the `dalex` library is to create a level of abstraction around a model that makes it easier to explore and explain the model.
Constructor `Explainer()` is THE method for model wrapping. There is only one argument that is required by the function; it is `model`, which is used to specify the model-object with the fitted form of the model. However, the function takes also additional arguments that extend its functionalities. They are discussed in Section \@ref(ExplainersTitanicPythonCode). If these additional arguments are not provided by the user, the `dalex` library will try to extract them from the model. It is a good idea to specify them directly to avoid surprises.
As soon as the model is wrapped by using the `Explainer()` function, all further functionalities can be performed on the resulting object. They will be presented in subsequent chapters in subsections *Code snippets for Python*.
### Code snippets for Python
```{r python_setup, include=FALSE, eval=FALSE}
library(reticulate)
use_python("/Library/Frameworks/Python.framework/Versions/3.6/bin/python3")
```
A detailed description of model exploration will be presented in the next chapters. In general, however, the way of working with the `dalex` library can be described in the following steps:
1. Import the `dalex` library.
```{python, eval=FALSE, highlight=TRUE}
import dalex as dx
```
2. Create an `Explainer` object. This serves as a wrapper around the model.
```{python, eval=FALSE, highlight=TRUE}
exp = dx.Explainer(model, X, y)
```
3. Calculate predictions for the model.
```{python, eval=FALSE, highlight=TRUE}
exp.predict(henry)
```
4. Calculate specific explanations.
```{python, eval=FALSE, highlight=TRUE}
obs_bd = exp.predict_parts(obs, type='break_down')
```
5. Print calculated explanations.
```{python, eval=FALSE, highlight=TRUE}
obs_bd.result
```
6. Plot calculated explanations.
```{python, eval=FALSE, highlight=TRUE}
obs_bd.plot()
```