03a-Do-it-with-R.Rmd

# Do-it-yourself

Most of the methods presented in this book are available in both R and Python and can be used in a uniform way. But each of these languages has also many other tools for Explanatory Model Analysis.  

In this book, we introduce various methods for instance-level and dataset-level exploration and explanation of predictive models. In each chapter, there is a section with code snippets for R and Python that shows how to use a particular method. 

## Do-it-yourself with R  {#doItYourselfWithR}

In this section, we provide a short description of the steps that are needed to set-up the R environment with the required libraries.

### What to install?

Obviously, the R software [@RcoreT] is needed. It is always a good idea to use the newest version. At least R in version 3.6 is recommended. It can be downloaded from the CRAN website [https://cran.r-project.org/](https://cran.r-project.org/).

A good editor makes working with R much easier. There is plenty of choices, but, especially for beginners, it is worth considering the RStudio editor, an open-source and enterprise-ready tool for R. It can be downloaded from  https://www.rstudio.com/.

Once R and the editor are available, the required packages should be installed.

The most important one is the `DALEX` package in version 1.0 or newer. It is the entry point to solutions introduced in this book. The  package can be installed by executing the following command from the R command line:

```{r, eval=FALSE}
install.packages("DALEX")
```

Installation of `DALEX` will automatically take care about installation of other requirements (packages required by it), like the `ggplot2` package for data visualization, or `ingredients` and `iBreakDown` with specific methods for model exploration.

### How to work with `DALEX`? {#infoDALEX}

To conduct model exploration with `DALEX`, first, a model has to be created. Then the model has got to be prepared for exploration.

There are many packages in R that can be used to construct a model. Some packages are algorithm-specific, like `randomForest` for random forest classification and regression models [@randomForest], `gbm` for generalized boosted regression models [@gbm], `rms` with extensions for generalized linear models [@rms], and many others. There are also packages that can be used for constructing models with different algorithms, these include the `h2o` package [@h2oPackage], `caret` [@caret] and its successor `parsnip` [@parsnipPackage], a very powerful and extensible framework `mlr` [@mlr], or `keras` that is a wrapper to Python library with the same name [@kerasPackage].

While it is great to have such a large choice of tools for constructing models, the disadvantage is that different packages have different interfaces and different arguments. Moreover, model-objects created with different packages may have different internal structures. The main goal of the `DALEX` package is to create a level of abstraction around a model that makes it easier to explore and explain the model. Figure \@ref(fig:DALEXarchitecture) illustrates the contents of the package. In particular, function `DALEX::explain` is THE function for model wrapping. There is only one argument that is required by the function; it is `model`, which is used to specify the model-object with the fitted form of the model. However, the function allows additional arguments that extend its functionalities. They are discussed in Section \@ref(ExplainersTitanicRCode).

<!---
* `data`, a data frame to which the model is to be applied; 
* `y`, observed values of the dependent variable for the validation data; it is an optional argument, required for explainers focused on model validation and benchmarking.
* `predict_function`, a function that returns prediction scores; if not specified, then a default `predict()` function is used. Note that, for some models, the default `predict()` function returns classes; in such cases, you should provide a function that will return numerical scores. 
* `label`, a name of a model; if not specified, then it is extracted from the `class(model)`. This name will be presented in figures, so it is recommended to make the name informative.
--->

(ref:DALEXarchitecture) The `DALEX` package creates a layer of abstraction around models, allowing you to work with different models in a uniform way. The key function is the `explain()` function, which wraps any model into a uniform interface. Then other functions from the `DALEX` package can be applied to the resulting object to explore the model.

```{r DALEXarchitecture, echo=FALSE, fig.cap='(ref:DALEXarchitecture)', out.width = '99%', fig.align='center'}
knitr::include_graphics("figure/DALEX_architecture.png")
```


### How to work with `archivist`?

As we will focus on the exploration of predictive models, we prefer not to waste space nor time on replication of the code necessary for model development. This is where the `archivist` packages help.

The `archivist` package [@archivist] is designed to store, share, and manage R objects. We will use it to easily access R objects for pre-constructed models and pre-calculated explainers. To install the package, the following command should be executed in the R command line:

```{r, eval=FALSE}
install.packages("archivist")
```

Once the package has been installed, function `aread()` can be used to retrieve R objects from any remote repository. For this book, we use a GitHub repository `models` hosted at https://github.com/pbiecek/models. For instance, to download a model with the md5 hash `ceb40`, the following command has to be executed:

```{r, eval=FALSE}
archivist::aread("pbiecek/models/ceb40")
```

Since the md5 hash `ceb40` uniquely defines the model, referring to the repository object results in using exactly the same model and the same explanations. Thus, in the subsequent chapters, pre-constructed models will be accessed with `archivist` hooks. In the following sections, we will also use `archivist` hooks when referring to datasets.


## Do-it-yourself with Python  {#doItYourselfWithPython}

In this section, we provide a short description of steps that are needed to set-up the Python environment with the required libraries.

### What to install?

The Python interpreter [@python3] is needed. It is always a good idea to use the newest version. At least Python in version 3.6 is recommended. It can be downloaded from the Python website [https://python.org/](https://python.org/).
A popular environment for a simple Python installation and configuration is Anaconda, which can be downloaded from  website [https://www.anaconda.com/](https://www.anaconda.com/).

There are many editors available for Python that allow editing the code in a convenient way. In the data science community a very popular solution is Jupyter Notebook. It is a web application that allows creating and sharing documents that contain live code, visualizations, and descriptions. Jupyter Notebook can be installed from the website [https://jupyter.org/](https://jupyter.org/).

Once Python and the editor are available, the required libraries should be installed. The most important one is the `dalex` library, currently in version `0.2.0`. The  library can be installed with `pip` by executing the following instruction from the command line:

```
pip install dalex
```

Installation of `dalex` will automatically take care about other required libraries.

### How to work with `dalex`? {#infoDALEXpy}

There are many libraries in Python that can be used to construct a predictive model. Among the most popular ones are  algorithm-specific libraries like `catboost` [@catbooost], `xgboost` [@xgboost], and `keras` [@chollet2015keras], or libraries with multiple ML algorithms like `scikit-learn` [@scikitlearn].

While it is great to have such a large choice of tools for constructing models, the disadvantage is that different libraries have different interfaces and different arguments. Moreover, model-objects created with different library  may have different internal structures. The main goal of the `dalex` library is to create a level of abstraction around a model that makes it easier to explore and explain the model.

Constructor `Explainer()` is THE method for model wrapping. There is only one argument that is required by the function; it is `model`, which is used to specify the model-object with the fitted form of the model. However, the function takes also additional arguments that extend its functionalities. They are discussed in Section \@ref(ExplainersTitanicPythonCode). If these additional arguments are not provided by the user, the `dalex` library will try to extract them from the model. It is a good idea to specify them directly to avoid surprises. 

As soon as the model is wrapped by using the `Explainer()` function, all further functionalities can be performed on the resulting object. They will be presented in subsequent chapters in subsections *Code snippets for Python*.

### Code snippets for Python

```{r python_setup, include=FALSE, eval=FALSE}
library(reticulate)
use_python("/Library/Frameworks/Python.framework/Versions/3.6/bin/python3")
```

A detailed description of model exploration will be presented in the next chapters. In general, however, the way of working with the `dalex` library can be described in the following steps: 

1. Import the `dalex` library.

```{python, eval=FALSE, highlight=TRUE}
import dalex as dx 
```

2. Create an `Explainer` object. This serves as a wrapper around the model.

```{python, eval=FALSE, highlight=TRUE}
exp = dx.Explainer(model, X, y)
```

3. Calculate predictions for the model.

```{python, eval=FALSE, highlight=TRUE}
exp.predict(henry)
```

4. Calculate specific explanations.

```{python, eval=FALSE, highlight=TRUE}
obs_bd = exp.predict_parts(obs, type='break_down')
```

5. Print calculated explanations.

```{python, eval=FALSE, highlight=TRUE}
obs_bd.result
```

6. Plot calculated explanations.

```{python, eval=FALSE, highlight=TRUE}
obs_bd.plot()
```