diff --git a/README.md b/README.md index b0fac467..1298d8a0 100644 --- a/README.md +++ b/README.md @@ -50,23 +50,22 @@ The following figure presents examples of displayed fashion items as actions.

-We collected the data in a 7-day experiment in late November 2019 on three “campaigns,” corresponding to all, men's, and women's items, respectively. -Each campaign randomly used either the Uniform Random algorithm or the Bernoulli Thompson Sampling (Bernoulli TS) algorithm, which was pre-trained for about a month before the data collection period. +We collected the data in a 7-days experiment in late November 2019 on three “campaigns,” corresponding to all, men's, and women's items, respectively. +Each campaign randomly used either the Uniform Random policy or the Bernoulli Thompson Sampling (Bernoulli TS) policy, which was pre-trained for about a month before the data collection period.

The small size version of our data is available at [./obd](https://github.com/st-tech/zr-obp/tree/master/obd). -This can be used for running [examples](https://github.com/st-tech/zr-obp/tree/master/examples). +This can be used for running some [examples](https://github.com/st-tech/zr-obp/tree/master/examples). We release the full size version of our data at [https://research.zozo.com/data.html](https://research.zozo.com/data.html). Please download the full size version for research uses. Please see [./obd/README.md](https://github.com/st-tech/zr-obp/blob/master/obd/README.md) for the description of the dataset. ## Open Bandit Pipeline (OBP) - -*Open Bandit Pipeline* is a series of implementations of dataset preprocessing, OPE estimators, and the evaluation of OPE estimators. +*Open Bandit Pipeline* is a series of implementations of dataset preprocessing, policy learning methods, OPE estimators, and the evaluation of OPE protocols. This pipeline allows researchers to focus on building their own OPE estimator and easily compare it with others’ methods in realistic and reproducible ways. Thus, it facilitates reproducible research on bandit algorithms and off-policy evaluation. @@ -82,7 +81,7 @@ Thus, it facilitates reproducible research on bandit algorithms and off-policy e Open Bandit Pipeline consists of the following main modules. - **dataset module**: This module provides a data loader for Open Bandit Dataset and a flexible interface for handling logged bandit feedback. It also provides tools to generate synthetic bandit datasets. -- **policy module**: This module provides interfaces for online and offline bandit algorithms. It also implements several standard policy learning methods. +- **policy module**: This module provides interfaces for training online and offline bandit policies. It also implements several standard policy learning methods. - **simulator module**: This module provides functions for conducting offline bandit simulation. - **ope module**: This module provides interfaces for OPE estimators. It also implements several standard and advanced OPE estimators. @@ -131,6 +130,8 @@ Currently, Open Bandit Dataset & Pipeline facilitate evaluation and comparison r - **Off-Policy Evaluation**: We present implementations of behavior policies used when collecting datasets as a part of our pipeline. Our open data also contains logged bandit feedback data generated by *multiple* different bandit policies. Therefore, it enables the evaluation of off-policy evaluation with ground-truth for the performance of evaluation policies. +Please refer to to our [documentation](https://zr-obp.readthedocs.io/en/latest/ope.html) for the basic formulation of OPE. + # Installation @@ -162,7 +163,7 @@ python setup.py install # Usage -We show an example of conducting offline evaluation of the performance of Bernoulli Thompson Sampling (BernoulliTS) as an evaluation policy using the *Inverse Probability Weighting (IPW)* and logged bandit feedback generated by the Random policy (behavior policy). +We show an example of conducting offline evaluation of the performance of BernoulliTS as an evaluation policy using Inverse Probability Weighting (IPW) and logged bandit feedback generated by the Random policy (behavior policy). We see that only ten lines of code are sufficient to complete OPE from scratch. ```python @@ -206,9 +207,9 @@ Below, we explain some important features in the example. We prepare an easy-to-use data loader for Open Bandit Dataset. ```python -# load and preprocess raw data in "ALL" campaign collected by the Random policy +# load and preprocess raw data in "All" campaign collected by the Random policy dataset = OpenBanditDataset(behavior_policy='random', campaign='all') -# obtain logged bandit feedback generated by the behavior policy +# obtain logged bandit feedback bandit_feedback = dataset.obtain_batch_bandit_feedback() print(bandit_feedback.keys()) @@ -216,7 +217,7 @@ dict_keys(['n_rounds', 'n_actions', 'action', 'position', 'reward', 'pscore', 'c ``` Users can implement their own feature engineering in the `pre_process` method of `obp.dataset.OpenBanditDataset` class. -We show an example of implementing some new feature engineering processes in [`./examples/examples_with_obd/custom_dataset.py`](https://github.com/st-tech/zr-obp/blob/master/benchmark/cf_policy_search/custom_dataset.py). +We show an example of implementing some new feature engineering processes in [`custom_dataset.py`](https://github.com/st-tech/zr-obp/blob/master/benchmark/cf_policy_search/custom_dataset.py). Moreover, by following the interface of `obp.dataset.BaseBanditDataset` class, one can handle future open datasets for bandit algorithms other than our Open Bandit Dataset. `dataset` module also provide a class to generate synthetic bandit datasets. @@ -236,7 +237,7 @@ evaluation_policy = BernoulliTS( campaign="all", random_state=12345 ) -# compute the distribution over actions by the evaluation policy using Monte Carlo simulation +# compute the action choice probabilities by the evaluation policy using Monte Carlo simulation # action_dist is an array of shape (n_rounds, n_actions, len_list) # representing the distribution over actions made by the evaluation policy action_dist = evaluation_policy.compute_batch_action_dist( @@ -244,8 +245,10 @@ action_dist = evaluation_policy.compute_batch_action_dist( ) ``` -When `is_zozotown_prior=False`, non-informative prior distribution is used. -The `compute_batch_action_dist` method of `BernoulliTS` computes the action choice probabilities based on given hyperparameters of the beta distribution. `action_dist` is an array representing the distribution over actions made by the evaluation policy. +The `compute_batch_action_dist` method of `BernoulliTS` computes the action choice probabilities based on given hyperparameters of the beta distribution. +When `is_zozotown_prior=True`, hyperparameters used during the data collection process on the ZOZOTOWN platform are set. +Otherwise, non-informative prior hyperparameters are used. +`action_dist` is an array representing the action choice probabilities made by the evaluation policy. Users can implement their own bandit algorithms by following the interfaces implemented in [`./obp/policy/base.py`](https://github.com/st-tech/zr-obp/blob/master/obp/policy/base.py). @@ -255,21 +258,22 @@ Our final step is **off-policy evaluation** (OPE), which attempts to estimate th Our pipeline also provides an easy procedure for doing OPE as follows. ```python -# estimate the policy value of BernoulliTS based on the distribution over actions by that policy +# estimate the policy value of BernoulliTS based on its action choice probabilities # it is possible to set multiple OPE estimators to the `ope_estimators` argument ope = OffPolicyEvaluation(bandit_feedback=bandit_feedback, ope_estimators=[IPW()]) estimated_policy_value = ope.estimate_policy_values(action_dist=action_dist) print(estimated_policy_value) -{'ipw': 0.004553...} # dictionary containing estimated policy values by each OPE estimator. +{'ipw': 0.004553...} # dictionary containing policy values estimated by each OPE estimator. # compare the estimated performance of BernoulliTS (evaluation policy) -# with the ground-truth performance of Random (behavior policy) -relative_policy_value_of_bernoulli_ts = estimated_policy_value['ipw'] / bandit_feedback['reward'].mean() +# with the ground-truth performance of the Random policy (behavior policy) +policy_value_improvement = estimated_policy_value['ipw'] / bandit_feedback['reward'].mean() # our OPE procedure suggests that BernoulliTS improves Random by 19.81% -print(relative_policy_value_of_bernoulli_ts) +print(policy_value_improvement) 1.198126... ``` -Users can implement their own OPE estimator by following the interface of `obp.ope.BaseOffPolicyEstimator` class. `obp.ope.OffPolicyEvaluation` class summarizes and compares the estimated policy values by several off-policy estimators. +Users can implement their own OPE estimator by following the interface of `obp.ope.BaseOffPolicyEstimator` class. +`obp.ope.OffPolicyEvaluation` class summarizes and compares the policy values estimated by several different estimators. A detailed usage of this class can be found at [quickstart](https://github.com/st-tech/zr-obp/tree/master/examples/quickstart). `bandit_feedback['reward'].mean()` is the empirical mean of factual rewards (on-policy estimate of the policy value) in the log and thus is the ground-truth performance of the behavior policy (the Random policy in this example.). diff --git a/benchmark/ope/README.md b/benchmark/ope/README.md index e81fae6c..8886da0b 100644 --- a/benchmark/ope/README.md +++ b/benchmark/ope/README.md @@ -11,8 +11,7 @@ Please download the full [open bandit dataset](https://research.zozo.com/data.ht Model-dependent estimators such as DM and DR need a pre-trained regression model. Here, we train a regression model with some machine learning methods. -We define hyperparameters for the machine learning methods in [`conf/hyperparams.yaml`](https://github.com/st-tech/zr-obp/blob/master/benchmark/ope/conf/hyperparams.yaml). -[train_regression_model.py](https://github.com/st-tech/zr-obp/blob/master/benchmark/ope/train_regression_model.py) implements the training process of the regression model. +[train_regression_model.py](https://github.com/st-tech/zr-obp/blob/master/benchmark/ope/train_regression_model.py) implements the training process of the regression model. ([`conf/hyperparams.yaml`](https://github.com/st-tech/zr-obp/blob/master/benchmark/ope/conf/hyperparams.yaml) defines hyperparameters for the machine learning methods.) ``` python train_regression_model.py\ @@ -34,8 +33,8 @@ where - `$campaign` specifies the campaign considered in ZOZOTOWN and should be one of "all", "men", or "women". - `$n_sim_to_compute_action_dist` is the number of monte carlo simulation to compute the action choice probabilities by a given evaluation policy. - `$is_timeseries_split` is whether the data is split based on timestamp or not. If true, the out-sample performance of OPE is tested. See the relevant paper for details. -- - `$test_size` specifies the proportion of the dataset to include in the test split when `$is_timeseries_split=True`. -- `$is_mrdr` is whether the regression model is trained by the more robust doubly robust way or not. See the relevant paper for details. +- `$test_size` specifies the proportion of the dataset to include in the test split when `$is_timeseries_split=True`. +- `$is_mrdr` is whether the regression model is trained by the more robust doubly robust way. See the relevant paper for details. - `$n_jobs` is the maximum number of concurrently running jobs. For example, the following command trains the regression model based on logistic regression on the logged bandit feedback data collected by the Random policy (as a behavior policy) in "All" campaign. @@ -158,9 +157,3 @@ do done ``` --> - - diff --git a/obp/policy/offline.py b/obp/policy/offline.py index 96ca5ee6..49de9ee7 100644 --- a/obp/policy/offline.py +++ b/obp/policy/offline.py @@ -245,8 +245,8 @@ def sample_action( .. math:: - & P (A_1 = a_1 | x) = \\frac{e^{f(x,a_1,1) / \\tau}}{\\sum_{a^{\\prime} \\in \\mathcal{A}} e^{f(x,a^{\\prime},1) / \\tau}} , \\\\ - & P (A_2 = a_2 | A_1 = a_1, x) = \\frac{e^{f(x,a_2,2) / \\tau}}{\\sum_{a^{\\prime} \\in \\mathcal{A} \\backslash \\{a_1\\}} e^{f(x,a^{\\prime},2) / \\tau}} , + & P (A_1 = a_1 | x) = \\frac{\\mathrm{exp}(f(x,a_1,1) / \\tau)}{\\sum_{a^{\\prime} \\in \\mathcal{A}} \\mathrm{exp}( f(x,a^{\\prime},1) / \\tau)} , \\\\ + & P (A_2 = a_2 | A_1 = a_1, x) = \\frac{\\mathrm{exp}(f(x,a_2,2) / \\tau)}{\\sum_{a^{\\prime} \\in \\mathcal{A} \\backslash \\{a_1\\}} \\mathrm{exp}(f(x,a^{\\prime},2) / \\tau )} , \\ldots where :math:`A_k` is a random variable representing an action at a position :math:`k`. @@ -304,7 +304,7 @@ def predict_proba( .. math:: - P (A = a | x) = \\frac{e^{f(x,a) / \\tau}}{\\sum_{a^{\\prime} \\in \\mathcal{A}} e^{f(x,a^{\\prime}) / \\tau}}, + P (A = a | x) = \\frac{\\mathrm{exp}(f(x,a) / \\tau)}{\\sum_{a^{\\prime} \\in \\mathcal{A}} \\mathrm{exp}(f(x,a^{\\prime}) / \\tau)}, where :math:`A` is a random variable representing an action, and :math:`\\tau` is a temperature hyperparameter. :math:`f: \\mathcal{X} \\times \\mathcal{A} \\rightarrow \\mathbb{R}_{+}` diff --git a/obp/version.py b/obp/version.py index f9aa3e11..e19434e2 100644 --- a/obp/version.py +++ b/obp/version.py @@ -1 +1 @@ -__version__ = "0.3.2" +__version__ = "0.3.3"