The OmniSafe Safety-Gymnasium Benchmark for offline algorithms evaluates the effectiveness of OmniSafe's offline algorithms across multiple environments from the Safety-Gymnasium task suite. For each algorithm and environment supported, we provide:
- Default hyperparameters used for the benchmark and scripts to reproduce the results.
- Graphs and raw data that can be used for research purposes.
- Log details obtained during training.
- Some hints on how to fine-tune the algorithm for optimal results.
Supported algorithms are listed below:
- [ICML 2019] Batch-Constrained deep Q-learning(BCQ)
- The Lagrange version of BCQ (BCQ-Lag)
- [NeurIPS 2020] Critic Regularized Regression
- The Constrained version of CRR (C-CRR)
- [ICLR 2022 (Spotlight)] COptiDICE: Offline Constrained Reinforcement Learning via Stationary Distribution Correction Estimation
We highly recommend using Safety-Gymnasium to run the following experiments. To install, in a linux machine, type:
pip install safety_gymnasium
omnisafe train --env-id SafetyAntVelocity-v1 --algo PPOLag
The PATH_TO_AGENT
is the path of the directory containing the torch_save
.
from omnisafe.common.offline.data_collector import OfflineDataCollector
# please change agent path
env_name = 'SafetyAntVelocity-v1'
size = 1_000_000
agents = [
('PATH_TO_AGENT', 'epoch-500.pt', 1_000_000),
]
save_dir = './data'
if __name__ == '__main__':
col = OfflineDataCollector(size, env_name)
for agent, model_name, num in agents:
col.register_agent(agent, model_name, num)
col.collect(save_dir)
You can set the main function of examples/benchmarks/experiment_grid.py
as:
if __name__ == '__main__':
eg = ExperimentGrid(exp_name='Offline-Benchmarks')
# set up the algorithms.
offline_policy = ['VAEBC', 'BCQ', 'BCQLag', 'CCR', 'CCRR', 'COptiDICE']
eg.add('algo', offline_policy)
# you can use wandb to monitor the experiment.
eg.add('logger_cfgs:use_wandb', [False])
# you can use tensorboard to monitor the experiment.
eg.add('logger_cfgs:use_tensorboard', [True])
# add dataset path
eg.add('train_cfgs:dataset', [dataset_path])
# set up the environment.
eg.add('env_id', [
'SafetyAntVelocity-v1',
])
eg.add('seed', [0, 5, 10, 15, 20])
# total experiment num must can be divided by num_pool
# meanwhile, users should decide this value according to their machine
eg.run(train, num_pool=5)
After that, you can run the following command to run the benchmark:
cd examples/benchmarks
python run_experiment_grid.py
You can also plot the results by running the following command:
cd examples
python analyze_experiment_results.py
For a detailed usage of OmniSafe statistics tool, please refer to this tutorial.
Algorithms | Reward (OmniSafe) | Cost (OmniSafe) |
---|---|---|
VAE-BC | 43.66±0.90 | 109.86±13.24 |
C-CRR | 45.48±0.87 | 127.30±12.60 |
BCQLag | 43.31±0.76 | 113.39±12.81 |
COptiDICE | 40.68±0.93 | 67.11±13.15 |
Algorithms | Reward (OmniSafe) | Cost (OmniSafe) |
---|---|---|
VAE-BC | 42.84±1.36 | 62.34±14.84 |
C-CRR | 45.99±1.36 | 97.20±13.57 |
BCQLag | 44.68±1.97 | 95.06±33.07 |
COptiDICE | 39.55±1.39 | 53.87±13.27 |
Algorithms | Reward (OmniSafe) | Cost (OmniSafe) |
---|---|---|
VAE-BC | 40.23±0.75 | 41.25±10.12 |
C-CRR | 40.66±0.88 | 49.90±10.81 |
BCQLag | 42.94±1.04 | 85.37±23.41 |
COptiDICE | 40.98±0.89 | 70.40±12.14 |
Algorithms | Reward (OmniSafe) | Cost (OmniSafe) |
---|---|---|
VAE-BC | 19.62±0.28 | 150.54±7.63 |
C-CRR | 18.53±0.45 | 122.63±13.14 |
BCQLag | 18.88±0.61 | 125.44±15.68 |
COptiDICE | 17.25±0.37 | 90.86±10.75 |
Algorithms | Reward (OmniSafe) | Cost (OmniSafe) |
---|---|---|
VAE-BC | 18.69±0.33 | 125.97±10.36 |
C-CRR | 17.24±0.43 | 89.47±11.55 |
BCQLag | 18.14±0.96 | 108.07±20.70 |
COptiDICE | 16.38±0.43 | 70.54±12.36 |
Algorithms | Reward (OmniSafe) | Cost (OmniSafe) |
---|---|---|
VAE-BC | 17.31±0.33 | 85.53±11.33 |
C-CRR | 15.74±0.42 | 48.38±10.31 |
BCQLag | 17.10±0.84 | 77.54±14.07 |
COptiDICE | 15.58±0.37 | 49.42±8.699 |
- The Lagrange method is widely recognized as the simplest approach for implementing safe algorithms. However, its efficacy is limited in offline settings. This is due to the fact that the q-critic, which is based on TD-learning, cannot ensure the absolute accuracy of the q value. Since the policy relies on the q value, relative accuracy is sufficient for learning the appropriate state. However, for the Lagrange-based algorithm to acquire meaningful Lagrange multipliers, it requires q values that are absolutely accurate.
- Currently, there exists no standardized data set for safe offline evaluation algorithms. This lack of uniformity renders the existing algorithm sensitive to the data set employed. When the data set comprises a significant number of unsafe trajectories, the algorithm struggles to learn the safe performance. We advocate for a construction method that involves training two online policies: one offering a high reward but lacking safety assurances, and the other offering a slightly lower reward but ensuring safety. The two policies are then used to collect data sets in a 1:1 ratio.