Skip to content

Commit

Permalink
Refine documents of PARL (#43)
Browse files Browse the repository at this point in the history
* remove not used files, add benchmark for DQN and DDPG, add Parameters management Readme

* Update README.md

* Update README.md

* add parl dependence in examples, use np shuffle instead of sklean

* fix codestyle

* refine readme of nips example

* fix bug

* fix code style

* Update README.md

* Update README.md

* Update README.md

* refine document and remove outdated design doc

* Update README.md

* Update README.md

* refine comment

* release version 1.0

* gif of examples

* Update README.md

* update Readme
  • Loading branch information
Hongsheng Zeng authored and Bo Zhou committed Jan 18, 2019
1 parent 4163d73 commit 7a7583a
Show file tree
Hide file tree
Showing 35 changed files with 80 additions and 1,521 deletions.
Binary file added .github/Aircraft.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added .github/Breakout.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added .github/Half-Cheetah.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added .github/NeurlIPS2018.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
6 changes: 5 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -67,7 +67,7 @@ agent = AtariAgent(algorithm)
# Install:
### Dependencies
- Python 2.7 or 3.5+.
- PaddlePaddle >=1.2.1 (We try to make our repository always compatible with newest version PaddlePaddle)
- PaddlePaddle >=1.2.1 (We try to make our repository always compatible with latest version PaddlePaddle)


```
Expand All @@ -80,3 +80,7 @@ pip install --upgrade git+https://github.com/PaddlePaddle/PARL.git
- [DDPG](examples/DDPG/)
- [PPO](examples/PPO/)
- [Winning Solution for NIPS2018: AI for Prosthetics Challenge](examples/NeurIPS2018-AI-for-Prosthetics-Challenge/)

<img src=".github/NeurlIPS2018.gif" width = "300" height ="200" alt="NeurlIPS2018"/> <img src=".github/Half-Cheetah.gif" width = "300" height ="200" alt="Half-Cheetah"/> <img src=".github/Breakout.gif" width = "200" height ="200" alt="Breakout"/>
<br>
<img src=".github/Aircraft.gif" width = "808" height ="300" alt="NeurlIPS2018"/>
Binary file removed docs/ct.png
Binary file not shown.
301 changes: 0 additions & 301 deletions docs/design_doc.md

This file was deleted.

Binary file removed docs/framework.png
Binary file not shown.
Binary file removed docs/model.png
Binary file not shown.
Binary file removed docs/relation.png
Binary file not shown.
Binary file removed docs/step.png
Binary file not shown.
Binary file added examples/DDPG/.benchmark/DDPG_HalfCheetah-v2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
4 changes: 4 additions & 0 deletions examples/DDPG/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,11 +7,15 @@ Based on PARL, the DDPG model of deep reinforcement learning is reproduced, and
### Mujoco games introduction
Please see [here](https://github.com/openai/mujoco-py) to know more about Mujoco game.

### Benchmark result
- HalfCheetah-v2
<img src=".benchmark/DDPG_HalfCheetah-v2.png"/>

## How to use
### Dependencies:
+ python2.7 or python3.5+
+ [paddlepaddle>=1.0.0](https://github.com/PaddlePaddle/Paddle)
+ [parl](https://github.com/PaddlePaddle/PARL)
+ gym
+ tqdm
+ mujoco-py>=1.50.1.0
Expand Down
Binary file added examples/DQN/.benchmark/DQN_Pong.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
7 changes: 6 additions & 1 deletion examples/DQN/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,15 +7,20 @@ Based on PARL, the DQN model of deep reinforcement learning is reproduced, and t
### Atari games introduction
Please see [here](https://gym.openai.com/envs/#atari) to know more about Atari game.

### Benchmark result
- Pong
<img src=".benchmark/DQN_Pong.png"/>

## How to use
### Dependencies:
+ python2.7 or python3.5+
+ [paddlepaddle>=1.0.0](https://github.com/PaddlePaddle/Paddle)
+ [parl](https://github.com/PaddlePaddle/PARL)
+ gym
+ tqdm
+ opencv-python
+ ale_python_interface
+ atari_py
+ [ale_python_interface](https://github.com/mgbellemare/Arcade-Learning-Environment)


### Start Training:
Expand Down
14 changes: 7 additions & 7 deletions examples/NeurIPS2018-AI-for-Prosthetics-Challenge/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ For more technical details about our solution, we provide:
3. [[Link]](https://drive.google.com/file/d/1W-FmbJu4_8KmwMIzH0GwaFKZ0z1jg_u0/view?usp=sharing) A poster briefly introducing our solution in NeurIPS2018 competition workshop.
3. (coming soon)A full academic paper detailing our solution, including entire training pipline, related work and experiments that analyze the importance of each key ingredient.

**Note**: Reproducibility is a long-standing issue in reinforcement learning field. We have tried to guarantee that our code is reproducible, testing each training sub-task three times. However, there are still some factors that prevent us from achieving the same performance. One problem is the choice time of a convergence model during curriculum learning. Choosing a sensible and natural gait visually is crucial for subsequent training, but the definition of what is a good gait varies from different people.
**Note**: Reproducibility is a long-standing issue in reinforcement learning field. We have tried to guarantee that our code is reproducible, testing each training sub-task three times. However, there are still some factors that prevent us from achieving the same performance. One problem is the choice time of a convergence model during curriculum learning. Choosing a sensible and natural gait visually is crucial for subsequent training, but the definition of what is a good gait varies from person to person.

<p align="center">
<img src="image/demo.gif" alt="PARL" width="500"/>
Expand Down Expand Up @@ -60,7 +60,7 @@ For final submission, we test our model in 500 CPUs, running 10 episodes per CPU
python simulator_server.py --port [PORT] --ensemble_num 1

# client (Suggest: 200+ clients)
python simulator_client.py --port [PORT] --ip [IP] --reward_type RunFastest
python simulator_client.py --port [PORT] --ip [SERVER_IP] --reward_type RunFastest
```

#### 2. Target: run at 3.0 m/s
Expand All @@ -71,7 +71,7 @@ python simulator_server.py --port [PORT] --ensemble_num 1 --warm_start_batchs 10
--restore_model_path [RunFastest model]

# client (Suggest: 200+ clients)
python simulator_client.py --port [PORT] --ip [IP] --reward_type FixedTargetSpeed --target_v 3.0 \
python simulator_client.py --port [PORT] --ip [SERVER_IP] --reward_type FixedTargetSpeed --target_v 3.0 \
--act_penalty_lowerbound 1.5
```

Expand All @@ -83,7 +83,7 @@ python simulator_server.py --port [PORT] --ensemble_num 1 --warm_start_batchs 10
--restore_model_path [FixedTargetSpeed 3.0m/s model]

# client (Suggest: 200+ clients)
python simulator_client.py --port [PORT] --ip [IP] --reward_type FixedTargetSpeed --target_v 2.0 \
python simulator_client.py --port [PORT] --ip [SERVER_IP] --reward_type FixedTargetSpeed --target_v 2.0 \
--act_penalty_lowerbound 0.75
```

Expand All @@ -99,7 +99,7 @@ python simulator_server.py --port [PORT] --ensemble_num 1 --warm_start_batchs 10
--restore_model_path [FixedTargetSpeed 2.0m/s model]

# client (Suggest: 200+ clients)
python simulator_client.py --port [PORT] --ip [IP] --reward_type FixedTargetSpeed --target_v 1.25 \
python simulator_client.py --port [PORT] --ip [SERVER_IP] --reward_type FixedTargetSpeed --target_v 1.25 \
--act_penalty_lowerbound 0.6
```

Expand All @@ -109,10 +109,10 @@ As mentioned before, the selection of model that used to fine-tune influence lat
```bash
# server
python simulator_server.py --port [PORT] --ensemble_num 12 --warm_start_batchs 1000 \
--restore_model_path [FixedTargetSpeed 1.25m/s] --restore_from_one_head
--restore_model_path [FixedTargetSpeed 1.25m/s model] --restore_from_one_head

# client (Suggest: 100+ clients)
python simulator_client.py --port [PORT] --ip [IP] --reward_type Round2 --act_penalty_lowerbound 0.75 \
python simulator_client.py --port [PORT] --ip [SERVER_IP] --reward_type Round2 --act_penalty_lowerbound 0.75 \
--act_penalty_coeff 7.0 --vel_penalty_coeff 20.0 --discrete_data --stage 3
```

Expand Down
3 changes: 2 additions & 1 deletion examples/PPO/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,9 @@ Please see [here](https://github.com/openai/mujoco-py) to know more about Mujoco

## How to use
### Dependencies:
+ python2.7 or python3.5+
+ python3.5+
+ [paddlepaddle>=1.0.0](https://github.com/PaddlePaddle/Paddle)
+ [parl](https://github.com/PaddlePaddle/PARL)
+ gym
+ tqdm
+ mujoco-py>=1.50.1.0
Expand Down
11 changes: 7 additions & 4 deletions examples/PPO/mujoco_agent.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,6 @@
import numpy as np
import parl.layers as layers
from paddle import fluid
from sklearn.utils import shuffle
from parl.framework.agent_base import Agent
from parl.utils import logger

Expand Down Expand Up @@ -183,12 +182,16 @@ def value_learn(self, obs, value):

all_loss = []
for _ in range(self.value_learn_times):
obs_train, value_train = shuffle(obs_train, value_train)
random_ids = np.arange(obs_train.shape[0])
np.random.shuffle(random_ids)
shuffle_obs_train = obs_train[random_ids]
shuffle_value_train = value_train[random_ids]
start = 0
while start < data_size:
end = start + self.value_batch_size
value_loss = self._batch_value_learn(obs_train[start:end, :],
value_train[start:end])
value_loss = self._batch_value_learn(
shuffle_obs_train[start:end, :],
shuffle_value_train[start:end])
all_loss.append(value_loss)
start += self.value_batch_size
return np.mean(all_loss)
1 change: 1 addition & 0 deletions examples/QuickStart/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ Based on PARL, train a agent to play CartPole game with policy gradient algorith

+ python2.7 or python3.5+
+ [paddlepaddle>=1.0.0](https://github.com/PaddlePaddle/Paddle)
+ [parl](https://github.com/PaddlePaddle/PARL)
+ gym

### Start Training:
Expand Down
13 changes: 0 additions & 13 deletions parl/algorithm_zoo/__init__.py

This file was deleted.

174 changes: 0 additions & 174 deletions parl/algorithm_zoo/simple_algorithms.py

This file was deleted.

13 changes: 0 additions & 13 deletions parl/common/__init__.py

This file was deleted.

Loading

0 comments on commit 7a7583a

Please sign in to comment.