This repo provides implementations of Multiplicative Compositional Policies (MCP), which is a method for learning reusable motor skills that can be composed to produce a range of complex behaviors. All code is written in Python 3, using PyTorch, NumPy, and Stable-Baselines3. Experiments are simulated with the MuJoCo Physics Engine. The project is built on DRLoco, an implementation of DeepMimic Framework with Stable-Baselines3.
MCP: Learning Composable Hierarchical Control with Multiplicative Compositional Policies. NeurIPS 2019.
[Paper] [Our Slides]
The character we decided to work with is a simple degrees-of-freedom (DoFs) ant. Although the paper employs imitative rewards for other characters, it trains the ant by a common RL approach (no imitation). Consequently, we trained the ant in this manner. Additionally, we devised various training methods:
Model Name | Description |
---|---|
MCPPO | The paper jointly trained primitives end-to-end, leading to the specializations. In MCPPO, we trained each primitive separately for an individual task. |
MCP_I | Like the other characters, we incorporate expert demonstrations in the pre-training phase. |
In our experiments, the pre-training phase consists of four different heading tasks: heading north, south, east, and west. For MCP Naive, we provided a corpus of reference motions. We followed the approach used to pre-train the humanoid in the paper. For the rest, we specified the goals and the reward functions to encourage the agent to navigate the desired direction.
To evaluate the agents, we considered four new heading tasks: north-west, north-east, south-west, and south-east. The goal and the reward function are defined in the same way as in pre-training.
Since there is no mocap data for the ant, we needed to develop experts to generate reference data. Accordingly, we trained four different MLP policies with PPO, each of which learned to navigate north/south/east/west. We consider each policy as an expert in a particular direction, and thus, we produce actions that can be regarded as reference trajectories.
The reference trajectories used for MCP Naive that were produced by PPO were observed to be very noisy leading to poor performance. Also, the sensitivity of Ant to the reward scaling factors led to the Ant being unable to learn via imitation of reference trajtory.
To install requirements, please refer to DRLoco installation documentation.
python mcppo.py
python mcp_naive.py
cd mcp
python train_mcp.py
python train_mcppo.py
python scratch_ant.py
python transfer.py
bash gen_plots.sh
bash make_traj.sh
cd mcp
bash gen_plots.sh
bash make_traj.sh