Major Performance Decrease in Tianshou 1.2 Compared to 0.5 on Windows and Linux #1225

ULudo · 2024-10-30T15:02:31Z

Hello,

I used Tainshou 0.5 on a custom environment running on a Windows PC. I was impressed by the training speed of the PPO agent, which exceeded 2000 iterations per second.

import tianshou, gymnasium as gym, torch, numpy, sys
  print(tianshou.__version__, gym.__version__, torch.__version__, numpy.__version__, sys.version, sys.platform)

0.5.0 0.26.3 2.5.1 1.26.4 3.11.10 | packaged by conda-forge | (main, Oct 16 2024, 01:17:14) [MSC v.1941 64 bit (AMD64)] win32

Training using the Tianshou library, version 0.5:

Epoch #1: 10112it [00:03, 2748.12it/s, env_step=10112, len=0, loss=0.303, loss/clip=0.000, loss/enEpoch #1: 10112it [00:03, 2610.28it/s, env_step=10112, len=0, loss=0.303, loss/clip=0.000, loss/ent=0.916, loss/vf=0.623, n/ep=0, n/st=128, rew=0.00]
Epoch #1: test_reward: 27.403868 ± 0.000000, best_reward: 27.403868 ± 0.000000 in #1
Epoch #2: 10112it [00:03, 2644.64it/s, env_step=20224, len=0, loss=0.500, loss/clip=0.000, loss/enEpoch #2: 10112it [00:03, 2634.27it/st/s, env_step=20224, len=0, loss=0.500, loss/clip=0.000, loss/ent=0.915, loss/vf=1.018, n/ep=0, n/st=128, rew=0.00]
Epoch #2: test_reward: 33.482427 ± 0.000000, best_reward: 33.482427 ± 0.000000 in #2
Epoch #3: 10112it [00:04, 2376.15it/s, env_step=30208, len=0, loss=0.715, loss/clip=-0.000, loss/eEpoch #3: 10112it [00:04, 2376.15it/st/s, env_step=30336, len=0, loss=0.713, loss/clip=-0.000, loss/eEpoch #3: 10112it [00:04, 2442.15it/s, env_step=30336, len=0, loss=03, .713, loss/clip=-0.000, loss/ent=0.913, loss/vf=1.445, n/ep=0, n/st=128, rew=0.00]
Epoch #3: test_reward: 35.236934 ± 0.000000, best_reward: 35.236934 ± 0.000000 in #3
Epoch #4:  59%|5| 5888/10000 [00:02<00:01, 2481.35it/s, env_step=36224, len=0, loss=0.614, loss/clip=0.000, loss/ent=0.911, loss/vf==1Epoch #4:  60%|6| 6016/10000 [00:02<00:01, 2559.46it/s, env_step=36224, len=0, loss=0.614, loss/clip=0.000, loss/ent=0.911, loss/vf 
Epoch #4: 10112it [00:04, 2473.71it/s, env_step=40448, len=0, loss=0.547, loss/clip=0.000, loss/ent=0.910, loss/vf=1.112, n/ep=0, Epoch #4: 10112it [00:04, 2508.40it/s, env_step=40448, len=0, loss=0.547, loss/clip=0.000, loss/ent=0.910, loss/vf=1.112, ep=0, st=128, rew=0.00]
Epoch #4: test_reward: 22.770667 ± 0.000000, best_reward: 35.236934 ± 0.000000 in #3
Epoch #5: 10112it [00:04, 2383.42it/s, env_step=50432, len=334, loss=0.479, loss/clip=0.000, loss/ent=0.911, loss/vf=0.976, ep=0, 
Epoch #5: 10112it [00:04, 2383.42it/s, env_step=50560, len=334, loss=0.476, loss/clip=0.000, loss/ent=0.911, loss/vf=0.970, ep=0, 
Epoch #5: 10112it [00:04, 2479.57it/s, env_step=50560, len=334, loss=0.476, loss/clip=0.000, loss/ent=0.911, loss/vf=0.970, ep=0, st=128, rew=54.33]
Epoch #5: test_reward: 29.205846 ± 0.000000, best_reward: 35.236934 ± 0.000000 in #3

Recently, I upgraded to Tianshou 1.2, keeping the agent configuration the same. However, I observed a significant performance drop, with the new version running approximately 12 times slower, as shown below. I also tested that on Linux and observed the same results:

import tianshou, gymnasium as gym, torch, numpy, sys
  print(tianshou.__version__, gym.__version__, torch.__version__, numpy.__version__, sys.version, sys.platform)

1.2.0-dev 0.28.1 2.1.1+cu121 1.24.4 3.11.10 (main, Sep 7 2024, 18:35:41) [GCC 11.4.0] linu

Training using the Tianshou library, version 1.2:

Epoch #1: 10112it [00:59, 169.41it/s, env_episode=0, env_step=10112, gradient_step=158, len=0, n/ep=0, n/st=128, rew=0.00]
Epoch #2: 10112it [00:59, 171.18it/s, env_episode=0, env_step=20224, gradient_step=316, len=0, n/ep=0, n/st=128, rew=0.00]             
Epoch #3: 10112it [00:59, 170.92it/s, env_episode=0, env_step=30336, gradient_step=474, len=0, n/ep=0, n/st=128, rew=0.00]             
Epoch #4: 10112it [00:59, 171.19it/s, env_episode=0, env_step=40448, gradient_step=632, len=0, n/ep=0, n/st=128, rew=0.00]             
Epoch #5: 10112it [00:59, 170.45it/s, env_episode=128, env_step=50560, gradient_step=790, len=1, n/ep=0, n/st=128, rew=41.83]

Have there been changes to the library that impact execution performance, and can I restore previous performance levels through configuration adjustments?

I have marked all applicable categories:
- exception-raising bug
- RL algorithm bug
- documentation request (i.e. "X is missing from the documentation.")
- new feature request
- design request (i.e. "X should be changed to Y.")
I have visited the source website
I have searched through the issue tracker for duplicates
I have mentioned version numbers, operating system and environment.

opcode81 · 2024-11-01T22:00:42Z

Since you are running on Windows: Do you have an Nvidia GPU which you expect to be using? If so, please check whether the GPU is indeed being used. Default CUDA support differs across different versions of torch (especially on Windows), so this is important to check.

Also, are you using parallel environments? If so, which type of vectorization did you enable?

ULudo · 2024-11-02T09:08:44Z

Yes, I have an NVIDIA graphics card available. It is recognized by PyTorch both in Windows and Linux:

print(f"Device: {device}")
print(f"Tianshou version: {tianshou.__version__}")
print(f"Torch version: {torch.__version__} and Cuda available: {torch.cuda.is_available()}")

Windows Output

Device: cuda
Tianshou version: 1.2.0-dev
Torch version: 2.5.0 and Cuda available: True

Linux Output

Device: cuda
Tianshou version: 1.2.0-dev
Torch version: 2.1.1+cu121 and Cuda available: True

The GPU is effectively used in Tianshou version 0.5. This should also apply to Tianshou version 1.2. My training setup is similar to the API examples:

# Models
net = Net(
    state_shape, 
    hidden_sizes=NETWORK_ARCHITECTURE, 
    activation=nn.Tanh,
    device=device)
actor = ActorProb(
    net,
    action_shape,
    max_action=max_action,
    unbounded=True,
    device=device,
).to(device)
net_c = Net(
    state_shape,
    hidden_sizes=NETWORK_ARCHITECTURE,
    activation=nn.Tanh,
    device=device,
)
critic = Critic(net_c, device=device).to(device)
actor_critic = ActorCritic(actor, critic)

I tested both DummyVectorEnv and SubprocVectorEnv. I found that the training setup takes significantly longer when using SubprocVectorEnv, similar to my experience with Tianshou version 0.5. However, the execution speed is very fast in Tianshou version 0.5, whether using DummyVectorEnv or SubprocVectorEnv.

train_envs = DummyVectorEnv([make_train_env() for _ in range(NUM_TRAIN_ENVS)])
test_envs = DummyVectorEnv([make_test_env() for _ in range(NUM_TEST_ENVS)])

MischaPanch · 2024-11-06T17:29:51Z

We will look into that asap, thanks for reporting!

opcode81 · 2024-11-07T17:49:58Z

I did a quick speed test, comparing 0.5.0 to 1.0.0 and the current development version (1.2.0-dev). I tested with the atari_ppo example, using CPU, the Pong environment and a single env.

While I did notice a slowdown, it is nowhere near the 12x slowdown you are describing; it is around 1.7x, which is still bad enough though. We will look into the reasons for the slowdown by profiling the current implementation, but it may not explain why your task is so much more greatly affected. Perhaps your environment causes the functions that are slower to be used more frequently, but it's hard to say. We will try to restore the speed of the old implementation for the Atari case and then you can check whether it helps for your use case as well, @ULudo.

rujialiu · 2024-11-12T05:10:33Z

I'm in a similar situation. I used tianshou 0.5 one year ago (but not for very long time), saved some log, showing that most atari games's training speed is >=500 it/s. Recently, I've upgrade my video card to 3070Ti, upgraded to 1.2 but the training speed of PongNoFrameskip-v4 decreased all the way down from 80it/s to, like 5it/s. Interestingly, atari_ppo has a steady speed of about 200it/s, but atari_dqn, atari_sac and atari_rainbow are all extremely slow.

I haven't spent to much time yet. If I have more findings, I'll let you know.
Environment: Windows 10, torch 2.3+cu121 (I've manually upgrade gymnasium to 1.0 and ale-py 0.10 and made a few changes)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Major Performance Decrease in Tianshou 1.2 Compared to 0.5 on Windows and Linux #1225

Major Performance Decrease in Tianshou 1.2 Compared to 0.5 on Windows and Linux #1225

ULudo commented Oct 30, 2024 •

edited by opcode81

Loading

opcode81 commented Nov 1, 2024 •

edited

Loading

ULudo commented Nov 2, 2024

MischaPanch commented Nov 6, 2024

opcode81 commented Nov 7, 2024 •

edited

Loading

rujialiu commented Nov 12, 2024

Major Performance Decrease in Tianshou 1.2 Compared to 0.5 on Windows and Linux #1225

Major Performance Decrease in Tianshou 1.2 Compared to 0.5 on Windows and Linux #1225

Comments

ULudo commented Oct 30, 2024 • edited by opcode81 Loading

opcode81 commented Nov 1, 2024 • edited Loading

ULudo commented Nov 2, 2024

MischaPanch commented Nov 6, 2024

opcode81 commented Nov 7, 2024 • edited Loading

rujialiu commented Nov 12, 2024

ULudo commented Oct 30, 2024 •

edited by opcode81

Loading

opcode81 commented Nov 1, 2024 •

edited

Loading

opcode81 commented Nov 7, 2024 •

edited

Loading