Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Major Performance Decrease in Tianshou 1.2 Compared to 0.5 on Windows and Linux #1225

Open
5 of 9 tasks
ULudo opened this issue Oct 30, 2024 · 5 comments
Open
5 of 9 tasks

Comments

@ULudo
Copy link

ULudo commented Oct 30, 2024

Hello,

I used Tainshou 0.5 on a custom environment running on a Windows PC. I was impressed by the training speed of the PPO agent, which exceeded 2000 iterations per second.

import tianshou, gymnasium as gym, torch, numpy, sys
  print(tianshou.__version__, gym.__version__, torch.__version__, numpy.__version__, sys.version, sys.platform)

0.5.0 0.26.3 2.5.1 1.26.4 3.11.10 | packaged by conda-forge | (main, Oct 16 2024, 01:17:14) [MSC v.1941 64 bit (AMD64)] win32

Training using the Tianshou library, version 0.5:

Epoch #1: 10112it [00:03, 2748.12it/s, env_step=10112, len=0, loss=0.303, loss/clip=0.000, loss/enEpoch #1: 10112it [00:03, 2610.28it/s, env_step=10112, len=0, loss=0.303, loss/clip=0.000, loss/ent=0.916, loss/vf=0.623, n/ep=0, n/st=128, rew=0.00]
Epoch #1: test_reward: 27.403868 ± 0.000000, best_reward: 27.403868 ± 0.000000 in #1
Epoch #2: 10112it [00:03, 2644.64it/s, env_step=20224, len=0, loss=0.500, loss/clip=0.000, loss/enEpoch #2: 10112it [00:03, 2634.27it/st/s, env_step=20224, len=0, loss=0.500, loss/clip=0.000, loss/ent=0.915, loss/vf=1.018, n/ep=0, n/st=128, rew=0.00]
Epoch #2: test_reward: 33.482427 ± 0.000000, best_reward: 33.482427 ± 0.000000 in #2
Epoch #3: 10112it [00:04, 2376.15it/s, env_step=30208, len=0, loss=0.715, loss/clip=-0.000, loss/eEpoch #3: 10112it [00:04, 2376.15it/st/s, env_step=30336, len=0, loss=0.713, loss/clip=-0.000, loss/eEpoch #3: 10112it [00:04, 2442.15it/s, env_step=30336, len=0, loss=03, .713, loss/clip=-0.000, loss/ent=0.913, loss/vf=1.445, n/ep=0, n/st=128, rew=0.00]
Epoch #3: test_reward: 35.236934 ± 0.000000, best_reward: 35.236934 ± 0.000000 in #3
Epoch #4:  59%|5| 5888/10000 [00:02<00:01, 2481.35it/s, env_step=36224, len=0, loss=0.614, loss/clip=0.000, loss/ent=0.911, loss/vf==1Epoch #4:  60%|6| 6016/10000 [00:02<00:01, 2559.46it/s, env_step=36224, len=0, loss=0.614, loss/clip=0.000, loss/ent=0.911, loss/vf 
Epoch #4: 10112it [00:04, 2473.71it/s, env_step=40448, len=0, loss=0.547, loss/clip=0.000, loss/ent=0.910, loss/vf=1.112, n/ep=0, Epoch #4: 10112it [00:04, 2508.40it/s, env_step=40448, len=0, loss=0.547, loss/clip=0.000, loss/ent=0.910, loss/vf=1.112, ep=0, st=128, rew=0.00]
Epoch #4: test_reward: 22.770667 ± 0.000000, best_reward: 35.236934 ± 0.000000 in #3
Epoch #5: 10112it [00:04, 2383.42it/s, env_step=50432, len=334, loss=0.479, loss/clip=0.000, loss/ent=0.911, loss/vf=0.976, ep=0, 
Epoch #5: 10112it [00:04, 2383.42it/s, env_step=50560, len=334, loss=0.476, loss/clip=0.000, loss/ent=0.911, loss/vf=0.970, ep=0, 
Epoch #5: 10112it [00:04, 2479.57it/s, env_step=50560, len=334, loss=0.476, loss/clip=0.000, loss/ent=0.911, loss/vf=0.970, ep=0, st=128, rew=54.33]
Epoch #5: test_reward: 29.205846 ± 0.000000, best_reward: 35.236934 ± 0.000000 in #3

Recently, I upgraded to Tianshou 1.2, keeping the agent configuration the same. However, I observed a significant performance drop, with the new version running approximately 12 times slower, as shown below. I also tested that on Linux and observed the same results:

import tianshou, gymnasium as gym, torch, numpy, sys
  print(tianshou.__version__, gym.__version__, torch.__version__, numpy.__version__, sys.version, sys.platform)

1.2.0-dev 0.28.1 2.1.1+cu121 1.24.4 3.11.10 (main, Sep 7 2024, 18:35:41) [GCC 11.4.0] linu

Training using the Tianshou library, version 1.2:

Epoch #1: 10112it [00:59, 169.41it/s, env_episode=0, env_step=10112, gradient_step=158, len=0, n/ep=0, n/st=128, rew=0.00]
Epoch #2: 10112it [00:59, 171.18it/s, env_episode=0, env_step=20224, gradient_step=316, len=0, n/ep=0, n/st=128, rew=0.00]             
Epoch #3: 10112it [00:59, 170.92it/s, env_episode=0, env_step=30336, gradient_step=474, len=0, n/ep=0, n/st=128, rew=0.00]             
Epoch #4: 10112it [00:59, 171.19it/s, env_episode=0, env_step=40448, gradient_step=632, len=0, n/ep=0, n/st=128, rew=0.00]             
Epoch #5: 10112it [00:59, 170.45it/s, env_episode=128, env_step=50560, gradient_step=790, len=1, n/ep=0, n/st=128, rew=41.83]          

Have there been changes to the library that impact execution performance, and can I restore previous performance levels through configuration adjustments?

  • I have marked all applicable categories:
    • exception-raising bug
    • RL algorithm bug
    • documentation request (i.e. "X is missing from the documentation.")
    • new feature request
    • design request (i.e. "X should be changed to Y.")
  • I have visited the source website
  • I have searched through the issue tracker for duplicates
  • I have mentioned version numbers, operating system and environment.
@opcode81
Copy link
Collaborator

opcode81 commented Nov 1, 2024

Since you are running on Windows: Do you have an Nvidia GPU which you expect to be using? If so, please check whether the GPU is indeed being used. Default CUDA support differs across different versions of torch (especially on Windows), so this is important to check.

Also, are you using parallel environments? If so, which type of vectorization did you enable?

@ULudo
Copy link
Author

ULudo commented Nov 2, 2024

Yes, I have an NVIDIA graphics card available. It is recognized by PyTorch both in Windows and Linux:

print(f"Device: {device}")
print(f"Tianshou version: {tianshou.__version__}")
print(f"Torch version: {torch.__version__} and Cuda available: {torch.cuda.is_available()}")

Windows Output

Device: cuda
Tianshou version: 1.2.0-dev
Torch version: 2.5.0 and Cuda available: True

Linux Output

Device: cuda
Tianshou version: 1.2.0-dev
Torch version: 2.1.1+cu121 and Cuda available: True

The GPU is effectively used in Tianshou version 0.5. This should also apply to Tianshou version 1.2. My training setup is similar to the API examples:

# Models
net = Net(
    state_shape, 
    hidden_sizes=NETWORK_ARCHITECTURE, 
    activation=nn.Tanh,
    device=device)
actor = ActorProb(
    net,
    action_shape,
    max_action=max_action,
    unbounded=True,
    device=device,
).to(device)
net_c = Net(
    state_shape,
    hidden_sizes=NETWORK_ARCHITECTURE,
    activation=nn.Tanh,
    device=device,
)
critic = Critic(net_c, device=device).to(device)
actor_critic = ActorCritic(actor, critic)

I tested both DummyVectorEnv and SubprocVectorEnv. I found that the training setup takes significantly longer when using SubprocVectorEnv, similar to my experience with Tianshou version 0.5. However, the execution speed is very fast in Tianshou version 0.5, whether using DummyVectorEnv or SubprocVectorEnv.

train_envs = DummyVectorEnv([make_train_env() for _ in range(NUM_TRAIN_ENVS)])
test_envs = DummyVectorEnv([make_test_env() for _ in range(NUM_TEST_ENVS)])

@MischaPanch
Copy link
Collaborator

We will look into that asap, thanks for reporting!

@opcode81
Copy link
Collaborator

opcode81 commented Nov 7, 2024

I did a quick speed test, comparing 0.5.0 to 1.0.0 and the current development version (1.2.0-dev). I tested with the atari_ppo example, using CPU, the Pong environment and a single env.

While I did notice a slowdown, it is nowhere near the 12x slowdown you are describing; it is around 1.7x, which is still bad enough though. We will look into the reasons for the slowdown by profiling the current implementation, but it may not explain why your task is so much more greatly affected. Perhaps your environment causes the functions that are slower to be used more frequently, but it's hard to say. We will try to restore the speed of the old implementation for the Atari case and then you can check whether it helps for your use case as well, @ULudo.

@rujialiu
Copy link

I'm in a similar situation. I used tianshou 0.5 one year ago (but not for very long time), saved some log, showing that most atari games's training speed is >=500 it/s. Recently, I've upgrade my video card to 3070Ti, upgraded to 1.2 but the training speed of PongNoFrameskip-v4 decreased all the way down from 80it/s to, like 5it/s. Interestingly, atari_ppo has a steady speed of about 200it/s, but atari_dqn, atari_sac and atari_rainbow are all extremely slow.

I haven't spent to much time yet. If I have more findings, I'll let you know.
Environment: Windows 10, torch 2.3+cu121 (I've manually upgrade gymnasium to 1.0 and ale-py 0.10 and made a few changes)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants