Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The dmc vision (task=dmc_walker_walk) has very bad performance #11

Open
LYK-love opened this issue May 16, 2024 · 6 comments
Open

The dmc vision (task=dmc_walker_walk) has very bad performance #11

LYK-love opened this issue May 16, 2024 · 6 comments
Assignees

Comments

@LYK-love
Copy link

Hello, I ran R2I with command:

current_date=$(date "+%Y%m%d-%H%M%S")
python recall2imagine/train.py \
    --configs dmc_vision \
    --ssm_type mimo \
    --wdb_name  dmc_original_${current_date} \
    --logdir ./logs/dmc_original_${current_date}

and got very low scores. According to DreamerV3 paper, it can achieve score > 900.
image

However, R2I can only achieve score < 200.
image

I think the hyperparameters are the same, in R2I's config.yaml, I saw

run.train_ratio: 512
  run.steps: 1e6
  rssm.deter: 512
  .*\.cnn_depth: 32
  .*\.layers: 2
  .*\.units: 512

, this aligns with DreamerV3:
image

Can you explain why this happens? Maybe it's because the ssm backbone is not so good as gru in this task? Or there's sth wrong with my hyperparameters?

@artemZholus artemZholus self-assigned this May 16, 2024
@artemZholus
Copy link
Collaborator

This looks like a serious issue. I will take a look.

@artemZholus
Copy link
Collaborator

Hi @LYK-love ,

after a quick search I found that the most likely reason is the mismatch of SSM hyperparameters. For example, you are probably using the hidden size of 128 in each layer of SSM. This is not too small (we used 512 in the paper if I remember correctly). There can be other mismatches too. I will be doing some reproducibility checks in the next few days and then get back to you. You can ask me anything here in the meantime.

@artemZholus
Copy link
Collaborator

Please note that the hidden size alone does not guarantee to fix everything, but you can try that in the meantime.

@LYK-love
Copy link
Author

I see the hidden_size attribute in the config file:

rssm: {deter: 4096, units: 1024, hidden: 128, stoch: 32, classes: 32,  ... }

Here you set hidden to be 128. I also see that in mmaze env, you set it to be 512:

mmaze:
  task: gym_memory_maze:MemoryMaze-9x9-v0
  ... 
  rssm.deter: 2048
  rssm.units: 1024
  .*\.cnn_depth: 48
  .*\.mlp_units: 400
  .*\.layers: 4
  .*\.mlp_layers: 4
  ssm.n_layers: 5
  rssm.hidden: 512

In this sense, hidden size=512 should reproduce the score for mmaze.

Now I am trying to know if R2I can achieve original DreamerV3's performance with GRU. It should work since you didn't change the architecture except the backbone. My command is:

current_date=$(date "+%Y%m%d-%H%M%S")
python recall2imagine/train.py \
    --configs dmc_vision \
    --ssm_type gru \
    --wdb_name  dmc_original_${current_date} \
    --logdir ./logs/dmc_original_${current_date}

Can you tell me what should I do in further to get the DreamerV3 score (with GRU backbone) on your code base? One thing worth consideration, as you mentioned, is the hidden size, since in DreamerV3, we have

rssm: {deter: 4096, units: 1024, stoch: 32, classes: 32, ... }

, and there is no hidden size attribute here.

@LYK-love
Copy link
Author

LYK-love commented May 17, 2024

Hi @LYK-love ,

after a quick search I found that the most likely reason is the mismatch of SSM hyperparameters. For example, you are probably using the hidden size of 128 in each layer of SSM. This is not too small (we used 512 in the paper if I remember correctly). There can be other mismatches too. I will be doing some reproducibility checks in the next few days and then get back to you. You can ask me anything here in the meantime.

Whrn I use mimo as backbone and set hidden size=512, i.e., I use command:

current_date=$(date "+%Y%m%d-%H%M%S")
python recall2imagine/train.py \
    --configs dmc_vision \
    --ssm_type mimo --rssm.hidden 512\
    --wdb_name  dmc_original_${current_date} \
    --logdir ./logs/dmc_original_${current_date}

I got

Wrote checkpoint: logs/dmc_original_20240517-045044/checkpoint.ckpt
Start training loop.
Tracing policy function.
Tracing policy function.
Tracing train function.
2024-05-17 04:55:19.959407: E external/xla/xla/service/gpu/buffer_comparator.cc:731] Difference at 0: 9912 vs 8192
2024-05-17 04:55:19.959500: E external/xla/xla/service/gpu/buffer_comparator.cc:731] Difference at 1: 9928 vs 8192
2024-05-17 04:55:19.959512: E external/xla/xla/service/gpu/buffer_comparator.cc:731] Difference at 2: 9920 vs 8192
2024-05-17 04:55:19.959521: E external/xla/xla/service/gpu/buffer_comparator.cc:731] Difference at 3: 9920 vs 8192
2024-05-17 04:55:19.959531: E external/xla/xla/service/gpu/buffer_comparator.cc:731] Difference at 4: 10224 vs 8192
2024-05-17 04:55:19.959540: E external/xla/xla/service/gpu/buffer_comparator.cc:731] Difference at 5: 10240 vs 8192
2024-05-17 04:55:19.959549: E external/xla/xla/service/gpu/buffer_comparator.cc:731] Difference at 6: 10240 vs 8192
2024-05-17 04:55:19.959559: E external/xla/xla/service/gpu/buffer_comparator.cc:731] Difference at 7: 10232 vs 8192
2024-05-17 04:55:19.959568: E external/xla/xla/service/gpu/buffer_comparator.cc:731] Difference at 8: 10232 vs 8192
2024-05-17 04:55:19.959577: E external/xla/xla/service/gpu/buffer_comparator.cc:731] Difference at 9: 10248 vs 8192
2024-05-17 04:55:19.959593: E external/xla/xla/service/gpu/gpu_conv_algorithm_picker.cc:681] Results mismatch between different convolution algorithms. This is likely a bug/unexpected loss of precision in cudnn.
(f16[32,4,4,4]{3,2,1,0}, u8[0]{0}) custom-call(f16[4096,64,64,4]{3,2,1,0}, f16[4096,32,32,32]{3,2,1,0}), window={size=4x4 stride=2x2 pad=1_1x1_1}, dim_labels=b01f_o01i->b01f, custom_call_target="__cudnn$convBackwardFilter", backend_config={"conv_result_scale":1,"activation_mode":"kNone","side_input_scale":0} for eng39{k2=11,k6=1,k12=96,k13=1,k14=0,k15=0,k17=97,k22=3} vs eng20{k2=6,k3=0}
2024-05-17 04:55:19.959617: E external/xla/xla/service/gpu/gpu_conv_algorithm_picker.cc:251] Device: NVIDIA RTX A6000
2024-05-17 04:55:19.959626: E external/xla/xla/service/gpu/gpu_conv_algorithm_picker.cc:252] Platform: Compute Capability 8.6
2024-05-17 04:55:19.959639: E external/xla/xla/service/gpu/gpu_conv_algorithm_picker.cc:253] Driver: 12040 (550.54.15)
2024-05-17 04:55:19.959649: E external/xla/xla/service/gpu/gpu_conv_algorithm_picker.cc:254] Runtime: <undefined>
2024-05-17 04:55:19.959662: E external/xla/xla/service/gpu/gpu_conv_algorithm_picker.cc:261] cudnn version: 8.9.6
2024-05-17 04:55:21.160854: E external/xla/xla/service/gpu/buffer_comparator.cc:731] Difference at 0: 2324 vs 2048
2024-05-17 04:55:21.160909: E external/xla/xla/service/gpu/buffer_comparator.cc:731] Difference at 1: 2322 vs 2048
2024-05-17 04:55:21.160923: E external/xla/xla/service/gpu/buffer_comparator.cc:731] Difference at 2: 2324 vs 2048
2024-05-17 04:55:21.160930: E external/xla/xla/service/gpu/buffer_comparator.cc:731] Difference at 3: 2320 vs 2048
2024-05-17 04:55:21.160936: E external/xla/xla/service/gpu/buffer_comparator.cc:731] Difference at 4: 2328 vs 2048
2024-05-17 04:55:21.160947: E external/xla/xla/service/gpu/buffer_comparator.cc:731] Difference at 5: 2328 vs 2048
2024-05-17 04:55:21.160954: E external/xla/xla/service/gpu/buffer_comparator.cc:731] Difference at 6: 2322 vs 2048
2024-05-17 04:55:21.160965: E external/xla/xla/service/gpu/buffer_comparator.cc:731] Difference at 7: 2320 vs 2048
2024-05-17 04:55:21.160984: E external/xla/xla/service/gpu/buffer_comparator.cc:731] Difference at 8: 2320 vs 2048
2024-05-17 04:55:21.160995: E external/xla/xla/service/gpu/buffer_comparator.cc:731] Difference at 9: 2322 vs 2048
2024-05-17 04:55:21.161013: E external/xla/xla/service/gpu/gpu_conv_algorithm_picker.cc:681] Results mismatch between different convolution algorithms. This is likely a bug/unexpected loss of precision in cudnn.
(f16[64,4,4,32]{3,2,1,0}, u8[0]{0}) custom-call(f16[4096,32,32,32]{3,2,1,0}, f16[4096,16,16,64]{3,2,1,0}), window={size=4x4 stride=2x2 pad=1_1x1_1}, dim_labels=b01f_o01i->b01f, custom_call_target="__cudnn$convBackwardFilter", backend_config={"conv_result_scale":1,"activation_mode":"kNone","side_input_scale":0} for eng10{k2=7,k12=23,k13=0,k14=4,k15=1,k17=24,k18=1,k23=0} vs eng20{k2=6,k3=0}
2024-05-17 04:55:21.161028: E external/xla/xla/service/gpu/gpu_conv_algorithm_picker.cc:251] Device: NVIDIA RTX A6000
2024-05-17 04:55:21.161035: E external/xla/xla/service/gpu/gpu_conv_algorithm_picker.cc:252] Platform: Compute Capability 8.6
2024-05-17 04:55:21.161047: E external/xla/xla/service/gpu/gpu_conv_algorithm_picker.cc:253] Driver: 12040 (550.54.15)
2024-05-17 04:55:21.161052: E external/xla/xla/service/gpu/gpu_conv_algorithm_picker.cc:254] Runtime: <undefined>
2024-05-17 04:55:21.161067: E external/xla/xla/service/gpu/gpu_conv_algorithm_picker.cc:261] cudnn version: 8.9.6
Tracing report function.

After that, the program is totally stuck. Is this normal?

@wnnng
Copy link

wnnng commented Jun 18, 2024

Hey, I don't know if this is still a problem, but I think it is related to XLA/JAX and is also discussed in the original dreamerv3 repo.
danijar/dreamerv3#126

Maybe the solution works here as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants