The dmc vision (task=dmc_walker_walk) has very bad performance #11

LYK-love · 2024-05-16T05:31:06Z

Hello, I ran R2I with command:

current_date=$(date "+%Y%m%d-%H%M%S")
python recall2imagine/train.py \
    --configs dmc_vision \
    --ssm_type mimo \
    --wdb_name  dmc_original_${current_date} \
    --logdir ./logs/dmc_original_${current_date}

and got very low scores. According to DreamerV3 paper, it can achieve score > 900.

However, R2I can only achieve score < 200.

I think the hyperparameters are the same, in R2I's config.yaml, I saw

run.train_ratio: 512
  run.steps: 1e6
  rssm.deter: 512
  .*\.cnn_depth: 32
  .*\.layers: 2
  .*\.units: 512

, this aligns with DreamerV3:

Can you explain why this happens? Maybe it's because the ssm backbone is not so good as gru in this task? Or there's sth wrong with my hyperparameters?

The text was updated successfully, but these errors were encountered:

artemZholus · 2024-05-16T06:20:37Z

This looks like a serious issue. I will take a look.

artemZholus · 2024-05-16T09:21:08Z

Hi @LYK-love ,

after a quick search I found that the most likely reason is the mismatch of SSM hyperparameters. For example, you are probably using the hidden size of 128 in each layer of SSM. This is not too small (we used 512 in the paper if I remember correctly). There can be other mismatches too. I will be doing some reproducibility checks in the next few days and then get back to you. You can ask me anything here in the meantime.

artemZholus · 2024-05-16T09:22:27Z

Please note that the hidden size alone does not guarantee to fix everything, but you can try that in the meantime.

LYK-love · 2024-05-16T21:17:14Z

I see the hidden_size attribute in the config file:

rssm: {deter: 4096, units: 1024, hidden: 128, stoch: 32, classes: 32,  ... }

Here you set hidden to be 128. I also see that in mmaze env, you set it to be 512:

mmaze:
  task: gym_memory_maze:MemoryMaze-9x9-v0
  ... 
  rssm.deter: 2048
  rssm.units: 1024
  .*\.cnn_depth: 48
  .*\.mlp_units: 400
  .*\.layers: 4
  .*\.mlp_layers: 4
  ssm.n_layers: 5
  rssm.hidden: 512

In this sense, hidden size=512 should reproduce the score for mmaze.

Now I am trying to know if R2I can achieve original DreamerV3's performance with GRU. It should work since you didn't change the architecture except the backbone. My command is:

current_date=$(date "+%Y%m%d-%H%M%S")
python recall2imagine/train.py \
    --configs dmc_vision \
    --ssm_type gru \
    --wdb_name  dmc_original_${current_date} \
    --logdir ./logs/dmc_original_${current_date}

Can you tell me what should I do in further to get the DreamerV3 score (with GRU backbone) on your code base? One thing worth consideration, as you mentioned, is the hidden size, since in DreamerV3, we have

rssm: {deter: 4096, units: 1024, stoch: 32, classes: 32, ... }

, and there is no hidden size attribute here.

LYK-love · 2024-05-17T05:03:14Z

Hi @LYK-love ,

after a quick search I found that the most likely reason is the mismatch of SSM hyperparameters. For example, you are probably using the hidden size of 128 in each layer of SSM. This is not too small (we used 512 in the paper if I remember correctly). There can be other mismatches too. I will be doing some reproducibility checks in the next few days and then get back to you. You can ask me anything here in the meantime.

Whrn I use mimo as backbone and set hidden size=512, i.e., I use command:

current_date=$(date "+%Y%m%d-%H%M%S")
python recall2imagine/train.py \
    --configs dmc_vision \
    --ssm_type mimo --rssm.hidden 512\
    --wdb_name  dmc_original_${current_date} \
    --logdir ./logs/dmc_original_${current_date}

I got

Wrote checkpoint: logs/dmc_original_20240517-045044/checkpoint.ckpt
Start training loop.
Tracing policy function.
Tracing policy function.
Tracing train function.
2024-05-17 04:55:19.959407: E external/xla/xla/service/gpu/buffer_comparator.cc:731] Difference at 0: 9912 vs 8192
2024-05-17 04:55:19.959500: E external/xla/xla/service/gpu/buffer_comparator.cc:731] Difference at 1: 9928 vs 8192
2024-05-17 04:55:19.959512: E external/xla/xla/service/gpu/buffer_comparator.cc:731] Difference at 2: 9920 vs 8192
2024-05-17 04:55:19.959521: E external/xla/xla/service/gpu/buffer_comparator.cc:731] Difference at 3: 9920 vs 8192
2024-05-17 04:55:19.959531: E external/xla/xla/service/gpu/buffer_comparator.cc:731] Difference at 4: 10224 vs 8192
2024-05-17 04:55:19.959540: E external/xla/xla/service/gpu/buffer_comparator.cc:731] Difference at 5: 10240 vs 8192
2024-05-17 04:55:19.959549: E external/xla/xla/service/gpu/buffer_comparator.cc:731] Difference at 6: 10240 vs 8192
2024-05-17 04:55:19.959559: E external/xla/xla/service/gpu/buffer_comparator.cc:731] Difference at 7: 10232 vs 8192
2024-05-17 04:55:19.959568: E external/xla/xla/service/gpu/buffer_comparator.cc:731] Difference at 8: 10232 vs 8192
2024-05-17 04:55:19.959577: E external/xla/xla/service/gpu/buffer_comparator.cc:731] Difference at 9: 10248 vs 8192
2024-05-17 04:55:19.959593: E external/xla/xla/service/gpu/gpu_conv_algorithm_picker.cc:681] Results mismatch between different convolution algorithms. This is likely a bug/unexpected loss of precision in cudnn.
(f16[32,4,4,4]{3,2,1,0}, u8[0]{0}) custom-call(f16[4096,64,64,4]{3,2,1,0}, f16[4096,32,32,32]{3,2,1,0}), window={size=4x4 stride=2x2 pad=1_1x1_1}, dim_labels=b01f_o01i->b01f, custom_call_target="__cudnn$convBackwardFilter", backend_config={"conv_result_scale":1,"activation_mode":"kNone","side_input_scale":0} for eng39{k2=11,k6=1,k12=96,k13=1,k14=0,k15=0,k17=97,k22=3} vs eng20{k2=6,k3=0}
2024-05-17 04:55:19.959617: E external/xla/xla/service/gpu/gpu_conv_algorithm_picker.cc:251] Device: NVIDIA RTX A6000
2024-05-17 04:55:19.959626: E external/xla/xla/service/gpu/gpu_conv_algorithm_picker.cc:252] Platform: Compute Capability 8.6
2024-05-17 04:55:19.959639: E external/xla/xla/service/gpu/gpu_conv_algorithm_picker.cc:253] Driver: 12040 (550.54.15)
2024-05-17 04:55:19.959649: E external/xla/xla/service/gpu/gpu_conv_algorithm_picker.cc:254] Runtime: <undefined>
2024-05-17 04:55:19.959662: E external/xla/xla/service/gpu/gpu_conv_algorithm_picker.cc:261] cudnn version: 8.9.6
2024-05-17 04:55:21.160854: E external/xla/xla/service/gpu/buffer_comparator.cc:731] Difference at 0: 2324 vs 2048
2024-05-17 04:55:21.160909: E external/xla/xla/service/gpu/buffer_comparator.cc:731] Difference at 1: 2322 vs 2048
2024-05-17 04:55:21.160923: E external/xla/xla/service/gpu/buffer_comparator.cc:731] Difference at 2: 2324 vs 2048
2024-05-17 04:55:21.160930: E external/xla/xla/service/gpu/buffer_comparator.cc:731] Difference at 3: 2320 vs 2048
2024-05-17 04:55:21.160936: E external/xla/xla/service/gpu/buffer_comparator.cc:731] Difference at 4: 2328 vs 2048
2024-05-17 04:55:21.160947: E external/xla/xla/service/gpu/buffer_comparator.cc:731] Difference at 5: 2328 vs 2048
2024-05-17 04:55:21.160954: E external/xla/xla/service/gpu/buffer_comparator.cc:731] Difference at 6: 2322 vs 2048
2024-05-17 04:55:21.160965: E external/xla/xla/service/gpu/buffer_comparator.cc:731] Difference at 7: 2320 vs 2048
2024-05-17 04:55:21.160984: E external/xla/xla/service/gpu/buffer_comparator.cc:731] Difference at 8: 2320 vs 2048
2024-05-17 04:55:21.160995: E external/xla/xla/service/gpu/buffer_comparator.cc:731] Difference at 9: 2322 vs 2048
2024-05-17 04:55:21.161013: E external/xla/xla/service/gpu/gpu_conv_algorithm_picker.cc:681] Results mismatch between different convolution algorithms. This is likely a bug/unexpected loss of precision in cudnn.
(f16[64,4,4,32]{3,2,1,0}, u8[0]{0}) custom-call(f16[4096,32,32,32]{3,2,1,0}, f16[4096,16,16,64]{3,2,1,0}), window={size=4x4 stride=2x2 pad=1_1x1_1}, dim_labels=b01f_o01i->b01f, custom_call_target="__cudnn$convBackwardFilter", backend_config={"conv_result_scale":1,"activation_mode":"kNone","side_input_scale":0} for eng10{k2=7,k12=23,k13=0,k14=4,k15=1,k17=24,k18=1,k23=0} vs eng20{k2=6,k3=0}
2024-05-17 04:55:21.161028: E external/xla/xla/service/gpu/gpu_conv_algorithm_picker.cc:251] Device: NVIDIA RTX A6000
2024-05-17 04:55:21.161035: E external/xla/xla/service/gpu/gpu_conv_algorithm_picker.cc:252] Platform: Compute Capability 8.6
2024-05-17 04:55:21.161047: E external/xla/xla/service/gpu/gpu_conv_algorithm_picker.cc:253] Driver: 12040 (550.54.15)
2024-05-17 04:55:21.161052: E external/xla/xla/service/gpu/gpu_conv_algorithm_picker.cc:254] Runtime: <undefined>
2024-05-17 04:55:21.161067: E external/xla/xla/service/gpu/gpu_conv_algorithm_picker.cc:261] cudnn version: 8.9.6
Tracing report function.

After that, the program is totally stuck. Is this normal?

wnnng · 2024-06-18T08:31:06Z

Hey, I don't know if this is still a problem, but I think it is related to XLA/JAX and is also discussed in the original dreamerv3 repo.
danijar/dreamerv3#126

Maybe the solution works here as well.

artemZholus self-assigned this May 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The dmc vision (task=dmc_walker_walk) has very bad performance #11

The dmc vision (task=dmc_walker_walk) has very bad performance #11

LYK-love commented May 16, 2024

artemZholus commented May 16, 2024

artemZholus commented May 16, 2024

artemZholus commented May 16, 2024

LYK-love commented May 16, 2024

LYK-love commented May 17, 2024 •

edited

Loading

wnnng commented Jun 18, 2024

The dmc vision (task=dmc_walker_walk) has very bad performance #11

The dmc vision (task=dmc_walker_walk) has very bad performance #11

Comments

LYK-love commented May 16, 2024

artemZholus commented May 16, 2024

artemZholus commented May 16, 2024

artemZholus commented May 16, 2024

LYK-love commented May 16, 2024

LYK-love commented May 17, 2024 • edited Loading

wnnng commented Jun 18, 2024

LYK-love commented May 17, 2024 •

edited

Loading