Batch evaluation script #92

jjbuschhoff · 2023-09-12T14:47:28Z

Sbatch script that performs evaluation on a given set of tasks for a given collection of model checkpoints using the Megatron-LM-client-server inference solution.

scripts/sbatch_eval/readme.md

scripts/sbatch_eval/run_megatron_server_client.sbatch

janEbert

Multiple paths in the sbatch script are not properly quoted and can cause issues if they include spaces, for example. For more safety it would be preferred to correctly quote all variable expansions in Bash scripts.

Also, the whole argument handling portion of run_server_no_opt.py is way too manual. There should be more abstraction here. For example, many command line flags are already missing. Continued experimentation would imply that every single change that adds arguments would also need to manually add them here, lest they would be ignored.

I noted lots of things that would need some love, but in general this is super useful and I'm very happy you created the script!

scripts/sbatch_eval/run_server_no_opt.py

scripts/sbatch_eval/run_megatron_server_client.sbatch

janEbert · 2023-09-18T16:10:30Z

scripts/sbatch_eval/run_server_no_opt.py

+    # Not enabling --use-flash-attn during inference as advised
+    # if args.use_flash_attn:
+    #     sys.argv.append("--use-flash-attn")


I'd remove the comment above since it is confusing and uncomment these lines so that FlashAttention is activated if args.bf16 or args.fp16. FlashAttention is an exact algorithm, so we do not gain anything from deactivating it.

scripts/sbatch_eval/run_megatron_server_client.sbatch

jjbuschhoff · 2023-09-27T12:28:45Z

Also, the whole argument handling portion of run_server_no_opt.py is way too manual. There should be more abstraction here. For example, many command line flags are already missing. Continued experimentation would imply that every single change that adds arguments would also need to manually add them here, lest they would be ignored.

I agree, I had a look into this and it seems that it is possible to call run_text_generation_server.py --use_checkpoint_args directly, however, this only sets some of the hyperparameters as per the checkpoint (see load_args_from_checkpoint in megatron.checkpointing), likely those that are necessary for training. The checkpoint_args are returned but unused in megatron.initialize.initialize_megatron(). I'm looking into a way to merge them that doesn't lead validate_args failing.

janEbert · 2023-09-27T13:50:49Z

That's a great find!

jjbuschhoff added 2 commits September 12, 2023 16:00

Base batch eval script

836cbd0

changed default partition to booster

85faf2d

janEbert requested changes Sep 15, 2023

View reviewed changes

janEbert requested changes Sep 18, 2023

View reviewed changes

Implemented wrapper and other small changes

e97ced5

fix formatting for pre-commit

c66aac5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batch evaluation script #92

Batch evaluation script #92

jjbuschhoff commented Sep 12, 2023

janEbert left a comment

janEbert Sep 18, 2023

jjbuschhoff commented Sep 27, 2023 •

edited

Loading

janEbert commented Sep 27, 2023

Batch evaluation script #92

Are you sure you want to change the base?

Batch evaluation script #92

Conversation

jjbuschhoff commented Sep 12, 2023

janEbert left a comment

Choose a reason for hiding this comment

janEbert Sep 18, 2023

Choose a reason for hiding this comment

jjbuschhoff commented Sep 27, 2023 • edited Loading

janEbert commented Sep 27, 2023

jjbuschhoff commented Sep 27, 2023 •

edited

Loading