Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batch evaluation script #92

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open

Batch evaluation script #92

wants to merge 4 commits into from

Conversation

jjbuschhoff
Copy link

Sbatch script that performs evaluation on a given set of tasks for a given collection of model checkpoints using the Megatron-LM-client-server inference solution.

scripts/sbatch_eval/readme.md Outdated Show resolved Hide resolved
scripts/sbatch_eval/readme.md Outdated Show resolved Hide resolved
scripts/sbatch_eval/readme.md Outdated Show resolved Hide resolved
scripts/sbatch_eval/readme.md Outdated Show resolved Hide resolved
scripts/sbatch_eval/readme.md Outdated Show resolved Hide resolved
scripts/sbatch_eval/run_megatron_server_client.sbatch Outdated Show resolved Hide resolved
Copy link

@janEbert janEbert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multiple paths in the sbatch script are not properly quoted and can cause issues if they include spaces, for example. For more safety it would be preferred to correctly quote all variable expansions in Bash scripts.

Also, the whole argument handling portion of run_server_no_opt.py is way too manual. There should be more abstraction here. For example, many command line flags are already missing. Continued experimentation would imply that every single change that adds arguments would also need to manually add them here, lest they would be ignored.

I noted lots of things that would need some love, but in general this is super useful and I'm very happy you created the script!

scripts/sbatch_eval/run_server_no_opt.py Outdated Show resolved Hide resolved
scripts/sbatch_eval/run_megatron_server_client.sbatch Outdated Show resolved Hide resolved
scripts/sbatch_eval/run_megatron_server_client.sbatch Outdated Show resolved Hide resolved
scripts/sbatch_eval/run_megatron_server_client.sbatch Outdated Show resolved Hide resolved
Comment on lines +75 to +77
# Not enabling --use-flash-attn during inference as advised
# if args.use_flash_attn:
# sys.argv.append("--use-flash-attn")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd remove the comment above since it is confusing and uncomment these lines so that FlashAttention is activated if args.bf16 or args.fp16. FlashAttention is an exact algorithm, so we do not gain anything from deactivating it.

scripts/sbatch_eval/run_megatron_server_client.sbatch Outdated Show resolved Hide resolved
@jjbuschhoff
Copy link
Author

jjbuschhoff commented Sep 27, 2023

Also, the whole argument handling portion of run_server_no_opt.py is way too manual. There should be more abstraction here. For example, many command line flags are already missing. Continued experimentation would imply that every single change that adds arguments would also need to manually add them here, lest they would be ignored.

I agree, I had a look into this and it seems that it is possible to call run_text_generation_server.py --use_checkpoint_args directly, however, this only sets some of the hyperparameters as per the checkpoint (see load_args_from_checkpoint in megatron.checkpointing), likely those that are necessary for training. The checkpoint_args are returned but unused in megatron.initialize.initialize_megatron(). I'm looking into a way to merge them that doesn't lead validate_args failing.

@janEbert
Copy link

That's a great find!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants