-
Notifications
You must be signed in to change notification settings - Fork 486
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create an eval-only script for existing ckpts #736
base: main
Are you sure you want to change the base?
Conversation
olmo/train.py
Outdated
if wandb.run is not None: | ||
wandb.finish(exit_code=exit_code, quiet=True) | ||
# if wandb.run is not None: | ||
# wandb.finish(exit_code=exit_code, quiet=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Debug code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will fix this
scripts/eval.py
Outdated
# train_loader = build_train_dataloader(cfg) | ||
train_loader = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this always going to be None
? If so, we don't need it.
scripts/eval.py
Outdated
if 'step' in cfg.load_path.split('/')[-1]: | ||
load_paths = [cfg.load_path] | ||
else: | ||
# This globbing does not work with remote paths. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is that problem handled then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not handled. I will assume the checkpoints are on WEKA.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At least throw an exception then.
scripts/eval.py
Outdated
log.info(f"Number of non-embedding parameters: {olmo_model.num_params(include_embedding=False):,d}") | ||
log.info(f"Peak GPU Memory (MB) before {cfg.distributed_strategy}: {int(peak_gpu_memory() or 0)}") | ||
|
||
olmo_model.set_activation_checkpointing(cfg.activation_checkpointing) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we only ever eval, we don't need this.
optim = build_optimizer(cfg, dist_model) | ||
scheduler = build_scheduler(cfg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need optimizers and schedulers if we're just evaluating.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you're creating these only so that you can produce a Trainer
object?
How hard is it to pull the stuff you need out of the Trainer
object, so we don't have to do so many things we don't need? It makes me particularly uncomfortable that you're creating a trainer with a None
data loader, which isn't supposed to work. It just happens to work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Trainer class has too many precious helper functions and it's kinda dumb to unroll them all. I do wanna keep at least a dummy Trainer object. Let me see if I can create it w/o the optim/scheduler/etc stuff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you might find that you don't need most of that stuff when you're doing inference only.
scripts/eval.py
Outdated
if 'step' in cfg.load_path.split('/')[-1]: | ||
load_paths = [cfg.load_path] | ||
else: | ||
# This globbing does not work with remote paths. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At least throw an exception then.
Let me know when this is ready for another review? |
if cfg.load_path is None: | ||
raise OLMoConfigurationError("To run eval you must provide a load_path") | ||
elif "://" in cfg.load_path: | ||
raise OLMoConfigurationError("Eval does not support remote paths. Please specify a local path or WEKA mounted path.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
throwing exception for remote paths here
This PR adds
scripts/eval.py
, which evaluates one or more existing ckpts while bypassing the training steps.It seems impossible to backfill evals back to the original wandb run, because "step" must always increase. Rewinding the run will truncate the log, which we don't want. Therefore, this script logs things to a new wandb run.
Starting from a training setup:
XXX.sh
file intoXXX-eval.sh
, point toscripts/eval.sh
, add a flag--wandb.group=XXX
to ensure it logs to the same group, and specify--load_path
to be either a single ckpt or all ckpts under a directory.XXX-launch.sh
file intoXXX-eval-launch.sh
, change--task-name
toXXX-eval
, and change the command so it runsXXX-eval.sh
.See an example in
peteish1-eval.sh
andpeteish1-eval-launch.sh
.