Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

eval drops when i have multi_eval dataset #2743

Closed
germanjke opened this issue Nov 28, 2023 · 5 comments · Fixed by #2746
Closed

eval drops when i have multi_eval dataset #2743

germanjke opened this issue Nov 28, 2023 · 5 comments · Fixed by #2746
Assignees

Comments

@germanjke
Copy link

Hi, I changed eval dataset to multi_eval dataset with this pull request

Before:

eval_loader:
  dataset:
    max_seq_len: ${max_seq_len}
    shuffle: true
    shuffle_seed: ${global_seed}
    streams:
      a:
        local: a
        remote: s3://a
        repeat: 1.0
        split: val
  drop_last: true
  name: text
  num_workers: 8

After:

eval_loader:
- label: general
  dataset:
    max_seq_len: ${max_seq_len}
    shuffle: true
    shuffle_seed: ${global_seed}
    streams:
      a:
        local: a
        remote: s3://a
        repeat: 1.0
        split: val
  drop_last: true
  name: text
  num_workers: 8
- label: multi_eval_train_subset
  dataset:
    max_seq_len: ${max_seq_len}
    shuffle: true
    shuffle_seed: ${global_seed}
    streams:
      cc2021_39:
        local: b
        remote: s3://b
        repeat: 1.0
        split: train
  drop_last: true
  name: text
  num_workers: 8
- label: multi_eval_val_subset
  dataset:
    max_seq_len: ${max_seq_len}
    shuffle: true
    shuffle_seed: ${global_seed}
    streams:
      c:
        local: c
        remote: s3://c
        repeat: 1.0
        split: val
  drop_last: true
  name: text
  num_workers: 8

My loaders build, proofed by this logs:

Building train loader...
Building eval loader...
Initializing model...
Building trainer...

But I have this error in self._train_loop()

╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /llm-foundry/scripts/train/train.py:665 in <module>                     │
│                                                                              │
│   662 │   cfg = om.merge(yaml_cfg, cli_cfg)                                  │
│   663 │   om.resolve(cfg)                                                    │
│   664 │   assert isinstance(cfg, DictConfig)                                 │
│ ❱ 665 │   main(cfg)                                                          │
│   666                                                                        │
│                                                                              │
│ /tgpt/llm-foundry/scripts/train/train.py:651 in main                         │
│                                                                              │
│   648 │   │   trainer.eval()                                                 │
│   649 │                                                                      │
│   650 │   print('Starting training...')                                      │
│ ❱ 651 │   trainer.fit()                                                      │
│   652 │                                                                      │
│   653 │   print('Done.')                                                     │
│   654 │   return trainer                                                     │
│                                                                              │
│ /usr/lib/python3/dist-packages/composer/trainer/trainer.py:1876 in fit       │
│                                                                              │
│   1873 │   │   │   self.state.scaler = ClosureGradScaler() if self._use_clos │
│   1874 │   │                                                                 │
│   1875 │   │   self.first_batch_complete = False                             │
│ ❱ 1876 │   │   self._train_loop()                                            │
│   1877 │                                                                     │
│   1878 │   def close(self):                                                  │
│   1879 │   │   """Shutdown the trainer.                                      │
│                                                                              │
│ /usr/lib/python3/dist-packages/composer/trainer/trainer.py:2104 in           │
│ _train_loop                                                                  │
│                                                                              │
│   2101 │   │   │   │   │   # Pause the timing during evaluation              │
│   2102 │   │   │   │   │   # Evaluation time is tracked separately in state. │
│   2103 │   │   │   │   │   duration = datetime.datetime.now() - last_wct     │
│ ❱ 2104 │   │   │   │   │   self._run_evaluators(Event.BATCH_END)             │
│   2105 │   │   │   │   │   last_wct = datetime.datetime.now() - duration     │
│   2106 │   │   │   │   │                                                     │
│   2107 │   │   │   │   │   self.engine.run_event(Event.BATCH_CHECKPOINT)     │
│                                                                              │
│ /usr/lib/python3/dist-packages/composer/trainer/trainer.py:2190 in           │
│ _run_evaluators                                                              │
│                                                                              │
│   2187 │   │   self.engine.run_event(Event.EVAL_BEFORE_ALL)                  │
│   2188 │   │   for index, evaluator in enumerate(self.state.evaluators):     │
│   2189 │   │   │   if evaluators_executing[index]:                           │
│ ❱ 2190 │   │   │   │   self._eval_loop(                                      │
│   2191 │   │   │   │   │   evaluator=evaluator,                              │
│   2192 │   │   │   │   │   subset_num_batches=evaluator.subset_num_batches,  │
│   2193 │   │   │   │   │   metrics=self.state.eval_metrics[evaluator.label], │
│                                                                              │
│ /usr/lib/python3/dist-packages/composer/trainer/trainer.py:2812 in           │
│ _eval_loop                                                                   │
│                                                                              │
│   2809 │   │   │   self.state.set_dataloader(data_spec.dataloader, evaluator │
│   2810 │   │   │   assert self.state.dataloader is not None, 'dataloader is  │
│   2811 │   │   │                                                             │
│ ❱ 2812 │   │   │   self.engine.run_event(Event.EVAL_START)                   │
│   2813 │   │   │                                                             │
│   2814 │   │   │   metrics = self._ensure_metrics_device_and_dtype(metrics)  │
│   2815                                                                       │
│                                                                              │
│ /usr/lib/python3/dist-packages/composer/core/engine.py:294 in run_event      │
│                                                                              │
│   291 │   │   │   # Run callbacks first, so any log calls from a callback th │
│   292 │   │   │   # get registered before they are flushed by the logger its │
│   293 │   │   │   self._run_nonlogger_callbacks(event)                       │
│ ❱ 294 │   │   │   self._run_loggers(event)                                   │
│   295 │   │                                                                  │
│   296 │   │   if event.is_before_event and duration_marker is not None:      │
│   297 │   │   │   duration_marker.start()                                    │
│                                                                              │
│ /usr/lib/python3/dist-packages/composer/core/engine.py:472 in _run_loggers   │
│                                                                              │
│   469 │                                                                      │
│   470 │   def _run_loggers(self, event: Union[Event, str]):                  │
│   471 │   │   loggers = [callback for callback in self.state.callbacks if is │
│ ❱ 472 │   │   self._run_callbacks(event, loggers)                            │
│   473 │                                                                      │
│   474 │   def _run_nonlogger_callbacks(self, event: Union[Event, str]):      │
│   475 │   │   callbacks = [callback for callback in self.state.callbacks if  │
│                                                                              │
│ /usr/lib/python3/dist-packages/composer/core/engine.py:468 in _run_callbacks │
│                                                                              │
│   465 │   │   │   ctx = cast(ContextManager, contextlib.nullcontext()) if ma │
│   466 │   │   │   with ctx:                                                  │
│   467 │   │   │   │   self._debug_log(event, f'Running callback {type(cb).__ │
│ ❱ 468 │   │   │   │   cb.run_event(event, self.state, self.logger)           │
│   469 │                                                                      │
│   470 │   def _run_loggers(self, event: Union[Event, str]):                  │
│   471 │   │   loggers = [callback for callback in self.state.callbacks if is │
│                                                                              │
│ /usr/lib/python3/dist-packages/composer/core/callback.py:96 in run_event     │
│                                                                              │
│    93 │   │   │   logger (Logger): The logger.                               │
│    94 │   │   """                                                            │
│    95 │   │   event_cb = getattr(self, event.value)                          │
│ ❱  96 │   │   return event_cb(state, logger)                                 │
│    97 │                                                                      │
│    98 │   def init(self, state: State, logger: Logger) -> None:              │
│    99 │   │   """Called on the :attr:`.Event.INIT` event.                    │
│                                                                              │
│ /usr/lib/python3/dist-packages/composer/loggers/console_logger.py:165 in     │
│ eval_start                                                                   │
│                                                                              │
│   162 │   │   # Remove index of last batch, so that we don't print progress  │
│   163 │   │   # at eval end.                                                 │
│   164 │   │   last_batch_idx = total_eval_batches                            │
│ ❱ 165 │   │   self.eval_batch_idxs_to_log.remove(last_batch_idx)             │
│   166 │   │   if not self.hparams_already_logged_to_console:                 │
│   167 │   │   │   self.hparams_already_logged_to_console = True              │
│   168 │   │   │   self._log_hparams_to_console()                             │
╰──────────────────────────────────────────────────────────────────────────────╯
ValueError: list.remove(x): x not in list

@karan6181 @hanlint @eracah

@germanjke
Copy link
Author

drop_last=false solves this thanks

@mvpatel2000
Copy link
Contributor

@germanjke can you please reopen this in foundry instead please? thanks!

@mvpatel2000
Copy link
Contributor

Reopening as this is a Composer issue -- misread issye

@mvpatel2000
Copy link
Contributor

mvpatel2000 commented Nov 28, 2023

@germanjke can you please share reproduction instructions so we can debug this? We think this may happen for very small datasets where there are basically no samples and they are dropped away so there are 0 on that GPU

@mvpatel2000
Copy link
Contributor

I was able to fix and repro in the linked PR :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants