Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feat] Streamline evaluations: Add integrated Evaluator framework #65

Open
tscholak opened this issue Nov 24, 2024 · 0 comments
Open

[feat] Streamline evaluations: Add integrated Evaluator framework #65

tscholak opened this issue Nov 24, 2024 · 0 comments
Labels
enhancement New feature or request

Comments

@tscholak
Copy link
Collaborator

🧐 Problem Description

Fast-LLM defines train-val-test through weights (e.g., [99,1,0]), which is primarily used to check if training is proceeding as expected (e.g., loss decreases over splits). However, this approach has limitations:

  • The split is performed randomly and isn't easily reproducible outside Fast-LLM.
  • It doesn't allow for evaluation on specific, held-out datasets often used in modern pipelines (e.g., C4, The Pile, Wikitext 103).
  • Evaluations are currently handled through separate workflows depending on exported checkpoints, which involve multiple manual steps and external tools like HF Hub, leading to inefficiencies and coordination overhead.

In modern training setups (e.g., OLMo), evaluations are performed on specific datasets directly during validation, allowing for streamlined processes and quicker insights. Fast-LLM's current setup does not support this, which slows down decision-making and increases the risk of wasted compute due to delayed evaluations.

💡 Proposed Solution

Introduce an Evaluator abstraction and framework into Fast-LLM, inspired by OLMo's approach, to enable both LM-loss evaluation and downstream benchmark testing directly during training.

This framework would:

  1. Support evaluation of LM loss on specific held-out datasets during validation (e.g., C4 val split, The Pile val split, multilingual datasets like French, German, etc.).
  2. Include Evaluator modules for downstream benchmarks (e.g., MMLU, ARC, PIQA, Humaneval).
  3. Use Fast-LLM's inference implementation for generative benchmarks, ensuring compatibility without model conversion overhead.

Key functionality:

  • Define datasets for evaluation in the training config, similar to OLMo's approach.
  • Run evaluations during validation without the need for separate workflows.
  • Extend to handle generative benchmarks alongside LM loss evaluation.
  • Automatically log results into the same tracking tool (WandB) as other training metrics, simplifying reporting and enabling near-real-time analysis.

Implementation Milestones:

  1. Phase 1: Start with LM-loss evaluations for a core set of datasets.
  2. Phase 2: Add generative downstream benchmarks as a separate phase, contingent on phase 1 success and demand.

Example: The OLMoE team has effectively used this approach, where Evaluators allow all evaluation metrics to be seamlessly logged in a single WandB run alongside training metrics (WandB Report).

🔄 Alternatives Considered

  1. Maintain Current Workflow: Continue evaluating exported checkpoints externally via scripts and HF Hub. While this avoids modifying Fast-LLM, it introduces inefficiencies, coordination overhead, and fragmented results, often leading to delayed or skipped evaluations.

  2. Perform Comprehensive Evaluation in Fast-LLM: This proposed approach addresses the inefficiencies but has trade-offs:

    • Opportunity Cost: Adding Evaluators requires development effort.
    • Fixed Evaluations: The set of evaluations must be decided upfront before training starts. Changes or additions later will require external tools like lighteval or lm-eval-harness.

To mitigate these, phase 1 should focus on core, stable benchmarks (e.g., LM-loss on key datasets and MMLU) that rarely change.

📈 Potential Benefits

  • Streamlined Process: Integrating evaluation into Fast-LLM removes reliance on external workflows, reducing overhead, errors, and delays.
  • Faster Decision-Making: On-the-fly evaluation provides immediate feedback on key benchmarks, preventing wasted compute on suboptimal models.
  • Reproducibility: Standardized evaluation ensures consistency and repeatability across runs.
  • Unified Metrics: Results are logged into the same tracking tool (WandB) as training metrics, eliminating the need for custom post-processing.

📝 Additional Context

  • OLMo's implementation of Evaluator modules: OLMo Evaluator Reference.
  • WandB report for OLMoE training: WandB Report.
  • Current workflow challenges:
    • Requires exporting checkpoints.
    • Involves multiple manual steps and external dependencies.
    • Coordination issues with separate compute resources for evaluation.

Adding this feature would modernize Fast-LLM and align it with industry-leading practices, improving training efficiency and reducing the risks of delayed insights.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant