[feat] Streamline evaluations: Add integrated Evaluator framework #65

tscholak · 2024-11-24T19:20:37Z

🧐 Problem Description

Fast-LLM defines train-val-test through weights (e.g., [99,1,0]), which is primarily used to check if training is proceeding as expected (e.g., loss decreases over splits). However, this approach has limitations:

The split is performed randomly and isn't easily reproducible outside Fast-LLM.
It doesn't allow for evaluation on specific, held-out datasets often used in modern pipelines (e.g., C4, The Pile, Wikitext 103).
Evaluations are currently handled through separate workflows depending on exported checkpoints, which involve multiple manual steps and external tools like HF Hub, leading to inefficiencies and coordination overhead.

In modern training setups (e.g., OLMo), evaluations are performed on specific datasets directly during validation, allowing for streamlined processes and quicker insights. Fast-LLM's current setup does not support this, which slows down decision-making and increases the risk of wasted compute due to delayed evaluations.

💡 Proposed Solution

Introduce an Evaluator abstraction and framework into Fast-LLM, inspired by OLMo's approach, to enable both LM-loss evaluation and downstream benchmark testing directly during training.

This framework would:

Support evaluation of LM loss on specific held-out datasets during validation (e.g., C4 val split, The Pile val split, multilingual datasets like French, German, etc.).
Include Evaluator modules for downstream benchmarks (e.g., MMLU, ARC, PIQA, Humaneval).
Use Fast-LLM's inference implementation for generative benchmarks, ensuring compatibility without model conversion overhead.

Key functionality:

Define datasets for evaluation in the training config, similar to OLMo's approach.
Run evaluations during validation without the need for separate workflows.
Extend to handle generative benchmarks alongside LM loss evaluation.
Automatically log results into the same tracking tool (WandB) as other training metrics, simplifying reporting and enabling near-real-time analysis.

Implementation Milestones:

Phase 1: Start with LM-loss evaluations for a core set of datasets.
Phase 2: Add generative downstream benchmarks as a separate phase, contingent on phase 1 success and demand.

Example: The OLMoE team has effectively used this approach, where Evaluators allow all evaluation metrics to be seamlessly logged in a single WandB run alongside training metrics (WandB Report).

🔄 Alternatives Considered

Maintain Current Workflow: Continue evaluating exported checkpoints externally via scripts and HF Hub. While this avoids modifying Fast-LLM, it introduces inefficiencies, coordination overhead, and fragmented results, often leading to delayed or skipped evaluations.
Perform Comprehensive Evaluation in Fast-LLM: This proposed approach addresses the inefficiencies but has trade-offs:
- Opportunity Cost: Adding Evaluators requires development effort.
- Fixed Evaluations: The set of evaluations must be decided upfront before training starts. Changes or additions later will require external tools like lighteval or lm-eval-harness.

To mitigate these, phase 1 should focus on core, stable benchmarks (e.g., LM-loss on key datasets and MMLU) that rarely change.

📈 Potential Benefits

Streamlined Process: Integrating evaluation into Fast-LLM removes reliance on external workflows, reducing overhead, errors, and delays.
Faster Decision-Making: On-the-fly evaluation provides immediate feedback on key benchmarks, preventing wasted compute on suboptimal models.
Reproducibility: Standardized evaluation ensures consistency and repeatability across runs.
Unified Metrics: Results are logged into the same tracking tool (WandB) as training metrics, eliminating the need for custom post-processing.

📝 Additional Context

OLMo's implementation of Evaluator modules: OLMo Evaluator Reference.
WandB report for OLMoE training: WandB Report.
Current workflow challenges:
- Requires exporting checkpoints.
- Involves multiple manual steps and external dependencies.
- Coordination issues with separate compute resources for evaluation.

Adding this feature would modernize Fast-LLM and align it with industry-leading practices, improving training efficiency and reducing the risks of delayed insights.

The text was updated successfully, but these errors were encountered:

tscholak added the enhancement New feature or request label Nov 24, 2024

tscholak mentioned this issue Dec 25, 2024

[meta] Fast-LLM Improvements Tracker 🌟 #100

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feat] Streamline evaluations: Add integrated Evaluator framework #65

[feat] Streamline evaluations: Add integrated Evaluator framework #65

tscholak commented Nov 24, 2024

[feat] Streamline evaluations: Add integrated Evaluator framework #65

[feat] Streamline evaluations: Add integrated Evaluator framework #65

Comments

tscholak commented Nov 24, 2024

🧐 Problem Description

💡 Proposed Solution

🔄 Alternatives Considered

📈 Potential Benefits

📝 Additional Context