You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Fast-LLM defines train-val-test through weights (e.g., [99,1,0]), which is primarily used to check if training is proceeding as expected (e.g., loss decreases over splits). However, this approach has limitations:
The split is performed randomly and isn't easily reproducible outside Fast-LLM.
It doesn't allow for evaluation on specific, held-out datasets often used in modern pipelines (e.g., C4, The Pile, Wikitext 103).
Evaluations are currently handled through separate workflows depending on exported checkpoints, which involve multiple manual steps and external tools like HF Hub, leading to inefficiencies and coordination overhead.
In modern training setups (e.g., OLMo), evaluations are performed on specific datasets directly during validation, allowing for streamlined processes and quicker insights. Fast-LLM's current setup does not support this, which slows down decision-making and increases the risk of wasted compute due to delayed evaluations.
💡 Proposed Solution
Introduce an Evaluator abstraction and framework into Fast-LLM, inspired by OLMo's approach, to enable both LM-loss evaluation and downstream benchmark testing directly during training.
This framework would:
Support evaluation of LM loss on specific held-out datasets during validation (e.g., C4 val split, The Pile val split, multilingual datasets like French, German, etc.).
Include Evaluator modules for downstream benchmarks (e.g., MMLU, ARC, PIQA, Humaneval).
Use Fast-LLM's inference implementation for generative benchmarks, ensuring compatibility without model conversion overhead.
Key functionality:
Define datasets for evaluation in the training config, similar to OLMo's approach.
Run evaluations during validation without the need for separate workflows.
Extend to handle generative benchmarks alongside LM loss evaluation.
Automatically log results into the same tracking tool (WandB) as other training metrics, simplifying reporting and enabling near-real-time analysis.
Implementation Milestones:
Phase 1: Start with LM-loss evaluations for a core set of datasets.
Phase 2: Add generative downstream benchmarks as a separate phase, contingent on phase 1 success and demand.
Example: The OLMoE team has effectively used this approach, where Evaluators allow all evaluation metrics to be seamlessly logged in a single WandB run alongside training metrics (WandB Report).
🔄 Alternatives Considered
Maintain Current Workflow: Continue evaluating exported checkpoints externally via scripts and HF Hub. While this avoids modifying Fast-LLM, it introduces inefficiencies, coordination overhead, and fragmented results, often leading to delayed or skipped evaluations.
Perform Comprehensive Evaluation in Fast-LLM: This proposed approach addresses the inefficiencies but has trade-offs:
Opportunity Cost: Adding Evaluators requires development effort.
Fixed Evaluations: The set of evaluations must be decided upfront before training starts. Changes or additions later will require external tools like lighteval or lm-eval-harness.
To mitigate these, phase 1 should focus on core, stable benchmarks (e.g., LM-loss on key datasets and MMLU) that rarely change.
📈 Potential Benefits
Streamlined Process: Integrating evaluation into Fast-LLM removes reliance on external workflows, reducing overhead, errors, and delays.
Faster Decision-Making: On-the-fly evaluation provides immediate feedback on key benchmarks, preventing wasted compute on suboptimal models.
Reproducibility: Standardized evaluation ensures consistency and repeatability across runs.
Unified Metrics: Results are logged into the same tracking tool (WandB) as training metrics, eliminating the need for custom post-processing.
Involves multiple manual steps and external dependencies.
Coordination issues with separate compute resources for evaluation.
Adding this feature would modernize Fast-LLM and align it with industry-leading practices, improving training efficiency and reducing the risks of delayed insights.
The text was updated successfully, but these errors were encountered:
🧐 Problem Description
Fast-LLM defines train-val-test through weights (e.g.,
[99,1,0]
), which is primarily used to check if training is proceeding as expected (e.g., loss decreases over splits). However, this approach has limitations:In modern training setups (e.g., OLMo), evaluations are performed on specific datasets directly during validation, allowing for streamlined processes and quicker insights. Fast-LLM's current setup does not support this, which slows down decision-making and increases the risk of wasted compute due to delayed evaluations.
💡 Proposed Solution
Introduce an Evaluator abstraction and framework into Fast-LLM, inspired by OLMo's approach, to enable both LM-loss evaluation and downstream benchmark testing directly during training.
This framework would:
Evaluator
modules for downstream benchmarks (e.g., MMLU, ARC, PIQA, Humaneval).Key functionality:
Implementation Milestones:
Example: The OLMoE team has effectively used this approach, where
Evaluator
s allow all evaluation metrics to be seamlessly logged in a single WandB run alongside training metrics (WandB Report).🔄 Alternatives Considered
Maintain Current Workflow: Continue evaluating exported checkpoints externally via scripts and HF Hub. While this avoids modifying Fast-LLM, it introduces inefficiencies, coordination overhead, and fragmented results, often leading to delayed or skipped evaluations.
Perform Comprehensive Evaluation in Fast-LLM: This proposed approach addresses the inefficiencies but has trade-offs:
Evaluator
s requires development effort.lighteval
orlm-eval-harness
.To mitigate these, phase 1 should focus on core, stable benchmarks (e.g., LM-loss on key datasets and MMLU) that rarely change.
📈 Potential Benefits
📝 Additional Context
Evaluator
modules: OLMo Evaluator Reference.Adding this feature would modernize Fast-LLM and align it with industry-leading practices, improving training efficiency and reducing the risks of delayed insights.
The text was updated successfully, but these errors were encountered: