Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
🐾 Process-supervised RM Trainer (#2127)
* initial skeleton * tokenize fn * adding bos and eos to tokenization fn * prmtrainer * fixing small typo in tokenize * typo in input_ids and labels construction * numpy dimension * introduce the stepwise reward trainer * update markdown files * let user decide post step separator in config * doc post_step_separator * do not add post step_tokens to last step of the reasoning process * renaming prm to stepwisereward * formatting * fix tokenize kwargs * adapt test to the new post_token args * adding example script * fix small typo * add create_model_card and renaming * fixing booleans * Adding the new stepwise_preference instead of placeholders for datasets * formatting * Update docs/source/_toctree.yml Co-authored-by: Quentin Gallouédec <[email protected]> * Update examples/scripts/stepwise_reward_modeling.py Co-authored-by: Quentin Gallouédec <[email protected]> * Update trl/trainer/stepwise_reward_trainer.py Co-authored-by: Quentin Gallouédec <[email protected]> * Update trl/trainer/stepwise_reward_trainer.py Co-authored-by: Quentin Gallouédec <[email protected]> * update push to hub Co-authored-by: Quentin Gallouédec <[email protected]> * step_separator can't be None Co-authored-by: Quentin Gallouédec <[email protected]> * fix suggested typos * add citation * reformat doc * reordering init * push to hub prm800k * changing dataset in example * change dataset format to align with the sky is blue example * fix tokenization column names * fix num labels in openai example * add support for conversational dataset * remove training whitespace * replace tokenizer with processing class * Update docs/source/dataset_formats.mdx Co-authored-by: Quentin Gallouédec <[email protected]> * remove openai_prm800k * Update trl/trainer/stepwise_reward_trainer.py Co-authored-by: Quentin Gallouédec <[email protected]> * Update trl/trainer/stepwise_reward_trainer.py Co-authored-by: Quentin Gallouédec <[email protected]> * Update docs/source/stepwise_reward_trainer.mdx Co-authored-by: lewtun <[email protected]> * Update docs/source/stepwise_reward_trainer.mdx Co-authored-by: lewtun <[email protected]> * renaming Co-authored-by: lewtun <[email protected]> * renaming Co-authored-by: lewtun <[email protected]> * minor renamings in docs * using prm800k instead of openai_prm800k * update num labels to 2 following the new format * changing doc examples to math examples * change reference to dataset_formats.mdx * changing dataset config in test * remove conversational dataset support * remove conv dataset support * fix bos token * fix scriptarguments in example * completion to completions * remove valuerror for step_separator inside steps * run precommit * remove conv dataset support Co-authored-by: Quentin Gallouédec <[email protected]> * renaming zen dataset * remove unused printing * unknown label column * introduce the train on last step arg * _tokenize support train_on_last_step * incorporate train_on_last_step to tests * formatting * remove comments in trainer * Refactor `tokenize_row` * Update max_completion_length parameter in StepwiseRewardConfig * Collator * Update comment * Update type hint * fix table * Remove collator * don't need pad token id * add error back * max length args * use tokenizer arg * Update doc * label -> labels * fixing tokenization issues in tokenize row * correct labels for token classification * adding max_length to tokenize_row * reformat tests * adding tests for tokenize row * fixing typos in comments * update doc Co-authored-by: Kashif Rasul <[email protected]> * Add math_shepherd.py script for dataset processing * split the dataset * formatting * same evaluation method for the two training methods * adding filtering to example script * formatting * Add features to avoid casting labels to bool in dataset tokenization * Update docs/source/stepwise_reward_trainer.mdx [ci skip] * Add learning_rate parameter to StepwiseRewardConfig class * update doc * Remove unused setup_chat_format function * Fix warning message in stepwise_reward_modeling.py * Update logging steps in stepwise_reward_trainer.mdx * little doc change [ci skip] * Fix copyrights * fix space after copyrights * Update dataset loading in stepwise_reward_modeling.py * refine compute_accuracy and proper test * fix tests * style * renamings * renaming in init * doc renaming * fix sorting and tag * experiemental [ci skip] * trigger CI * other doc fix --------- Co-authored-by: Quentin Gallouédec <[email protected]> Co-authored-by: Kashif Rasul <[email protected]> Co-authored-by: lewtun <[email protected]> Co-authored-by: Quentin Gallouédec <[email protected]>
- Loading branch information