Super simple scripts for supervised fine-tuning and preference-tuning, mostly based on Unsloth. Fork it, and do whatever you want with it.
- Install Unsloth.
- Install other dependencies
pip install fire
.
Experiment with different learning rate, beta and DPO / IPO. The actual best value will be highly data dependent. Start with learning rate that's approximately 10-times smaller than the learning rate you normally use for QLoRA SFT.
Monitor (train|eval)/rewards/accuracies
, (train|eval)/rewards/margins
and train/loss
. They should not improve too fast, if they do, lower learning rate.
Never trust just your (train|eval)/rewards
metrics alone, perform end-to-end testing on a dataset that does not overlap with your SFT and DPO data.
Here are some example DPO DreamGen V1 7B runs, using different learning rates:
Here are some example DPO Bagel runs:
- Comparison of DPO, IPO, KTO and various hyper-params
- DPO tutorial from HuggingFace
- DPO tutorial from Maxime Labonne
- DPO paper
End-to-end evals on data that's disjoint from your SFT and DPO training data are crucial to assess real improvements. Ideally, your evals should be as close to what you intend to use the final model for. If you don't have that, you can use one of the existing broad-spectrum auto-evals: