Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BCOTrainer conversational dataset support #2107

Merged
merged 8 commits into from
Sep 24, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
45 changes: 2 additions & 43 deletions docs/source/bco_trainer.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -6,49 +6,8 @@ For a full example have a look at [`examples/scripts/bco.py`].

## Expected dataset format

The BCO trainer expects a very specific format for the dataset as it does not require pairwise preferences. Since the model will be trained to directly optimize examples that consist of a prompt, model completion, and a label to indicate whether the completion is "good" or "bad", we expect a dataset with the following columns:

- `prompt`
- `completion`
- `label`

for example:

```
bco_dataset_dict = {
"prompt": [
"Hey, hello",
"How are you",
"What is your name?",
"What is your name?",
"Which is the best programming language?",
"Which is the best programming language?",
"Which is the best programming language?",
],
"completion": [
"hi nice to meet you",
"leave me alone",
"I don't have a name",
"My name is Mary",
"Python",
"C++",
"Java",
],
"label": [
True,
False,
False,
True,
True,
False,
False,
],
}
```

where the `prompt` contains the context inputs, `completion` contains the corresponding responses and `label` contains the corresponding flag that indicates if the generated completion is desired (`True`) or undesired (`False`).
A prompt can have multiple responses and this is reflected in the entries being repeated in the dictionary's value arrays. It is required that the dataset contains at least one desirable and one undesirable completion.

The [`BCOTrainer`] requires an [unpaired preference dataset](dataset_formats#unpaired-preference).
The [`BCOTrainer`] supports both [conversational](dataset_formats#conversational-dataset-format) and [standard](dataset_formats#standard-dataset-format) dataset format. When provided with a conversational dataset, the trainer will automatically apply the chat template to the dataset.

## Expected model format
The BCO trainer expects a model of `AutoModelForCausalLM`, compared to PPO that expects `AutoModelForCausalLMWithValueHead` for the value function.
Expand Down
101 changes: 9 additions & 92 deletions examples/scripts/bco.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,9 @@

# Full training:
python examples/scripts/bco.py \
--model_name_or_path=nnheui/stablelm-2-1_6b-sft-full \
--model_name_or_path Qwen/Qwen2.5-0.5B-Instruct \
--trust_remote_code \
--dataset_name trl-lib/ultrafeedback-gpt-3.5-turbo-helpfulness \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we can switch this to an unpaired version of trl-lib/ultrafeedback_binarized? That way we have two simple datasets that should "just" work

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. How trl-lib/ultrafeedback_binarized is generated? Which "aspect" (helpfulness, ...) is used?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I merge the PR, but we can work on it in a following PR

--per_device_train_batch_size 16 \
--per_device_eval_batch_size 32 \
--num_train_epochs 1 \
Expand Down Expand Up @@ -66,88 +68,15 @@
--lora_alpha=16
"""

import logging
from dataclasses import dataclass
from functools import partial
from typing import Literal, Optional

import torch
import torch.nn.functional as F
from accelerate import Accelerator, PartialState
from datasets import Dataset, load_dataset
from accelerate import Accelerator
from datasets import load_dataset
from transformers import AutoModel, AutoModelForCausalLM, AutoTokenizer, HfArgumentParser, PreTrainedModel

from trl import BCOConfig, BCOTrainer, ModelConfig, get_peft_config, setup_chat_format


# Define and parse arguments.
@dataclass
class ScriptArguments:
"""
The arguments for the BCO training script.
"""

llm_name: Literal["gpt-3.5-turbo", "llama-2-7b-chat", "llama-2-70b-chat"] = "gpt-3.5-turbo"


def build_helpfulness_dataset(llm_name: str, num_proc: Optional[int] = None) -> Dataset:
"""
Filter `llm_name` completions and binarize given their helpfulness score.
If helpfulness score is 5, it is desirable. Otherwise, it is undesirable.
"""

def get_model_rating(example, metric: str, llm_name: str):
try:
model_index = example["models"].index(llm_name)
return {metric: int(example["completions"][model_index]["annotations"][metric]["Rating"])}
except ValueError as e:
logging.warning(e)
return -1

def get_model_response(example, llm_name: str):
try:
model_index = example["models"].index(llm_name)
return {"response": example["completions"][model_index]["response"]}
except ValueError as e:
logging.warning(e)
return -1

dataset = load_dataset("openbmb/UltraFeedback")["train"]

dataset = dataset.filter(lambda example: llm_name in example["models"], batched=False, num_proc=num_proc)
dataset = dataset.filter(
lambda example: len(example["models"]) == len(example["completions"]), batched=False, num_proc=num_proc
)

METRIC = "helpfulness"

dataset = dataset.map(
get_model_rating,
batched=False,
fn_kwargs={"metric": METRIC, "llm_name": llm_name},
num_proc=num_proc,
)

dataset = dataset.map(
get_model_response,
batched=False,
fn_kwargs={"llm_name": llm_name},
num_proc=num_proc,
)

dataset = dataset.select_columns(["source", "instruction", "response", "helpfulness"])

dataset = dataset.rename_columns({"instruction": "prompt", "response": "completion"})
dataset = dataset.map(lambda example: {"label": example["helpfulness"] >= 5}, batched=False, num_proc=num_proc)

dataset = dataset.map(
lambda example: {"prompt": [{"role": "user", "content": example["prompt"]}]},
batched=False,
num_proc=num_proc,
)
dataset = dataset.train_test_split(test_size=0.05, seed=42)

return dataset
from trl import BCOConfig, BCOTrainer, DPOScriptArguments, ModelConfig, get_peft_config, setup_chat_format


def embed_prompt(input_ids: torch.LongTensor, attention_mask: torch.LongTensor, model: PreTrainedModel):
Expand All @@ -174,8 +103,8 @@ def mean_pooling(model_output, attention_mask):


if __name__ == "__main__":
parser = HfArgumentParser((ScriptArguments, BCOConfig, ModelConfig))
script_args, training_args, model_args = parser.parse_args_into_dataclasses()
parser = HfArgumentParser((DPOScriptArguments, BCOConfig, ModelConfig))
args, training_args, model_args = parser.parse_args_into_dataclasses()

training_args.gradient_checkpointing_kwargs = {"use_reentrant": True}

Expand All @@ -197,19 +126,7 @@ def mean_pooling(model_output, attention_mask):
if tokenizer.chat_template is None:
model, tokenizer = setup_chat_format(model, tokenizer)

# Apply chat template
def format_dataset(example):
example["prompt"] = tokenizer.apply_chat_template(
example["prompt"], tokenize=False, add_generation_prompt=True
)
return example

# Compute that only on the main process for faster data processing.
# see: https://github.com/huggingface/trl/pull/1255
with PartialState().local_main_process_first():
# Load the dataset
dataset = build_helpfulness_dataset(script_args.llm_name, num_proc=training_args.dataset_num_proc)
dataset = dataset.map(format_dataset, batched=False, num_proc=training_args.dataset_num_proc)
dataset = load_dataset(args.dataset_name)

accelerator = Accelerator()
embedding_model = AutoModel.from_pretrained(
Expand Down
20 changes: 7 additions & 13 deletions tests/test_bco_trainer.py
Original file line number Diff line number Diff line change
Expand Up @@ -49,19 +49,19 @@ def setUp(self):

@parameterized.expand(
[
["gpt2", True, True],
["gpt2", True, False],
["gpt2", False, True],
["gpt2", False, False],
["gpt2", True, True, "standard_unpaired_preference"],
["gpt2", True, False, "standard_unpaired_preference"],
["gpt2", False, True, "standard_unpaired_preference"],
["gpt2", False, False, "standard_unpaired_preference"],
["gpt2", True, True, "conversational_unpaired_preference"],
]
)
def test_bco_trainer(self, name, pre_compute, eval_dataset):
def test_bco_trainer(self, name, pre_compute, eval_dataset, config_name):
with tempfile.TemporaryDirectory() as tmp_dir:
training_args = BCOConfig(
output_dir=tmp_dir,
per_device_train_batch_size=2,
max_steps=3,
remove_unused_columns=False,
gradient_accumulation_steps=1,
learning_rate=9e-1,
eval_strategy="steps",
Expand All @@ -70,7 +70,7 @@ def test_bco_trainer(self, name, pre_compute, eval_dataset):
report_to="none",
)

dummy_dataset = load_dataset("trl-internal-testing/zen", "standard_unpaired_preference")
dummy_dataset = load_dataset("trl-internal-testing/zen", config_name)

if name == "gpt2":
model = self.model
Expand Down Expand Up @@ -129,7 +129,6 @@ def test_tokenize_and_process_tokens(self):
output_dir=tmp_dir,
per_device_train_batch_size=2,
max_steps=3,
remove_unused_columns=False,
gradient_accumulation_steps=1,
learning_rate=9e-1,
eval_strategy="steps",
Expand Down Expand Up @@ -192,7 +191,6 @@ def test_bco_trainer_without_providing_ref_model(self):
output_dir=tmp_dir,
per_device_train_batch_size=2,
max_steps=3,
remove_unused_columns=False,
gradient_accumulation_steps=4,
learning_rate=9e-1,
eval_strategy="steps",
Expand Down Expand Up @@ -230,7 +228,6 @@ def test_bco_trainer_udm(self):
output_dir=tmp_dir,
per_device_train_batch_size=2,
max_steps=3,
remove_unused_columns=False,
gradient_accumulation_steps=4,
learning_rate=9e-1,
eval_strategy="steps",
Expand Down Expand Up @@ -289,7 +286,6 @@ def test_bco_trainer_without_providing_ref_model_with_lora(self):
output_dir=tmp_dir,
per_device_train_batch_size=2,
max_steps=3,
remove_unused_columns=False,
gradient_accumulation_steps=4,
learning_rate=9e-1,
eval_strategy="steps",
Expand Down Expand Up @@ -330,7 +326,6 @@ def test_bco_trainer_generate_during_eval_no_wandb(self):
output_dir=tmp_dir,
per_device_train_batch_size=2,
max_steps=3,
remove_unused_columns=False,
gradient_accumulation_steps=1,
learning_rate=9e-1,
eval_strategy="steps",
Expand Down Expand Up @@ -376,7 +371,6 @@ def test_bco_lora_save(self):
output_dir=tmp_dir,
per_device_train_batch_size=2,
max_steps=3,
remove_unused_columns=False,
gradient_accumulation_steps=4,
learning_rate=9e-1,
eval_strategy="steps",
Expand Down
9 changes: 9 additions & 0 deletions trl/trainer/bco_trainer.py
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,7 @@
from transformers.trainer_utils import EvalLoopOutput, has_length
from transformers.utils import is_peft_available

from ..data_utils import maybe_apply_chat_template
from ..models import PreTrainedModelWrapper, create_reference_model
from .bco_config import BCOConfig
from .utils import (
Expand Down Expand Up @@ -562,6 +563,14 @@ def make_inputs_require_grad(module, input, output):
self.embedding_tokenizer = embedding_tokenizer

with PartialState().local_main_process_first():
# Apply the chat template if needed
train_dataset = train_dataset.map(
maybe_apply_chat_template, fn_kwargs={"tokenizer": tokenizer}, num_proc=args.dataset_num_proc
)
if eval_dataset is not None:
eval_dataset = eval_dataset.map(
maybe_apply_chat_template, fn_kwargs={"tokenizer": tokenizer}, num_proc=args.dataset_num_proc
)
# Shuffle the datasets
train_dataset = train_dataset.shuffle(seed=args.data_seed)
if eval_dataset is not None:
Expand Down
Loading