Skip to content

Commit

Permalink
Fix misleading variable "epoch" from the training loop from PPOTraine…
Browse files Browse the repository at this point in the history
…r Doc. (#1171)

* Fix misleading variable "epoch" from PPOTrainer Doc. 

The usage of the variable “epoch” is misleading in the original Doc, the dataloader does not contain the data for ALL epochs, but 1 only, thus 
"for epoch, batch in tqdm(enumerate(ppo_trainer.dataloader))"
is misleading and does not actually stores the epoch #. 

The correct version comes from the TRL PPO notebook tutorial 
(https://github.com/huggingface/trl/blob/main/examples/notebooks/gpt2-sentiment-control.ipynb), which uses an outer loop to capture the epochs.

I posted also the question on forum: https://discuss.huggingface.co/t/confusing-and-possibly-misleading-ppo-trainer-code-from-trl-api-doc-tutorial/67531

* Remove batch_id
  • Loading branch information
Jfhseh authored Jan 8, 2024
1 parent d57d0f9 commit ad597db
Showing 1 changed file with 17 additions and 17 deletions.
34 changes: 17 additions & 17 deletions docs/source/ppo_trainer.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -115,22 +115,22 @@ We can then loop over all examples in the dataset and generate a response for ea

```py
from tqdm import tqdm

for epoch, batch in tqdm(enumerate(ppo_trainer.dataloader)):
query_tensors = batch["input_ids"]

#### Get response from SFTModel
response_tensors = ppo_trainer.generate(query_tensors, **generation_kwargs)
batch["response"] = [tokenizer.decode(r.squeeze()) for r in response_tensors]

#### Compute reward score
texts = [q + r for q, r in zip(batch["query"], batch["response"])]
pipe_outputs = reward_model(texts)
rewards = [torch.tensor(output[1]["score"]) for output in pipe_outputs]

#### Run PPO step
stats = ppo_trainer.step(query_tensors, response_tensors, rewards)
ppo_trainer.log_stats(stats, batch, rewards)
for epoch in tqdm(range(ppo_trainer.config.ppo_epochs), "epoch: "):
for batch in tqdm(ppo_trainer.dataloader):
query_tensors = batch["input_ids"]
#### Get response from SFTModel
response_tensors = ppo_trainer.generate(query_tensors, **generation_kwargs)
batch["response"] = [tokenizer.decode(r.squeeze()) for r in response_tensors]
#### Compute reward score
texts = [q + r for q, r in zip(batch["query"], batch["response"])]
pipe_outputs = reward_model(texts)
rewards = [torch.tensor(output[1]["score"]) for output in pipe_outputs]
#### Run PPO step
stats = ppo_trainer.step(query_tensors, response_tensors, rewards)
ppo_trainer.log_stats(stats, batch, rewards)

#### Save model
ppo_trainer.save_model("my_ppo_model")
Expand All @@ -148,4 +148,4 @@ While training and evaluating we log the following metrics:

[[autodoc]] PPOTrainer

[[autodoc]] PPOConfig
[[autodoc]] PPOConfig

0 comments on commit ad597db

Please sign in to comment.