Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support add tokens to tokenizer. #498

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

congchan
Copy link
Contributor

@congchan congchan commented Jun 6, 2023

To improve the compatibility of various models initialized from different open-sourced models, people may want to add some tokens for better downstream tuning purposes.

For example, to improve our policy's adherence to our chat format, we may want to add ChatML tokens such as "<|system|>", "<|assistant|>", "<|user|>", and "<|end|>" to the policy tokenizer.

Adding special tokens is ignored by the decode phase of the PPO. This is because it needs to skip certain special tokens, such as EOS tokens. Therefore, Will only add normal tokens.

Copy link
Collaborator

@maxreciprocate maxreciprocate left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Cong, that's a nice QoL improvement! However, there is one minor issue with it, but I hope you can resolve it

self.tokenizer.add_special_tokens(
{"additional_special_tokens": self.additional_special_tokens}
)
self.model.base_model.resize_token_embeddings(len(self.tokenizer))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To improve compatibility with other modified tokenizers, I think it would be great if resizing happened by default, regardless of this if condition. Also, for PPO, the reference model/head should be resized likewise, otherwise, this error occurs:

../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [93,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
Traceback (most recent call last):
  File "/trlx/examples/ppo_sentiments.py", line 58, in <module>
    main(hparams)
  File "/trlx/examples/ppo_sentiments.py", line 47, in main
    trlx.train(
  File "/trlx/trlx/trlx.py", line 133, in train
    trainer.learn()
  File "/trlx/trlx/trainer/accelerate_base_trainer.py", line 506, in learn
    self.prepare_learning()
  File "trlx/trlx/trainer/accelerate_ppo_trainer.py", line 239, in prepare_learning
    self.make_experience(self.config.method.num_rollouts)
  File "/trlx/trlx/trainer/accelerate_ppo_trainer.py", line 427, in make_experience
    ref_logprobs = logprobs_of_labels(ref_logits[:, :-1, :], all_tokens[:, 1:])
  File "/trlx/trlx/utils/modeling.py", line 224, in logprobs_of_labels
    logprobs_labels = torch.gather(logprobs, dim=-1, index=labels.unsqueeze(-1))
RuntimeError: CUDA error: device-side assert triggered

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your review, I will resolve it later.
My plan is:

  • if hydra heads is used, hasattr(self.model, "frozen_head"), then I need to resize the self.model.frozen_head.decoder_blocks,
  • if not, just resize the self.ref_model

@congchan congchan changed the title feat: support add special tokens to tokenizer. feat: support add tokens to tokenizer. Jun 8, 2023
@congchan congchan requested a review from maxreciprocate June 9, 2023 08:29
self.model.frozen_head.resize_token_embeddings(len(self.tokenizer))
else:
# resize a reference model when hydra heads are not used
self.ref_model.resize_token_embeddings(len(self.tokenizer))
Copy link
Collaborator

@maxreciprocate maxreciprocate Jun 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when hydra heads are not used, ref_model gets instantiated in AcceleratePPOTrainer, so maybe we can move this line there:

if not hasattr(self.model, "frozen_head"):
self.ref_model = self.get_arch(self.config)
self.ref_model.to(self.accelerator.device)
self.ref_model.eval()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's better.

@congchan congchan force-pushed the dev_additional_special_tokens branch from dcd45d5 to 134bbf9 Compare June 14, 2023 14:12
@congchan congchan requested a review from maxreciprocate June 15, 2023 06:59
congchan and others added 5 commits August 11, 2023 18:43
* Resize the model by-default
* Adding special tokens is ignored by the decode phase of the PPO. This is because it needs to skip certain special tokens, such as EOS tokens. Therefore only add normal tokens.
move hydra heads and ref_model 's resize_token_embeddings function calls to  AcceleratePPOTrainer
@congchan congchan force-pushed the dev_additional_special_tokens branch from fd58c49 to e7fc3e3 Compare August 11, 2023 10:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants