feat: support add tokens to tokenizer. #498

congchan · 2023-06-06T14:25:05Z

To improve the compatibility of various models initialized from different open-sourced models, people may want to add some tokens for better downstream tuning purposes.

For example, to improve our policy's adherence to our chat format, we may want to add ChatML tokens such as "<|system|>", "<|assistant|>", "<|user|>", and "<|end|>" to the policy tokenizer.

Adding special tokens is ignored by the decode phase of the PPO. This is because it needs to skip certain special tokens, such as EOS tokens. Therefore, Will only add normal tokens.

maxreciprocate

Thanks Cong, that's a nice QoL improvement! However, there is one minor issue with it, but I hope you can resolve it

maxreciprocate · 2023-06-06T21:37:32Z

trlx/trainer/accelerate_base_trainer.py

+            self.tokenizer.add_special_tokens(
+                {"additional_special_tokens": self.additional_special_tokens}
+            )
+            self.model.base_model.resize_token_embeddings(len(self.tokenizer))


To improve compatibility with other modified tokenizers, I think it would be great if resizing happened by default, regardless of this if condition. Also, for PPO, the reference model/head should be resized likewise, otherwise, this error occurs:

../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [93,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed. Traceback (most recent call last): File "/trlx/examples/ppo_sentiments.py", line 58, in <module> main(hparams) File "/trlx/examples/ppo_sentiments.py", line 47, in main trlx.train( File "/trlx/trlx/trlx.py", line 133, in train trainer.learn() File "/trlx/trlx/trainer/accelerate_base_trainer.py", line 506, in learn self.prepare_learning() File "trlx/trlx/trainer/accelerate_ppo_trainer.py", line 239, in prepare_learning self.make_experience(self.config.method.num_rollouts) File "/trlx/trlx/trainer/accelerate_ppo_trainer.py", line 427, in make_experience ref_logprobs = logprobs_of_labels(ref_logits[:, :-1, :], all_tokens[:, 1:]) File "/trlx/trlx/utils/modeling.py", line 224, in logprobs_of_labels logprobs_labels = torch.gather(logprobs, dim=-1, index=labels.unsqueeze(-1)) RuntimeError: CUDA error: device-side assert triggered

Thanks for your review, I will resolve it later.
My plan is:

if hydra heads is used, hasattr(self.model, "frozen_head"), then I need to resize the self.model.frozen_head.decoder_blocks,

if not, just resize the self.ref_model

maxreciprocate · 2023-06-12T15:59:13Z

trlx/trainer/accelerate_base_trainer.py

+            self.model.frozen_head.resize_token_embeddings(len(self.tokenizer))
+        else:
+            # resize a reference model when hydra heads are not used
+            self.ref_model.resize_token_embeddings(len(self.tokenizer))


when hydra heads are not used, ref_model gets instantiated in AcceleratePPOTrainer, so maybe we can move this line there:

trlx/trlx/trainer/accelerate_ppo_trainer.py

Lines 71 to 74 in 404217b

if not hasattr(self.model, "frozen_head"):

self.ref_model = self.get_arch(self.config)

self.ref_model.to(self.accelerator.device)

self.ref_model.eval()

Yeah that's better.

* Resize the model by-default * Adding special tokens is ignored by the decode phase of the PPO. This is because it needs to skip certain special tokens, such as EOS tokens. Therefore only add normal tokens.

move hydra heads and ref_model 's resize_token_embeddings function calls to AcceleratePPOTrainer

maxreciprocate requested changes Jun 6, 2023

View reviewed changes

congchan changed the title ~~feat: support add special tokens to tokenizer.~~ feat: support add tokens to tokenizer. Jun 8, 2023

congchan requested a review from maxreciprocate June 9, 2023 08:29

maxreciprocate reviewed Jun 12, 2023

View reviewed changes

congchan force-pushed the dev_additional_special_tokens branch from dcd45d5 to 134bbf9 Compare June 14, 2023 14:12

congchan requested a review from maxreciprocate June 15, 2023 06:59

congchan and others added 5 commits August 11, 2023 18:43

feat: support add special tokens to tokenizer.

1085471

feat: support add tokens to tokenizer.

d806f0a

* Resize the model by-default * Adding special tokens is ignored by the decode phase of the PPO. This is because it needs to skip certain special tokens, such as EOS tokens. Therefore only add normal tokens.

fix: move hydra heads resize_token_embeddings

fc93796

move hydra heads and ref_model 's resize_token_embeddings function calls to AcceleratePPOTrainer

fix(accelerate_ppo_trainer): resize token embeddings without hydra

923ec65

fix(accelerate_ppo_trainer): no resizing when using peft reference

e7fc3e3

congchan force-pushed the dev_additional_special_tokens branch from fd58c49 to e7fc3e3 Compare August 11, 2023 10:53

fix accelerate_ppo_trainer.py: resize frozen_head when it is not None

41a01c9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support add tokens to tokenizer. #498

feat: support add tokens to tokenizer. #498

congchan commented Jun 6, 2023 •

edited

Loading

maxreciprocate left a comment

maxreciprocate Jun 6, 2023

congchan Jun 7, 2023

maxreciprocate Jun 12, 2023 •

edited

Loading

congchan Jun 13, 2023

	if not hasattr(self.model, "frozen_head"):
	self.ref_model = self.get_arch(self.config)
	self.ref_model.to(self.accelerator.device)
	self.ref_model.eval()

feat: support add tokens to tokenizer. #498

Are you sure you want to change the base?

feat: support add tokens to tokenizer. #498

Conversation

congchan commented Jun 6, 2023 • edited Loading

maxreciprocate left a comment

Choose a reason for hiding this comment

maxreciprocate Jun 6, 2023

Choose a reason for hiding this comment

congchan Jun 7, 2023

Choose a reason for hiding this comment

maxreciprocate Jun 12, 2023 • edited Loading

Choose a reason for hiding this comment

congchan Jun 13, 2023

Choose a reason for hiding this comment

congchan commented Jun 6, 2023 •

edited

Loading

maxreciprocate Jun 12, 2023 •

edited

Loading