Dpo penalty update #138

Eugene-hu · 2023-08-25T18:55:12Z

Adds additional checks for empty strings and smaller completions
Adds a penalty for repeated tokens akin to huggingface's repeat penalty during generation (https://github.com/huggingface/transformers/blob/v4.32.0/src/transformers/generation/logits_process.py#L279)
The penalty will deter repeated tokens with subsequently heavier penalties each time it occurs.

shibshib · 2023-08-25T19:01:37Z

openvalidators/reward/dpo.py

@@ -33,17 +33,23 @@ def name(self) -> str: return RewardModelType.dpo.value
    def __init__(self, device: str):
        super().__init__()
        self.device = device
+        self.penalty = 1.2


do we have any data for why we picked 1.2 for penalty?

It is the same default parameter used by huggingface and was retrieved from this paper (https://arxiv.org/pdf/2305.14314.pdf)

I think that adding this reference to the code in one comment line could help clarify future doubts

p-ferreira · 2023-08-25T19:12:24Z

openvalidators/reward/dpo.py

+
+            # Check if completion is 
+            if completion.strip() == '' or len(completion) <= 5:
+                return -11 # exp(-11)=1.67e-5 < 2e-5=1/50257 (typical vocab size)


I'm not sure if I got it why is -11, could you please elaborate more so I could better understand it?

exp(-11) corresponds to base value given to zero or short responses; it is the nearest integer value that is less than equal probability value across all logits (1/50257).

Would it be feasible to calculate this in runtime by getting something like 1 / model.vocab_size? That way the code will be independent of the model used as it would be calculated dynamically.

Would it be feasible to calculate this in runtime by getting something like 1 / model.vocab_size? That way the code will be independent of the model used as it would be calculated dynamically.

Yes this can be done in a future update, and will be necessary if the DPO model tokenizer is changed to something non-standard.

Got it, I will create an issue for that so we don't lose track of this

Eugene-hu added 2 commits August 25, 2023 11:41

penalty update

2c10303

dpo 1.2

e8a65f2

Eugene-hu requested review from shibshib, steffencruz, p-ferreira, isabella618033 and opentaco August 25, 2023 18:55

shibshib approved these changes Aug 25, 2023

View reviewed changes

p-ferreira reviewed Aug 25, 2023

View reviewed changes

p-ferreira approved these changes Aug 25, 2023

View reviewed changes

comment

b156a6d

Eugene-hu merged commit bd315ec into staging Aug 25, 2023

This was referenced Aug 28, 2023

Adjust vocab size calculation of DPO model to be dynamic #141

Open

V.1.2.0 Release #142

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dpo penalty update #138

Dpo penalty update #138

Eugene-hu commented Aug 25, 2023

shibshib Aug 25, 2023

Eugene-hu Aug 25, 2023

p-ferreira Aug 25, 2023

p-ferreira Aug 25, 2023

Eugene-hu Aug 25, 2023

p-ferreira Aug 25, 2023

opentaco Aug 28, 2023

p-ferreira Aug 28, 2023

Dpo penalty update #138

Dpo penalty update #138

Conversation

Eugene-hu commented Aug 25, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment