Introduce Knowledge Distillation Base #417
Open
+363
−0
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
Thanks to the nice suggestions from @Tcc0403 and @hongpeng-guo. This PR is the first split from #408, focusing solely on introducing the Knowledge Distillation base class. As a result, this PR does not include any tests at the moment.
Code Changes
Refactor
beta
into two weights:weight_hard_loss
andweight_soft_loss
, as coefficients betweenhard_loss
andsoft_loss
. @Tcc0403 also pointed out that we could usetorch.lerp
if applicable.Pass
teacher_logits
andstudent_logits
directly to the divergence loss function. This avoids redundant computations of converting logits to log probabilities and then reverting them to raw logits. However note that we are not reusing thestudent_log_probs
value calculated duringce_loss
in distillation base.get_batch_logps
intest/utils.py
.Modify
chunking
dimensions fromB
toB * T
. Thanks to @hongpeng-guo's great advice.Normalize the
distillation_loss
using(full_target != ignore_index).sum()
.TODO
Knowledge Distillation
Knowledge Distillation (KD; Hinton et al. 2015, Gou et al. 2020) is a straightforward way to build a smaller, cheaper model (“student model”) to speed up inference by transferring skills from a pre-trained expensive model (“teacher model”) into the student.
In knowledge distillation, a student model is trained to replicate the outputs of a teacher model using a distillation loss. Neural networks typically include a softmax layer; for instance, a large language model produces a probability distribution over tokens. Let
z_t
andz_s
represent the logits before the softmax layer for the teacher and student models, respectively. The distillation loss reduces the discrepancy between the two softmax outputs at a high temperatureT
. When ground truth labelsy
are available, this approach can be combined with a supervised learning objective, such as cross-entropy, to compare the student’s outputs with the ground truth.The combined loss function is defined as:
Here, we directly pass in
logits
rather thanlogpbs
. @Tcc0403Shared
DistillationBase
To support various distillation learning objectives, this PR aims to add a
LigerFusedLinearDistillationBase
which is basically same as propose by @hongpeng-guo within this discussion #371 (comment). Thank you @hongpeng-guo for thinking through this.Testing Done
I'll post JSD tests and benchmarks results in next PR.
make test
to ensure correctnessmake checkstyle
to ensure code stylemake test-convergence
to ensure convergence