Introduce Knowledge Distillation Base #417

austin362667 · 2024-12-02T09:48:04Z

Summary

Thanks to the nice suggestions from @Tcc0403 and @hongpeng-guo. This PR is the first split from #408, focusing solely on introducing the Knowledge Distillation base class. As a result, this PR does not include any tests at the moment.

Code Changes

Refactor beta into two weights: weight_hard_loss and weight_soft_loss, as coefficients between hard_loss and soft_loss. @Tcc0403 also pointed out that we could use torch.lerp if applicable.
Pass teacher_logits and student_logits directly to the divergence loss function. This avoids redundant computations of converting logits to log probabilities and then reverting them to raw logits. However note that we are not reusing the student_log_probs value calculated during ce_loss in distillation base.
1. Remove the unnecessary get_batch_logps in test/utils.py.
Modify chunking dimensions from B to B * T. Thanks to @hongpeng-guo's great advice.
1. Fix the loss calculation to use per-token values instead of averaging across the sequence length dimension.
Normalize the distillation_loss using (full_target != ignore_index).sum().

TODO

Although a slightly slowdown is reasonable, we need to investigate why this PR's implementation is significantly slower compared to the naive approach. Thanks to @Tcc0403 's clarification.

Knowledge Distillation

Knowledge Distillation (KD; Hinton et al. 2015, Gou et al. 2020) is a straightforward way to build a smaller, cheaper model (“student model”) to speed up inference by transferring skills from a pre-trained expensive model (“teacher model”) into the student.

In knowledge distillation, a student model is trained to replicate the outputs of a teacher model using a distillation loss. Neural networks typically include a softmax layer; for instance, a large language model produces a probability distribution over tokens. Let z_t and z_s represent the logits before the softmax layer for the teacher and student models, respectively. The distillation loss reduces the discrepancy between the two softmax outputs at a high temperature T. When ground truth labels y are available, this approach can be combined with a supervised learning objective, such as cross-entropy, to compare the student’s outputs with the ground truth.

The combined loss function is defined as:

$$\mathcal{L}_{\text{knowledge distillation}} = \mathcal{w}_{\text{soft}} \cdot \mathcal{L}_{\text{distill}}(\mathbf{z_t}, \mathbf{z_s}, T) + \mathcal{w}_{\text{hard}} \cdot \mathcal{L}_{\text{cross entropy}}(\mathbf{y}, \mathbf{z_s}),$$

Here, we directly pass in logits rather than logpbs. @Tcc0403

Shared `DistillationBase`

To support various distillation learning objectives, this PR aims to add a LigerFusedLinearDistillationBase which is basically same as propose by @hongpeng-guo within this discussion #371 (comment). Thank you @hongpeng-guo for thinking through this.

Testing Done

I'll post JSD tests and benchmarks results in next PR.

Hardware Type: L40S
run make test to ensure correctness
run make checkstyle to ensure code style
run make test-convergence to ensure convergence

Signed-off-by: Austin Liu <[email protected]>

austin362667 added 3 commits December 2, 2024 15:46

Add liger and naive distill base

e0d7000

Signed-off-by: Austin Liu <[email protected]>

Format

1848688

Signed-off-by: Austin Liu <[email protected]>

Refactor beta

f5d18f1

Signed-off-by: Austin Liu <[email protected]>

austin362667 mentioned this pull request Dec 2, 2024

Introduce Distillation with a Chunked, Fused Linear JS-divergence Loss #408

Closed

5 tasks

austin362667 changed the title ~~Feat/distill base~~ Introduce Knowledge Distillation Base Dec 2, 2024

Remove imports

a83a8a5

Signed-off-by: Austin Liu <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce Knowledge Distillation Base #417

Introduce Knowledge Distillation Base #417

austin362667 commented Dec 2, 2024 •

edited

Loading

Introduce Knowledge Distillation Base #417

Are you sure you want to change the base?

Introduce Knowledge Distillation Base #417

Conversation

austin362667 commented Dec 2, 2024 • edited Loading

Summary

Code Changes

TODO

Knowledge Distillation

Shared DistillationBase

Testing Done

austin362667 commented Dec 2, 2024 •

edited

Loading

Shared `DistillationBase`