Add support for Deepseek V2 #2224

danieldk · 2024-07-12T11:00:12Z

What does this PR do?

Deepseek V2 is a MoE model from Deepseek. Relevant variations compared to other models:

Grouped top-K in expert selection.
mscale in yarn is calculated using the mscale and mscale_all_dim configuration options.
mscale_all_dim is also used in scaling attention softmax.
Permuting of the query/key representations before applying rotary embeddings.
Some projections cannot be sharded (q_a_proj, kv_a_proj_with_mqa). So, we need weight loads that supports quantized weights. To this end {Weights,WeightsLoader}.get_weight was added.
The query/key head dimensionality differs from that of the value, so we need to pad during attention.
Heads with size 192, needs an extension to our paged attention fork and we need to ensure that the KV cache is allocated with the correct size.
Shared experts.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

drbh

looks great just a couple small syntax comments

server/Makefile-vllm

server/text_generation_server/models/custom_modeling/flash_deepseek_v2_modeling.py

Deepseek V2 is a MoE model from Deepseek. Relevant variations compared to other models: - Grouped top-K in expert selection. - mscale in yarn is calculated using the `mscale` and `mscale_all_dim` configuration options. - `mscale_all_dim` is also used in scaling attention softmax. - Permuting of the query/key representations before applying rotary embeddings. - Some projections cannot be sharded (`q_a_proj`, `kv_a_proj_with_mqa`). So, we need weight loads that supports quantized weights. To this end `{Weights,WeightLoader}.get_weight` was added. - The query/key head dimensionality differs from that of the value, so we need to pad during attention. - Heads with size 192, needs an extension to our paged attention fork and we need to ensure that the KV cache is allocated with the correct size. - Shared experts.

drbh

lgtm!

Deepseek V2 is a MoE model from Deepseek. Relevant variations compared to other models: - Grouped top-K in expert selection. - mscale in yarn is calculated using the `mscale` and `mscale_all_dim` configuration options. - `mscale_all_dim` is also used in scaling attention softmax. - Permuting of the query/key representations before applying rotary embeddings. - Some projections cannot be sharded (`q_a_proj`, `kv_a_proj_with_mqa`). So, we need weight loads that supports quantized weights. To this end `{Weights,WeightLoader}.get_weight` was added. - The query/key head dimensionality differs from that of the value, so we need to pad during attention. - Heads with size 192, needs an extension to our paged attention fork and we need to ensure that the KV cache is allocated with the correct size. - Shared experts.

danieldk force-pushed the feature/deepseek-v2 branch 3 times, most recently from 0dbb6e9 to 032c479 Compare July 15, 2024 13:11

danieldk marked this pull request as draft July 15, 2024 15:27

danieldk marked this pull request as ready for review July 15, 2024 15:27

drbh requested changes Jul 18, 2024

View reviewed changes

danieldk force-pushed the feature/deepseek-v2 branch from 032c479 to e5cd109 Compare July 19, 2024 08:55

danieldk force-pushed the feature/deepseek-v2 branch from e5cd109 to 836a2e2 Compare July 19, 2024 08:57

danieldk requested a review from drbh July 19, 2024 09:02

drbh approved these changes Jul 19, 2024

View reviewed changes

danieldk merged commit e52be9b into main Jul 19, 2024
9 checks passed

danieldk deleted the feature/deepseek-v2 branch July 19, 2024 15:23

PaoloAlbano mentioned this pull request Jul 22, 2024

misc: update vllm dependency to support attention size 160 #2187

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for Deepseek V2 #2224

Add support for Deepseek V2 #2224

danieldk commented Jul 12, 2024 •

edited

Loading

drbh left a comment

drbh left a comment

Add support for Deepseek V2 #2224

Add support for Deepseek V2 #2224

Conversation

danieldk commented Jul 12, 2024 • edited Loading

What does this PR do?

Before submitting

Who can review?

drbh left a comment

Choose a reason for hiding this comment

drbh left a comment

Choose a reason for hiding this comment

danieldk commented Jul 12, 2024 •

edited

Loading