Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Deepseek V2 #2224

Merged
merged 1 commit into from
Jul 19, 2024
Merged

Add support for Deepseek V2 #2224

merged 1 commit into from
Jul 19, 2024

Conversation

danieldk
Copy link
Member

@danieldk danieldk commented Jul 12, 2024

What does this PR do?

Deepseek V2 is a MoE model from Deepseek. Relevant variations compared to other models:

  • Grouped top-K in expert selection.
  • mscale in yarn is calculated using the mscale and mscale_all_dim configuration options.
  • mscale_all_dim is also used in scaling attention softmax.
  • Permuting of the query/key representations before applying rotary embeddings.
  • Some projections cannot be sharded (q_a_proj, kv_a_proj_with_mqa). So, we need weight loads that supports quantized weights. To this end {Weights,WeightsLoader}.get_weight was added.
  • The query/key head dimensionality differs from that of the value, so we need to pad during attention.
  • Heads with size 192, needs an extension to our paged attention fork and we need to ensure that the KV cache is allocated with the correct size.
  • Shared experts.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@danieldk danieldk force-pushed the feature/deepseek-v2 branch 3 times, most recently from 0dbb6e9 to 032c479 Compare July 15, 2024 13:11
@danieldk danieldk marked this pull request as draft July 15, 2024 15:27
@danieldk danieldk marked this pull request as ready for review July 15, 2024 15:27
Copy link
Collaborator

@drbh drbh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great just a couple small syntax comments

@danieldk danieldk force-pushed the feature/deepseek-v2 branch from 032c479 to e5cd109 Compare July 19, 2024 08:55
Deepseek V2 is a MoE model from Deepseek. Relevant variations
compared to other models:

- Grouped top-K in expert selection.
- mscale in yarn is calculated using the `mscale` and `mscale_all_dim`
  configuration options.
- `mscale_all_dim` is also used in scaling attention softmax.
- Permuting of the query/key representations before applying rotary
  embeddings.
- Some projections cannot be sharded (`q_a_proj`, `kv_a_proj_with_mqa`).
  So, we need weight loads that supports quantized weights. To this
  end `{Weights,WeightLoader}.get_weight` was added.
- The query/key head dimensionality differs from that of the value,
  so we need to pad during attention.
- Heads with size 192, needs an extension to our paged attention
  fork and we need to ensure that the KV cache is allocated with the
  correct size.
- Shared experts.
@danieldk danieldk force-pushed the feature/deepseek-v2 branch from e5cd109 to 836a2e2 Compare July 19, 2024 08:57
@danieldk danieldk requested a review from drbh July 19, 2024 09:02
Copy link
Collaborator

@drbh drbh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

@danieldk danieldk merged commit e52be9b into main Jul 19, 2024
9 checks passed
@danieldk danieldk deleted the feature/deepseek-v2 branch July 19, 2024 15:23
ErikKaum pushed a commit that referenced this pull request Jul 25, 2024
Deepseek V2 is a MoE model from Deepseek. Relevant variations
compared to other models:

- Grouped top-K in expert selection.
- mscale in yarn is calculated using the `mscale` and `mscale_all_dim`
  configuration options.
- `mscale_all_dim` is also used in scaling attention softmax.
- Permuting of the query/key representations before applying rotary
  embeddings.
- Some projections cannot be sharded (`q_a_proj`, `kv_a_proj_with_mqa`).
  So, we need weight loads that supports quantized weights. To this
  end `{Weights,WeightLoader}.get_weight` was added.
- The query/key head dimensionality differs from that of the value,
  so we need to pad during attention.
- Heads with size 192, needs an extension to our paged attention
  fork and we need to ensure that the KV cache is allocated with the
  correct size.
- Shared experts.
ErikKaum pushed a commit that referenced this pull request Jul 26, 2024
Deepseek V2 is a MoE model from Deepseek. Relevant variations
compared to other models:

- Grouped top-K in expert selection.
- mscale in yarn is calculated using the `mscale` and `mscale_all_dim`
  configuration options.
- `mscale_all_dim` is also used in scaling attention softmax.
- Permuting of the query/key representations before applying rotary
  embeddings.
- Some projections cannot be sharded (`q_a_proj`, `kv_a_proj_with_mqa`).
  So, we need weight loads that supports quantized weights. To this
  end `{Weights,WeightLoader}.get_weight` was added.
- The query/key head dimensionality differs from that of the value,
  so we need to pad during attention.
- Heads with size 192, needs an extension to our paged attention
  fork and we need to ensure that the KV cache is allocated with the
  correct size.
- Shared experts.
yuanwu2017 pushed a commit to yuanwu2017/tgi-gaudi that referenced this pull request Sep 26, 2024
Deepseek V2 is a MoE model from Deepseek. Relevant variations
compared to other models:

- Grouped top-K in expert selection.
- mscale in yarn is calculated using the `mscale` and `mscale_all_dim`
  configuration options.
- `mscale_all_dim` is also used in scaling attention softmax.
- Permuting of the query/key representations before applying rotary
  embeddings.
- Some projections cannot be sharded (`q_a_proj`, `kv_a_proj_with_mqa`).
  So, we need weight loads that supports quantized weights. To this
  end `{Weights,WeightLoader}.get_weight` was added.
- The query/key head dimensionality differs from that of the value,
  so we need to pad during attention.
- Heads with size 192, needs an extension to our paged attention
  fork and we need to ensure that the KV cache is allocated with the
  correct size.
- Shared experts.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants