Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add diffllama #34083

Open
wants to merge 57 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 49 commits
Commits
Show all changes
57 commits
Select commit Hold shift + click to select a range
3bd9e34
first adding diffllama
weak-kajuma Oct 11, 2024
269055e
add Diff Attention and other but still with errors
weak-kajuma Oct 11, 2024
dbbf073
complate make attention Diff-Attention
weak-kajuma Oct 16, 2024
c4ea9df
fix some bugs which may be caused by transformer-cli while adding model
weak-kajuma Oct 16, 2024
e072544
fix a bug caused by forgetting KV cache...
weak-kajuma Oct 16, 2024
674d7a2
Update src/transformers/models/diffllama/modeling_diffllama.py
weak-kajuma Oct 20, 2024
9eac636
Update src/transformers/models/diffllama/modeling_diffllama.py
weak-kajuma Oct 20, 2024
0e99dbd
Update src/transformers/models/diffllama/modeling_diffllama.py
weak-kajuma Oct 20, 2024
1e445c7
Update src/transformers/models/diffllama/modeling_diffllama.py
weak-kajuma Oct 20, 2024
cca6a5c
Update src/transformers/models/diffllama/modeling_diffllama.py
weak-kajuma Oct 20, 2024
dd167af
Update src/transformers/models/diffllama/modeling_diffllama.py
weak-kajuma Oct 20, 2024
23099cb
Update src/transformers/models/diffllama/modeling_diffllama.py
weak-kajuma Oct 20, 2024
faac378
Update src/transformers/models/diffllama/modeling_diffllama.py
weak-kajuma Oct 20, 2024
53e13aa
I found Attention missed implemented from paper still on e072544a3bfc…
weak-kajuma Oct 20, 2024
63b018a
re-implemented
weak-kajuma Oct 20, 2024
204bec8
adding groupnorm
weak-kajuma Oct 20, 2024
bce12e5
align with transformers code style
weak-kajuma Oct 20, 2024
44d8423
fix typo
weak-kajuma Oct 20, 2024
6dc6f81
adding groupnorm
weak-kajuma Oct 20, 2024
48b38e8
change SdpaAttention to DiffSdpaAttention
weak-kajuma Oct 20, 2024
997f561
fix bug
weak-kajuma Oct 20, 2024
107bd3c
Update src/transformers/models/diffllama/modeling_diffllama.py
weak-kajuma Oct 21, 2024
26307d9
fix bugs of places of "GroupNorm with scale" and etc
weak-kajuma Oct 21, 2024
22aa145
Revert "fix bugs of places of "GroupNorm with scale" and etc"
weak-kajuma Oct 21, 2024
cc472be
simplify multiple of attention (matmul) operations into one by repeat…
weak-kajuma Oct 22, 2024
e834129
simplify multiple of attention (matmul) operations into one by repeat…
weak-kajuma Oct 22, 2024
e9d94e5
simplify multiple of attention (matmul) operations into one by repeat…
weak-kajuma Oct 22, 2024
0352999
remove missed type
weak-kajuma Oct 22, 2024
843178a
add diffllama model_doc
weak-kajuma Oct 29, 2024
71c8d12
apply make style/quality
weak-kajuma Oct 29, 2024
fea95fa
apply review comment about model
weak-kajuma Oct 30, 2024
b3f8dd5
apply review comment about test
weak-kajuma Oct 30, 2024
50ce353
place diffllama alphabetically on the src/transformers/__init__.py
weak-kajuma Oct 30, 2024
6f25333
fix forgot code
weak-kajuma Oct 31, 2024
dd2282e
Supports parameters that are not initialized with standard deviation …
weak-kajuma Oct 31, 2024
9e7a9c3
add DiffLlamaConfig to CONFIG_CLASSES_TO_IGNORE_FOR_DOCSTRING_CHECKPO…
weak-kajuma Oct 31, 2024
8c98d19
remove unused property of config
weak-kajuma Nov 1, 2024
cbf217d
add to supported model list
weak-kajuma Nov 1, 2024
c873982
add to spda supported model list
weak-kajuma Nov 1, 2024
b003a53
fix copyright, remove pretraining_tensor_parallel, and modify for ini…
weak-kajuma Nov 7, 2024
37c7a88
remove unused import and etc.
weak-kajuma Nov 7, 2024
ba92d5c
empty commit
weak-kajuma Nov 7, 2024
8cc823e
empty commit
weak-kajuma Nov 7, 2024
d47631d
empty commit
weak-kajuma Nov 7, 2024
c6932de
apply modular transformers but with bugs
weak-kajuma Nov 20, 2024
48e16cf
revert prev commit
weak-kajuma Dec 1, 2024
a44f95d
create src/transformers/model/diffllama/modular_diffllama.py
weak-kajuma Dec 1, 2024
c45aa59
run utils/modular_model_converter.py
weak-kajuma Dec 1, 2024
c5741eb
empty commit
weak-kajuma Dec 1, 2024
ea622ce
leaner modular diffllama
weak-kajuma Dec 6, 2024
e30c298
Merge branch 'huggingface:main' into add_diffllama
weak-kajuma Dec 6, 2024
3f85c22
remove more and more in modular_diffllama.pt
weak-kajuma Dec 6, 2024
87d034d
remove more and more in modular_diffllama.pt
weak-kajuma Dec 6, 2024
4660c6e
resolve missing docstring entries
weak-kajuma Dec 21, 2024
b4ff5f3
force reset
weak-kajuma Dec 21, 2024
484a493
Merge branch 'huggingface:main' into add_diffllama
weak-kajuma Dec 21, 2024
0ce2023
convert modular
weak-kajuma Dec 21, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion docs/source/en/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -376,6 +376,8 @@
title: DeBERTa-v2
- local: model_doc/dialogpt
title: DialoGPT
- local: model_doc/diffllama
title: DiffLlama
- local: model_doc/distilbert
title: DistilBERT
- local: model_doc/dpr
Expand Down Expand Up @@ -969,4 +971,4 @@
- local: internal/time_series_utils
title: Utilities for Time Series
title: Internal Helpers
title: API
title: API
1 change: 1 addition & 0 deletions docs/source/en/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -120,6 +120,7 @@ Flax), PyTorch, and/or TensorFlow.
| [DETA](model_doc/deta) | ✅ | ❌ | ❌ |
| [DETR](model_doc/detr) | ✅ | ❌ | ❌ |
| [DialoGPT](model_doc/dialogpt) | ✅ | ✅ | ✅ |
| [DiffLlama](model_doc/diffllama) | ✅ | ❌ | ❌ |
| [DiNAT](model_doc/dinat) | ✅ | ❌ | ❌ |
| [DINOv2](model_doc/dinov2) | ✅ | ❌ | ✅ |
| [DistilBERT](model_doc/distilbert) | ✅ | ✅ | ✅ |
Expand Down
59 changes: 59 additions & 0 deletions docs/source/en/model_doc/diffllama.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
<!--Copyright 2024 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.

⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.

-->

# DiffLlama

## Overview

The DiffLlama model was proposed in [Differential Transformer](https://arxiv.org/abs/2410.05258) by Kazuma Matsumoto and .
This model is combine Llama model and Differential Transformer's Attention.

The abstract from the paper is the following:

*Transformer tends to overallocate attention to irrelevant context. In this work, we introduce Diff Transformer, which amplifies attention to the relevant context while canceling noise. Specifically, the differential attention mechanism calculates attention scores as the difference between two separate softmax attention maps. The subtraction cancels noise, promoting the emergence of sparse attention patterns. Experimental results on language modeling show that Diff Transformer outperforms Transformer in various settings of scaling up model size and training tokens. More intriguingly, it offers notable advantages in practical applications, such as long-context modeling, key information retrieval, hallucination mitigation, in-context learning, and reduction of activation outliers. By being less distracted by irrelevant context, Diff Transformer can mitigate hallucination in question answering and text summarization. For in-context learning, Diff Transformer not only enhances accuracy but is also more robust to order permutation, which was considered as a chronic robustness issue. The results position Diff Transformer as a highly effective and promising architecture to advance large language models.*

### Usage tips
The hyperparameters of this model is the same as Llama model.


## DiffLlamaConfig

[[autodoc]] DiffLlamaConfig

## DiffLlamaModel

[[autodoc]] DiffLlamaModel
- forward

## DiffLlamaForCausalLM

[[autodoc]] DiffLlamaForCausalLM
- forward

## DiffLlamaForSequenceClassification

[[autodoc]] DiffLlamaForSequenceClassification
- forward

## DiffLlamaForQuestionAnswering

[[autodoc]] DiffLlamaForQuestionAnswering
- forward

## DiffLlamaForTokenClassification

[[autodoc]] DiffLlamaForTokenClassification
- forward
2 changes: 2 additions & 0 deletions docs/source/en/perf_infer_gpu_one.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,7 @@ FlashAttention-2 is currently supported for the following architectures:
* [CLIP](https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPModel)
* [Cohere](https://huggingface.co/docs/transformers/model_doc/cohere#transformers.CohereModel)
* [Dbrx](https://huggingface.co/docs/transformers/model_doc/dbrx#transformers.DbrxModel)
* [DiffLlama](https://huggingface.co/docs/transformers/model_doc/diffllama#transformers.DiffLlamaModel)
* [DistilBert](https://huggingface.co/docs/transformers/model_doc/distilbert#transformers.DistilBertModel)
* [Gemma](https://huggingface.co/docs/transformers/model_doc/gemma#transformers.GemmaModel)
* [Gemma2](https://huggingface.co/docs/transformers/model_doc/gemma2#transformers.Gemma2Model)
Expand Down Expand Up @@ -219,6 +220,7 @@ For now, Transformers supports SDPA inference and training for the following arc
* [data2vec_audio](https://huggingface.co/docs/transformers/main/en/model_doc/data2vec#transformers.Data2VecAudioModel)
* [Dbrx](https://huggingface.co/docs/transformers/model_doc/dbrx#transformers.DbrxModel)
* [DeiT](https://huggingface.co/docs/transformers/model_doc/deit#transformers.DeiTModel)
* [DiffLlama](https://huggingface.co/docs/transformers/model_doc/diffllama#transformers.DiffLlamaModel)
* [Dinov2](https://huggingface.co/docs/transformers/en/model_doc/dinov2)
* [DistilBert](https://huggingface.co/docs/transformers/model_doc/distilbert#transformers.DistilBertModel)
* [Dpr](https://huggingface.co/docs/transformers/model_doc/dpr#transformers.DprReader)
Expand Down
20 changes: 20 additions & 0 deletions src/transformers/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -387,6 +387,7 @@
"models.depth_anything": ["DepthAnythingConfig"],
"models.detr": ["DetrConfig"],
"models.dialogpt": [],
"models.diffllama": ["DiffLlamaConfig"],
"models.dinat": ["DinatConfig"],
"models.dinov2": ["Dinov2Config"],
"models.distilbert": [
Expand Down Expand Up @@ -2072,6 +2073,16 @@
"DetrPreTrainedModel",
]
)
_import_structure["models.diffllama"].extend(
[
"DiffLlamaForCausalLM",
"DiffLlamaForQuestionAnswering",
"DiffLlamaForSequenceClassification",
"DiffLlamaForTokenClassification",
"DiffLlamaModel",
"DiffLlamaPreTrainedModel",
]
)
_import_structure["models.dinat"].extend(
[
"DinatBackbone",
Expand Down Expand Up @@ -5221,6 +5232,7 @@
)
from .models.depth_anything import DepthAnythingConfig
from .models.detr import DetrConfig
from .models.diffllama import DiffLlamaConfig
from .models.dinat import DinatConfig
from .models.dinov2 import Dinov2Config
from .models.distilbert import (
Expand Down Expand Up @@ -6824,6 +6836,14 @@
DetrModel,
DetrPreTrainedModel,
)
from .models.diffllama import (
DiffLlamaForCausalLM,
DiffLlamaForQuestionAnswering,
DiffLlamaForSequenceClassification,
DiffLlamaForTokenClassification,
DiffLlamaModel,
DiffLlamaPreTrainedModel,
)
from .models.dinat import (
DinatBackbone,
DinatForImageClassification,
Expand Down
1 change: 1 addition & 0 deletions src/transformers/models/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,7 @@
depth_anything,
detr,
dialogpt,
diffllama,
dinat,
dinov2,
distilbert,
Expand Down
2 changes: 2 additions & 0 deletions src/transformers/models/auto/configuration_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -87,6 +87,7 @@
("depth_anything", "DepthAnythingConfig"),
("deta", "DetaConfig"),
("detr", "DetrConfig"),
("diffllama", "DiffLlamaConfig"),
("dinat", "DinatConfig"),
("dinov2", "Dinov2Config"),
("distilbert", "DistilBertConfig"),
Expand Down Expand Up @@ -385,6 +386,7 @@
("deta", "DETA"),
("detr", "DETR"),
("dialogpt", "DialoGPT"),
("diffllama", "DiffLlama"),
("dinat", "DiNAT"),
("dinov2", "DINOv2"),
("distilbert", "DistilBERT"),
Expand Down
5 changes: 5 additions & 0 deletions src/transformers/models/auto/modeling_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -86,6 +86,7 @@
("deit", "DeiTModel"),
("deta", "DetaModel"),
("detr", "DetrModel"),
("diffllama", "DiffLlamaModel"),
("dinat", "DinatModel"),
("dinov2", "Dinov2Model"),
("distilbert", "DistilBertModel"),
Expand Down Expand Up @@ -477,6 +478,7 @@
("ctrl", "CTRLLMHeadModel"),
("data2vec-text", "Data2VecTextForCausalLM"),
("dbrx", "DbrxForCausalLM"),
("diffllama", "DiffLlamaForCausalLM"),
("electra", "ElectraForCausalLM"),
("ernie", "ErnieForCausalLM"),
("falcon", "FalconForCausalLM"),
Expand Down Expand Up @@ -928,6 +930,7 @@
("data2vec-text", "Data2VecTextForSequenceClassification"),
("deberta", "DebertaForSequenceClassification"),
("deberta-v2", "DebertaV2ForSequenceClassification"),
("diffllama", "DiffLlamaForSequenceClassification"),
("distilbert", "DistilBertForSequenceClassification"),
("electra", "ElectraForSequenceClassification"),
("ernie", "ErnieForSequenceClassification"),
Expand Down Expand Up @@ -1021,6 +1024,7 @@
("data2vec-text", "Data2VecTextForQuestionAnswering"),
("deberta", "DebertaForQuestionAnswering"),
("deberta-v2", "DebertaV2ForQuestionAnswering"),
("diffllama", "DiffLlamaForQuestionAnswering"),
("distilbert", "DistilBertForQuestionAnswering"),
("electra", "ElectraForQuestionAnswering"),
("ernie", "ErnieForQuestionAnswering"),
Expand Down Expand Up @@ -1114,6 +1118,7 @@
("data2vec-text", "Data2VecTextForTokenClassification"),
("deberta", "DebertaForTokenClassification"),
("deberta-v2", "DebertaV2ForTokenClassification"),
("diffllama", "DiffLlamaForTokenClassification"),
("distilbert", "DistilBertForTokenClassification"),
("electra", "ElectraForTokenClassification"),
("ernie", "ErnieForTokenClassification"),
Expand Down
7 changes: 7 additions & 0 deletions src/transformers/models/auto/tokenization_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -167,6 +167,13 @@
"DebertaV2TokenizerFast" if is_tokenizers_available() else None,
),
),
(
"diffllama",
(
"LlamaTokenizer" if is_sentencepiece_available() else None,
"LlamaTokenizerFast" if is_tokenizers_available() else None,
),
),
("distilbert", ("DistilBertTokenizer", "DistilBertTokenizerFast" if is_tokenizers_available() else None)),
(
"dpr",
Expand Down
27 changes: 27 additions & 0 deletions src/transformers/models/diffllama/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from typing import TYPE_CHECKING

from ...utils import _LazyModule
from ...utils.import_utils import define_import_structure


if TYPE_CHECKING:
from .configuration_diffllama import *
from .modeling_diffllama import *
else:
import sys

_file = globals()["__file__"]
sys.modules[__name__] = _LazyModule(__name__, _file, define_import_structure(_file), module_spec=__spec__)
Loading