convert-diff-transformer CLI command / codepath #2197

djsaunde · 2024-12-17T05:19:51Z

Description

This PR implements the differential attention layer from the Differential Transformer paper.

Motivation and Context

We wanted to add this attention implementation to axolotl so users can swap out the existing attention layers in their models for this more performant version. We matched the official implementation details as closely as possible, while adopting it to play nicely with the transformers attention implementations.

Since we were focused on being able to convert existing LLMs to having these differential attention layers, we wanted a way to not degrade the performance of the (possibly pre-trained) LLM while doing so.

To this end, the conversion process doubles the dimensionality of the query and key projections (since the differential attention requires both a positive and negative component of the attention) and (optionally; pass --zero-init) initializes the weights of the negative component to zero, while copying over the weights from the original attention modules to the positive components.

When doing this, the converted network computes the same function as the original (pass --debug to confirm this), but may suffer from a vanishing gradient problem. The default behavior is thus to initialize the weights of the negative components of the differential attention layers to 0-centered normally distributed values with a small variance.

Relevant links:

How has this been tested?

SmolLM2-135m on A40 Runpod instance on this feature branch. Workflow was:

Convert the model to use either eager or SDPA differential attention
- With and without --zero-init and --debug flags for sanity checking exact model conversion (completions, logits, losses)
Run new axolotl evaluate command on the small mhenrichsen/alpaca_2k_test dataset with both the original and converted model and check that their evaluation metrics match

For example:

$ axolotl convert-diff-transformer ../configs/smollm.yaml --output-dir ../converted-model --zero-init --debug
...
[2024-12-17 05:15:26,910] [INFO] [axolotl.cli.convert_attention.convert_diff_transformer:75] [PID:94590] [RANK:0] Converting 
to differential attention...                                                                                                 
[2024-12-17 05:15:26,910] [INFO] [axolotl.integrations.diff_transformer.convert.convert_module:97] [PID:94590] [RANK:0] Conve
rting attention layer 0: LlamaSdpaAttention to LlamaDifferentialSdpaAttention                                                
[2024-12-17 05:15:26,921] [DEBUG] [axolotl.integrations.diff_transformer.convert.copy_attention_weights:64] [PID:94590] [RANK
:0] Copied positive attention weights from LlamaSdpaAttention to LlamaDifferentialSdpaAttention                              
[2024-12-17 05:15:26,921] [INFO] [axolotl.integrations.diff_transformer.convert.convert_module:97] [PID:94590] [RANK:0] Conve
rting attention layer 1: LlamaSdpaAttention to LlamaDifferentialSdpaAttention                                                
[2024-12-17 05:15:26,930] [DEBUG] [axolotl.integrations.diff_transformer.convert.copy_attention_weights:64] [PID:94590] [RANK
:0] Copied positive attention weights from LlamaSdpaAttention to LlamaDifferentialSdpaAttention
...
ANK:0] Converted 30 attention layers to differential attention
[2024-12-17 05:15:27,181] [INFO] [axolotl.cli.convert_attention.convert_diff_transformer:85] [PID:94590] [RANK:0] Testing con
verted model...
[2024-12-17 05:15:27,785] [INFO] [axolotl.cli.convert_attention.test_inference:43] [PID:94590] [RANK:0] Prompt: The quick brown fox                                                                                                                       
[2024-12-17 05:15:28,280] [INFO] [axolotl.cli.convert_attention.convert_diff_transformer:121] [PID:94590] [RANK:0] Generations match!
Model generation:
**************************************************
The quick brown fox jumps over the lazy dog

The quick brown fox jumps over the lazy dog.

The
**************************************************

Types of changes

axolotl.integrations.diff_transformer module, which implements the differential attention layers for the Llama LLM architecture and for various attention implementations (eager, SDPA, Flash Attention 2), and
axolotl.cli.integrations.convert_diff_transformer module (and updates to axolotl.cli.main), which implements the convert-diff-transformer CLI command, and
Monkeypatch in axolotl.cli.integrations.convert_diff_transformer.patches (to be moved) for updating LLAMA_ATTENTION_CLASSES constant in transformers.models.llama.modeling_llama.

TODO

outputs

model-out/eval_summary.csv

winglian · 2024-12-21T05:13:23Z

src/axolotl/utils/yaml.py

+def dump_yaml_preserved_order(
+    data: Dict, reference_yaml_path: str, output_path: str
+) -> None:
+    """Dump YAML file while preserving nested order and normalized spacing."""


We could similarly have a function to normalize any config yaml file to have some expected ordering / formatting.

ehartford · 2024-12-21T08:52:42Z

I thought differential transformer requires model architecture change and modeling code change? Does this somehow automatically implement a modeling.py for the model?

djsaunde · 2024-12-21T15:32:39Z

I thought differential transformer requires model architecture change and modeling code change? Does this somehow automatically implement a modeling.py for the model?

Good question. I've implemented a monkeypatch in src/axolotl/monkeypatch/attention/differential.py that updates the PreTrainedModel._autoset_attn_implementation function to be aware of the differential attention implementation. I think it's a bit of a hack, though, so using custom modeling code might be a good change before merge. Happy to hear your thoughts / feedback!

As for the architecture change, we have src/axolotl/cli/integrations/convert_diff_transformer.py which does the actually swapping of (llama only, for now) attention layers with differential attention in the model.

ehartford · 2024-12-21T23:00:48Z

monkey patch only works in the context of Axolotl - we will need a modeling.py to make inference work properly in the wild (transformers, TGI, vllm, etc) right? (If I understand correctly)

* basic evaluate CLI command / codepath * tests for evaluate CLI command * fixes and cleanup * review comments; slightly DRYing up things --------- Co-authored-by: Dan Saunders <[email protected]>

…r llama arch)

…lity

djsaunde self-assigned this Dec 17, 2024

winglian reviewed Dec 17, 2024

View reviewed changes

outputs Outdated Show resolved Hide resolved

winglian reviewed Dec 17, 2024

View reviewed changes

model-out/eval_summary.csv Outdated Show resolved Hide resolved

djsaunde force-pushed the diff-transformer branch 2 times, most recently from f2c37e7 to 2717b97 Compare December 20, 2024 20:41

winglian reviewed Dec 21, 2024

View reviewed changes

djsaunde force-pushed the diff-transformer branch from adc024b to 938b627 Compare December 21, 2024 16:57

djsaunde added 20 commits December 23, 2024 14:22

Basic evaluate CLI command / codepath (#2188)

5b4d027

* basic evaluate CLI command / codepath * tests for evaluate CLI command * fixes and cleanup * review comments; slightly DRYing up things --------- Co-authored-by: Dan Saunders <[email protected]>

initial diff attn layer / model conversion implementation (support fo…

1e49a88

…r llama arch)

Adding script for doing conversion; fixes and updates

8c4ff51

adding CLI command for convert-diff-transformer

8264c62

training fixes, patching, minor cleanup

4bdbb2f

various improvemnents

60a1668

various improvemnents

32f1b3f

fix model save / load logic

c1968ed

pre-commit fix

dbeea75

moving monkeypatch

81d9ff4

differential flash attention 2; cleanup

c74a290

duplicate code ignore

12d14cc

convert-differential-transformer test coverage

a6b5a5e

plugin implementation

6a9af88

fixes post-rebase

b7294d4

isolating problematic test

c57d21e

adding split_heads argument for retaining original (Q, K) dimensionan…

513b262

…lity

moving tests around for flash_attn install

313265f

removing extra pytest xdist args

53b4d80

adding yaml dumper preserving input config format

9262124

djsaunde and others added 3 commits December 23, 2024 14:22

refactor and fixing test isolation issues

a1a3f1d

added modeling code; cleanup + refactor

c6def27

fix duplicate-code warnings

7d9ec2c

djsaunde force-pushed the diff-transformer branch from 938b627 to 7d9ec2c Compare December 23, 2024 19:22

Dan Saunders and others added 2 commits December 23, 2024 20:40

updated custom modeling code

44e4b83

progress on modeling code

6945bdd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

convert-diff-transformer CLI command / codepath #2197

convert-diff-transformer CLI command / codepath #2197

djsaunde commented Dec 17, 2024 •

edited

Loading

winglian Dec 21, 2024

djsaunde Dec 21, 2024

ehartford commented Dec 21, 2024

djsaunde commented Dec 21, 2024 •

edited

Loading

ehartford commented Dec 21, 2024

convert-diff-transformer CLI command / codepath #2197

Are you sure you want to change the base?

convert-diff-transformer CLI command / codepath #2197

Conversation

djsaunde commented Dec 17, 2024 • edited Loading

Description

Motivation and Context

How has this been tested?

Types of changes

TODO

winglian Dec 21, 2024

Choose a reason for hiding this comment

djsaunde Dec 21, 2024

Choose a reason for hiding this comment

ehartford commented Dec 21, 2024

djsaunde commented Dec 21, 2024 • edited Loading

ehartford commented Dec 21, 2024

djsaunde commented Dec 17, 2024 •

edited

Loading

djsaunde commented Dec 21, 2024 •

edited

Loading