Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama : support RWKV v6 models #8980

Merged
merged 53 commits into from
Sep 1, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
53 commits
Select commit Hold shift + click to select a range
8d2eca3
convert_hf_to_gguf: Add support for RWKV v6
MollySophia Jul 31, 2024
dc0767f
Add RWKV tokenization
LaylBongers Apr 4, 2024
865167d
Fix build
MollySophia Jul 31, 2024
7cac72a
Do not use special tokens when matching in RWKV tokenizer
LaylBongers Apr 12, 2024
e92c74f
Fix model loading
LaylBongers Apr 15, 2024
a0aae8d
Add (broken) placeholder graph builder for RWKV
LaylBongers Apr 17, 2024
a866789
Add workaround for kv cache
LaylBongers Apr 19, 2024
4e23d97
Add logits conversion to rwkv5
LaylBongers Apr 23, 2024
5479588
Add rwkv5 layer norms
LaylBongers Apr 26, 2024
dd3aa3d
Add time mix KVRG & correct merge mistake
LaylBongers May 6, 2024
b409fd8
Add remaining time mix parameters
LaylBongers May 13, 2024
3cbeffc
Add time mix output loading
LaylBongers May 13, 2024
b3b17e0
Add placeholder llm_build_time_mix
LaylBongers May 14, 2024
700dad1
Fix build
MollySophia Aug 1, 2024
a180b63
Load more tensors for rwkv v6
MollySophia Aug 1, 2024
0e5ac34
Fix rwkv tokenizer
MollySophia Aug 2, 2024
5732de8
ggml: Add unary operator Exp
MollySophia Aug 2, 2024
0784a0c
RWKV v6 graph building
MollySophia Aug 2, 2024
8d498c7
Add ``rescale_every_n_layers`` parameter
MollySophia Aug 6, 2024
903089b
Add ``wkv.head_size`` key for RWKV
MollySophia Aug 7, 2024
98ce5f4
Fix offloading layers to CUDA
MollySophia Aug 7, 2024
01dcf4b
Fix parallel inferencing for RWKV
MollySophia Aug 9, 2024
6ae2f48
Remove trailing whitespaces
MollySophia Aug 11, 2024
8bc1f9a
build_rwkv: Avoid using inplace operations
MollySophia Aug 11, 2024
18decea
convert_hf_to_gguf: rwkv: Avoid using ``eval``
MollySophia Aug 11, 2024
7f2e370
convert_hf_to_gguf: rwkv tokenizer: Don't escape sequences manually
MollySophia Aug 12, 2024
c695552
Update convert_hf_to_gguf.py
MollySophia Aug 12, 2024
8aa711a
ggml: Add backward computation for unary op ``exp``
MollySophia Aug 12, 2024
ae9936a
Update convert_hf_to_gguf.py
MollySophia Aug 12, 2024
5afa3ef
Update convert_hf_to_gguf.py
MollySophia Aug 12, 2024
12fbe1a
Use MODEL_ARCH.RWKV6 instead of MODEL_ARCH.RWKV
MollySophia Aug 12, 2024
276d53b
build_rwkv6: Simplify graph
MollySophia Aug 12, 2024
b0f4fe5
llama: rwkv6: Detect model.type
MollySophia Aug 13, 2024
683d70c
llama: rwkv6: Fix tensor loading for 7B/14B models
MollySophia Aug 13, 2024
ee1b78c
llama: rwkv6: Fix group_norm assertion failure with Metal
MollySophia Aug 13, 2024
c165e34
llama: rwkv6: Clean up
MollySophia Aug 13, 2024
6da6aa4
llama: rwkv6: Add quantization tensor exclusion
MollySophia Aug 13, 2024
f5d955d
llama: rwkv6: Use the new advanced batch splits
MollySophia Aug 23, 2024
57decb4
Update src/llama.cpp
MollySophia Aug 25, 2024
e94778a
llama: rwkv6: Use ``ggml_norm`` instead of ``ggml_group_norm``
MollySophia Aug 25, 2024
7756afd
llama: rwkv6: Apply code style and misc changes
MollySophia Aug 25, 2024
87a2901
converter: Use class name ``Rwkv6Model``
MollySophia Aug 25, 2024
c414a24
llama: rwkv6: Make use of key ``feed_forward_length``
MollySophia Aug 25, 2024
6d69fd7
llama: rwkv6: Add kv ``time_mix_extra_dim`` and ``time_decay_extra_dim``
MollySophia Aug 25, 2024
601b592
converter: Match ``new_name`` instead of ``name`` for float32 explici…
MollySophia Aug 26, 2024
e0ea511
llama: rwkv6: Keep ``time_mix_w1/w2`` as F32
MollySophia Aug 26, 2024
5f00c52
llama: rwkv6: Remove unused nodes
MollySophia Aug 26, 2024
7444046
llama: rwkv6: Apply code format changes
MollySophia Aug 26, 2024
7f2ef56
llama: rwkv6: Add lora for some supported tensors
MollySophia Aug 30, 2024
7004323
rwkv : speed-up tokenization using trie
ggerganov Aug 30, 2024
59dc2e7
minor : style + indentation
ggerganov Aug 30, 2024
5175375
llama: rwkv6: Avoid division by zero
MollySophia Aug 31, 2024
846358d
ggml: rwkv_wkv: Avoid copying the state
MollySophia Aug 31, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
84 changes: 83 additions & 1 deletion convert_hf_to_gguf.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@

from __future__ import annotations

import ast
import logging
import argparse
import contextlib
Expand Down Expand Up @@ -298,9 +299,12 @@ def prepare_tensors(self):
gguf.MODEL_TENSOR.POS_EMBD,
gguf.MODEL_TENSOR.TOKEN_TYPES,
gguf.MODEL_TENSOR.SSM_CONV1D,
gguf.MODEL_TENSOR.TIME_MIX_FIRST,
gguf.MODEL_TENSOR.TIME_MIX_W1,
gguf.MODEL_TENSOR.TIME_MIX_W2,
)
)
or not name.endswith(".weight")
or not new_name.endswith(".weight")
):
data_qtype = gguf.GGMLQuantizationType.F32

Expand Down Expand Up @@ -2716,6 +2720,84 @@ class StarCoder2Model(Model):
model_arch = gguf.MODEL_ARCH.STARCODER2


@Model.register("Rwkv6ForCausalLM")
class Rwkv6Model(Model):
model_arch = gguf.MODEL_ARCH.RWKV6

def set_vocab(self):
assert (self.dir_model / "rwkv_vocab_v20230424.txt").is_file()
vocab_size = self.hparams.get("vocab_size", 65536)

tokens: list[bytes] = ['<s>'.encode("utf-8")]
toktypes: list[int] = [gguf.TokenType.CONTROL]

with open(self.dir_model / "rwkv_vocab_v20230424.txt", "r", encoding="utf-8") as f:
lines = f.readlines()
for line in lines:
parts = line.split(' ')
assert len(parts) >= 3
token, token_len = ast.literal_eval(' '.join(parts[1:-1])), int(parts[-1])
token = token.encode("utf-8") if isinstance(token, str) else token
assert isinstance(token, bytes)
assert len(token) == token_len
token_text: str = repr(token)[2:-1] # "b'\xff'" -> "\xff"
tokens.append(token_text.encode("utf-8"))
MollySophia marked this conversation as resolved.
Show resolved Hide resolved
toktypes.append(gguf.TokenType.NORMAL)
remainder = vocab_size - len(tokens)
assert remainder >= 0
for i in range(len(tokens), vocab_size):
tokens.append(f"[PAD{i}]".encode("utf-8"))
toktypes.append(gguf.TokenType.UNUSED)

self.gguf_writer.add_tokenizer_model("rwkv")
self.gguf_writer.add_token_list(tokens)
self.gguf_writer.add_token_types(toktypes)

def set_gguf_parameters(self):
block_count = self.hparams["num_hidden_layers"]
head_size = self.hparams["head_size"]
hidden_size = self.hparams["hidden_size"]
layer_norm_eps = self.hparams["layer_norm_epsilon"]
rescale_every_n_layers = self.hparams["rescale_every"]
intermediate_size = self.hparams["intermediate_size"] if self.hparams["intermediate_size"] is not None else int((hidden_size * 3.5) // 32 * 32)
time_mix_extra_dim = 64 if hidden_size == 4096 else 32
time_decay_extra_dim = 128 if hidden_size == 4096 else 64

# RWKV isn't context limited
self.gguf_writer.add_context_length(1048576)
self.gguf_writer.add_embedding_length(hidden_size)
self.gguf_writer.add_block_count(block_count)
self.gguf_writer.add_layer_norm_eps(layer_norm_eps)
self.gguf_writer.add_rescale_every_n_layers(rescale_every_n_layers)
self.gguf_writer.add_wkv_head_size(head_size)
self.gguf_writer.add_time_mix_extra_dim(time_mix_extra_dim)
self.gguf_writer.add_time_decay_extra_dim(time_decay_extra_dim)
self.gguf_writer.add_feed_forward_length(intermediate_size)
self.gguf_writer.add_file_type(self.ftype)

# required by llama.cpp, unused
self.gguf_writer.add_head_count(0)

def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]:
new_name = self.map_tensor_name(name)

if not (new_name.endswith(".weight") or new_name.endswith(".bias")):
new_name += ".weight"
MollySophia marked this conversation as resolved.
Show resolved Hide resolved

if new_name.endswith("time_mix_w1.weight") or new_name.endswith("time_mix_decay_w1.weight") or new_name.endswith("time_mix_decay_w2.weight"):
data_torch = data_torch.transpose(0, 1)

if new_name.endswith("time_mix_w2.weight"):
data_torch = data_torch.permute(0, 2, 1)

rescale_every_n_layers = self.hparams["rescale_every"]
if rescale_every_n_layers > 0:
if new_name.endswith("time_mix_output.weight") or new_name.endswith("channel_mix_value.weight"):
data_torch = data_torch.div_(2 ** int(bid // rescale_every_n_layers))

yield (new_name, data_torch)


@Model.register("MambaForCausalLM", "MambaLMHeadModel", "FalconMambaForCausalLM")
class MambaModel(Model):
model_arch = gguf.MODEL_ARCH.MAMBA
Expand Down
19 changes: 19 additions & 0 deletions ggml/include/ggml.h
Original file line number Diff line number Diff line change
Expand Up @@ -512,6 +512,7 @@ extern "C" {
GGML_OP_WIN_UNPART,
GGML_OP_GET_REL_POS,
GGML_OP_ADD_REL_POS,
GGML_OP_RWKV_WKV,

GGML_OP_UNARY,

Expand Down Expand Up @@ -546,6 +547,7 @@ extern "C" {
GGML_UNARY_OP_SILU,
GGML_UNARY_OP_HARDSWISH,
GGML_UNARY_OP_HARDSIGMOID,
GGML_UNARY_OP_EXP,

GGML_UNARY_OP_COUNT,
};
Expand Down Expand Up @@ -1139,6 +1141,14 @@ extern "C" {
struct ggml_context * ctx,
struct ggml_tensor * a);

GGML_API struct ggml_tensor * ggml_exp(
struct ggml_context * ctx,
struct ggml_tensor * a);

GGML_API struct ggml_tensor * ggml_exp_inplace(
struct ggml_context * ctx,
struct ggml_tensor * a);

// normalize along rows
GGML_API struct ggml_tensor * ggml_norm(
struct ggml_context * ctx,
Expand Down Expand Up @@ -1887,6 +1897,15 @@ extern "C" {
struct ggml_tensor * pw,
struct ggml_tensor * ph);

GGML_API struct ggml_tensor * ggml_rwkv_wkv(
struct ggml_context * ctx,
struct ggml_tensor * k,
struct ggml_tensor * v,
struct ggml_tensor * r,
struct ggml_tensor * tf,
struct ggml_tensor * td,
struct ggml_tensor * state);

// custom operators

typedef void (*ggml_unary_op_f32_t) (const int, float *, const float *);
Expand Down
Loading
Loading