Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

merge upstream #43

Merged
merged 80 commits into from
Oct 27, 2024
Merged
Changes from 1 commit
Commits
Show all changes
80 commits
Select commit Hold shift + click to select a range
c7499c5
examples : do not use common library in simple example (#9803)
slaren Oct 10, 2024
cf8e0a3
musa: add docker image support (#9685)
yeahdongcn Oct 10, 2024
0e9f760
rpc : add backend registry / device interfaces (#9812)
slaren Oct 10, 2024
7eee341
common : use common_ prefix for common library functions (#9805)
slaren Oct 10, 2024
9677640
ggml : move more prints to the ggml log system (#9839)
slaren Oct 11, 2024
943d20b
musa : update doc (#9856)
yeahdongcn Oct 12, 2024
11ac980
llama : improve infill support and special token detection (#9798)
ggerganov Oct 12, 2024
95c76e8
server : remove legacy system_prompt feature (#9857)
ggerganov Oct 12, 2024
1bde94d
server : remove self-extend features (#9860)
ggerganov Oct 12, 2024
edc2656
server : add option to time limit the generation phase (#9865)
ggerganov Oct 12, 2024
92be9f1
flake.lock: Update (#9870)
ggerganov Oct 13, 2024
c7181bd
server : reuse cached context chunks (#9866)
ggerganov Oct 13, 2024
d4c19c0
server : accept extra_context for the infill endpoint (#9874)
ggerganov Oct 13, 2024
13dca2a
Vectorize load instructions in dmmv f16 CUDA kernel (#9816)
agray3 Oct 14, 2024
a89f75e
server : handle "logprobs" field with false value (#9871)
VoidIsVoid Oct 14, 2024
4c42f93
readme : update bindings list (#9889)
srgtuszy Oct 15, 2024
dcdd535
server : update preact (#9895)
ggerganov Oct 15, 2024
fbc98b7
sampling : add XTC sampler (#9742)
MaggotHATE Oct 15, 2024
223c25a
server : improve infill context reuse (#9894)
ggerganov Oct 15, 2024
755a9b2
llama : add infill sampler (#9896)
ggerganov Oct 15, 2024
becfd38
[CANN] Fix cann compilation error (#9891)
leo-pony Oct 16, 2024
cd60b88
ggml-alloc : remove buffer_id from leaf_alloc (ggml/987)
danbev Oct 9, 2024
0e41b30
sync : ggml
ggerganov Oct 16, 2024
1f66b69
server : fix the disappearance of the end of the text (#9867)
z80maniac Oct 16, 2024
10433e8
llama : add tensor name for "result_norm" (#9907)
MollySophia Oct 16, 2024
66c2c93
grammar : fix JSON Schema for string regex with top-level alt. (#9903)
jemc Oct 16, 2024
dbf18e4
llava : fix typo in error message [no ci] (#9884)
danbev Oct 16, 2024
9e04102
llama : suppress conversion from 'size_t' to 'int' (#9046)
danbev Oct 16, 2024
73afe68
fix: use `vm_allocate` to allocate CPU backend buffer on macOS (#9875)
giladgd Oct 16, 2024
2194200
fix: allocating CPU buffer with size `0` (#9917)
giladgd Oct 16, 2024
f010b77
vulkan : add backend registry / device interfaces (#9721)
slaren Oct 17, 2024
3752217
readme : update bindings list (#9918)
ShenghaiWang Oct 17, 2024
99bd4ac
llama : infill sampling handle very long tokens (#9924)
ggerganov Oct 17, 2024
9f45fc1
llama : change warning to debug log
ggerganov Oct 17, 2024
17bb928
readme : remove --memory-f32 references (#9925)
ggerganov Oct 17, 2024
6f55bcc
llama : rename batch_all to batch (#8881)
danbev Oct 17, 2024
8901755
server : add n_indent parameter for line indentation requirement (#9929)
ggerganov Oct 18, 2024
60ce97c
add amx kernel for gemm (#8998)
mingfeima Oct 18, 2024
87421a2
[SYCL] Add SYCL Backend registry, device and Event Interfaces (#9705)
OuadiElfarouki Oct 18, 2024
afd9909
rpc : backend refactoring (#9912)
rgerganov Oct 18, 2024
cda0e4b
llama : remove all_pos_0, all_pos_1, all_seq_id from llama_batch (#9745)
ngxson Oct 18, 2024
7cab208
readme : update infra list (#9942)
icppWorld Oct 20, 2024
45f0976
readme : update bindings list (#9951)
lcarrere Oct 20, 2024
1db8c84
fix mul_mat_vec_q and *_vec_q error (#9939)
NeoZhangJianyu Oct 21, 2024
bc21975
speculative : fix handling of some input params (#9963)
ggerganov Oct 21, 2024
55e4778
llama : default sampling changes + greedy update (#9897)
ggerganov Oct 21, 2024
d5ebd79
rpc : pack only RPC structs (#9959)
rgerganov Oct 21, 2024
f594bc8
ggml : add asserts for type conversion in fattn kernels (#9971)
ggerganov Oct 21, 2024
dbd5f2f
llama.vim : plugin for Neovim (#9787)
ggerganov Oct 21, 2024
94008cc
arg : fix attention non-causal arg value hint (#9985)
danbev Oct 21, 2024
994cfb1
readme : update UI list (#9972)
a-ghorbani Oct 21, 2024
e01c67a
llama.vim : move info to the right of screen [no ci] (#9787)
ggerganov Oct 21, 2024
e94a138
llama.vim : fix info text display [no ci] (#9787)
ggerganov Oct 21, 2024
674804a
arg : fix typo in embeddings argument help [no ci] (#9994)
danbev Oct 22, 2024
6b84473
[CANN] Adapt to dynamically loadable backends mechanism (#9970)
leo-pony Oct 22, 2024
4ff7fe1
llama : add chat template for RWKV-World + fix EOT (#9968)
MollySophia Oct 22, 2024
c421ac0
lora : warn user if new token is added in the adapter (#9948)
ngxson Oct 22, 2024
11d4705
Rwkv chat template fix (#10001)
MollySophia Oct 22, 2024
19d900a
llama : rename batch to ubatch (#9950)
danbev Oct 22, 2024
c8c07d6
llama : fix empty batch causing llama_batch_allocr to crash (#9966)
ngxson Oct 22, 2024
873279b
flake.lock: Update
github-actions[bot] Oct 20, 2024
4c9388f
metal : add POOL2D and fix IM2COL (#9943)
junhee-yoo Oct 23, 2024
ac113a0
llama.vim : add classic vim support (#9995)
m18coppola Oct 23, 2024
c19af0a
ggml : remove redundant set of contexts used field (ggml/978)
danbev Oct 16, 2024
80273a3
CUDA: fix 1D im2col, add tests (ggml/993)
JohannesGaessler Oct 18, 2024
2d3aba9
llama.vim : bump generation time limit to 3s [no ci]
ggerganov Oct 23, 2024
190a37d
sync : ggml
ggerganov Oct 23, 2024
0a1c750
server : samplers accept the prompt correctly (#10019)
wwoodsTM Oct 23, 2024
c39665f
CUDA: fix MMQ for non-contiguous src0, add tests (#10021)
JohannesGaessler Oct 24, 2024
167a515
CUDA: fix insufficient buffer clearing for MMQ (#10032)
JohannesGaessler Oct 24, 2024
40f2555
ci : fix cmake flags for SYCL
ggerganov Oct 24, 2024
958367b
server : refactor slot input data, move tokenizer to HTTP thread (#10…
ngxson Oct 24, 2024
bc5ba00
server : check that the prompt fits in the slot's context (#10030)
ggerganov Oct 25, 2024
2f8bd2b
llamafile : extend sgemm.cpp support for Q5_0 models (#10010)
Srihari-mcw Oct 25, 2024
d80fb71
llama: string_split fix (#10022)
Xarbirus Oct 25, 2024
ff252ea
llama : add DRY sampler (#9702)
wwoodsTM Oct 25, 2024
6687503
metal : support permuted matrix multiplicaions (#10033)
ggerganov Oct 25, 2024
9e4a256
scripts : fix amx sync [no ci]
ggerganov Oct 26, 2024
8c60a8a
increase cuda_cpy block size (ggml/996)
bssrdf Oct 23, 2024
cc2983d
sync : ggml
ggerganov Oct 26, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Vectorize load instructions in dmmv f16 CUDA kernel (ggerganov#9816)
* Vectorize load instructions in dmmv f16 CUDA kernel

Replaces scalar with vector load instructions, which substantially
improves performance on NVIDIA HBM GPUs, e.g. gives a 1.27X overall
speedup for Meta-Llama-3-8B-Instruct-F16 BS1 inference evaluation on
H100 SXM 80GB HBM3. On GDDR GPUs, there is a slight (1.01X) speedup.

* addressed comment

* Update ggml/src/ggml-cuda/dmmv.cu

Co-authored-by: Johannes Gäßler <[email protected]>

---------

Co-authored-by: Johannes Gäßler <[email protected]>
agray3 and JohannesGaessler authored Oct 14, 2024

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
commit 13dca2a54a394757d56fdd652b9f0df08f44ea22
34 changes: 25 additions & 9 deletions ggml/src/ggml-cuda/dmmv.cu
Original file line number Diff line number Diff line change
@@ -416,10 +416,11 @@ static __global__ void dequantize_mul_mat_vec_q6_k(const void * __restrict__ vx,

static __device__ void convert_f16(const void * vx, const int64_t ib, const int iqs, dfloat2 & v){
const half * x = (const half *) vx;

// load 2 halfs into register in a single instruction
const half2 x_reg = *((half2 *) &(x[ib + iqs]));
// automatic half -> float type cast if dfloat == float
v.x = x[ib + iqs + 0];
v.y = x[ib + iqs + 1];
v.x = __low2float(x_reg);
v.y = __high2float(x_reg);
}

static constexpr __device__ dequantize_kernel_t get_dequantize_kernel(ggml_type type) {
@@ -476,13 +477,28 @@ static __global__ void dequantize_mul_mat_vec(const void * __restrict__ vx, cons
// matrix multiplication
// for qr = 2 the y index needs to increase by 1 per j iter because of y_offset = qk/2
#ifdef GGML_CUDA_F16
tmp += __hmul2(v, {
y[iybs + iqs + j/qr + 0],
y[iybs + iqs + j/qr + y_offset]
});
if ( y_offset == 1 ) {
// load 2 dfloats into register in a single instruction
const dfloat2 y_reg = *((dfloat2 *) &(y[iybs + iqs + j/qr]));
tmp += __hmul2(v, y_reg);
}
else {
tmp += __hmul2(v, {
y[iybs + iqs + j/qr + 0],
y[iybs + iqs + j/qr + y_offset]
});
}
#else
tmp += v.x * y[iybs + iqs + j/qr + 0];
tmp += v.y * y[iybs + iqs + j/qr + y_offset];
if ( y_offset == 1 ) {
// load 2 dfloats into register in a single instruction
const dfloat2 y_reg = *((dfloat2 *) &(y[iybs + iqs + j/qr]));
tmp += v.x * y_reg.x;
tmp += v.y * y_reg.y;
}
else {
tmp += v.x * y[iybs + iqs + j/qr + 0];
tmp += v.y * y[iybs + iqs + j/qr + y_offset];
}
#endif // GGML_CUDA_F16
}
}