Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

merged from upstream #8

Merged
merged 64 commits into from
Apr 16, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
64 commits
Select commit Hold shift + click to select a range
75cd4c7
ci: bench: support sse and fix prompt processing time / server: add t…
phymbert Apr 6, 2024
57dd02c
Tests: Added integration tests for GBNF parser (#6472)
HanClinto Apr 6, 2024
b66aec6
backend : fix typo in scheduler documentation (ggml/781)
danbev Apr 3, 2024
54ea069
sync : ggml
ggerganov Apr 6, 2024
d4f220a
support/fix OPs GGML_TYPE_IQ4_NL, GGML_TYPE_IQ4_XS, GGML_TYPE_IQ3_XXS…
NeoZhangJianyu Apr 7, 2024
9472bce
Run make to build the project (#6457)
limitedAtonement Apr 7, 2024
43e8995
scripts : sync ggml-cuda folder
ggerganov Apr 7, 2024
f77261a
ggml: bypass code incompatible with CUDA < 11.1 (whisper/2020)
primenko-v Apr 4, 2024
c372477
sync : ggml
ggerganov Apr 7, 2024
e0717e7
Add GritLM as supported models. (#6513)
dranger003 Apr 7, 2024
b909236
flake.lock: Update (#6517)
ggerganov Apr 7, 2024
855f544
Change Windows AMD example to release build to make inference much fa…
thebaron88 Apr 7, 2024
d752327
Adding KodiBot to UI list (#6535)
firatkiral Apr 8, 2024
87fb5b4
remove row=1 cond (#6532)
abhilash1910 Apr 8, 2024
beea6e1
llama : save and restore kv cache for single seq id (#6341)
kaetemi Apr 8, 2024
e3c337d
llama : support negative ith in llama_get_ API (#6519)
TheFlipbook Apr 8, 2024
b73e564
quantize : fix precedence of cli args (#6541)
ggerganov Apr 8, 2024
cecd8d3
Comment explaining a decision (#6531)
kunnis Apr 8, 2024
cc4a954
llama : fix attention layer count sanity check (#6550)
ggerganov Apr 8, 2024
e11a899
license : update copyright notice + add AUTHORS (#6405)
ggerganov Apr 9, 2024
5dc9dd7
llama : add Command R Plus support (#6491)
RefractAI Apr 9, 2024
400d5d7
server : detect search query to start webchat (#6554)
Mardak Apr 9, 2024
c4a3a4f
sync : ggml
ggerganov Apr 9, 2024
1b67731
BERT tokenizer fixes (#6498)
cebtenzzre Apr 9, 2024
ba5e134
readme: fix typo in amdgpu target name (#6573)
Sejsel Apr 9, 2024
b231b37
readme : update UI list (#6560)
ylsdamxssjxxdd Apr 10, 2024
29122d3
readme : fix ROCm link (#6579)
artem-zinnatullin Apr 10, 2024
67fac4b
docs : how to add a model (#6565)
phymbert Apr 10, 2024
65c64dc
convert.py : add consolidated.safetensors for mixtral 8x22b (#6587)
slaren Apr 10, 2024
4f407a0
llama : add model types for mixtral (#6589)
slaren Apr 10, 2024
b3a96f2
minor layout improvements (#6572)
rsoika Apr 10, 2024
8228b66
gguf : add option to not check tensor data (#6582)
danbev Apr 10, 2024
b804b1e
eval-callback: Example how to use eval callback for debugging (#6576)
phymbert Apr 11, 2024
f4183af
scripts : add --outdir option to hf.sh (#6600)
danbev Apr 11, 2024
1bbdaf6
ci: download artifacts to release directory (#6612)
Hugi-R Apr 11, 2024
cbaadc9
grammars: 1.5x faster inference w/ complex grammars (vector reserves …
ochafik Apr 11, 2024
a474f50
Refactor Error Handling for CUDA (#6575)
nneubacher Apr 11, 2024
f7001cc
As suggested by @slaren, disabling Metal for test to fix CI build on …
HanClinto Apr 11, 2024
04a5ac2
Optimization: eliminate addition of redundant stacks when advancing g…
HanClinto Apr 12, 2024
9ed2737
ci : disable Metal for macOS-latest-cmake-x64 (#6628)
ggerganov Apr 12, 2024
81da18e
eval-callback: use ggml_op_desc to pretty print unary operator name (…
phymbert Apr 12, 2024
dee7f8d
Correct free memory and total memory. (#6630)
MasterYi1024 Apr 12, 2024
ef21ce4
imatrix : remove invalid assert (#6632)
ggerganov Apr 12, 2024
5c4d767
chore: Fix markdown warnings (#6625)
reneleonhardt Apr 12, 2024
91c7360
llama : add gguf_remove_key + remove split meta during quantize (#6591)
zj040045 Apr 12, 2024
24ee66e
server : coherent log output for KV cache full (#6637)
phymbert Apr 12, 2024
4cc120c
infill : add download instructions for model (#6626)
danbev Apr 12, 2024
fbbc030
metal : unify mul_mv_id kernels (#6556)
slaren Apr 12, 2024
ab9a324
JSON schema conversion: ⚡️ faster repetitions, min/maxLength for stri…
ochafik Apr 12, 2024
4bd0f93
model: support arch `DbrxForCausalLM` (#6515)
phymbert Apr 13, 2024
b5e7285
CUDA: fix matrix multiplication logic for tests (#6667)
JohannesGaessler Apr 13, 2024
de17e3f
fix memcpy() crash, add missed cmd in guide, fix softmax (#6622)
NeoZhangJianyu Apr 14, 2024
a4ec34e
convert : enable the `--use-temp-file` cli flag (#6645)
jac-jim Apr 14, 2024
e689fc4
[bug fix] convert github repository_owner to lowercase (#6673)
jaeminSon Apr 14, 2024
8800226
Fix --split-max-size (#6655)
CISC Apr 14, 2024
422c2af
Added support for GGML_OP_CLAMP in Metal (#6662)
dave-fl Apr 14, 2024
f184dd9
flake.lock: Update (#6669)
ggerganov Apr 14, 2024
04fbc5f
Add Command R chat template (#6650)
jc19chaoj Apr 14, 2024
1958f7e
llama : add missing kv clear in llama_beam_search (#6664)
dwrensha Apr 14, 2024
17e98d4
fix mul_mat_id() for new input, make the ut pass (#6682)
NeoZhangJianyu Apr 15, 2024
7fc16a2
swift : linux support (#6590)
spprichard Apr 15, 2024
3272896
server : revert "minor layout improvements" (#6684)
phymbert Apr 15, 2024
132f557
llama : fix restoring the number of outputs from state files (#6687)
compilade Apr 15, 2024
7593639
`main`: add --json-schema / -j flag (#6659)
ochafik Apr 15, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 13 additions & 7 deletions .github/workflows/bench.yml
Original file line number Diff line number Diff line change
Expand Up @@ -79,12 +79,18 @@ jobs:
sleep 0.1
done

- name: Install k6
- name: Set up Go
uses: actions/setup-go@v5
with:
go-version: '1.21'

- name: Install k6 and xk6-sse
id: k6_installation
run: |
cd examples/server/bench
wget --quiet https://github.com/grafana/k6/releases/download/v0.49.0/k6-v0.49.0-linux-amd64.tar.gz
tar xzf k6*.tar.gz --strip-components=1
go install go.k6.io/xk6/cmd/xk6@latest
xk6 build master \
--with github.com/phymbert/xk6-sse

- name: Build
id: cmake_build
Expand Down Expand Up @@ -118,7 +124,7 @@ jobs:

cd examples/server/bench
source venv/bin/activate
BENCH_K6_BIN_PATH=./k6 python bench.py \
python bench.py \
--runner-label ${{ env.RUNNER_LABEL }} \
--name ${{ github.job }} \
--branch ${{ github.head_ref || github.ref_name }} \
Expand Down Expand Up @@ -228,9 +234,9 @@ jobs:
<summary>Expand details for performance related PR only</summary>

- Concurrent users: ${{ env.N_USERS }}, duration: ${{ github.event.inputs.duration || env.DURATION }}
- HTTP request : avg=${{ env.HTTP_REQ_DURATION_AVG }}ms p(90)=${{ env.HTTP_REQ_DURATION_P_90_ }}ms fails=${{ env.HTTP_REQ_FAILED_PASSES }}, finish reason: stop=${{ env.LLAMACPP_COMPLETIONS_STOP_RATE_PASSES }} truncated=${{ env.LLAMACPP_COMPLETIONS_TRUNCATED_RATE_PASSES }}
- Prompt processing (pp): avg=${{ env.LLAMACPP_PROMPT_TOKENS_AVG }}tk/s p(90)=${{ env.LLAMACPP_PROMPT_TOKENS_P_90_ }}tk/s **total=${{ env.LLAMACPP_PROMPT_TOKENS_TOTAL_COUNTER_RATE }}tk/s**
- Token generation (tg): avg=${{ env.LLAMACPP_TOKENS_SECOND_AVG }}tk/s p(90)=${{ env.LLAMACPP_TOKENS_SECOND_P_90_ }}tk/s **total=${{ env.LLAMACPP_COMPLETION_TOKENS_TOTAL_COUNTER_RATE }}tk/s**
- HTTP request : avg=${{ env.HTTP_REQ_DURATION_AVG }}ms p(95)=${{ env.HTTP_REQ_DURATION_P_95_ }}ms fails=${{ env.HTTP_REQ_FAILED_PASSES }}, finish reason: stop=${{ env.LLAMACPP_COMPLETIONS_STOP_RATE_PASSES }} truncated=${{ env.LLAMACPP_COMPLETIONS_TRUNCATED_RATE_PASSES }}
- Prompt processing (pp): avg=${{ env.LLAMACPP_PROMPT_PROCESSING_SECOND_AVG }}tk/s p(95)=${{ env.LLAMACPP_PROMPT_PROCESSING_SECOND_P_95_ }}tk/s
- Token generation (tg): avg=${{ env.LLAMACPP_TOKENS_SECOND_AVG }}tk/s p(95)=${{ env.LLAMACPP_TOKENS_SECOND_P_95_ }}tk/s
- ${{ env.BENCH_GRAPH_XLABEL }}


Expand Down
22 changes: 15 additions & 7 deletions .github/workflows/build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ jobs:
id: cmake_test
run: |
cd build
ctest -L main --verbose --timeout 900
ctest -L 'main|curl' --verbose --timeout 900

- name: Determine tag name
id: tag
Expand Down Expand Up @@ -101,7 +101,9 @@ jobs:
sysctl -a
mkdir build
cd build
cmake -DLLAMA_FATAL_WARNINGS=ON -DLLAMA_METAL_EMBED_LIBRARY=ON -DLLAMA_CURL=ON ..
# Metal is disabled due to intermittent failures with Github runners not having a GPU:
# https://github.com/ggerganov/llama.cpp/actions/runs/8635935781/job/23674807267#step:5:2313
cmake -DLLAMA_FATAL_WARNINGS=ON -DLLAMA_METAL=OFF -DLLAMA_CURL=ON ..
cmake --build . --config Release -j $(sysctl -n hw.logicalcpu)

- name: Test
Expand Down Expand Up @@ -209,21 +211,21 @@ jobs:
id: depends
run: |
sudo apt-get update
sudo apt-get install build-essential
sudo apt-get install build-essential libcurl4-openssl-dev

- name: Build
id: cmake_build
run: |
mkdir build
cd build
cmake .. -DLLAMA_FATAL_WARNINGS=ON
cmake .. -DLLAMA_FATAL_WARNINGS=ON -DLLAMA_CURL=ON
cmake --build . --config Release -j $(nproc)

- name: Test
id: cmake_test
run: |
cd build
ctest -L main --verbose --timeout 900
ctest -L 'main|curl' --verbose --timeout 900

- name: Test llama2c conversion
id: llama2c_test
Expand Down Expand Up @@ -938,6 +940,12 @@ jobs:
- name: Download artifacts
id: download-artifact
uses: actions/download-artifact@v4
with:
path: ./artifact

- name: Move artifacts
id: move_artifacts
run: mkdir -p ./artifact/release && mv ./artifact/*/*.zip ./artifact/release

- name: Create release
id: create_release
Expand All @@ -956,15 +964,15 @@ jobs:
const path = require('path');
const fs = require('fs');
const release_id = '${{ steps.create_release.outputs.id }}';
for (let file of await fs.readdirSync('./artifact')) {
for (let file of await fs.readdirSync('./artifact/release')) {
if (path.extname(file) === '.zip') {
console.log('uploadReleaseAsset', file);
await github.repos.uploadReleaseAsset({
owner: context.repo.owner,
repo: context.repo.repo,
release_id: release_id,
name: file,
data: await fs.readFileSync(`./artifact/${file}`)
data: await fs.readFileSync(`./artifact/release/${file}`)
});
}
}
Expand Down
10 changes: 8 additions & 2 deletions .github/workflows/docker.yml
Original file line number Diff line number Diff line change
Expand Up @@ -91,14 +91,20 @@ jobs:
echo "name=${SAFE_NAME}-b${BUILD_NUMBER}-${SHORT_HASH}" >> $GITHUB_OUTPUT
fi

- name: Downcase github.repository_owner
run: |
echo "repository_owner_lowercase=${GITHUB_REPOSITORY_OWNER@L}" >> $GITHUB_ENV
env:
GITHUB_REPOSITORY_OWNER: '${{ github.repository_owner }}'

- name: Build and push Docker image (versioned)
if: github.event_name == 'push'
uses: docker/build-push-action@v4
with:
context: .
push: true
platforms: ${{ matrix.config.platforms }}
tags: "ghcr.io/${{ github.repository_owner }}/llama.cpp:${{ matrix.config.tag }}-${{ env.COMMIT_SHA }}"
tags: "ghcr.io/${{ env.repository_owner_lowercase }}/llama.cpp:${{ matrix.config.tag }}-${{ env.COMMIT_SHA }}"
file: ${{ matrix.config.dockerfile }}

- name: Build and push Docker image (tagged)
Expand All @@ -107,5 +113,5 @@ jobs:
context: .
push: ${{ github.event_name == 'push' }}
platforms: ${{ matrix.config.platforms }}
tags: "ghcr.io/${{ github.repository_owner }}/llama.cpp:${{ matrix.config.tag }},ghcr.io/${{ github.repository_owner }}/llama.cpp:${{ matrix.config.tag }}-${{ steps.tag.outputs.name }}"
tags: "ghcr.io/${{ env.repository_owner_lowercase }}/llama.cpp:${{ matrix.config.tag }},ghcr.io/${{ env.repository_owner_lowercase }}/llama.cpp:${{ matrix.config.tag }}-${{ steps.tag.outputs.name }}"
file: ${{ matrix.config.dockerfile }}
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,7 @@ models-mnt
/convert-llama2c-to-ggml
/embd-input-test
/embedding
/eval-callback
/gguf
/gguf-llama-simple
/gguf-split
Expand Down
Loading
Loading