Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

merge upstream #34

Merged
merged 70 commits into from
Aug 28, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
70 commits
Select commit Hold shift + click to select a range
5ef07e2
server : handle models with missing EOS token (#8997)
ggerganov Aug 12, 2024
d3ae0ee
py : fix requirements check '==' -> '~=' (#8982)
ggerganov Aug 12, 2024
2589292
Fix a spelling mistake (#9001)
Septa2112 Aug 12, 2024
df5478f
ggml: fix div-by-zero (#9003)
DavidKorczynski Aug 12, 2024
1262e7e
grammar-parser : fix possible null-deref (#9004)
DavidKorczynski Aug 12, 2024
84eb2f4
docs: introduce gpustack and gguf-parser (#8873)
thxCode Aug 12, 2024
0fd93cd
llama : model-based max number of graph nodes calculation (#8970)
nicoboss Aug 12, 2024
1f67436
ci : enable RPC in all of the released builds (#9006)
rgerganov Aug 12, 2024
fc4ca27
ci : fix github workflow vulnerable to script injection (#9008)
diogoteles08 Aug 12, 2024
828d6ff
export-lora : throw error if lora is quantized (#9002)
ngxson Aug 13, 2024
06943a6
ggml : move rope type enum to ggml.h (#8949)
danbev Aug 13, 2024
43bdd3c
cmake : remove unused option GGML_CURL (#9011)
ggerganov Aug 14, 2024
98a532d
server : fix segfault on long system prompt (#8987)
compilade Aug 14, 2024
5fd89a7
Vulkan Optimizations and Fixes (#8959)
0cc4m Aug 14, 2024
234b306
server : init stop and error fields of the result struct (#9026)
jpodivin Aug 15, 2024
d5492f0
ci : disable bench workflow (#9010)
ggerganov Aug 15, 2024
6bda7ce
llama : add pre-tokenizer regexes for BLOOM and gpt3-finnish (#8850)
Exploder98 Aug 15, 2024
4af8420
common : remove duplicate function llama_should_add_bos_token (#8778)
kylo5aby Aug 15, 2024
37501d9
server : fix duplicated n_predict key in the generation_settings (#8994)
snowyu Aug 15, 2024
4b9afbb
retrieval : fix memory leak in retrieval query handling (#8955)
gtygo Aug 15, 2024
e3f6fd5
ggml : dynamic ggml_sched_max_splits based on graph_size (#9047)
nicoboss Aug 16, 2024
2a24c8c
Add Nemotron/Minitron GGUF Conversion & Inference Support (#8922)
suhara Aug 16, 2024
fb487bb
common : add support for cpu_get_num_physical_cores() on Windows (#8771)
Septa2112 Aug 16, 2024
c679e0c
llama : add EXAONE model support (#9025)
mscheong01 Aug 16, 2024
23fd453
gguf-py : bump version from 0.9.1 to 0.10.0 (#9051)
compilade Aug 16, 2024
c8ddce8
Fix inference example lacks required parameters (#9035)
Aisuko Aug 16, 2024
ee2984b
py : fix wrong input type for raw_dtype in ggml to gguf scripts (#8928)
farbodbj Aug 16, 2024
d565bb2
llava : support MiniCPM-V-2.6 (#8967)
tc-mb Aug 16, 2024
8b3befc
server : refactor middleware and /health endpoint (#9056)
ngxson Aug 16, 2024
2fb9267
Fix incorrect use of ctx_split for bias tensors (#9063)
suhara Aug 17, 2024
2339a0b
tests : add integration test for lora adapters (#8957)
ltoniazzi Aug 18, 2024
554b049
flake.lock: Update (#9068)
ggerganov Aug 18, 2024
18eaf29
rpc : prevent crashes on invalid input (#9040)
rgerganov Aug 19, 2024
1b6ff90
rpc : print error message when failed to connect endpoint (#9042)
rgerganov Aug 19, 2024
cfac111
cann: add doc for cann backend (#8867)
wangshuai09 Aug 19, 2024
90db814
tests : add missing comma in grammar integration tests (#9099)
fairydreaming Aug 20, 2024
4f8d19f
[SYCL] Fix SYCL `im2col` and `convert` Overflow with Large Dims (#9052)
zhentaoyu Aug 20, 2024
50addec
[SYCL] fallback mmvq (#9088)
airMeng Aug 20, 2024
2f3c146
llava: Add ACC OP for GPU acceleration to the Vulkan backend in the L…
cyzero-kim Aug 20, 2024
8455340
llama : std::move llm_bigram_bpe from work_queue (#9062)
danbev Aug 21, 2024
f63f603
llava : zero-initialize clip_ctx structure fields with aggregate init…
fairydreaming Aug 21, 2024
b40eb84
llama : support for `falcon-mamba` architecture (#9074)
younesbelkada Aug 21, 2024
fc54ef0
server : support reading arguments from environment variables (#9105)
ngxson Aug 21, 2024
a1631e5
llama : simplify Mamba with advanced batch splits (#8526)
compilade Aug 21, 2024
1731d42
[SYCL] Add oneDNN primitive support (#9091)
luoyu-intel Aug 22, 2024
11b84eb
[SYCL] Add a space to supress a cmake warning (#9133)
qnixsynapse Aug 22, 2024
a07c32e
llama : use F32 precision in GLM4 attention and no FA (#9130)
piDack Aug 23, 2024
3ba780e
lora : fix llama conversion script with ROPE_FREQS (#9117)
ngxson Aug 23, 2024
8f824ff
quantize : fix typo in usage help of `quantize.cpp` (#9145)
joaodinissf Aug 24, 2024
e11bd85
CPU/CUDA: Gemma 2 FlashAttention support (#8542)
JohannesGaessler Aug 24, 2024
f91fc56
CUDA: fix Gemma 2 numerical issues for FA (#9166)
JohannesGaessler Aug 25, 2024
93bc383
common: fixed not working find argument --n-gpu-layers-draft (#9175)
GermanAizek Aug 25, 2024
436787f
llama : fix time complexity of string replacement (#9163)
jart Aug 26, 2024
f12ceac
ggml-ci : try to improve build time (#9160)
slaren Aug 26, 2024
0c41e03
metal : gemma2 flash attention support (#9159)
slaren Aug 26, 2024
e5edb21
server : update deps (#9183)
ggerganov Aug 26, 2024
7a3df79
ci : add VULKAN support to ggml-ci (#9055)
ggerganov Aug 26, 2024
879275a
tests : fix compile warnings for unreachable code (#9185)
ggerganov Aug 26, 2024
fc18425
ggml : add SSM Metal kernels (#8546)
ggerganov Aug 26, 2024
06658ad
metal : separate scale and mask from QKT in FA kernel (#9189)
ggerganov Aug 26, 2024
7d787ed
ggml : do not crash when quantizing q4_x_x with an imatrix (#9192)
slaren Aug 26, 2024
ad76569
common : Update stb_image.h to latest version (#9161)
arch-btw Aug 27, 2024
75e1dbb
llama : fix llama3.1 rope_freqs not respecting custom head_dim (#9141)
nyxkrage Aug 27, 2024
2e59d61
llama : fix ChatGLM4 wrong shape (#9194)
CausalLM Aug 27, 2024
a77feb5
server : add some missing env variables (#9116)
ngxson Aug 27, 2024
78eb487
llama : fix qs.n_attention_wv for DeepSeek-V2 (#9156)
compilade Aug 27, 2024
3246fe8
Fix minicpm example directory (#9111)
xyb Aug 27, 2024
231cff5
sync : ggml
ggerganov Aug 27, 2024
20f1789
vulkan : fix build (#0)
ggerganov Aug 27, 2024
23e298c
Merge branch 'ggerganov:master' into master
l3utterfly Aug 28, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
44 changes: 44 additions & 0 deletions .devops/llama-cli-cann.Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
ARG ASCEND_VERSION=8.0.rc2.alpha003-910b-openeuler22.03-py3.8

FROM cosdt/cann:$ASCEND_VERSION AS build

WORKDIR /app

COPY . .

RUN yum install -y gcc g++ cmake make
ENV ASCEND_TOOLKIT_HOME=/usr/local/Ascend/ascend-toolkit/latest
ENV LIBRARY_PATH=${ASCEND_TOOLKIT_HOME}/lib64:$LIBRARY_PATH
ENV LD_LIBRARY_PATH=${ASCEND_TOOLKIT_HOME}/lib64:${ASCEND_TOOLKIT_HOME}/lib64/plugin/opskernel:${ASCEND_TOOLKIT_HOME}/lib64/plugin/nnengine:${ASCEND_TOOLKIT_HOME}/opp/built-in/op_impl/ai_core/tbe/op_tiling:${LD_LIBRARY_PATH}
ENV PYTHONPATH=${ASCEND_TOOLKIT_HOME}/python/site-packages:${ASCEND_TOOLKIT_HOME}/opp/built-in/op_impl/ai_core/tbe:${PYTHONPATH}
ENV PATH=${ASCEND_TOOLKIT_HOME}/bin:${ASCEND_TOOLKIT_HOME}/compiler/ccec_compiler/bin:${PATH}
ENV ASCEND_AICPU_PATH=${ASCEND_TOOLKIT_HOME}
ENV ASCEND_OPP_PATH=${ASCEND_TOOLKIT_HOME}/opp
ENV TOOLCHAIN_HOME=${ASCEND_TOOLKIT_HOME}/toolkit
ENV ASCEND_HOME_PATH=${ASCEND_TOOLKIT_HOME}

# find libascend_hal.so, because the drive hasn`t been mounted.
ENV LD_LIBRARY_PATH=${ASCEND_TOOLKIT_HOME}/runtime/lib64/stub:$LD_LIBRARY_PATH

RUN echo "Building with static libs" && \
source /usr/local/Ascend/ascend-toolkit/set_env.sh --force && \
cmake -B build -DGGML_CANN=ON -DBUILD_SHARED_LIBS=OFF && \
cmake --build build --config Release --target llama-cli

# TODO: use image with NNRT
FROM cosdt/cann:$ASCEND_VERSION AS runtime
COPY --from=build /app/build/bin/llama-cli /llama-cli

ENV LC_ALL=C.utf8

ENV ASCEND_TOOLKIT_HOME=/usr/local/Ascend/ascend-toolkit/latest
ENV LIBRARY_PATH=${ASCEND_TOOLKIT_HOME}/lib64:$LIBRARY_PATH
ENV LD_LIBRARY_PATH=${ASCEND_TOOLKIT_HOME}/lib64:${ASCEND_TOOLKIT_HOME}/lib64/plugin/opskernel:${ASCEND_TOOLKIT_HOME}/lib64/plugin/nnengine:${ASCEND_TOOLKIT_HOME}/opp/built-in/op_impl/ai_core/tbe/op_tiling:${LD_LIBRARY_PATH}
ENV PYTHONPATH=${ASCEND_TOOLKIT_HOME}/python/site-packages:${ASCEND_TOOLKIT_HOME}/opp/built-in/op_impl/ai_core/tbe:${PYTHONPATH}
ENV PATH=${ASCEND_TOOLKIT_HOME}/bin:${ASCEND_TOOLKIT_HOME}/compiler/ccec_compiler/bin:${PATH}
ENV ASCEND_AICPU_PATH=${ASCEND_TOOLKIT_HOME}
ENV ASCEND_OPP_PATH=${ASCEND_TOOLKIT_HOME}/opp
ENV TOOLCHAIN_HOME=${ASCEND_TOOLKIT_HOME}/toolkit
ENV ASCEND_HOME_PATH=${ASCEND_TOOLKIT_HOME}

ENTRYPOINT ["/llama-cli" ]
2 changes: 2 additions & 0 deletions .devops/llama-server-cuda.Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,8 @@ ENV CUDA_DOCKER_ARCH=${CUDA_DOCKER_ARCH}
ENV GGML_CUDA=1
# Enable cURL
ENV LLAMA_CURL=1
# Must be set to 0.0.0.0 so it can listen to requests from host machine
ENV LLAMA_ARG_HOST=0.0.0.0

RUN make -j$(nproc) llama-server

Expand Down
2 changes: 2 additions & 0 deletions .devops/llama-server-intel.Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,8 @@ RUN apt-get update && \
COPY --from=build /app/build/bin/llama-server /llama-server

ENV LC_ALL=C.utf8
# Must be set to 0.0.0.0 so it can listen to requests from host machine
ENV LLAMA_ARG_HOST=0.0.0.0

HEALTHCHECK CMD [ "curl", "-f", "http://localhost:8080/health" ]

Expand Down
2 changes: 2 additions & 0 deletions .devops/llama-server-rocm.Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,8 @@ ENV GPU_TARGETS=${ROCM_DOCKER_ARCH}
ENV GGML_HIPBLAS=1
ENV CC=/opt/rocm/llvm/bin/clang
ENV CXX=/opt/rocm/llvm/bin/clang++
# Must be set to 0.0.0.0 so it can listen to requests from host machine
ENV LLAMA_ARG_HOST=0.0.0.0

# Enable cURL
ENV LLAMA_CURL=1
Expand Down
2 changes: 2 additions & 0 deletions .devops/llama-server-vulkan.Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,8 @@ RUN cp /app/build/bin/llama-server /llama-server && \
rm -rf /app

ENV LC_ALL=C.utf8
# Must be set to 0.0.0.0 so it can listen to requests from host machine
ENV LLAMA_ARG_HOST=0.0.0.0

HEALTHCHECK CMD [ "curl", "-f", "http://localhost:8080/health" ]

Expand Down
2 changes: 2 additions & 0 deletions .devops/llama-server.Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,8 @@ RUN apt-get update && \
COPY --from=build /app/llama-server /llama-server

ENV LC_ALL=C.utf8
# Must be set to 0.0.0.0 so it can listen to requests from host machine
ENV LLAMA_ARG_HOST=0.0.0.0

HEALTHCHECK CMD [ "curl", "-f", "http://localhost:8080/health" ]

Expand Down
2 changes: 1 addition & 1 deletion .ecrc
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
{
"Exclude": ["^\\.gitmodules$"],
"Exclude": ["^\\.gitmodules$", "stb_image\\.h"],
"Disable": {
"IndentSize": true
}
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,6 @@
# TODO: there have been some issues with the workflow, so disabling for now
# https://github.com/ggerganov/llama.cpp/issues/7893
#
# Benchmark
name: Benchmark

Expand Down Expand Up @@ -129,6 +132,8 @@ jobs:

- name: Server bench
id: server_bench
env:
HEAD_REF: ${{ github.head_ref || github.ref_name }}
run: |
set -eux

Expand All @@ -137,7 +142,7 @@ jobs:
python bench.py \
--runner-label ${{ env.RUNNER_LABEL }} \
--name ${{ github.job }} \
--branch ${{ github.head_ref || github.ref_name }} \
--branch $HEAD_REF \
--commit ${{ github.event.inputs.sha || github.event.pull_request.head.sha || github.sha }} \
--scenario script.js \
--duration ${{ github.event.inputs.duration || env.DURATION }} \
Expand Down
22 changes: 10 additions & 12 deletions .github/workflows/build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@ jobs:
sysctl -a
mkdir build
cd build
cmake -DLLAMA_FATAL_WARNINGS=ON -DGGML_METAL_EMBED_LIBRARY=ON -DLLAMA_CURL=ON -DBUILD_SHARED_LIBS=OFF ..
cmake -DLLAMA_FATAL_WARNINGS=ON -DGGML_METAL_EMBED_LIBRARY=ON -DLLAMA_CURL=ON -DGGML_RPC=ON -DBUILD_SHARED_LIBS=OFF ..
cmake --build . --config Release -j $(sysctl -n hw.logicalcpu)

- name: Test
Expand Down Expand Up @@ -105,7 +105,7 @@ jobs:
sysctl -a
# Metal is disabled due to intermittent failures with Github runners not having a GPU:
# https://github.com/ggerganov/llama.cpp/actions/runs/8635935781/job/23674807267#step:5:2313
cmake -B build -DLLAMA_FATAL_WARNINGS=ON -DGGML_METAL=OFF -DLLAMA_CURL=ON -DBUILD_SHARED_LIBS=OFF
cmake -B build -DLLAMA_FATAL_WARNINGS=ON -DGGML_METAL=OFF -DLLAMA_CURL=ON -DGGML_RPC=ON -DBUILD_SHARED_LIBS=OFF
cmake --build build --config Release -j $(sysctl -n hw.logicalcpu)

- name: Test
Expand Down Expand Up @@ -222,7 +222,7 @@ jobs:
run: |
mkdir build
cd build
cmake .. -DLLAMA_FATAL_WARNINGS=ON -DLLAMA_CURL=ON -DBUILD_SHARED_LIBS=OFF
cmake .. -DLLAMA_FATAL_WARNINGS=ON -DLLAMA_CURL=ON -DGGML_RPC=ON -DBUILD_SHARED_LIBS=OFF
cmake --build . --config Release -j $(nproc)

- name: Test
Expand Down Expand Up @@ -696,22 +696,20 @@ jobs:
strategy:
matrix:
include:
- build: 'rpc-x64'
defines: '-DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DBUILD_SHARED_LIBS=ON'
- build: 'noavx-x64'
defines: '-DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_AVX=OFF -DGGML_AVX2=OFF -DGGML_FMA=OFF -DBUILD_SHARED_LIBS=ON'
defines: '-DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DGGML_AVX=OFF -DGGML_AVX2=OFF -DGGML_FMA=OFF -DBUILD_SHARED_LIBS=ON'
- build: 'avx2-x64'
defines: '-DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DBUILD_SHARED_LIBS=ON'
defines: '-DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DBUILD_SHARED_LIBS=ON'
- build: 'avx-x64'
defines: '-DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_AVX2=OFF -DBUILD_SHARED_LIBS=ON'
defines: '-DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DGGML_AVX2=OFF -DBUILD_SHARED_LIBS=ON'
- build: 'avx512-x64'
defines: '-DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_AVX512=ON -DBUILD_SHARED_LIBS=ON'
defines: '-DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DGGML_AVX512=ON -DBUILD_SHARED_LIBS=ON'
- build: 'openblas-x64'
defines: '-DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_BLAS=ON -DBUILD_SHARED_LIBS=ON -DGGML_BLAS_VENDOR=OpenBLAS -DBLAS_INCLUDE_DIRS="$env:RUNNER_TEMP/openblas/include" -DBLAS_LIBRARIES="$env:RUNNER_TEMP/openblas/lib/openblas.lib"'
defines: '-DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DGGML_BLAS=ON -DBUILD_SHARED_LIBS=ON -DGGML_BLAS_VENDOR=OpenBLAS -DBLAS_INCLUDE_DIRS="$env:RUNNER_TEMP/openblas/include" -DBLAS_LIBRARIES="$env:RUNNER_TEMP/openblas/lib/openblas.lib"'
- build: 'kompute-x64'
defines: '-DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_KOMPUTE=ON -DKOMPUTE_OPT_DISABLE_VULKAN_VERSION_CHECK=ON -DBUILD_SHARED_LIBS=ON'
defines: '-DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DGGML_KOMPUTE=ON -DKOMPUTE_OPT_DISABLE_VULKAN_VERSION_CHECK=ON -DBUILD_SHARED_LIBS=ON'
- build: 'vulkan-x64'
defines: '-DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_VULKAN=ON -DBUILD_SHARED_LIBS=ON'
defines: '-DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DGGML_VULKAN=ON -DBUILD_SHARED_LIBS=ON'
- build: 'llvm-arm64'
defines: '-G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/arm64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DBUILD_SHARED_LIBS=ON'
- build: 'msvc-arm64'
Expand Down
6 changes: 2 additions & 4 deletions .github/workflows/python-check-requirements.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,15 +6,13 @@ on:
- '.github/workflows/python-check-requirements.yml'
- 'scripts/check-requirements.sh'
- 'convert*.py'
- 'requirements.txt'
- 'requirements/*.txt'
- '**/requirements*.txt'
pull_request:
paths:
- '.github/workflows/python-check-requirements.yml'
- 'scripts/check-requirements.sh'
- 'convert*.py'
- 'requirements.txt'
- 'requirements/*.txt'
- '**/requirements*.txt'

concurrency:
group: ${{ github.workflow }}-${{ github.head_ref && github.ref || github.run_id }}
Expand Down
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -129,3 +129,6 @@ poetry.toml

# Scripts
!/scripts/install-oneapi.bat

# Test models for lora adapters
/lora-tests
5 changes: 4 additions & 1 deletion CMakePresets.json
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@
{ "name": "release", "hidden": true, "cacheVariables": { "CMAKE_BUILD_TYPE": "Release" } },
{ "name": "reldbg", "hidden": true, "cacheVariables": { "CMAKE_BUILD_TYPE": "RelWithDebInfo" } },
{ "name": "static", "hidden": true, "cacheVariables": { "GGML_STATIC": "ON" } },
{ "name": "sycl_f16", "hidden": true, "cacheVariables": { "GGML_SYCL_F16": "ON" } },

{
"name": "arm64-windows-msvc", "hidden": true,
Expand Down Expand Up @@ -60,6 +61,8 @@
{ "name": "x64-windows-msvc+static-release", "inherits": [ "base", "reldbg", "static" ] },

{ "name": "x64-windows-sycl-debug" , "inherits": [ "sycl-base", "debug" ] },
{ "name": "x64-windows-sycl-release", "inherits": [ "sycl-base", "release" ] }
{ "name": "x64-windows-sycl-debug-f16", "inherits": [ "sycl-base", "debug", "sycl_f16" ] },
{ "name": "x64-windows-sycl-release", "inherits": [ "sycl-base", "release" ] },
{ "name": "x64-windows-sycl-release-f16", "inherits": [ "sycl-base", "release", "sycl_f16" ] }
]
}
4 changes: 4 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -763,6 +763,10 @@ ifdef GGML_VULKAN_MEMORY_DEBUG
MK_CPPFLAGS += -DGGML_VULKAN_MEMORY_DEBUG
endif

ifdef GGML_VULKAN_PERF
MK_CPPFLAGS += -DGGML_VULKAN_PERF
endif

ifdef GGML_VULKAN_VALIDATE
MK_CPPFLAGS += -DGGML_VULKAN_VALIDATE
endif
Expand Down
5 changes: 5 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -105,6 +105,8 @@ Typically finetunes of the base models below are supported as well.
- [x] [Open Elm models](https://huggingface.co/collections/apple/openelm-instruct-models-6619ad295d7ae9f868b759ca)
- [x] [ChatGLM3-6b](https://huggingface.co/THUDM/chatglm3-6b) + [ChatGLM4-9b](https://huggingface.co/THUDM/glm-4-9b)
- [x] [SmolLM](https://huggingface.co/collections/HuggingFaceTB/smollm-6695016cad7167254ce15966)
- [x] [EXAONE-3.0-7.8B-Instruct](https://huggingface.co/LGAI-EXAONE/EXAONE-3.0-7.8B-Instruct)
- [x] [FalconMamba Models](https://huggingface.co/collections/tiiuae/falconmamba-7b-66b9a580324dd1598b0f6d4a)

(instructions for supporting more models: [HOWTO-add-model.md](./docs/development/HOWTO-add-model.md))

Expand Down Expand Up @@ -186,10 +188,12 @@ Unless otherwise noted these projects are open-source with permissive licensing:

- [akx/ggify](https://github.com/akx/ggify) – download PyTorch models from HuggingFace Hub and convert them to GGML
- [crashr/gppm](https://github.com/crashr/gppm) – launch llama.cpp instances utilizing NVIDIA Tesla P40 or P100 GPUs with reduced idle power consumption
- [gpustack/gguf-parser](https://github.com/gpustack/gguf-parser-go/tree/main/cmd/gguf-parser) - review/check the GGUF file and estimate the memory usage

**Infrastructure:**

- [Paddler](https://github.com/distantmagic/paddler) - Stateful load balancer custom-tailored for llama.cpp
- [GPUStack](https://github.com/gpustack/gpustack) - Manage GPU clusters for running LLMs

**Games:**
- [Lucy's Labyrinth](https://github.com/MorganRO8/Lucys_Labyrinth) - A simple maze game where agents controlled by an AI model will try to trick you.
Expand Down Expand Up @@ -422,6 +426,7 @@ Please refer to [Build llama.cpp locally](./docs/build.md)
| [CUDA](./docs/build.md#cuda) | Nvidia GPU |
| [hipBLAS](./docs/build.md#hipblas) | AMD GPU |
| [Vulkan](./docs/build.md#vulkan) | GPU |
| [CANN](./docs/build.md#cann) | Ascend NPU |

## Tools

Expand Down
29 changes: 17 additions & 12 deletions ci/run.sh
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,9 @@
# # with SYCL support
# GG_BUILD_SYCL=1 bash ./ci/run.sh ./tmp/results ./tmp/mnt
#
# # with VULKAN support
# GG_BUILD_VULKAN=1 bash ./ci/run.sh ./tmp/results ./tmp/mnt
#

if [ -z "$2" ]; then
echo "usage: $0 <output-dir> <mnt-dir>"
Expand Down Expand Up @@ -40,7 +43,7 @@ if [ ! -z ${GG_BUILD_METAL} ]; then
fi

if [ ! -z ${GG_BUILD_CUDA} ]; then
CMAKE_EXTRA="${CMAKE_EXTRA} -DGGML_CUDA=1"
CMAKE_EXTRA="${CMAKE_EXTRA} -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=native"
fi

if [ ! -z ${GG_BUILD_SYCL} ]; then
Expand All @@ -52,6 +55,10 @@ if [ ! -z ${GG_BUILD_SYCL} ]; then

CMAKE_EXTRA="${CMAKE_EXTRA} -DGGML_SYCL=1 DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DGGML_SYCL_F16=ON"
fi

if [ ! -z ${GG_BUILD_VULKAN} ]; then
CMAKE_EXTRA="${CMAKE_EXTRA} -DGGML_VULKAN=1"
fi
## helpers

# download a file if it does not exist or if it is outdated
Expand Down Expand Up @@ -107,7 +114,7 @@ function gg_run_ctest_debug {
gg_check_build_requirements

(time cmake -DCMAKE_BUILD_TYPE=Debug ${CMAKE_EXTRA} .. ) 2>&1 | tee -a $OUT/${ci}-cmake.log
(time make -j ) 2>&1 | tee -a $OUT/${ci}-make.log
(time make -j$(nproc) ) 2>&1 | tee -a $OUT/${ci}-make.log

(time ctest --output-on-failure -L main -E test-opt ) 2>&1 | tee -a $OUT/${ci}-ctest.log

Expand Down Expand Up @@ -138,7 +145,7 @@ function gg_run_ctest_release {
gg_check_build_requirements

(time cmake -DCMAKE_BUILD_TYPE=Release ${CMAKE_EXTRA} .. ) 2>&1 | tee -a $OUT/${ci}-cmake.log
(time make -j ) 2>&1 | tee -a $OUT/${ci}-make.log
(time make -j$(nproc) ) 2>&1 | tee -a $OUT/${ci}-make.log

if [ -z ${GG_BUILD_LOW_PERF} ]; then
(time ctest --output-on-failure -L main ) 2>&1 | tee -a $OUT/${ci}-ctest.log
Expand Down Expand Up @@ -266,7 +273,6 @@ function gg_sum_ctest_with_model_release {
}

# open_llama_7b_v2
# requires: GG_BUILD_CUDA

function gg_run_open_llama_7b_v2 {
cd ${SRC}
Expand All @@ -290,8 +296,8 @@ function gg_run_open_llama_7b_v2 {

set -e

(time cmake -DCMAKE_BUILD_TYPE=Release ${CMAKE_EXTRA} -DGGML_CUDA=1 .. ) 2>&1 | tee -a $OUT/${ci}-cmake.log
(time make -j ) 2>&1 | tee -a $OUT/${ci}-make.log
(time cmake -DCMAKE_BUILD_TYPE=Release ${CMAKE_EXTRA} .. ) 2>&1 | tee -a $OUT/${ci}-cmake.log
(time make -j$(nproc) ) 2>&1 | tee -a $OUT/${ci}-make.log

python3 ../examples/convert_legacy_llama.py ${path_models} --outfile ${path_models}/ggml-model-f16.gguf

Expand Down Expand Up @@ -425,7 +431,7 @@ function gg_run_pythia_1_4b {
set -e

(time cmake -DCMAKE_BUILD_TYPE=Release ${CMAKE_EXTRA} .. ) 2>&1 | tee -a $OUT/${ci}-cmake.log
(time make -j ) 2>&1 | tee -a $OUT/${ci}-make.log
(time make -j$(nproc) ) 2>&1 | tee -a $OUT/${ci}-make.log

python3 ../convert_hf_to_gguf.py ${path_models} --outfile ${path_models}/ggml-model-f16.gguf

Expand Down Expand Up @@ -535,7 +541,6 @@ function gg_sum_pythia_1_4b {
}

# pythia_2_8b
# requires: GG_BUILD_CUDA

function gg_run_pythia_2_8b {
cd ${SRC}
Expand All @@ -556,8 +561,8 @@ function gg_run_pythia_2_8b {

set -e

(time cmake -DCMAKE_BUILD_TYPE=Release ${CMAKE_EXTRA} -DGGML_CUDA=1 .. ) 2>&1 | tee -a $OUT/${ci}-cmake.log
(time make -j ) 2>&1 | tee -a $OUT/${ci}-make.log
(time cmake -DCMAKE_BUILD_TYPE=Release ${CMAKE_EXTRA} .. ) 2>&1 | tee -a $OUT/${ci}-cmake.log
(time make -j$(nproc) ) 2>&1 | tee -a $OUT/${ci}-make.log

python3 ../convert_hf_to_gguf.py ${path_models} --outfile ${path_models}/ggml-model-f16.gguf

Expand Down Expand Up @@ -692,7 +697,7 @@ function gg_run_embd_bge_small {
set -e

(time cmake -DCMAKE_BUILD_TYPE=Release ${CMAKE_EXTRA} .. ) 2>&1 | tee -a $OUT/${ci}-cmake.log
(time make -j ) 2>&1 | tee -a $OUT/${ci}-make.log
(time make -j$(nproc) ) 2>&1 | tee -a $OUT/${ci}-make.log

python3 ../convert_hf_to_gguf.py ${path_models} --outfile ${path_models}/ggml-model-f16.gguf

Expand Down Expand Up @@ -761,7 +766,7 @@ if [ -z ${GG_BUILD_LOW_PERF} ]; then
fi

if [ -z ${GG_BUILD_VRAM_GB} ] || [ ${GG_BUILD_VRAM_GB} -ge 8 ]; then
if [ -z ${GG_BUILD_CUDA} ]; then
if [ -z ${GG_BUILD_CUDA} ] && [ -z ${GG_BUILD_VULKAN} ]; then
test $ret -eq 0 && gg_run pythia_1_4b
else
test $ret -eq 0 && gg_run pythia_2_8b
Expand Down
Loading
Loading