Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add llama.cpp backend #2723

Open
wants to merge 92 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 74 commits
Commits
Show all changes
92 commits
Select commit Hold shift + click to select a range
aa1fcba
feat(llamacpp): initial commit
mfuntowicz Oct 3, 2024
7d1f8a2
feat(llamacpp): correctly handle CMAKE_BUILD_TYPE for spdlog macros
mfuntowicz Oct 3, 2024
52d57dc
feat(llamacpp): initial end2end build
mfuntowicz Oct 4, 2024
e4432d3
misc(cmake): add parameter to build specific cuda arch
mfuntowicz Oct 18, 2024
fa89d1e
misc(cmake): wut
mfuntowicz Oct 21, 2024
05ad684
feat(llamacpp): enable cuda
mfuntowicz Oct 21, 2024
0911076
feat(backend): correctly load llama.cpp model from llama api and not …
mfuntowicz Oct 22, 2024
098c669
feat(backend): tell cmake to build llama-common and link to it
mfuntowicz Oct 22, 2024
45d5a6a
feat(backend): add some initial decoding steps
mfuntowicz Oct 22, 2024
92bb113
feat(backend): use llama_token as TokenId type
mfuntowicz Oct 22, 2024
d4b5be1
feat(backend): minor refactor
mfuntowicz Oct 23, 2024
37faeb3
feat(backend): expose frequency and repetition penalties
mfuntowicz Oct 23, 2024
f9c2486
chore(backend): minor formatting
mfuntowicz Oct 23, 2024
355d8a5
feat(backend): wip Rust binding
mfuntowicz Oct 24, 2024
e4d803c
feat(backend): build and link through build.rs
mfuntowicz Oct 24, 2024
f0859c2
misc(build): handle different lib destination folder lib/lib64
mfuntowicz Oct 25, 2024
179309b
misc(build): refactor build type detection in cmake
mfuntowicz Oct 25, 2024
a316c53
feat(llamacpp): expose number of threads for the backend when constru…
mfuntowicz Oct 25, 2024
0c1dd0e
feat(llamacpp): wip explosion
mfuntowicz Oct 29, 2024
dbc5b7a
misc(offline): link correctly
mfuntowicz Oct 26, 2024
6115904
misc(offline): expose more parameters for generate
mfuntowicz Oct 28, 2024
b98c635
feat(backend): entirely rewrite backend
mfuntowicz Oct 30, 2024
6a5f6b0
misc(offline): update offline tester
mfuntowicz Oct 30, 2024
d52b4c4
feat(backend): full rework of the backend internal to safer c++
mfuntowicz Oct 31, 2024
3af2c68
misc(offline): match rework
mfuntowicz Oct 31, 2024
f39edc7
feat(backend): add mapping for ignore_eos_token stopping criteria
mfuntowicz Oct 31, 2024
d4aee42
feat(backend): add logit parameter in the callback fn
mfuntowicz Oct 31, 2024
612f2f9
feat(backend): bind incoming request to the server
mfuntowicz Oct 31, 2024
b50dcdd
feat(backend): avoid dropping the boxed stream at the end of the call…
mfuntowicz Nov 2, 2024
3e82f14
feat(backend): somewhat generates the final infer response
mfuntowicz Nov 2, 2024
bd8f0f1
feat(backend): fix invalid reference to ctx instead of context in rel…
mfuntowicz Nov 2, 2024
2cdfed9
feat(backend): correctly link to shared fmt and spdlog instead of static
mfuntowicz Nov 2, 2024
86a2ae6
chore: unsued variables
mfuntowicz Nov 2, 2024
7b0a56f
feat(backend): fix memory leaking on llama_sampler when the decode ends
mfuntowicz Nov 3, 2024
31d9254
feat(backend): remove static from inner_fw visitor as it leads to inv…
mfuntowicz Nov 3, 2024
188442f
misc(lint): make clippy happier
mfuntowicz Nov 3, 2024
05ff551
feat(backend): add number of generated tokens in the callback
mfuntowicz Nov 3, 2024
06424aa
feat(backend): correctly handle the max_new_tokens case for is_eos
mfuntowicz Nov 3, 2024
11c593d
feat(backend): make eog clearer on c++ side
mfuntowicz Nov 3, 2024
5b7a951
feat(backend): refactor the callback to handle intermediate and end i…
mfuntowicz Nov 4, 2024
958c72a
misc(ffi): remove unused ffi mapping
mfuntowicz Nov 4, 2024
1473259
feat(backend): add early stopping criteria from TGI stream callback
mfuntowicz Nov 4, 2024
1149186
feat(backend): expose tokenizer to the GenerationContext to decode token
mfuntowicz Nov 4, 2024
52208f5
misc(backend): decrease log verbosity in callback
mfuntowicz Nov 4, 2024
62dba1a
misc(cmake): use url deps and not git repo
mfuntowicz Nov 5, 2024
5884218
misc(backend): missing header <variant>
mfuntowicz Nov 5, 2024
a1154b1
feat(backend): avoid copy constructor
mfuntowicz Nov 5, 2024
7eec0f7
chore(backend): minor fixes mostly format
mfuntowicz Nov 5, 2024
a7afde4
feat(backend): dockerfile
mfuntowicz Nov 5, 2024
2065282
feat(dockerfile): build process
mfuntowicz Nov 6, 2024
26d0266
feat(backend): handle all the tokenization failure and send back to t…
mfuntowicz Nov 6, 2024
cf17928
misc(cmake): remove dependency on fmt
mfuntowicz Nov 7, 2024
4f5397c
misc(cmake): use URL base llama.cpp repo
mfuntowicz Nov 7, 2024
86d30ae
feat(backend): simplify overall cpp structure
mfuntowicz Nov 9, 2024
6915fa3
feat(backend): remove reinterpret_cast converting from uint32_t to ll…
mfuntowicz Nov 9, 2024
7e2890f
feat(backend): remove unused function
mfuntowicz Nov 11, 2024
488ba93
feat(backend): fix invalid reference to context in release mode
mfuntowicz Nov 11, 2024
363d5e4
feat(backend): use std::ranges to map uint32_t to llama_token
mfuntowicz Nov 12, 2024
02cd6fe
chore(backend): minor improvements
mfuntowicz Nov 12, 2024
daf1631
dockerfile(backend): initial working version of llama.cpp container
mfuntowicz Nov 12, 2024
57b2154
feat(backend): simplify Rust callback
mfuntowicz Nov 12, 2024
6f059c4
feat(backend): wrap Arc tokenizer to avoid duplicating
mfuntowicz Nov 14, 2024
70c90ad
feat(backend): update llamacpp to 4077
mfuntowicz Nov 14, 2024
23d2bcf
misc(build): improve build process
mfuntowicz Nov 14, 2024
5335bf9
feat(backend): multistream inference on CPU
mfuntowicz Nov 20, 2024
50c3766
feat(backend): bind thread and memory affinity for thread
mfuntowicz Nov 21, 2024
84eead2
feat(backend): correctly setup llama_context providing n_threads and …
mfuntowicz Nov 21, 2024
5a85661
feat(backend): rely on multi consumer queue to scheduler workers
mfuntowicz Nov 22, 2024
30ae996
misc(docker): add numa lib as dependency
mfuntowicz Nov 22, 2024
2d9465d
misc(backend): allow rebinding numa core affinity
mfuntowicz Nov 22, 2024
4ee2ee5
misc(license): update LICENSE
mfuntowicz Nov 22, 2024
b9c04b9
misc(doc): c++ documentation
mfuntowicz Nov 22, 2024
862a519
misc(doc): rust documentation
mfuntowicz Nov 22, 2024
9025a26
chore: remove unrelated change to trtllm
mfuntowicz Nov 22, 2024
bbe95ca
Update Dockerfile.llamacpp as per review
mfuntowicz Nov 28, 2024
d918e6a
Update Dockerfile.llamacpp as per review
mfuntowicz Nov 28, 2024
274cfce
feat(backend): remove core overriding in the Rust backend
mfuntowicz Nov 28, 2024
8e89793
feat(backend): use the new batch api from llama
mfuntowicz Nov 28, 2024
298367c
feat(backend): fix when num_cores_per_instance is equals to zero with…
mfuntowicz Nov 28, 2024
929a2fc
feat(backend): add some test to the backend for core allocation
mfuntowicz Nov 28, 2024
df72c56
feat(backend): add guard in case top_k = 0
mfuntowicz Nov 28, 2024
9d659f1
feat(backend): add missing temperature parameter
mfuntowicz Nov 28, 2024
6c5a75b
misc(offline): update model creation as std::shared_ptr
mfuntowicz Nov 28, 2024
b1ebc8f
feat(backend): update llama.cpp to 4215
mfuntowicz Nov 28, 2024
dc6435e
feat(backend): create llama_context_params with default factory
mfuntowicz Nov 28, 2024
b10eaab
feat(backend): use new batch API to generate tokens
mfuntowicz Nov 28, 2024
59b0ef3
feat: Fix Cmakelist to allow building on Darwin platform (#2785)
Hugoch Nov 28, 2024
f5c4cee
feat(backend): correctly link to all libraries
mfuntowicz Nov 29, 2024
db41776
feat(backend): add mimalloc memory allocator to the container
mfuntowicz Nov 29, 2024
c9f6c3a
feat(backend): better map exception throw on C++ side
mfuntowicz Nov 29, 2024
e0dda9b
feat(backend): use c++ defined types for llama.cpp
mfuntowicz Nov 29, 2024
182ffaf
misc: use return Ok(())
mfuntowicz Dec 12, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
127 changes: 127 additions & 0 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ members = [
"backends/trtllm",
"launcher",
"router"
]
, "backends/llamacpp"]
default-members = [
"benchmark",
"backends/v2",
Expand Down
74 changes: 74 additions & 0 deletions Dockerfile.llamacpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
# Build dependencies resolver stage
FROM lukemathwalker/cargo-chef:latest AS chef
WORKDIR /usr/src/text-generation-inference/

FROM chef AS planner
COPY Cargo.lock Cargo.lock
COPY Cargo.toml Cargo.toml
COPY rust-toolchain.toml rust-toolchain.toml
COPY backends backends
COPY benchmark benchmark
COPY clients clients
COPY launcher launcher
COPY router router

RUN cargo chef prepare --recipe-path recipe.json

FROM chef AS builder
ENV CMAKE_INSTALL_PREFIX=/usr/src/text-generation-inference/dist
RUN --mount=type=cache,target=/var/cache/apt,sharing=locked \
--mount=type=cache,target=/var/lib/apt,sharing=locked \
apt update && DEBIAN_FRONTEND=noninteractive apt install -y \
clang \
cmake \
gcc g++ \
libc++-dev \
libnumactl-dev \
mfuntowicz marked this conversation as resolved.
Show resolved Hide resolved
libopenmpi-dev \
libssl-dev \
ninja-build \
openssl \
python3-dev

RUN update-alternatives --install /usr/bin/cc cc /usr/bin/clang 10 \
&& update-alternatives --install /usr/bin/c++ c++ /usr/bin/clang 10 \
&& update-alternatives --auto cc \
&& update-alternatives --auto c++ \
&& update-alternatives --display cc \
&& update-alternatives --display c++ \
&& cc --version \
&& c++ --version

COPY --from=planner /usr/src/text-generation-inference/recipe.json recipe.json
RUN cargo chef cook --profile release-opt --package text-generation-backend-llamacpp --bin text-generation-backend-llamacpp --recipe-path recipe.json

COPY Cargo.lock Cargo.lock
COPY Cargo.toml Cargo.toml
COPY rust-toolchain.toml rust-toolchain.toml
COPY backends backends
COPY benchmark benchmark
COPY launcher launcher
COPY router router

ENV RUSTFLAGS="-L/usr/lib"
ENV CMAKE_INSTALL_PREFIX=/usr/src/text-generation-inference/dist
RUN cargo build --profile release-opt --package text-generation-backend-llamacpp --bin text-generation-backend-llamacpp --frozen

FROM ubuntu:22.04
ENV DEBIAN_FRONTEND=noninteractive

RUN --mount=type=cache,target=/var/cache/apt,sharing=locked \
--mount=type=cache,target=/var/lib/apt,sharing=locked \
apt update && \
apt upgrade -y && \
apt install -y \
numactl \
openssl \
python3.11-dev
mfuntowicz marked this conversation as resolved.
Show resolved Hide resolved

COPY --from=builder /usr/src/text-generation-inference/target/release-opt/text-generation-backend-llamacpp /usr/src/text-generation-inference/text-generation-launcher
COPY --from=builder /usr/src/text-generation-inference/dist /usr/

Hugoch marked this conversation as resolved.
Show resolved Hide resolved
ENV PORT=8080
WORKDIR /usr/src/text-generation-inference
ENTRYPOINT ["text-generation-launcher"]
3 changes: 2 additions & 1 deletion LICENSE
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@

Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
Expand Down Expand Up @@ -186,7 +187,7 @@
same "printed page" as the copyright notice for easier
identification within third-party archives.

Copyright 2022 Hugging Face
Copyright 2024 Hugging Face Inc.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
Expand Down
62 changes: 62 additions & 0 deletions backends/llamacpp/CMakeLists.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
cmake_minimum_required(VERSION 3.24)

project(tgi-llama-cpp-backend VERSION 1.0.0)
set(CMAKE_CXX_STANDARD 23)

include(FetchContent)

set(LLAMA_CPP_TARGET_VERSION "b3837" CACHE STRING "Version of llama.cpp to build against")
set(LLAMA_CPP_TARGET_CUDA_ARCHS "75-real;80-real;86-real;89-real;90-real" CACHE STRING "CUDA arch(s) to build")
option(LLAMA_CPP_BUILD_OFFLINE_RUNNER "Flag to build the standalone c++ backend runner")
option(LLAMA_CPP_BUILD_CUDA "Flag to build CUDA enabled inference through llama.cpp")

if (${CMAKE_CXX_COMPILER_ID} STREQUAL "Clang" AND ${CMAKE_SYSTEM_NAME} STREQUAL "Linux")
message(STATUS "Targeting libc++")
set(CMAKE_CXX_FLAGS -stdlib=libc++ ${CMAKE_CXX_FLAGS})
else ()
message(STATUS "Not using libc++ ${CMAKE_CXX_COMPILER_ID} ${CMAKE_SYSTEM_NAME}")
endif ()

# Add dependencies
include(cmake/numa.cmake)
include(cmake/spdlog.cmake)

if (${LLAMA_CPP_BUILD_CUDA})
message(STATUS "Enabling llama.cpp CUDA support")

if (NOT DEFINED CMAKE_CUDA_ARCHITECTURES)
set(CMAKE_CUDA_ARCHITECTURES ${LLAMA_CPP_TARGET_CUDA_ARCHS})
endif ()
set(GGML_CUDA ON)
endif ()

# Download llama.cpp repo at the specific version
fetchcontent_declare(
llama
URL https://github.com/ggerganov/llama.cpp/archive/refs/tags/b4077.tar.gz
)

fetchcontent_makeavailable(llama)

add_library(tgi_llamacpp_backend_impl STATIC csrc/backend.hpp csrc/backend.cpp)
target_compile_features(tgi_llamacpp_backend_impl PRIVATE cxx_std_11)
target_link_libraries(tgi_llamacpp_backend_impl PUBLIC spdlog::spdlog llama)

if (NUMA_FOUND)
target_link_libraries(tgi_llamacpp_backend_impl PUBLIC numa)
endif ()

install(TARGETS tgi_llamacpp_backend_impl spdlog llama)

if (${CMAKE_BUILD_TYPE} STREQUAL "Debug")
target_compile_definitions(tgi_llamacpp_backend_impl PRIVATE TGI_LLAMACPP_BACKEND_DEBUG=1)
endif ()

if (${LLAMA_CPP_BUILD_OFFLINE_RUNNER})
message(STATUS "Building llama.cpp offline runner")
add_executable(tgi_llamacpp_offline_runner offline/main.cpp)

target_link_libraries(tgi_llamacpp_offline_runner PUBLIC tgi_llamacpp_backend_impl llama spdlog::spdlog)
endif ()


33 changes: 33 additions & 0 deletions backends/llamacpp/Cargo.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
[package]
name = "text-generation-backend-llamacpp"
version.workspace = true
edition.workspace = true
authors.workspace = true
homepage.workspace = true

[dependencies]
async-trait = "0.1"
async-channel = "2.3"
clap = { version = "4.5.19", features = ["derive"] }
cxx = "1.0"
num_cpus = "1"
hf-hub = { workspace = true }
image = { version = "0.25.1", features = ["default-formats"] }
metrics = { workspace = true }
metrics-exporter-prometheus = { workspace = true }
serde_json = "1.0.128"
text-generation-router = { path = "../../router" }
thiserror = "1.0.64"
tokio = "1.40.0"
tokio-stream = "0.1.16"
tokenizers = { workspace = true }
tracing = "0.1"
tracing-opentelemetry = "0.27.0"
tracing-subscriber = { version = "0.3", features = ["json", "env-filter"] }
utoipa = { version = "4.2.3", features = ["axum_extras"] }
log = "0.4.22"

[build-dependencies]
cmake = "0.1"
cxx-build = { version = "1.0", features = ["parallel"] }
pkg-config = "0.3"
Loading
Loading