llama2-server-docker-gpu

This repository contains scripts allowing easily run a GPU accelerated Llama 2 REST server in a Docker container. This server will run only models that are stored in the HuggingFace repository and are compatible with llama.cpp.

For the GPU support https://jllllll.github.io/llama-cpp-python-cuBLAS-wheels/AVX/cu122 version of the llama-cpp library is used.

Pre-requisites

To actually run the server on the GPU, you need to have the following installed:

scripts/rhel7-install-nvidia-runtime.sh script shows how those dependencies can be installed on RHEL 7.

To see if you have the drivers installed, run the following command:

docker run --rm --runtime=nvidia -it -e NVIDIA_VISIBLE_DEVICES=all nvidia/cuda:12.2.0-devel-ubuntu20.04 nvidia-smi

Run llama server

First, build the Docker image:

cd docker && ./build.sh

Then you can run the server with the run-server.sh script which takes following parameters:

--hg-repo-id - ID of the HuggingFace repository containing the model
--hg-filename - Name of the file containing the model

I've tested that with a following repositories:

Chosen model will be cached in the models directory, so it will be downloaded only once.

Examples

Run Llama Llama-2-7B-Chat-GGML with the q2 quantization on GPU

./run-server.sh --hg-repo-id TheBloke/Llama-2-7B-Chat-GGML --hg-filename llama-2-7b-chat.ggmlv3.q2_K.bin --n_gpu_layers 2048

Run Llama Llama-2-70B-Chat-GGML with the q5 quantization on GPU (tested on g5.12xlarge)

./run-server.sh --hg-repo-id TheBloke/Llama-2-70B-Chat-GGML --hg-filename llama-2-70b-chat.ggmlv3.q5_K_S.bin --n_gpu_layers 2048 --n_gqa 8

Server parameters

This run-server.sh a wrapper around the llama-cpp-python server, so it will accept the same parameters as the original server. You can see the list of parameters by running:

./run-server.sh --help

You will need to play with them a bit to find the optimal configuration for your use case.

Llama client

llama_client.py contains a simple Python REST client for the Llama server. It can be used as follows:

from llama2_server.llama_client import (
    LlamaClient,
    CreateCompletionRequest,
    ChatCompletionRequestMessage, CreateEmbeddingRequest, CreateChatCompletionRequest,
)

client = LlamaClient("https://localhost:8080")

# completions
print(client.create_completions(CreateCompletionRequest(prompt="Name all planets in the solar system")))

# chat completions
print(client.create_chat_completion(CreateChatCompletionRequest(messages=[
    ChatCompletionRequestMessage(role="system", content="You are a well-known astronomer"),
    ChatCompletionRequestMessage(role="user", content="List all planets in the solar system"),
])))

# embeddings
print(client.create_embeddings(CreateEmbeddingRequest(input=["Hello world!"])))

CLI

After running poetry install && poetry shell you should be able to call LlamaClient using llama-cli CLI from your terminal:

Completions

llama-cli completion --llama-url 'http://localhost:8080' --prompt 'List all planets in the solar system'

Chat completions

llama-cli chat-completion --max_tokens 1024 --llama-url 'https://localhost:8080' --message 'system|You are a well-known astronomer' --message 'user|List all planets in the solar system'

Embeddings

llama-cli embeddings --llama-url 'http://localhost:8080' --text 'List all planets in the solar system'

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
docker		docker
llama2_server		llama2_server
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
run-server.sh		run-server.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

llama2-server-docker-gpu

Pre-requisites

Run llama server

Examples

Server parameters

Llama client

CLI

About

Releases

Packages

Contributors 2

Languages

License

ZbigniewTomanek/llama2-server-docker-gpu

Folders and files

Latest commit

History

Repository files navigation

llama2-server-docker-gpu

Pre-requisites

Run llama server

Examples

Server parameters

Llama client

CLI

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages