Windows llama.cpp

A PowerShell automation to rebuild llama.cpp for a Windows environment. It automates the following steps:

Fetching and extracting a specific release of OpenBLAS
Fetching the latest version of llama.cpp
Fixing OpenBLAS binding in the CMakeLists.txt
Rebuilding the binaries with CMake
Updating the Python dependencies
Automatically detects the best BLAS acceleration

BLAS support

This script currently supports OpenBLAS for CPU BLAS acceleration and CUDA for NVIDIA GPU BLAS acceleration.

Installation

1. Install Prerequisites

Download and install the latest versions:

Tip

When installing Visual Studio 2022 it is sufficent to just install the Build Tools for Visual Studio 2022 package. Also make sure that Desktop development with C++ is enabled in the installer.

2. Enable Hardware Accelerated GPU Scheduling (optional)

Execute the following in a PowerShell terminal with Administrator privileges to enable the Hardware Accelerated GPU Scheduling feature:

New-ItemProperty `
    -Path "HKLM:\SYSTEM\CurrentControlSet\Control\GraphicsDrivers" `
    -Name "HwSchMode" `
    -Value "2" `
    -PropertyType DWORD `
    -Force

Then restart your computer to activate the feature.

3. Clone the repository from GitHub

Clone the repository to a nice place on your machine via:

git clone --recurse-submodules git@github.com:countzero/windows_llama.cpp.git

4. Create a new Conda environment

Create a new Conda environment for this project with a specific version of Python:

conda create --name llama.cpp python=3.12

5. Initialize Conda for shell interaction

To make Conda available in you current shell execute the following:

conda init

Tip

You can always revert this via conda init --reverse.

6. Execute the build script

To build llama.cpp binaries for a Windows environment with the best available BLAS acceleration execute the script:

./rebuild_llama.cpp.ps1

Tip

If PowerShell is not configured to execute files allow it by executing the following in an elevated PowerShell: Set-ExecutionPolicy RemoteSigned

7. Download a large language model

Download a large language model (LLM) with weights in the GGUF format into the ./vendor/llama.cpp/models directory. You can for example download the gemma-2-9b-it model in a quantized GGUF format:

https://huggingface.co/bartowski/gemma-2-9b-it-GGUF/resolve/main/gemma-2-9b-it-IQ4_XS.gguf

Tip

See the 🤗 Open LLM Leaderboard and LMSYS Chatbot Arena Leaderboard for best in class open source LLMs.

Usage

Chat via server script

You can easily chat with a specific model by using the .\examples\server.ps1 script:

.\examples\server.ps1 -model ".\vendor\llama.cpp\models\gemma-2-9b-it-IQ4_XS.gguf"

Note

The script will automatically start the llama.cpp server with an optimal configuration for your machine.

Execute the following to get detailed help on further options of the server script:

Get-Help -Detailed .\examples\server.ps1

Chat via CLI

You can now chat with the model:

./vendor/llama.cpp/build/bin/Release/llama-cli `
    --model "./vendor/llama.cpp/models/gemma-2-9b-it-IQ4_XS.gguf" `
    --ctx-size 8192 `
    --threads 16 `
    --n-gpu-layers 33 `
    --reverse-prompt '[[USER_NAME]]:' `
    --prompt-cache "./cache/gemma-2-9b-it-IQ4_XS.gguf.prompt" `
    --file "./vendor/llama.cpp/prompts/chat-with-vicuna-v1.txt" `
    --color `
    --interactive

Chat via Webinterface

You can start llama.cpp as a webserver:

./vendor/llama.cpp/build/bin/Release/llama-server `
    --model "./vendor/llama.cpp/models/gemma-2-9b-it-IQ4_XS.gguf" `
    --ctx-size 8192 `
    --threads 16 `
    --n-gpu-layers 33

And then access llama.cpp via the webinterface at:

http://127.0.0.1:8080/

Increase the context size

You can increase the context size of a model with a minimal quality loss by setting the RoPE parameters. The formula for the parameters is as follows:

context_scale = increased_context_size / original_context_size
rope_frequency_scale = 1 / context_scale
rope_frequency_base = 10000 * context_scale

Note

To increase the context size of an openchat-3.6-8b-20240522 model from its original context size of 8192 to 32768 means, that the context_scale is 4.0. The rope_frequency_scale will then be 0.25 and the rope_frequency_base equals 40000.

To extend the context to 32k execute the following:

./vendor/llama.cpp/build/bin/Release/llama-cli `
    --model "./vendor/llama.cpp/models/gemma-2-9b-it-IQ4_XS.gguf" `
    --ctx-size 32768 `
    --rope-freq-scale 0.25 `
    --rope-freq-base 40000 `
    --threads 16 `
    --n-gpu-layers 33 `
    --reverse-prompt '[[USER_NAME]]:' `
    --prompt-cache "./cache/gemma-2-9b-it-IQ4_XS.gguf.prompt" `
    --file "./vendor/llama.cpp/prompts/chat-with-vicuna-v1.txt" `
    --color `
    --interactive

Enforce JSON response

You can enforce a specific grammar for the response generation. The following will always return a JSON response:

./vendor/llama.cpp/build/bin/Release/llama-cli `
    --model "./vendor/llama.cpp/models/gemma-2-9b-it-IQ4_XS.gguf" `
    --ctx-size 8192 `
    --threads 16 `
    --n-gpu-layers 33 `
    --prompt-cache "./cache/gemma-2-9b-it-IQ4_XS.gguf.prompt" `
    --prompt "The scientific classification (Taxonomy) of a Llama: " `
    --grammar-file "./vendor/llama.cpp/grammars/json.gbnf"
    --color

Measure model perplexity

Execute the following to measure the perplexity of the GGML formatted model:

./vendor/llama.cpp/build/bin/Release/llama-perplexity `
    --model "./vendor/llama.cpp/models/gemma-2-9b-it-IQ4_XS.gguf" `
    --ctx-size 8192 `
    --threads 16 `
    --n-gpu-layers 33 `
    --file "./vendor/wikitext-2-raw-v1/wikitext-2-raw/wiki.test.raw"

Count prompt tokens

You can easily count the tokens of a prompt for a specific model by using the .\examples\count_tokens.ps1 script:

 .\examples\count_tokens.ps1 `
     -model ".\vendor\llama.cpp\models\gemma-2-9b-it-IQ4_XS.gguf" `
     -file ".\prompts\chat_with_llm.txt"

To inspect the actual tokenization result you can use the -debug flag:

 .\examples\count_tokens.ps1 `
     -model ".\vendor\llama.cpp\models\gemma-2-9b-it-IQ4_XS.gguf" `
     -prompt "Hello Word!" `
     -debug

Note

The script is a simple wrapper for the tokenize.cpp example of the llama.cpp project.

Execute the following to get detailed help on further options of the server script:

Get-Help -Detailed .\examples\count_tokens.ps1

Build

Rebuild llama.cpp

Every time there is a new release of llama.cpp you can simply execute the script to automatically rebuild everything:

Command	Description
`./rebuild_llama.cpp.ps1`	Automatically detects best BLAS acceleration
`./rebuild_llama.cpp.ps1 -blasAccelerator "OFF"`	Without any BLAS acceleration
`./rebuild_llama.cpp.ps1 -blasAccelerator "OpenBLAS"`	With CPU BLAS acceleration
`./rebuild_llama.cpp.ps1 -blasAccelerator "CUDA"`	With NVIDIA GPU BLAS acceleration

Build a specific version of llama.cpp

You can build a specific version of llama.cpp by specifying a git tag or commit:

Command	Description
`./rebuild_llama.cpp.ps1`	The latest release
`./rebuild_llama.cpp.ps1 -version "b1138"`	The tag `b1138`
`./rebuild_llama.cpp.ps1 -version "1d16309"`	The commit `1d16309`

Name		Name	Last commit message	Last commit date
Latest commit History 133 Commits
cache		cache
examples		examples
grammars		grammars
images		images
prompts		prompts
vendor		vendor
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
CHANGELOG.md		CHANGELOG.md
README.md		README.md
rebuild_llama.cpp.ps1		rebuild_llama.cpp.ps1
requirements_override.txt		requirements_override.txt
windows_lama_cpp.sublime-project		windows_lama_cpp.sublime-project

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Windows llama.cpp

BLAS support

Installation

1. Install Prerequisites

2. Enable Hardware Accelerated GPU Scheduling (optional)

3. Clone the repository from GitHub

4. Create a new Conda environment

5. Initialize Conda for shell interaction

6. Execute the build script

7. Download a large language model

Usage

Chat via server script

Chat via CLI

Chat via Webinterface

Increase the context size

Enforce JSON response

Measure model perplexity

Count prompt tokens

Build

Rebuild llama.cpp

Build a specific version of llama.cpp

About

Releases 24

Packages

Languages

countzero/windows_llama.cpp

Folders and files

Latest commit

History

Repository files navigation

Windows llama.cpp

BLAS support

Installation

1. Install Prerequisites

2. Enable Hardware Accelerated GPU Scheduling (optional)

3. Clone the repository from GitHub

4. Create a new Conda environment

5. Initialize Conda for shell interaction

6. Execute the build script

7. Download a large language model

Usage

Chat via server script

Chat via CLI

Chat via Webinterface

Increase the context size

Enforce JSON response

Measure model perplexity

Count prompt tokens

Build

Rebuild llama.cpp

Build a specific version of llama.cpp

About

Topics

Resources

Stars

Watchers

Forks

Releases 24

Packages 0

Languages

Packages