Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pull] master from ggerganov:master #143

Closed
wants to merge 35 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
c83ad6d
ggml-backend : add device and backend reg interfaces (#9707)
slaren Oct 2, 2024
5639971
Fixed dequant precision issues in Q4_1 and Q5_1 (#9711)
OuadiElfarouki Oct 3, 2024
841713e
rpc : enable vulkan (#9714)
rgerganov Oct 3, 2024
e3c355b
convert : handle tokenizer merges format from transformers 4.45 (#9696)
compilade Oct 3, 2024
d6fe7ab
ggml: unify backend logging mechanism (#9709)
bandoti Oct 3, 2024
a7ad553
ggml-backend : add device description to CPU backend (#9720)
slaren Oct 3, 2024
5d5ab1e
metal : fix compute pass descriptor autorelease crash (#9718)
jmousseau Oct 3, 2024
eee39bd
ggml: refactor cross entropy loss CPU impl. (ggml/976)
JohannesGaessler Oct 2, 2024
fabdc3b
ggml/ex: calculate accuracy in graph, adapt MNIST (ggml/980)
JohannesGaessler Oct 3, 2024
1bb8a64
sync : ggml
ggerganov Oct 3, 2024
d5ed2b9
metal : remove abort (skip) (ggml/0)
ggerganov Oct 3, 2024
133c7b4
Fixed RNG seed docs (#9723)
d-kleine Oct 4, 2024
f3fdcfa
ci : fine-grant permission (#9710)
ngxson Oct 4, 2024
ff56576
ggml : fixes after sync (ggml/983)
slaren Oct 4, 2024
55951c0
ggml : fix typo in example usage ggml_gallocr_new (ggml/984)
danbev Oct 4, 2024
1788077
sync : ggml
ggerganov Oct 4, 2024
71967c2
Add Llama Assistant (#9744)
vietanhdev Oct 4, 2024
905f548
metal : zero-init buffer contexts (whisper/0)
ggerganov Oct 5, 2024
58b1669
sync : ggml
ggerganov Oct 5, 2024
8c475b9
rerank : use [SEP] token instead of [BOS] (#9737)
ggerganov Oct 5, 2024
b0915d5
vulkan : retry allocation with fallback flags (whisper/2451)
SRHMorris Oct 6, 2024
b6d6c52
sync : llama.cpp
ggerganov Oct 6, 2024
f4b2dcd
readme : fix typo [no ci]
ggerganov Oct 6, 2024
d5cb868
contrib : simplify + minor edits [no ci]
ggerganov Oct 6, 2024
96b6912
metal : single allocation of encode_async block (#9747)
ptsochantaris Oct 7, 2024
d5ac8cf
ggml : add metal backend registry / device (#9713)
ggerganov Oct 7, 2024
6279dac
flake.lock: Update (#9753)
ggerganov Oct 7, 2024
f1af42f
Update building for Android (#9672)
amqdn Oct 7, 2024
6374743
ggml : add backend registry / device interfaces to BLAS backend (#9752)
slaren Oct 7, 2024
fa42aa6
scripts : fix spelling typo in messages and comments (#9782)
standby24x7 Oct 8, 2024
458367a
server : better security control for public deployments (#9776)
ngxson Oct 8, 2024
dca1d4b
ggml : fix BLAS with unsupported types (#9775)
slaren Oct 8, 2024
3dc48fe
examples : remove llama.vim
ggerganov Oct 9, 2024
e702206
perplexity : fix integer overflow (#9783)
ggerganov Oct 9, 2024
c81f3bb
cmake : do not build common library by default when standalone (#9804)
slaren Oct 9, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions .github/workflows/bench.yml.disabled
Original file line number Diff line number Diff line change
Expand Up @@ -27,10 +27,10 @@ on:
push:
branches:
- master
paths: ['llama.cpp', 'ggml.c', 'ggml-backend.c', 'ggml-quants.c', '**/*.cu', 'examples/server/*.h*', 'examples/server/*.cpp']
paths: ['llama.cpp', 'ggml.c', 'ggml-backend.cpp', 'ggml-quants.c', '**/*.cu', 'examples/server/*.h*', 'examples/server/*.cpp']
pull_request_target:
types: [opened, synchronize, reopened]
paths: ['llama.cpp', 'ggml.c', 'ggml-backend.c', 'ggml-quants.c', '**/*.cu', 'examples/server/*.h*', 'examples/server/*.cpp']
paths: ['llama.cpp', 'ggml.c', 'ggml-backend.cpp', 'ggml-quants.c', '**/*.cu', 'examples/server/*.h*', 'examples/server/*.cpp']
schedule:
- cron: '04 2 * * *'

Expand Down
5 changes: 5 additions & 0 deletions .github/workflows/build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,11 @@ concurrency:
group: ${{ github.workflow }}-${{ github.head_ref && github.ref || github.run_id }}
cancel-in-progress: true

# Fine-grant permission
# https://docs.github.com/en/actions/security-for-github-actions/security-guides/automatic-token-authentication#modifying-the-permissions-for-the-github_token
permissions:
contents: write # for creating release

env:
BRANCH_NAME: ${{ github.head_ref || github.ref_name }}
GGML_NLOOP: 3
Expand Down
5 changes: 5 additions & 0 deletions .github/workflows/close-issue.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,11 @@ on:
schedule:
- cron: "42 0 * * *"

# Fine-grant permission
# https://docs.github.com/en/actions/security-for-github-actions/security-guides/automatic-token-authentication#modifying-the-permissions-for-the-github_token
permissions:
issues: write

jobs:
close-issues:
runs-on: ubuntu-latest
Expand Down
7 changes: 7 additions & 0 deletions .github/workflows/nix-ci-aarch64.yml
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,13 @@ concurrency:
group: ${{ github.workflow }}-${{ github.head_ref && github.ref || github.run_id }}
cancel-in-progress: true

# Fine-grant permission
# https://docs.github.com/en/actions/security-for-github-actions/security-guides/automatic-token-authentication#modifying-the-permissions-for-the-github_token
permissions:
# https://github.com/DeterminateSystems/nix-installer-action?tab=readme-ov-file#with-flakehub
id-token: write
contents: read

jobs:
nix-build-aarch64:
runs-on: ubuntu-latest
Expand Down
7 changes: 7 additions & 0 deletions .github/workflows/nix-ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,13 @@ concurrency:
group: ${{ github.workflow }}-${{ github.head_ref && github.ref || github.run_id }}
cancel-in-progress: true

# Fine-grant permission
# https://docs.github.com/en/actions/security-for-github-actions/security-guides/automatic-token-authentication#modifying-the-permissions-for-the-github_token
permissions:
# https://github.com/DeterminateSystems/nix-installer-action?tab=readme-ov-file#with-flakehub
id-token: write
contents: read

jobs:
nix-eval:
strategy:
Expand Down
6 changes: 3 additions & 3 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,7 @@ option(LLAMA_SANITIZE_ADDRESS "llama: enable address sanitizer" OFF)
option(LLAMA_SANITIZE_UNDEFINED "llama: enable undefined sanitizer" OFF)

# utils
option(LLAMA_BUILD_COMMON "llama: build common utils library" ON)
option(LLAMA_BUILD_COMMON "llama: build common utils library" ${LLAMA_STANDALONE})

# extra artifacts
option(LLAMA_BUILD_TESTS "llama: build tests" ${LLAMA_STANDALONE})
Expand Down Expand Up @@ -201,12 +201,12 @@ if (LLAMA_BUILD_COMMON)
add_subdirectory(common)
endif()

if (LLAMA_BUILD_TESTS AND NOT CMAKE_JS_VERSION)
if (LLAMA_BUILD_COMMON AND LLAMA_BUILD_TESTS AND NOT CMAKE_JS_VERSION)
include(CTest)
add_subdirectory(tests)
endif()

if (LLAMA_BUILD_EXAMPLES)
if (LLAMA_BUILD_COMMON AND LLAMA_BUILD_EXAMPLES)
add_subdirectory(examples)
add_subdirectory(pocs)
endif()
11 changes: 5 additions & 6 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -1,24 +1,23 @@
# Pull requests (for contributors)

- Test your changes:
- Using the commands in the [`tests`](tests) folder. For instance, running the `./tests/test-backend-ops` command tests different backend implementations of the GGML library
- Using the commands in the [`tests`](tests) folder. For instance, running the `./tests/test-backend-ops` command tests different backend implementations of the `ggml` library
- Execute [the full CI locally on your machine](ci/README.md) before publishing
- Please rate the complexity of your PR (i.e. `Review Complexity : Low`, `Review Complexity : Medium`, `Review Complexity : High`). This makes it easier for maintainers to triage the PRs.
- The PR template has a series of review complexity checkboxes `[ ]` that [you can mark as](https://docs.github.com/en/get-started/writing-on-github/working-with-advanced-formatting/about-task-lists) `[X]` for your convenience
- Consider allowing write access to your branch for faster review
- Optionally rate the complexity of your PR (i.e. `Review Complexity : Low`, `Review Complexity : Medium`, `Review Complexity : High`). This makes it easier for maintainers to triage the PRs
- Consider allowing write access to your branch for faster reviews, as reviewers can push commits directly
- If your PR becomes stale, don't hesitate to ping the maintainers in the comments

# Pull requests (for collaborators)

- Squash-merge PRs
- Use the following format for the squashed commit title: `<module> : <commit title> (#<issue_number>)`. For example: `utils : fix typo in utils.py (#1234)`
- Optionally, pick a `<module>` from here: https://github.com/ggerganov/llama.cpp/wiki/Modules
- Optionally pick a `<module>` from here: https://github.com/ggerganov/llama.cpp/wiki/Modules

# Coding guidelines

- Avoid adding third-party dependencies, extra files, extra headers, etc.
- Always consider cross-compatibility with other operating systems and architectures
- Avoid fancy looking modern STL constructs, use basic `for` loops, avoid templates, keep it simple
- Avoid fancy-looking modern STL constructs, use basic `for` loops, avoid templates, keep it simple
- There are no strict rules for the code style, but try to follow the patterns in the code (indentation, spaces, etc.). Vertical alignment makes things more readable and easier to batch edit
- Clean-up any trailing whitespaces, use 4 spaces for indentation, brackets on the same line, `void * ptr`, `int & a`
- Naming usually optimizes for common prefix (see https://github.com/ggerganov/ggml/pull/302#discussion_r1243240963)
Expand Down
5 changes: 3 additions & 2 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -1054,10 +1054,11 @@ ggml/src/ggml-alloc.o: \
$(CC) $(CFLAGS) -c $< -o $@

ggml/src/ggml-backend.o: \
ggml/src/ggml-backend.c \
ggml/src/ggml-backend.cpp \
ggml/src/ggml-backend-impl.h \
ggml/include/ggml.h \
ggml/include/ggml-backend.h
$(CC) $(CFLAGS) -c $< -o $@
$(CXX) $(CXXFLAGS) -c $< -o $@

ggml/src/ggml-quants.o: \
ggml/src/ggml-quants.c \
Expand Down
2 changes: 1 addition & 1 deletion Package.swift
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ var sources = [
"src/unicode-data.cpp",
"ggml/src/ggml.c",
"ggml/src/ggml-alloc.c",
"ggml/src/ggml-backend.c",
"ggml/src/ggml-backend.cpp",
"ggml/src/ggml-quants.c",
"ggml/src/ggml-aarch64.c",
]
Expand Down
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -169,6 +169,7 @@ Unless otherwise noted these projects are open-source with permissive licensing:
- [AIKit](https://github.com/sozercan/aikit) (MIT)
- [LARS - The LLM & Advanced Referencing Solution](https://github.com/abgulati/LARS) (AGPL)
- [LLMUnity](https://github.com/undreamai/LLMUnity) (MIT)
- [Llama Assistant](https://github.com/vietanhdev/llama-assistant) (GPL)

*(to have a project listed here, it should clearly state that it depends on `llama.cpp`)*

Expand Down
7 changes: 4 additions & 3 deletions ci/run.sh
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
#/bin/bash
#!/bin/bash
#
# sample usage:
#
Expand Down Expand Up @@ -751,7 +751,8 @@ function gg_run_rerank_tiny {

model_f16="${path_models}/ggml-model-f16.gguf"

(time ./bin/llama-embedding --model ${model_f16} -p "what is panda?</s><s>hi\nwhat is panda?</s><s>it's a bear\nwhat is panda?</s><s>The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China." --pooling rank --embd-normalize -1 --verbose-prompt) 2>&1 | tee -a $OUT/${ci}-rk-f16.log
# for this model, the SEP token is "</s>"
(time ./bin/llama-embedding --model ${model_f16} -p "what is panda?</s></s>hi\nwhat is panda?</s></s>it's a bear\nwhat is panda?</s></s>The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China." --pooling rank --embd-normalize -1 --verbose-prompt) 2>&1 | tee -a $OUT/${ci}-rk-f16.log

# sample output
# rerank score 0: 0.029
Expand All @@ -774,7 +775,7 @@ function gg_run_rerank_tiny {

check_score "rerank score 0" "$(cat $OUT/${ci}-rk-f16.log | grep "rerank score 0")" "0.00" "0.05" | tee -a $OUT/${ci}-rk-f16.log
check_score "rerank score 1" "$(cat $OUT/${ci}-rk-f16.log | grep "rerank score 1")" "0.00" "0.05" | tee -a $OUT/${ci}-rk-f16.log
check_score "rerank score 2" "$(cat $OUT/${ci}-rk-f16.log | grep "rerank score 2")" "0.10" "0.15" | tee -a $OUT/${ci}-rk-f16.log
check_score "rerank score 2" "$(cat $OUT/${ci}-rk-f16.log | grep "rerank score 2")" "0.10" "0.30" | tee -a $OUT/${ci}-rk-f16.log

set +e
}
Expand Down
18 changes: 16 additions & 2 deletions common/arg.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -911,7 +911,7 @@ gpt_params_context gpt_params_parser_init(gpt_params & params, llama_example ex,
).set_sparam());
add_opt(llama_arg(
{"-s", "--seed"}, "SEED",
format("RNG seed (default: %u, use random seed for %u)", params.sparams.seed, LLAMA_DEFAULT_SEED),
format("RNG seed (default: %d, use random seed for %d)", params.sparams.seed, LLAMA_DEFAULT_SEED),
[](gpt_params & params, const std::string & value) {
params.sparams.seed = std::stoul(value);
}
Expand Down Expand Up @@ -1838,9 +1838,23 @@ gpt_params_context gpt_params_parser_init(gpt_params & params, llama_example ex,
params.endpoint_metrics = true;
}
).set_examples({LLAMA_EXAMPLE_SERVER}).set_env("LLAMA_ARG_ENDPOINT_METRICS"));
add_opt(llama_arg(
{"--slots"},
format("enable slots monitoring endpoint (default: %s)", params.endpoint_slots ? "enabled" : "disabled"),
[](gpt_params & params) {
params.endpoint_slots = true;
}
).set_examples({LLAMA_EXAMPLE_SERVER}).set_env("LLAMA_ARG_ENDPOINT_SLOTS"));
add_opt(llama_arg(
{"--props"},
format("enable changing global properties via POST /props (default: %s)", params.endpoint_props ? "enabled" : "disabled"),
[](gpt_params & params) {
params.endpoint_props = true;
}
).set_examples({LLAMA_EXAMPLE_SERVER}).set_env("LLAMA_ARG_ENDPOINT_PROPS"));
add_opt(llama_arg(
{"--no-slots"},
format("disables slots monitoring endpoint (default: %s)", params.endpoint_slots ? "enabled" : "disabled"),
"disables slots monitoring endpoint",
[](gpt_params & params) {
params.endpoint_slots = false;
}
Expand Down
30 changes: 29 additions & 1 deletion common/common.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -838,6 +838,31 @@ struct llama_init_result llama_init_from_gpt_params(gpt_params & params) {
return iparams;
}

if (params.reranking) {
bool ok = true;

if (llama_token_bos(model) == LLAMA_TOKEN_NULL) {
LOG_WRN("%s: warning: model does not have a BOS token, reranking will not work\n", __func__);
ok = false;
}

if (llama_token_eos(model) == LLAMA_TOKEN_NULL) {
LOG_WRN("%s: warning: model does not have an EOS token, reranking will not work\n", __func__);
ok = false;
}

if (llama_token_sep(model) == LLAMA_TOKEN_NULL) {
LOG_WRN("%s: warning: model does not have a SEP token, reranking will not work\n", __func__);
ok = false;
}

if (!ok) {
llama_free_model(model);

return iparams;
}
}

auto cparams = llama_context_params_from_gpt_params(params);

llama_context * lctx = llama_new_context_with_model(model, cparams);
Expand All @@ -855,6 +880,7 @@ struct llama_init_result llama_init_from_gpt_params(gpt_params & params) {
if (cvec.n_embd == -1) {
llama_free(lctx);
llama_free_model(model);

return iparams;
}

Expand All @@ -867,6 +893,7 @@ struct llama_init_result llama_init_from_gpt_params(gpt_params & params) {
if (err) {
llama_free(lctx);
llama_free_model(model);

return iparams;
}
}
Expand All @@ -889,7 +916,7 @@ struct llama_init_result llama_init_from_gpt_params(gpt_params & params) {
llama_lora_adapters_apply(lctx, iparams.lora_adapters);
}

if (params.sparams.ignore_eos && llama_token_eos(model) == -1) {
if (params.sparams.ignore_eos && llama_token_eos(model) == LLAMA_TOKEN_NULL) {
LOG_WRN("%s: warning: model does not have an EOS token, ignoring --ignore-eos\n", __func__);
params.sparams.ignore_eos = false;
}
Expand Down Expand Up @@ -930,6 +957,7 @@ struct llama_init_result llama_init_from_gpt_params(gpt_params & params) {

iparams.model = model;
iparams.context = lctx;

return iparams;
}

Expand Down
5 changes: 4 additions & 1 deletion common/common.h
Original file line number Diff line number Diff line change
Expand Up @@ -290,7 +290,10 @@ struct gpt_params {
std::string ssl_file_key = ""; // NOLINT
std::string ssl_file_cert = ""; // NOLINT

bool endpoint_slots = true;
// "advanced" endpoints are disabled by default for better security
bool webui = true;
bool endpoint_slots = false;
bool endpoint_props = false; // only control POST requests, not GET
bool endpoint_metrics = false;

bool log_json = false;
Expand Down
83 changes: 55 additions & 28 deletions docs/android.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,55 +2,82 @@
# Android

## Build on Android using Termux
[Termux](https://github.com/termux/termux-app#installation) is a method to execute `llama.cpp` on an Android device (no root required).

[Termux](https://termux.dev/en/) is an Android terminal emulator and Linux environment app (no root required). As of writing, Termux is available experimentally in the Google Play Store; otherwise, it may be obtained directly from the project repo or on F-Droid.

With Termux, you can install and run `llama.cpp` as if the environment were Linux. Once in the Termux shell:

```
$ apt update && apt upgrade -y
$ apt install git cmake
```

Then, follow the [build instructions](https://github.com/ggerganov/llama.cpp/blob/master/docs/build.md), specifically for CMake.

Once the binaries are built, download your model of choice (e.g., from Hugging Face). It's recommended to place it in the `~/` directory for best performance:

```
apt update && apt upgrade -y
apt install git make cmake
$ curl -L {model-url} -o ~/{model}.gguf
```

It's recommended to move your model inside the `~/` directory for best performance:
Then, if you are not already in the repo directory, `cd` into `llama.cpp` and:

```
cd storage/downloads
mv model.gguf ~/
$ ./build/bin/llama-simple -m ~/{model}.gguf -c {context-size} -p "{your-prompt}"
```

[Get the code](https://github.com/ggerganov/llama.cpp#get-the-code) & [follow the Linux build instructions](https://github.com/ggerganov/llama.cpp#build) to build `llama.cpp`.
Here, we show `llama-simple`, but any of the executables under `examples` should work, in theory. Be sure to set `context-size` to a reasonable number (say, 4096) to start with; otherwise, memory could spike and kill your terminal.

To see what it might look like visually, here's an old demo of an interactive session running on a Pixel 5 phone:

https://user-images.githubusercontent.com/271616/225014776-1d567049-ad71-4ef2-b050-55b0b3b9274c.mp4

## Cross-compile using Android NDK
It's possible to build `llama.cpp` for Android on your host system via CMake and the Android NDK. If you are interested in this path, ensure you already have an environment prepared to cross-compile programs for Android (i.e., install the Android SDK). Note that, unlike desktop environments, the Android environment ships with a limited set of native libraries, and so only those libraries are available to CMake when building with the Android NDK (see: https://developer.android.com/ndk/guides/stable_apis.)

## Building the Project using Android NDK
Obtain the [Android NDK](https://developer.android.com/ndk) and then build with CMake.
Once you're ready and have cloned `llama.cpp`, invoke the following in the project directory:

Execute the following commands on your computer to avoid downloading the NDK to your mobile. Alternatively, you can also do this in Termux:
```
$ mkdir build-android
$ cd build-android
$ export NDK=<your_ndk_directory>
$ cmake -DCMAKE_TOOLCHAIN_FILE=$NDK/build/cmake/android.toolchain.cmake -DANDROID_ABI=arm64-v8a -DANDROID_PLATFORM=android-23 -DCMAKE_C_FLAGS=-march=armv8.4a+dotprod ..
$ make
$ cmake \
-DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake \
-DANDROID_ABI=arm64-v8a \
-DANDROID_PLATFORM=android-28 \
-DCMAKE_C_FLAGS="-march=armv8.7a" \
-DCMAKE_CXX_FLAGS="-march=armv8.7a" \
-DGGML_OPENMP=OFF \
-DGGML_LLAMAFILE=OFF \
-B build-android
```

Install [termux](https://github.com/termux/termux-app#installation) on your device and run `termux-setup-storage` to get access to your SD card (if Android 11+ then run the command twice).
Notes:
- While later versions of Android NDK ship with OpenMP, it must still be installed by CMake as a dependency, which is not supported at this time
- `llamafile` does not appear to support Android devices (see: https://github.com/Mozilla-Ocho/llamafile/issues/325)

The above command should configure `llama.cpp` with the most performant options for modern devices. Even if your device is not running `armv8.7a`, `llama.cpp` includes runtime checks for available CPU features it can use.

Finally, copy these built `llama` binaries and the model file to your device storage. Because the file permissions in the Android sdcard cannot be changed, you can copy the executable files to the `/data/data/com.termux/files/home/bin` path, and then execute the following commands in Termux to add executable permission:
Feel free to adjust the Android ABI for your target. Once the project is configured:

(Assumed that you have pushed the built executable files to the /sdcard/llama.cpp/bin path using `adb push`)
```
$cp -r /sdcard/llama.cpp/bin /data/data/com.termux/files/home/
$cd /data/data/com.termux/files/home/bin
$chmod +x ./*
$ cmake --build build-android --config Release -j{n}
$ cmake --install build-android --prefix {install-dir} --config Release
```

Download model [llama-2-7b-chat.Q4_K_M.gguf](https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/blob/main/llama-2-7b-chat.Q4_K_M.gguf), and push it to `/sdcard/llama.cpp/`, then move it to `/data/data/com.termux/files/home/model/`
After installing, go ahead and download the model of your choice to your host system. Then:

```
$mv /sdcard/llama.cpp/llama-2-7b-chat.Q4_K_M.gguf /data/data/com.termux/files/home/model/
$ adb shell "mkdir /data/local/tmp/llama.cpp"
$ adb push {install-dir} /data/local/tmp/llama.cpp/
$ adb push {model}.gguf /data/local/tmp/llama.cpp/
$ adb shell
```

Now, you can start chatting:
In the `adb shell`:

```
$cd /data/data/com.termux/files/home/bin
$./llama-cli -m ../model/llama-2-7b-chat.Q4_K_M.gguf -n 128 -cml
$ cd /data/local/tmp/llama.cpp
$ LD_LIBRARY_PATH=lib ./bin/llama-simple -m {model}.gguf -c {context-size} -p "{your-prompt}"
```

Here's a demo of an interactive session running on Pixel 5 phone:
That's it!

https://user-images.githubusercontent.com/271616/225014776-1d567049-ad71-4ef2-b050-55b0b3b9274c.mp4
Be aware that Android will not find the library path `lib` on its own, so we must specify `LD_LIBRARY_PATH` in order to run the installed executables. Android does support `RPATH` in later API levels, so this could change in the future. Refer to the previous section for information about `context-size` (very important!) and running other `examples`.
Loading
Loading