Skip to content

Commit

Permalink
Merge branch 'master' into layla-build
Browse files Browse the repository at this point in the history
  • Loading branch information
l3utterfly committed Jan 9, 2024
2 parents 2bd20ee + 128de35 commit c2bcfd8
Show file tree
Hide file tree
Showing 16 changed files with 1,267 additions and 58 deletions.
2 changes: 1 addition & 1 deletion Package.swift
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ let package = Package(
name: "llama",
dependencies: ["ggml"],
path: ".",
exclude: [],
exclude: ["ggml-metal.metal"],
sources: [
"llama.cpp",
],
Expand Down
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ Inference of [LLaMA](https://arxiv.org/abs/2302.13971) model in pure C/C++

### Hot topics

- New SOTA quantized models, including pure 2-bits: https://huggingface.co/ikawrakow
- Collecting Apple Silicon performance stats:
- M-series: https://github.com/ggerganov/llama.cpp/discussions/4167
- A-series: https://github.com/ggerganov/llama.cpp/discussions/4508
Expand Down
2 changes: 1 addition & 1 deletion common/common.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -920,7 +920,7 @@ void gpt_print_usage(int /*argc*/, char ** argv, const gpt_params & params) {
#endif
printf(" -gan N, --grp-attn-n N\n");
printf(" group-attention factor (default: %d)\n", params.grp_attn_n);
printf(" -gat N, --grp-attn-w N\n");
printf(" -gaw N, --grp-attn-w N\n");
printf(" group-attention width (default: %.1f)\n", (double)params.grp_attn_w);
printf(" --verbose-prompt print prompt before generation\n");
printf(" -dkvc, --dump-kv-cache\n");
Expand Down
13 changes: 9 additions & 4 deletions examples/llama.swiftui/README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,12 @@
# llama.swiftui
# llama.cpp/examples/llama.swiftui

Local inference of llama.cpp on an iPhone.
So far I only tested with starcoder 1B model, but it can most likely handle 7B models as well.
Local inference of llama.cpp on an iPhone. This is a sample app that can be used as a starting
point for more advanced projects.

https://github.com/bachittle/llama.cpp/assets/39804642/e290827a-4edb-4093-9642-2a5e399ec545
For usage instructions and performance stats, check the following discussion: https://github.com/ggerganov/llama.cpp/discussions/4508

![image](https://github.com/ggerganov/llama.cpp/assets/1991296/2b40284f-8421-47a2-b634-74eece09a299)

Video demonstration:

https://github.com/bachittle/llama.cpp/assets/39804642/e290827a-4edb-4093-9642-2a5e399ec545
60 changes: 35 additions & 25 deletions examples/server/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ Command line options:
- `--host`: Set the hostname or ip address to listen. Default `127.0.0.1`.
- `--port`: Set the port to listen. Default: `8080`.
- `--path`: path from which to serve static files (default examples/server/public)
- `--api-key`: Set an api key for request authorization. By default the server responds to every request. With an api key set, the requests must have the Authorization header set with the api key as Bearer token.
- `--embedding`: Enable embedding extraction, Default: disabled.
- `-np N`, `--parallel N`: Set the number of slots for process requests (default: 1)
- `-cb`, `--cont-batching`: enable continuous batching (a.k.a dynamic batching) (default: disabled)
Expand Down Expand Up @@ -174,35 +175,44 @@ node index.js

`system_prompt`: Change the system prompt (initial prompt of all slots), this is useful for chat applications. [See more](#change-system-prompt-on-runtime)

*Result JSON:*
### Result JSON:

Note: When using streaming mode (`stream`) only `content` and `stop` will be returned until end of completion.
* Note: When using streaming mode (`stream`) only `content` and `stop` will be returned until end of completion.

`content`: Completion result as a string (excluding `stopping_word` if any). In case of streaming mode, will contain the next token as a string.

`stop`: Boolean for use with `stream` to check whether the generation has stopped (Note: This is not related to stopping words array `stop` from input options)
- `completion_probabilities`: An array of token probabilities for each completion. The array's length is `n_predict`. Each item in the array has the following structure:

`generation_settings`: The provided options above excluding `prompt` but including `n_ctx`, `model`

`model`: The path to the model loaded with `-m`

`prompt`: The provided `prompt`

`stopped_eos`: Indicating whether the completion has stopped because it encountered the EOS token

`stopped_limit`: Indicating whether the completion stopped because `n_predict` tokens were generated before stop words or EOS was encountered

`stopped_word`: Indicating whether the completion stopped due to encountering a stopping word from `stop` JSON array provided

`stopping_word`: The stopping word encountered which stopped the generation (or "" if not stopped due to a stopping word)

`timings`: Hash of timing information about the completion such as the number of tokens `predicted_per_second`

`tokens_cached`: Number of tokens from the prompt which could be re-used from previous completion (`n_past`)

`tokens_evaluated`: Number of tokens evaluated in total from the prompt

`truncated`: Boolean indicating if the context size was exceeded during generation, i.e. the number of tokens provided in the prompt (`tokens_evaluated`) plus tokens generated (`tokens predicted`) exceeded the context size (`n_ctx`)
```
{
"content": "<the token selected by the model>",
"probs": [
{
"prob": float,
"tok_str": "<most likely token>"
},
{
"prob": float,
"tok_str": "<second most likely tonen>"
},
...
]
},
```
Notice that each `probs` is an array of length `n_probs`.

- `content`: Completion result as a string (excluding `stopping_word` if any). In case of streaming mode, will contain the next token as a string.
- `stop`: Boolean for use with `stream` to check whether the generation has stopped (Note: This is not related to stopping words array `stop` from input options)
- `generation_settings`: The provided options above excluding `prompt` but including `n_ctx`, `model`
- `model`: The path to the model loaded with `-m`
- `prompt`: The provided `prompt`
- `stopped_eos`: Indicating whether the completion has stopped because it encountered the EOS token
- `stopped_limit`: Indicating whether the completion stopped because `n_predict` tokens were generated before stop words or EOS was encountered
- `stopped_word`: Indicating whether the completion stopped due to encountering a stopping word from `stop` JSON array provided
- `stopping_word`: The stopping word encountered which stopped the generation (or "" if not stopped due to a stopping word)
- `timings`: Hash of timing information about the completion such as the number of tokens `predicted_per_second`
- `tokens_cached`: Number of tokens from the prompt which could be re-used from previous completion (`n_past`)
- `tokens_evaluated`: Number of tokens evaluated in total from the prompt
- `truncated`: Boolean indicating if the context size was exceeded during generation, i.e. the number of tokens provided in the prompt (`tokens_evaluated`) plus tokens generated (`tokens predicted`) exceeded the context size (`n_ctx`)

- **POST** `/tokenize`: Tokenize a given text.

Expand Down
Loading

0 comments on commit c2bcfd8

Please sign in to comment.