-
Notifications
You must be signed in to change notification settings - Fork 71
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add a README for llama subtitle sumerizer pipeline example
- Add a README file to guide users to build/run the pipeline example. Signed-off-by: Suyeon Kim <[email protected]>
- Loading branch information
Showing
1 changed file
with
160 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,160 @@ | ||
# Example of GStreamer/NNStreamer subtitle sumerizer pipeline using llama.cpp | ||
|
||
## Description | ||
|
||
This example shows how to use llama.cpp via GStreamer/NNStreamer pipeline in Tizen/RPI4. Users can use their ML model/app as GStreamer/NNStreamer pipeline if they implement their model/app as a [C++ class](https://github.com/nnstreamer/nnstreamer/blob/main/ext/nnstreamer/tensor_filter/tensor_filter_cpp.hh). This example shows how to use the [llama.cpp](https://github.com/ggerganov/llama.cpp) in pipeline as cpp class tensor_filter. | ||
|
||
The cpp class wrapping llama.cpp is implemented in https://github.com/yeonykim2/llama.cpp/tree/nnstreamer_llama_subtitle_summarizer. | ||
|
||
|
||
## Prerequisites | ||
|
||
- rpi4 flashed with the latest tizen-headed (64bit) image. | ||
- Tizen GBS tools. | ||
|
||
## Build / Install guide | ||
|
||
- Build nnstreamer-llama-cpp rpm package | ||
```bash | ||
$ git clone https://github.com/yeonykim2/llama.cpp && cd llama.cpp && git checkout nnstreamer_llama_subtitle_summarizer | ||
$ gbs build -A aarch64 | ||
|
||
# check contents of the RPM file | ||
$ ls ~/GBS-ROOT/local/repos/tizen/aarch64/RPMS | ||
> nnstreamer-llama-68m-gguf-1.0.0-0.aarch64.rpm ... | ||
$ cp ~/GBS-ROOT/local/repos/tizen/aarch64/RPMS/nnstreamer-llama-68m-gguf-1.0.0-0.aarch64.rpm . | ||
$ rpm2cpio nnstreamer-llama-68m-gguf-1.0.0-0.aarch64.rpm | cpio -idmv | ||
> ./usr/lib/nnstreamer/bin/big_buck_bunny_trailer_480p.webm | ||
> ./usr/lib/nnstreamer/bin/libnnstreamer-llama.so | ||
> ./usr/lib/nnstreamer/bin/models | ||
> ./usr/lib/nnstreamer/bin/subtitles.srt | ||
``` | ||
|
||
- Download the model file `llama-68m-chat-v1.fp16.gguf` [here](https://huggingface.co/afrideva/Llama-68M-Chat-v1-GGUF). | ||
```bash | ||
# install the rpm package | ||
$ sdb push nnstreamer-llama-68m-gguf-1.0.0-0.aarch64.rpm /root/ | ||
$ sdb shell rpm -ivh /root/nnstreamer-llama-68m-gguf-1.0.0-0.aarch64.rpm | ||
|
||
# install model, sample video and sample srt file | ||
$ sdb push models/llama-68m-chat-v1.fp16.gguf /usr/lib/nnstreamer/bin/models/ | ||
$ sdb push subtitles.srt /usr/lib/nnstreamer/bin/ | ||
$ sdb push big_buck_bunny_trailer_480p.webm /usr/lib/nnstreamer/bin/ | ||
``` | ||
|
||
|
||
|
||
## Run pipeline in sdb shell | ||
|
||
Following gst-launch example make llama.cpp do summerize on the sample `subtitles.srt` with llama-68m-chat-v1.fp16.gguf model. | ||
- `ORC_DEBUG=` suppresses the debug message from gst-orc. | ||
- `LD_LIBRARY_PATH=.` makes `libnnstreamer-llama.so` viable for gstreamer. | ||
|
||
``` bash | ||
# sdb shell | ||
$ cd /usr/lib/nnstreamer/bin/ | ||
$ ORC_DEBUG= LD_LIBRARY_PATH=. gst-launch-1.0 -v filesrc location=big_buck_bunny_trailer_480p.webm ! decodebin ! videoconvert ! autovideosink filesrc location=subtitles.srt ! text/x-raw, format=utf8 ! tensor_converter input-dim=48000:1:1:1 ! tensor_filter framework=cpp model=nnstreamer_llama_filter,libnnstreamer-llama.so ! fakesink | ||
|
||
> ... | ||
nstreamer_llama_filter: build = 0 (unknown) | ||
nnstreamer_llama_filter: built with clang version 17.0.6 for x86_64-tizen-linux-gnu | ||
nnstreamer_llama_filter: seed = 10441 | ||
nnstreamer_llama_filter: llama backend init | ||
nnstreamer_llama_filter: load the model and apply lora adapter, if any | ||
llama_model_loader: loaded meta data with 21 key-value pairs and 21 tensors from models/llama-68m-chat-v1.fp16.gguf (version GGUF V3 (latest)) | ||
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. | ||
llama_model_loader: - kv 0: general.architecture str = llama | ||
llama_model_loader: - kv 1: general.name str = active | ||
llama_model_loader: - kv 2: llama.context_length u32 = 2048 | ||
llama_model_loader: - kv 3: llama.embedding_length u32 = 768 | ||
llama_model_loader: - kv 4: llama.block_count u32 = 2 | ||
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 3072 | ||
llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 64 | ||
llama_model_loader: - kv 7: llama.attention.head_count u32 = 12 | ||
llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 12 | ||
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000001 | ||
llama_model_loader: - kv 10: llama.rope.freq_base f32 = 10000.000000 | ||
llama_model_loader: - kv 11: general.file_type u32 = 1 | ||
llama_model_loader: - kv 12: tokenizer.ggml.model str = llama | ||
|
||
... | ||
|
||
prompt: ******* | ||
|
||
"<|im_start|>system You are a helpful assistant.<|im_end|> <|im_start|>user Please, summarize the sutitle. | ||
1 | ||
00:00:00,005 --> 00:00:03,800 | ||
The peach open movie project presents | ||
2 | ||
00:00:06,001 --> 00:00:09,001 | ||
One big rabbit | ||
3 | ||
00:00:10,900 --> 00:00:13,001 | ||
Three rodents | ||
4 | ||
00:00:16,400 --> 00:00:18,954 | ||
And one giant payback | ||
5 | ||
00:00:22,950 --> 00:00:25,001 | ||
Get ready | ||
6 | ||
00:00:26,700 --> 00:00:30,000 | ||
Big Buck Bunny | ||
7 | ||
00:00:30,001 --> 00:00:31,100 | ||
Coming soon | ||
8 | ||
00:00:31,101 --> 00:00:32,509 | ||
www.bigbuckbunny.org | ||
<|im_end|>" | ||
******* | ||
|
||
sampling: | ||
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000 | ||
top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800 | ||
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 | ||
sampling order: | ||
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature | ||
generate: n_ctx = 512, n_batch = 2048, n_predict = 128, n_keep = 1 | ||
|
||
... | ||
|
||
###################### | ||
# Generation Results # | ||
###################### | ||
|
||
|
||
<|im_start|>assistant | ||
To get a new job, a film or a video | ||
The most important part of the work is to look at the film's structure, theme, and content. You can start with a good movie title, a short summary of the original film, or a short summary. | ||
1000<|im_end|> | ||
[end of text] | ||
Pipeline is PREROLLED ... | ||
Setting pipeline to PLAYING ... | ||
Redistribute latency... | ||
New clock: GstSystemClock | ||
Got EOS from element "pipeline0". | ||
Execution ended after 0:00:32.480586369 | ||
Setting pipeline to NULL ... | ||
Freeing pipeline ... | ||
llama_print_timings: load time = 128.68 ms | ||
llama_print_timings: sample time = 9.85 ms / 82 runs ( 0.12 ms per token, 8329.10 tokens per second) | ||
llama_print_timings: prompt eval time = 1057.57 ms / 362 tokens ( 2.92 ms per token, 342.29 tokens per second) | ||
llama_print_timings: eval time = 2462.21 ms / 81 runs ( 30.40 ms per token, 32.90 tokens per second) | ||
llama_print_timings: total time = 36241.92 ms / 443 tokens | ||
... | ||
``` | ||
- It shows that the llama-68M model can run on the Tizen/RPI4 device. | ||
- Cannot be sure about the performance of LLMs. |