Releases: ngxson/wllama
1.15.0
New features
downloadModel()
Download model to cache without loading it. The use case would be to allow application to have a "model manager" screen that allows:
- Download model via
downloadModel()
- List all downloaded models using
CacheManager.list()
- Delete a downloaded model using
CacheManager.delete()
KV cache reuse in createCompletion
When calling createCompletion
, you can pass useCache: true
as an option. It will reuse the KV cache from the last createCompletion
call. It is equivalent to cache_prompt
option on llama.cpp server.
wllama.createCompletion(input, {
useCache: true,
...
});
For example:
- On the first call, you have 2 messages:
user: hello
,assistant: hi
- On the second call, you add one message:
user: hello
,assistant: hi
,user: who are you?
Then, only the added message user: who are you?
will need to be evaluated.
What's Changed
- Add
downloadModel
function by @ngxson in #95 - fix log print and
downloadModel
by @ngxson in #100 - Add
main
example (chat UI) by @ngxson in #99 - Improve main UI example by @ngxson in #102
- implement KV cache reuse by @ngxson in #103
Full Changelog: 1.14.2...1.15.0
1.14.2
Update to latest upstream llama.cpp source code:
- Fix support for llama-3.1, phi 3 and SmolLM
Full Changelog: 1.14.0...1.14.2
1.14.0
1.13.0
What's Changed
- Update README.md by @flatsiedatsie in #78
- sync with upstream llama.cpp source code (+gemma2 support) by @ngxson in #81
- Fix exit() function crash if model is not loaded by @flatsiedatsie in #84
- Improve cache API by @ngxson in #80
- v1.13.0 by @ngxson in #85
New Contributors
- @flatsiedatsie made their first contribution in #78
Full Changelog: 1.12.1...1.13.0
1.12.1
1.12.0
Important
In prior versions, if you initialize wllama with embeddings: true
, you will still able to generate completions.
From v1.12.0, if you start wllama with embeddings: true
, this will throws an error when you try to use createCompletion
. You must add wllama.setOptions({ embeddings: false })
to turn of embeddings.
More details: This feature is introduced in ggerganov/llama.cpp#7477 , which allows models like GritLM to be used for both embeddings and text generation.
What's Changed
- Add
wllama.setOptions
by @ngxson in #73 - v1.12.0 by @ngxson in #74
- warn user if embeddings is incorrectly set by @ngxson in #75
Full Changelog: 1.11.0...1.12.0
1.11.0
What's Changed
- Internally generate the model URL array when the provided URL for
loadModelFromUrl
method is from a single shard of a model split with thegguf-split
tool by @felladrin in #61 - Allow loading a model using relative path by @felladrin in #64
- Git ignore also .DS_Store which are created by MacOS Finder by @felladrin in #65
- v1.11.0 by @ngxson in #68
Full Changelog: 1.10.0...1.11.0
1.10.0
What's Changed
loadModel()
now also acceptsBlob
orFile
- Added
GGUFRemoteBlob
that can stream Blob from a remote URL - Added example for loading local gguf files
- Implement OPFS for cache
Note: Optionally, you can clear the CacheStorage
used by previous version.
Pull requests:
- fix small typo in README by @ngxson in #51
- sync with latest llama.cpp source code by @ngxson in #59
- add Blob support + OPFS + load from local file(s) by @ngxson in #52
- v1.10.0 by @ngxson in #60
Full Changelog: 1.9.0...1.10.0
1.9.0
1.8.1
What's Changed
HeapFS allow us to save more memory while loading model. It also prevent doing memcpy, so loading model will be a bit faster.
- Make the
config
parameter of theloadModelFromUrl
function optional by @felladrin in #32 - Remove prebuilt esm by @ngxson in #33
- Improve error handling on abort() by @ngxson in #34
- add tool for debugging memory by @ngxson in #37
- sync to upstream llama.cpp source code by @ngxson in #46
Full Changelog: 1.8.0...1.8.1