A WIP library that provides Java bindings for llama.cpp.
The project is split into two parts, a fluent API for managing Llama models and contexts, and a lower level API that provides direct access to the "public" interface of llama.cpp
Built against tag b1698.
jLlama may be compatible with newer versions of llama.cpp if you use a custom build, so long as no breaking changes have been made to llama.h
and its implementation.
Maven artifacts are provided for the following architectures:
- Mac arm64
- Windows x64
- Linux x64.
Artifacts targeting architectures and features (e.g. CUDA) will be available soon. Custom builds of llama.cpp can be used for features not currently available in the artifacts.
The project is still in early form, currently present only in the OSSRH Snapshots repository, located at https://s01.oss.sonatype.org/content/repositories/snapshots/.
If using Gradle, add the following to your repositories:
maven {
name = 'OSSRH'
url = 'https://s01.oss.sonatype.org/content/repositories/snapshots/'
}
If using Maven, add the following to your <repositories />
element:
<repository>
<id>ossrh</id>
<url>https://s01.oss.sonatype.org/content/repositories/snapshots/</url>
</repository>
jLlama has 3 dependencies that must be included:
- jni-bindings
- Provides the Java APIs for accessing a native bridge to llama-cpp
- jni-implementation.
- A bridge written in native code to access llama.cpp.
- Provided per supported architecture.
- llama-cpp
- A build of llama.cpp for a given architecture.
Gradle Dependencies Example:
implementation "net.jllama:jni-bindings:0.0.1-SNAPSHOT"
implementation "net.jllama:jni-implementation:0.0.1-SNAPSHOT:${classifier}"
implementation "net.jllama:llama-cpp:0.0.1-SNAPSHOT:${classifier}"
Maven Dependencies Example:
<dependency>
<groupId>net.jllama</groupId>
<artifactId>jni-bindings</artifactId>
<version>0.0.1-SNAPSHOT</version>
</dependency>
<dependency>
<groupId>net.jllama</groupId>
<artifactId>jni-implementation</artifactId>
<version>0.0.1-SNAPSHOT</version>
<classifier>{classifier}</classifier>
</dependency>
<dependency>
<groupId>net.jllama</groupId>
<artifactId>llama-cpp</artifactId>
<version>0.0.1-SNAPSHOT</version>
<classifier>${classifier}</classifier>
</dependency>
For the native dependencies, there are currently 3 available classifiers for supported architectures:
windows_x64
linux_x64
os_x_arm64
jLlama prioritizes loading native code off the "path" before loading from artifacts. The expected library file name and the path environment variable for each supported architecture is as follows:
- Mac
- library file name:
libllama.dylib
- Append path to library to the environment variable
DYLD_LIBRARY_PATH
- library file name:
- Linux
- library file name:
libllama.so
- Append path to library to the environment variable
LD_LIBRARY_PATH
- library file name:
- Windows
- library file name:
llama.dll
- Append path to library to the environment variable
PATH
- library file name:
The root of the API is the net.jllama.api.Llama
class. All access to the LlaMA flows from here.
Llama.library()
A Model represents a trained LlaMA model. It can be used to create contexts, which are used to generate tokens. Any number of models can be loaded at once, bounded only by the amount of memory available on the machine. JVM heap space is not used for this allocation.
Model model = Llama.library()
.newModel()
.with()
.path("/path/to/model.gguf")
.load();
A Context represents a single working instance of a Model. A context is stateful, containing the current relationships between all available tokens. These relationships are defined through "sequences." A context can contain one or more sequences of tokens at a given time.
Context context = mode.newContext()
.withDefaults() // use with() if you don't want to prepopulate with default values
.cotextLength(1000)
.seed(ThreadLocalRandom.current().nextInt())
.create();
A token represents a word-piece, a piece of the model's vocabulary. Llama operates on tokens using contexts, so text must be converted to and from tokens when interacting with the context. A token can represent multiple characters, or only part of a multi-byte character. The encoding is UTF-8.
The model provides access to special tokens and to tokenization functionality.
List<Integer> tokens = model.tokens().tokenize(text);
byte[] detokenized = model.tokens().detokenize(tokens);
To transform into a Java String, use
String stringPiece = new String(detokenized, StandardCharsets.UTF_8);
Care must be taken with detokenized bytes - they need to be buffered so that incomplete UTF-8 characters are not included in the java String. Java will recognize an incomplete byte and render each byte as "�".
Sequences are added to a context with the evaluation of a "batch" of tokens. Evaluation adds the batched sequences (or sequence updates) and internally calculates the probabilities for the next token.
int seqId = 0; // NOTE: id must be unique per sequence
Sequence sequence = Sequence.tokenSequence(seqId);
Batch batch = context.batch()
.type(SequenceType.TOKEN)
.get()
.stage();
batch.stage(sequence.piece(tokens));
context.evaluate(batch);
Evaluation internally calculates probabilities, known as "logits," for the next token in the sequence. These logits are the probability for each token in the vocabulary to be the next token in the sequence.
List<Float> logits = context.getLogits(sequence);
After logits are calculated, a token must be picked from the probabilities. The context provides a sampler to optionally apply sampling criteria before sampling (picking) the next token.
int token = context.sampler(logits)
.keepTopK(50)
.applyTemperature(1.1f)
.sample();
This token can then be translated back to readable text by invoking model.tokens().detokenize(token)
.
For examples of use, including a simple prompt evaluator and a Spring-based Chatbot, see jllama-examples
Logging is provided by log4j2-api, which can be integrated with the logger implementation of your choice.
A core library is also provided, which endeavors provides a close to one-to-one mapping to llama.cpp's C interface.
More documentation will be available soon.
jLlama is currently built by a single docker image that is configured with toolchains for Mac, Linux, and Windows. The docker image can be found under docker/jllama-build
and is built and run by Github actions defined under .github
.