AWS Lambda Implementations of llama.cpp and Ollama for serverless inference.
AWS Lambda has huge potential for deploying serverless LLMs using llama.cpp. There's minimal configuration, inherent scaling, and easy integration with the rest of AWS services. There's also a very generous free tier to help ease the cost of running an LLM.
Phi-2 and TinyLlama are small enough models to provide CPU only inference at "reasonable" inference speed. This is roughly 120+s init and 20s-40s for inference in testing. There should be no expectation of using this for chat, batch inference only.
There is an example in the ollama-lamdba project of using a Mistral 7b model variant:
- openhermes-2.5-neural-chat-v3-3-slerp.Q4_K_M.gguf
This takes anywhere from 6min to 9min to load. Inference after init is more "reasonable" and around 30s-40s in testing.
2024-03-22 Update
There now exists an arm branch. Llama.cpp has been updated to build for arm64. Ollama doesn't want to play well with colima and hasn't been modified. So far there seems to be no difference between arm and amd64.
These are all 4_K_M quantization with the exception of TinyLlama using 5_K_M.
Use the Ollama project. When testing Ollama seems to have better startup times than using llama.cpp directly even when accounting for the Ollama server starting up.
This project relies on the following:
- Serverless Framework
- AWS
- Lambda Runtime Emulator
- To keep image sizes smaller this is not included in the Docker image. The expectation is this is installed in the users home folder. If installed in another place edit the Makefile accordingly.
- default profile for credentials.
- Lambda Runtime Emulator
- make
- docker
Pick which implementation you want to develop in ollama-lambda
or llama-cpp-lambda
and cd
to the directory.
Make a virtual env and install dependencies:
python -m venv env
source env/bin/activate
pip install -r requirements.txt
Create .env
file for local development:
export IMAGE=ollama # Used in docker build as the image name.
export MODEL_REPO=TheBloke/phi-2-dpo-GGUF # Model repo for fetching models.
export MODEL_GGUF=phi-2-dpo.Q4_K_M.gguf # The model quantization to fetch.
export MODEL_PATH=models/phi_dpo_4_K_M # Path to the local location of the model. Note you must create this.
If installing a new model change to the project folder:
- Make the model path if it does not exist:
mkdir -p models/<NEW_MODEL>
- Run
make fetch-model
- This will download the specified model to the env var
MODEL_PATH
- This will download the specified model to the env var
If using Ollama you will need to create a model file. See the examples in the models folder. Additional parameters can be found here
llama.cpp does not use a model file. Instead the parameters are stored in the lambda handler. Example handlers can be found in the models folder.
To develop and test locally you should have a .env
file in the project directory ollama-lambda/.env
or llama-cpp-lambda/.env
If you wish to change the test prompt edit prompt.json
in the tests
folder
To build the docker image run:
make docker-build
You may need to comment or uncomment model paths in .dockerignore. This prevents other models from being included in the build context and adding to an already long build time.
To test the docker image run to start the lambda function:
make docker-run
In a separate terminal run:
make docker-test
This will send the prompt found in test/prompt.json
to the lambda function.
To deploy the llama lambda to AWS change to the serverless-config directory and the model type you want to deploy.
If you wish to change the model being deployed, edit the config.json
file with the correct model path.
Ensure your AWS credentials are valid and run:
sls deploy
This will take a few minutes most likely.
Once deployed there will be a url in the output. Copy this url and run:
curl -H "Content-Type: application/json" <LAMBDA_URL> -d @../../tests/prompt.json
This will invoke the lambda function. On first invocation, there is a substantial warmup period where the model is being loaded. This can take upwards of a minute or more.
Occasionally, on first invocation the lambda will return an Internal Server Error. This means that you've exceeded the lambda function timeout and the model is taking a substantially long time to load. If you invoke again the model should be loaded or reloaded and return a response from the llm.
Subsequent invocations will be faster around 20s-40s for additional prompts. If the lambda function has concurrent invocations the model will be loaded for each concurrency.
Additional changes can be pushed using the same sls command:
sls deploy
To remove the lambda function, change to the serverless-config directory and the model type. Then run:
sls remove