GitHub - annalhq/sapien: My custom fine tuned LLAMA3.1 8B based chatbot

Sapien, a LLaMA 3.1 70B 🦙 Fine-Tuned LoRA model using Alpaca Dataset

📋 Table of Contents

⚡ Introduction
✨ Key Features
📊 Fine tuned models
💬 Inferencing
📋 Requirements
🚀 Getting Started
🐥 Frontend
🗄️ Backend
🛠️ Code Formatting

⚡ Introduction

Sapien is the LLaMA 3.1 70B model fined tuned using Low-Rank Adaptation (LoRA) on the Alpaca dataset. The training is optimized for 4-bit and 16-bit precision.

Watch more detailed project walkthrough

✨ Key Features

LoRA (Low-Rank Adaptation) for optimizing large language models.
4-bit & 16-bit precision fine-tuning using advanced quantization techniques.
Alpaca Dataset: Instruction-based fine-tuning dataset.
Model Hosting: Push the trained model to Hugging Face for deployment.

🎶 Fine tuned models

My fine tuned Llama model
Official Meta Llama 3.2 for Ollama (released on 25th Sept 2024)

Model config:

{
  "_name_or_path": "unsloth/meta-llama-3.1-8b-bnb-4bit",
  "architectures": ["LlamaForCausalLM"],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128000,
  "eos_token_id": 128001,
  "head_dim": 128,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "max_position_embeddings": 131072,
  "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "pad_token_id": 128004,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-5,
  "rope_scaling": {
    "factor": 8.0,
    "high_freq_factor": 4.0,
    "low_freq_factor": 1.0,
    "original_max_position_embeddings": 8192,
    "rope_type": "llama3"
  },
  "rope_theta": 500000.0,
  "tie_word_embeddings": false,
  "torch_dtype": "float16",
  "transformers_version": "4.45.1",
  "unsloth_version": "2024.9.post3",
  "use_cache": true,
  "vocab_size": 128256
}

Trainer stats:

[
  60,
  0.8564618517955144,
  {
    "train_runtime": 441.2579,
    "train_samples_per_second": 1.088,
    "train_steps_per_second": 0.136,
    "total_flos": 5726714157219840.0,
    "train_loss": 0.8564618517955144,
    "epoch": 0.00927357032457496
  }
]

💬Inferencing

(This will work only when you have all model files locally saved after running trainer)

from unsloth import FastLanguageModel

FastLanguageModel.for_inference(model)

inputs = tokenizer(
    [
        alpaca_prompt.format(
            "Give me the first 10 digits of Pi",
            "3.14159",
            "",
        )
    ],
    return_tensors="pt",
).to("cuda")

outputs = model.generate(**inputs, max_new_tokens=64)
print(tokenizer.batch_decode(outputs))

📋 Requirements

To run Sapien, you need the following requirements:

Node.js (version 14 or higher)
npm (version 6 or higher)
Python (version 3.7 or higher)
Ollama (version 1.7 or higher for method 2)
llama.cpp (for method 3)

Make sure you have these installed before proceeding.

🚀 Getting Started

To get started with Sapien, follow these steps:

Clone the repository:

git clone https://github.com/annalhq/sapien.git
cd sapien

Install dependencies
```
npm install
```
Run the dev server
```
npm run dev
```
For the backend part refer to Backend section accordingly

Frontend

Deployed using NextJS and Shadcn UI library alongside Vercel's AI SDK UI.

🛠️ Code Formatting

These integrations will make sure while deploying that there is no server side issues (also maintains code consistency)

If you are making changes in the code, make sure to run npm run format otherwise Husky will prevent you from committing the code to repository.

Use --no-verify flag alongside with your git command to skip the invocation of Husky.

ESLint

ESLint is used to identify and fix problems in JavaScript and TypeScript code. To run ESLint, use:

npm run check-lint

Prettier

For consitent code formatting, use:

npm run check-format

Husky

Husky is used to manage Git hooks. The pre-commit hook checks for formatting, linting, and type errors, and also builds the project.

Backend

1. 🤗 HuggingFace inference

This Serverless Inference API allows you to easily do inference on my fine tuned models or you can use any other models with TextToText generation models.

Getting tokens from HuggingFace

Login to HuggingFace and get tokens from here. As reccommended, it is preferable to create fine-grained tokens with the scope to Make calls to the serverless Inference API

Official Tokens guide by HuggingFace

In v1.0.0 of this project, HFInference client is used for handling inference from model.

import { HfInference } from "@huggingface/inference";

const inference = new HfInference("HUGGINGFACE_API_KEY");

const result = await inference.textClassification({
  model: "https://huggingface.co/annalhq/llama-3.1-8B-lora-alpaca",
  inputs: "Hi! How are you?",
});

console.log(result);

Store HF token variables .env.local as

HUGGINGFACE_API_KEY=hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Now this will use this serverless API from my model for streaming text.

2. 🦙 Ollama server

(Recommended for locally running)

Here you can use Ollama to serve my model locally. As it does not has stream handling capabilities as for chat frontend I've used Vercel AI SDK with ModelFusion.

Vercel AI SDK will handle stream forwarding and rendering, and ModelFusion to integrate Ollama with the Vercel AI SDK.

Install Ollama from official site
Pulling model on Ollama

If you want to use my model in Ollama follow these instructions:

Download HFDownloader
Download my model in SafeTensor format from HF

hf -m annalhq/llama-3.1-8B-lora-alpaca

Importing a fine tuned adapter from Safetensors weights

First, create a Modelfile with a FROM command pointing at the base model you used for fine tuning, and an ADAPTER command which points to the directory with your Safetensors adapter:

FROM <base annalhq/llama-3.1-8B-lora-alpaca>
ADAPTER /path/to/safetensors/adapter/directory

ollama create annalhq/llama-3.1-8B-lora-alpaca

Lastly, test the model:

ollama run annalhq/llama-3.1-8B-lora-alpaca

3. llama.cpp server

Cloning llama.cpp from here

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

Compiling llama.cpp

Using make:
- On Linux or MacOS:
```
make
```
On Windows (x86/x64 only, arm64 requires cmake):
1. Download the latest fortran version of w64devkit.
2. Extract w64devkit on your pc.
3. Run w64devkit.exe.
4. Use the cd command to reach the llama.cpp folder.
5. From here you can run:
```
make
```

Convert SafeTensore modelfile of my model to GGUF using these instructions
Start the llama.cpp server

./server -m models/llama-3.1-8B-lora-alpaca.gguf

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
.husky		.husky
.vscode		.vscode
model		model
public		public
src		src
.eslintrc.json		.eslintrc.json
.gitignore		.gitignore
.prettierignore		.prettierignore
.prettierrc		.prettierrc
README.md		README.md
components.json		components.json
next.config.mjs		next.config.mjs
package-lock.json		package-lock.json
package.json		package.json
postcss.config.mjs		postcss.config.mjs
tailwind.config.ts		tailwind.config.ts
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sapien, a LLaMA 3.1 70B 🦙 Fine-Tuned LoRA model using Alpaca Dataset

📋 Table of Contents

⚡ Introduction

✨ Key Features

🎶 Fine tuned models

💬Inferencing

📋 Requirements

🚀 Getting Started

Frontend

🛠️ Code Formatting

ESLint

Prettier

Husky

Backend

1. 🤗 HuggingFace inference

2. 🦙 Ollama server

3. llama.cpp server

About

Releases

Packages

Languages

annalhq/sapien

Folders and files

Latest commit

History

Repository files navigation

Sapien, a LLaMA 3.1 70B 🦙 Fine-Tuned LoRA model using Alpaca Dataset

📋 Table of Contents

⚡ Introduction

✨ Key Features

🎶 Fine tuned models

💬Inferencing

📋 Requirements

🚀 Getting Started

Frontend

🛠️ Code Formatting

ESLint

Prettier

Husky

Backend

1. 🤗 HuggingFace inference

2. 🦙 Ollama server

3. llama.cpp server

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages