Skip to content

LLM-Training-API: Including Embeddings & ReRankers, mergekit, LaserRMT

Notifications You must be signed in to change notification settings

l4b4r4b4b4/AIDocks

Repository files navigation

AIDocks

The AI Trainer's Dry Dock.

Features

  • πŸš€ Fine-Tune Embeddings, ReRankerings & Large Language Models (LLMs),
  • πŸš€ Dataset templates,
  • πŸš€ Build-Your-Own Mixture-of-Experts (MoE),
  • πŸš€ Optimize LLMs with LASER-Random Matrix Theory,
  • πŸš€ Quantize models for optimal model size &
  • πŸš€ Publish models to πŸ€— HuggingFace Hub.

Roadmap

(unsorted)

  • Auto Hardware Detection -> Model recommendation for fine-tuning and inference
  • Combined LLM & retrieval model fine-tuning with human feedback
  • The Truth Tables: Distributed (private & shared) Knowledge/Document Management in Chroma over sup- and sub-domain graph in Neo4j.
  • Model Conditioning: Chat-based LLM alignment for domain-(field) expertise with auto & human scoring on retrieval relevance, AI reasoning & conclusion.
    • Memory & History
    • Domain specific knowledge retrieval & expert prompting
    • Multiple Conversation
    • Multiple human & AI participants
    • General & Agent Specific Knowledge attachment by domain tags
    • Auto & Human eval for retrieval, reasoning & conclusion results
  • AI Task Library

Disclaimer In very early development stage. So feedback and contributions are highly appreciated!

Pre-Requisites

  1. CUDA-GPU
  2. Docker & docker-compose
  3. NVIDIA Container Toolkit

Quick Start

git clone https://github.com/l4b4r4b4b4/AIDocks
cd AIDocks
docker-compose up -d && \
docker-compose ps && \
docker-compose logs -f

Go to the interactive API documentation to explore all available endpoints & features!

Services

Docks WebApp

Docks API

Vision

Llava 1.6 service incl. Gradio Frontend, Controller & Model Worker

llm-inference

Endpoints πŸš€

The following endpoints are exposed:

  1. /train
  2. /compose
  3. /optimize
  4. /quantize
  5. /publish

/train Training & Fine-Tuning

The training routes expose different endpoints to fine-tune embeddings or reranking models used for retrieval and LLMs.

/train/llm LLM fine-tuning (DPO & SFT)

Try API endpoint Finetune Mistral, Llama 2-5x faster with 50% less memory with unsloth

Example datasets when using ChatML for

  1. SFT
  2. DPO

Supported Models

  • Llama,
  • Yi,
  • Mistral,
  • CodeLlama,
  • Qwen (llamafied),
  • Deepseek and their derived models (Open Hermes etc).

Features

  1. All kernels written in OpenAI's Triton language. Manual backprop engine
  2. 0% loss in accuracy - no approximation methods - all exact
  3. No change of hardware. Supports NVIDIA GPUs since 2018+. Minimum CUDA Capability 7.0 (V100, T4, Titan V, RTX 20, 30, 40x, A100, H100, L40 etc) Check your GPU! GTX 1070, 1080 works, but is slow
  4. Works on Linux and Windows via WSL
  5. Download 4 bit models 4x faster from πŸ€— Huggingface! Eg: unsloth/mistral-7b-bnb-4bit
  6. Supports 4bit and 16bit QLoRA / LoRA finetuning via bitsandbytes

/train/emb Embeddings

LoRA-PEFT for Embeddings using peft and accelerate library.

Supported Models

Example datasets

/train/rerank ReRankerings

LoRA-PEFT for re-ranking models.

Supported Models

Example datasets

/compose - BYO-MoE

Try API endpoint

/compose is an endpoint for combining Mistral or Llama models of the same size into Mixture-of-Experts models. The endpoint will combine the self-attention and layer normalization parameters from a "base" model with the MLP parameters from a set of "expert" models.

/compose endpoint can be used with minimal or no GPU.

/compose endpoint uses its own JSON configuration syntax, which looks like so: request body

{
    "base_model": "cognitivecomputations/dolphin-2.6-mistral-7b-dpo",
    "gate_mode": "hidden",
    "dtype": "bfloat16",
    "experts":[
        {
            "source_model": "teknium/OpenHermes-2.5-Mistral-7B",
            "positive_prompts": [
                "instruction"
                "solutions"
                "chat"
                "questions"
                "comprehension"
            ]
        },   
        {
            "source_model": "openaccess-ai-collective/DPOpenHermes-7B",
            "positive_prompts": [
                "mathematics"
                "optimization"
                "code"
                "step-by-step"
                "science"
            ],
            "negative_prompts": [
                "chat"
                "questions"
            ]
        }
    ]
}

Options:

gate_mode: hidden, cheap_embed, or random

dtype: float32, float16, or bfloat16

Gate Modes

There are three methods for populating the MoE gates implemented.

"hidden"

Uses the hidden state representations of the positive/negative prompts for MoE gate parameters. Best quality and most effective option; the default. Requires evaluating each prompt using the base model so you might not be able to use this on constrained hardware (depending on the model).

Coming Soon: use --load-in-8bit or --load-in-4bit to reduce VRAM usage.

"cheap_embed"

Uses only the raw token embedding of the prompts, using the same gate parameters for every layer. Distinctly less effective than "hidden". Can be run on much, much lower end hardware.

"random"

Randomly initializes the MoE gates. Good for if you are going to fine tune the model afterwards, or maybe if you want something a little unhinged? I won't judge.

/optimize - LaserRMT

Try API endpoint request body

{
    "base_model_name" : "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    "laser_model_name": "TinyLaser",
    "top_k_layers": 15
}

LaserRMT optimizes LLMs combining Layer-Selective Rank Reduction (LASER) and the Marchenko-Pastur law from Random Matrix Theory. This method targets model complexity reduction while maintaining or enhancing performance, making it more efficient than the traditional brute-force search method.

  1. LASER Framework Adaptation: LaserRMT adapts the LASER technique, which reduces the complexity of neural networks by selectively pruning the weights of a model's layers.
  2. Marchenko-Pastur Law Integration: The Marchenko-Pastur law, a concept from Random Matrix Theory used to determine the distribution of eigenvalues in large random matrices, guides the identification of redundant components in LLMs. This allows for effective complexity reduction without loss of key information.
  3. Enhanced Model Performance: By systematically identifying and eliminating less important components in the model's layers, LaserRMT can potentially enhance the model's performance and interpretability.
  4. Efficient Optimization Process: LaserRMT provides a more efficient and theoretically robust framework for optimizing large-scale language models, setting a new standard for language model refinement.

This approach opens new avenues for optimizing neural networks, underscoring the synergy between advanced mathematical theories and practical AI applications. LaserRMT sets a precedent for future developments in the field of LLM optimization.

/quantize/{method}

Try API endpoint

AWQ

Generate AWQ-quantizations optimized for GPU-inference.

/publish to HuggingFace πŸ€—

Try API endpoint Publish generated local models to πŸ€— HuggingFace Hub.

Explaining Resources

Some explaining resources for concepts, technologies and tools used in this repository.

  1. MergeKit Mixtral
  2. Mixture of Experts for Clowns (at a Circus)
  3. Fernando Fernandes Neto, David Golchinfar and Eric Hartford. "Optimizing Large Language Models Using Layer-Selective Rank Reduction and Random Matrix Theory." 2024.
  4. The Truth is in There: Improving Reasoning in Language Models with Layer-Selective Rank Reduction
  5. An Empirical view of Marchenko-Pastur Theorem

About

LLM-Training-API: Including Embeddings & ReRankers, mergekit, LaserRMT

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published