Skip to content

Latest commit

 

History

History
 
 

Llama-3.1-405B-FP8

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 

Llama 3.1 is the latest iteration of the Llama series, an advanced AI language model developed to understand and generate human-like text. It leverages state-of-the-art machine learning techniques to provide high-quality, context-aware responses.

Key Features

  • Enhanced Natural Language Understanding (NLU): Llama 3.1 has significantly improved NLU capabilities, enabling it to comprehend and process complex language patterns more effectively.
  • Contextual Awareness: It maintains context over extended conversations, ensuring coherent and relevant responses even in lengthy interactions.
  • Multilingual Support: Llama 3.1 supports multiple languages, making it versatile for global applications.
  • Customizability: Developers can fine-tune the model on specific datasets to cater to niche requirements, enhancing its adaptability.

Technical Specifications

  • Architecture: Transformer-based architecture with advanced attention mechanisms.
  • Parameters: The model contains billions of parameters, contributing to its high performance in various NLP tasks.
  • Training Data: Trained on a diverse corpus of text, including books, articles, and websites, to ensure a broad understanding of language.
  • Fine-tuning: Supports fine-tuning on domain-specific data to improve performance in targeted applications.

Performance Metrics

  • Accuracy: Achieves high accuracy in language comprehension and generation benchmarks.
  • Latency: Optimized for low-latency responses, ensuring real-time interaction capabilities.
  • Scalability: Designed to scale efficiently across multiple servers, supporting high-demand applications.

Use Cases

  • Chatbots and Virtual Assistants: Provides human-like interactions, enhancing user experience.
  • Content Generation: Assists in creating articles, reports, and other textual content.
  • Language Translation: Offers accurate and context-aware translations across supported languages.
  • Customer Support: Automates responses to common queries, improving support efficiency.

Serve with vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving.

vLLM is fast with:

  • State-of-the-art serving throughput
  • Efficient management of attention key and value memory with PagedAttention
  • Continuous batching of incoming requests
  • Fast model execution with CUDA/HIP graph
  • Quantization: GPTQ, AWQ, SqueezeLLM, FP8 KV Cache
  • Optimized CUDA kernels

vLLM is flexible and easy to use with:

  • Seamless integration with popular Hugging Face models
  • High-throughput serving with various decoding algorithms, including parallel sampling, beam search, and more
  • Tensor parallelism support for distributed inference
  • Streaming outputs
  • OpenAI-compatible API server
  • Support NVIDIA GPUs and AMD GPUs
  • (Experimental) Prefix caching support
  • (Experimental) Multi-lora support

vLLM seamlessly supports many Hugging Face models, including the following architectures:

  • Aquila & Aquila2 (BAAI/AquilaChat2-7B, BAAI/AquilaChat2-34B, BAAI/Aquila-7B, BAAI/AquilaChat-7B, etc.)
  • Baichuan & Baichuan2 (baichuan-inc/Baichuan2-13B-Chat, baichuan-inc/Baichuan-7B, etc.)
  • BLOOM (bigscience/bloom, bigscience/bloomz, etc.)
  • ChatGLM (THUDM/chatglm2-6b, THUDM/chatglm3-6b, etc.)
  • Command-R (CohereForAI/c4ai-command-r-v01, etc.)
  • DBRX (databricks/dbrx-base, databricks/dbrx-instruct etc.)
  • DeciLM (Deci/DeciLM-7B, Deci/DeciLM-7B-instruct, etc.)
  • Falcon (tiiuae/falcon-7b, tiiuae/falcon-40b, tiiuae/falcon-rw-7b, etc.)
  • Gemma (google/gemma-2b, google/gemma-7b, etc.)
  • GPT-2 (gpt2, gpt2-xl, etc.)
  • GPT BigCode (bigcode/starcoder, bigcode/gpt_bigcode-santacoder, etc.)
  • GPT-J (EleutherAI/gpt-j-6b, nomic-ai/gpt4all-j, etc.)
  • GPT-NeoX (EleutherAI/gpt-neox-20b, databricks/dolly-v2-12b, stabilityai/stablelm-tuned-alpha-7b, etc.)
  • InternLM (internlm/internlm-7b, internlm/internlm-chat-7b, etc.)
  • InternLM2 (internlm/internlm2-7b, internlm/internlm2-chat-7b, etc.)
  • Jais (core42/jais-13b, core42/jais-13b-chat, core42/jais-30b-v3, core42/jais-30b-chat-v3, etc.)
  • LLaMA, Llama 2, and Meta Llama 3 (meta-llama/Meta-Llama-3-8B-Instruct, meta-llama/Meta-Llama-3-70B-Instruct, meta-llama/Llama-2-70b-hf, lmsys/vicuna-13b-v1.3, young-geng/koala, openlm-research/open_llama_13b, etc.)
  • MiniCPM (openbmb/MiniCPM-2B-sft-bf16, openbmb/MiniCPM-2B-dpo-bf16, etc.)
  • Mistral (mistralai/Mistral-7B-v0.1, mistralai/Mistral-7B-Instruct-v0.1, etc.)
  • Mixtral (mistralai/Mixtral-8x7B-v0.1, mistralai/Mixtral-8x7B-Instruct-v0.1, mistral-community/Mixtral-8x22B-v0.1, etc.)
  • MPT (mosaicml/mpt-7b, mosaicml/mpt-30b, etc.)
  • OLMo (allenai/OLMo-1B-hf, allenai/OLMo-7B-hf, etc.)
  • OPT (facebook/opt-66b, facebook/opt-iml-max-30b, etc.)
  • Orion (OrionStarAI/Orion-14B-Base, OrionStarAI/Orion-14B-Chat, etc.)
  • Phi (microsoft/phi-1_5, microsoft/phi-2, etc.)
  • Phi-3 (microsoft/Phi-3-mini-4k-instruct, microsoft/Phi-3-mini-128k-instruct, etc.)
  • Qwen (Qwen/Qwen-7B, Qwen/Qwen-7B-Chat, etc.)
  • Qwen2 (Qwen/Qwen1.5-7B, Qwen/Qwen1.5-7B-Chat, etc.)
  • Qwen2MoE (Qwen/Qwen1.5-MoE-A2.7B, Qwen/Qwen1.5-MoE-A2.7B-Chat, etc.)
  • StableLM(stabilityai/stablelm-3b-4e1t, stabilityai/stablelm-base-alpha-7b-v2, etc.)
  • Starcoder2(bigcode/starcoder2-3b, bigcode/starcoder2-7b, bigcode/starcoder2-15b, etc.)
  • Xverse (xverse/XVERSE-7B-Chat, xverse/XVERSE-13B-Chat, xverse/XVERSE-65B-Chat, etc.)
  • Yi (01-ai/Yi-6B, 01-ai/Yi-34B, etc.)

Getting Started

Visit our documentation to get started.