I hope to implement some acceleration technologies for Large Language Models (LLMs) because I enjoy doing this myself and love the challenge of bringing research papers into real-world applications.
If there are any technologies you'd like to develop or discuss, feel free to reach out. Thanks!
I'm excited to dive deeper into AI research!
- 2024/12/16: Add the
Medusa-1 Training Script v2
- 2024/12/15: Add the
Medusa-1 Training Script
- 2024/12/12: Update the KV Cache support for Speculative Decoding
- 2024/12/04: Add the
Kangaroo Training Script v2
- 2024/11/26: Add the
Kangaroo Training Script
- 2024/11/22: Update the
Target Model Keep Generation Mechanism
experiment - 2024/11/18: Update the
Self-Speculative Decoding
experiment results ofgoogle--gemma-2-9b-it
. - 2024/11/12: Reviewing implementation challenges for
Self-Speculative Decoding
and evaluating model compatibility for improved efficiency. - 2024/11/10: Initial setup for
Self-Speculative Decoding
completed; data pipeline in place for testing draft-and-verify. - 2024/11/08:
Speculative Decoding
successfully implemented. Verified improved inference time with no noticeable accuracy degradation.
- Batched Speculative Decoding:
- Prompt lookup decoding: Determine timeline after reviewing initial implementations.
- UAG Integration: Assess when to integrate after
Medusa
andKangaroo
are in place.
- 2024/11/08 | Complete
Speculative Decoding
following the paper Fast Inference from Transformers via Speculative Decoding - 2024/11/15 | Implement
Self-Speculative Decoding
as per Draft & Verify - Lossless Large Language Model Acceleration via Self-Speculative Decoding- LayerSkip model architecture
- Bayesian Optimization for Layer Skip Selection (AR)
- Adaption Draft-Exiting Mechanism
- Optimization
- Bayesian Optimization for Layer Skip Selection (Speed)
-
gemma-2-9b-it
experiment
- 2024/11/22 | Develop
Kangaroo
following Kangaroo - Lossless Self-Speculative Decoding via Double Early Exiting- Kangaroo model
- Training Script
- Implement double early exits to improve speed.
- 2024/11/29 | Implement
Medusa
from Medusa - Simple LLM Inference Acceleration Framework with Multiple Decoding Heads- Medusa model
- Training Script (Medusa-1)
- Testing
- 2024/12/20 | Implement
Eagle
from
- TBD | Implement
Batched Speculative Decoding
from The Synergy of Speculative Decoding and Batching in Serving Large Language Models - TBD | Implement
prompt lookup decoding
from prompt-lookup-decoding GitHub - TBD | Implement
UAG
(Universal Assisted Generation) from Universal Assisted Generation Blog