-
Notifications
You must be signed in to change notification settings - Fork 10k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature Request: T-MAC: CPU Renaissance via Table Lookup for Low-Bit LLM Deployment on Edge #8485
Comments
The optimizations that they list on their Github page help with more efficient compute but the bottleneck for GEMV kernels on the hardware that I'm interested in (GPUs, desktop CPUs) is not compute but memory bandwidth. Something like this is only going to be useful on e.g. phones that have comparatively little compute and where conserving battery life is very important. The specific comparisons they draw are also fishy:
|
Note that they did not compare with types from #8151, so their numbers are inflated. See also #7931 (comment), and the associated commit replacing the q22_grid lookup table with SIMD-based unpacking. (although the current implementation in #8151 doesn't shuffle anymore) But maybe there's something which can still be learned from their implementation. I don't have the same hardware on which they ran comparisons, so I can't directly compare the numbers. |
Assuming that the gains in compute efficiency are real their technique would be very useful for large batch matrix matrix multiplication or FlashAttention on CPUs where you are compute bound rather than I/O bound. I would still assume that a CPU+GPU system would be better but it would still be a very good addition to llama.cpp I think. |
add to that, I would imagine a large 1.58bit model 14B+ would be a lot more compute bound than io bound on CPU or even GPU |
@sorasoras @JohannesGaessler Thank you for your interest in T-MAC. Our previous repo (as of the time of this post) may not fully showcase the benefits of T-MAC, and we have made some updates over the last month. To address your concerns here:
We are working on migrating to the latest llama.cpp version. Hopefully, our modifications can be merged into the mainstream. |
Thank you for the plots, those look much more convincing.
This is only true if the weights are dequantized to FP16/FP32 and the dot product is calculated using floating point arithmetic. If you convert both the weights and activations to 8 bit you can use SIMD instructions on CPUs that are comparatively faster (make sure to report if things like AVX2 are available on the tested hardware). Similarly on CUDA GPUs the |
llama.cpp has already converted activation to 8-bit, but we still observe that after quantizing weights to lower-bits, the GEMV is more compute bottlenecked. According to our blackbox profile, LUT SIMD instructions (tbl/pshuf) also have better throughput (CPI) than FMA instructions (dot/fma...). Additionally, T-MAC needs less TBL instructions for lower bits. Results of AVX2 CPU provided here (Surface Book 3). We agree with you on the statement regarding CUDA GPUs. From our insights, GPUs aren’t well-suited for LUT due to their limited on-chip memory per core. Conversely, placing LUT on shared memory leads to slow random access caused by bank conflict. |
@kaleid-liner I'd be interested in end-to-end performance comparisons between T-MAC and These don't use lookup tables, but are significantly faster than the other low-bit types in I think T-MAC is still faster than these improved types, but I did not yet figure out how to build T-MAC on NixOS (especially regarding TVM), and I don't have common machines with the ones tested in the profiling data for T-MAC. For example, on an AWS
Note that I've used the
I expect T-MAC to still be faster because of its more optimized memory layout and better tiling compared to |
@compilade Thanks! I will compare T-MAC against TQ1_0 and TQ2_0. I also expect T-MAC to still be faster because of much less computations. |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
Any updates? |
@sorasoras @compilade Thanks for your close attention! We've just submitted a pull request to integrate it. |
Prerequisites
Feature Description
https://arxiv.org/pdf/2407.00088
Answer
T-MAC (Table-based Matrix-Activation Computation) is an innovative method designed to enable efficient deployment of low-bit Large Language Models (LLMs) on edge devices using CPUs. Here are the key aspects of T-MAC:
Purpose: T-MAC addresses the challenge of deploying weight-quantized LLMs on edge devices with limited resources, focusing on efficient mixed-precision matrix multiplication (mpGEMM) without relying on GPUs.
Core Technique: It uses a lookup table (LUT)-based approach to directly support mpGEMM without the need for weight dequantization. This method transforms traditional data-type-centric multiplication into bit-wise table lookup operations.
Performance Improvements:
Up to 4x increase in throughput compared to llama.cpp
70% reduction in energy consumption
For BitNet-b1.58-3B model:
30 tokens/s with a single core on M2-Ultra
71 tokens/s with eight cores on M2-Ultra
11 tokens/s on Raspberry Pi 5
Key Features:
Scales linearly with weight bit-width
Eliminates multiplications and reduces additions
Supports various activation types (fp8, fp16, int8) using fast table lookup and add instructions
Implementation Techniques:
LUT-centric data layout for efficient on-chip memory usage
Table quantization and mirror consolidation to reduce table size
Utilization of tbl/pshuf instructions for fast table lookup on CPUs
Evaluation:
Tested on various edge devices including Apple M2 Ultra, Jetson AGX Orin, Surface Book 3, and Raspberry Pi 5
Achieved up to 6.6x speedup (average 3.6x) compared to llama.cpp
End-to-end LLM inference speedup of 2.8x for Llama-2-7B-2bit model
Significance: T-MAC provides a practical solution for deploying LLMs on edge devices using widely available CPUs, making LLM inference speed on CPUs comparable or even superior to GPUs on the same devices in some cases.
Availability: The T-MAC system is open-sourced and available on GitHub for further development and implementation.
Motivation
Looks like a good addition to current Bitnet 1.58bit to speed it up even further
Possible Implementation
https://github.com/microsoft/T-MAC
The text was updated successfully, but these errors were encountered: