forked from LMCache/LMCache
-
Notifications
You must be signed in to change notification settings - Fork 0
/
TODO
21 lines (20 loc) · 1.25 KB
/
TODO
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
fix the bug of async put (the server feedback message channel will be shared and it does not correctly demux to put/get thread)
speed up encoding
reuse cdf shared memory to hold the lengths during decoding
Disk loading in configuration file
------
(functionality) add configuration of serde (torch/cachegen) to config file
(deployment) Add torchac_cuda installation to Docker deployment and README
(functionality) Async prefetch during startup
(modeling) Model the throughput threshold for improvement
(refactoring, usability) separate repo of vllm driver code + better instructions + better demo
- update docker file and test the dockers
(functionality, performance) Non-blocking put implementation -- maybe implement it in the connector level
(performance) Pipeline get and deserialization
(functionality) CacheGen GPU compression
(functionality, correctness, tests) work with vllm's prefix caching -- when seq.num_coomputed_tokens() is not zero
(functionality, usability) instructions to install CacheGen and torchac_cuda
(functionality) graceful close for the connectors
(tests) spin-up the redis/lmcache server during testings
(refactoring, tests) Refactor unit tests for lmcache engine
(benchmarking) use pytest-benchmark to benchmark the performance of each sub-component