TODO

fix the bug of async put (the server feedback message channel will be shared and it does not correctly demux to put/get thread)
speed up encoding
reuse cdf shared memory to hold the lengths during decoding

Disk loading in configuration file
------
(functionality) add configuration of serde (torch/cachegen) to config file
(deployment) Add torchac_cuda installation to Docker deployment and README
(functionality) Async prefetch during startup
(modeling) Model the throughput threshold for improvement
(refactoring, usability) separate repo of vllm driver code + better instructions + better demo
    - update docker file and test the dockers
(functionality, performance) Non-blocking put implementation -- maybe implement it in the connector level
(performance) Pipeline get and deserialization
(functionality) CacheGen GPU compression
(functionality, correctness, tests) work with vllm's prefix caching -- when seq.num_coomputed_tokens() is not zero
(functionality, usability) instructions to install CacheGen and torchac_cuda
(functionality) graceful close for the connectors
(tests) spin-up the redis/lmcache server during testings
(refactoring, tests) Refactor unit tests for lmcache engine
(benchmarking) use pytest-benchmark to benchmark the performance of each sub-component