-
Notifications
You must be signed in to change notification settings - Fork 197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prefix caching. #675
Prefix caching. #675
Conversation
406393f
if (!blocks.size()) { | ||
return nullptr; | ||
} | ||
auto hash_block = std::min_element(std::begin(blocks), std::end(blocks), block_is_less); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you think we need to store blocks in already sorted manner?
In this case get_lru_block
will take O(1)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we store blocks in a sorted structure like priority_queue
, get_lru_block()
will be O(1)
but get_block()
by hash will be O(n)
, as we will need to loop over all blocks and check hashes:
KVCacheBlock::Ptr get_block(size_t hash) { |
Currently with hash table get_block()
is O(1)
. We can probably have two structures in evictor both hash table and priority_queue
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, we can try this optimization.
But maybe let's first perform comparison with vLLM implementation? We can measure how much time this piece of code takes averagely.
|
||
Output schedule(std::vector<SequenceGroup::Ptr>& sequence_groups) { | ||
Output scheduler_output; | ||
|
||
if (m_config.enable_prefix_caching) | ||
_restore_cached_blocks(sequence_groups); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to perform it on each iteration? As far as I could remember, vLLM does this prefix caching during creation of sequence groups.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, moved restoring of blocks to add_request()
.
if (sequence_group->get_num_processed_tokens() == 0) | ||
m_block_manager.allocate(sequence, num_required_blocks, sequence_group->get_prompt_ids()); | ||
else | ||
m_block_manager.append_slots(sequence_group); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need such condition based on get_num_processed_tokens ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
append_slots()
is needed, because after blocks restoring we might not need to allocate a new block, this situation happens only if sequence_group->get_num_processed_tokens() > 0
.
But append_slots()
can actually be used in both cases, so I removed the condition and used append_slots()
.
} | ||
const char* data = reinterpret_cast<const char*>(content.data()); | ||
std::size_t size = content.size() * sizeof(content[0]); | ||
return std::hash<std::string_view>{}(std::string_view(data, size)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here hash computation for next block will recompute all hashes from previous blocks. Do you think we can optimize it? E.g. only once compute all hashes block by block, where current block relies on hashes of previous blocks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, changed hash computation in PR https://github.com/openvinotoolkit/openvino.genai/pull/758/files
Now hash is commutated from previous hashes plus tokens of only current block.
Applied comments from #675
Applied comments from openvinotoolkit/openvino.genai#675
Port of #639