Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Apply LoRA adapters per-request #10377

Closed
4 tasks done
ngxson opened this issue Nov 18, 2024 · 1 comment · Fixed by #10994
Closed
4 tasks done

Feature Request: Apply LoRA adapters per-request #10377

ngxson opened this issue Nov 18, 2024 · 1 comment · Fixed by #10994
Labels
enhancement New feature or request stale

Comments

@ngxson
Copy link
Collaborator

ngxson commented Nov 18, 2024

Prerequisites

  • I am running the latest code. Mention the version if possible as well.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

Server now supports hot-swapping LoRA adapters via /lora-adapters endpoint, which changes the global adapter config.

With this, the only "safe" moment to apply LoRA changes is when all slots are idle.

However, this is not practical in case the server has a high number of requests (ref: #10374). With continuous batching, the chance of all slots become idle is rare.

Motivation

Possible Implementation

  1. We can group only requests using the same LoRA config to the same batch
  2. Call common_lora_adapters_apply before processing the batch (remember to clear KV if needed)
@ngxson ngxson added the enhancement New feature or request label Nov 18, 2024
@michaellin99999
Copy link

I think there needs to be another way.
it is weird to apply LoRa swap when server is idle, the swap is only meaningful when actual users Request it to happen. i.e. summarize this for me, calculate this for me etc.... what causes the need to swap adapters is a instantaneous thing. If you think about it , Its not possible to predict when users need the swap to happen and the better way will to have the swap happen WHEN they need it.
This functionality is critical espeically for small models that have to fit to multiple use cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request stale
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants