-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable multiple LoRa adapters #2010
Conversation
Hey guys, as I am switched to lorax and started contributing there a lot after the first license change I am happy to see the PR got opened I would be happy if you are open for some questions and discussion about this. I'd would be happy to contribute here too. |
hi @flozi00 thanks for the feedback! can you share more about the lorax style api? I see that in lorax you can specify the adapter via the |
Yes, i mean the "adapter_id" inside "parameters" for the tgi api (as you did it, i see now), and the "model" in the openai api :) |
update: This PR's implementation has been updated to align with the great work done by the lorax team. This implementation tries to use the same layers when possible and only diverges to work with TGI's recent updates/improvements and limits lora to loading at startup. Current changes allow weights to be loaded similar to Lorax, however there are still issues with generation to be resolved, and other refactors |
Looks like you are successfully adopting the lorax code |
@flozi00 generation with loras is mostly stable, just focusing on the rebase then refactors now. And thank you 🙂 a review once the PR is ready would be super helpful! |
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Thanks for the shoutout in the docs! It's quite interesting to see things come full circle, maybe we should chat about merging our projects. |
of course @tgaddair thank you for the awesome work! thats an interesting idea and we are always aiming to improve TGI. We appreciate any contributions/discussions about features that may be helpful to our users |
I'd love to migrate to tgi again 👍 And of course trying to contribute here too @tgaddair |
hi @xiadingZ in this PR lora adapters are loaded from the once this initial lora work is merged we'll follow up with other improvement such as easier ways to specify lora path, and etc |
Hi, @drbh I can try your methods with downloaded lora. But I have a lora adapter trained locally. It doesn't have a directory structure such as I set HUGGINGFACE_HUB_CACHE as |
server/text_generation_server/models/custom_modeling/flash_llama_modeling.py
Outdated
Show resolved
Hide resolved
server/text_generation_server/models/custom_modeling/flash_llama_modeling.py
Outdated
Show resolved
Hide resolved
server/text_generation_server/models/custom_modeling/flash_llama_modeling.py
Outdated
Show resolved
Hide resolved
server/text_generation_server/models/custom_modeling/flash_llama_modeling.py
Outdated
Show resolved
Hide resolved
server/text_generation_server/models/custom_modeling/flash_llama_modeling.py
Outdated
Show resolved
Hide resolved
Forgot to add: we probably want an integration test as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all the changes! Looks ready to merge to me after the small nit that breaks CI is fixed.
@danieldk thanks for the review! I've fixed the nits and CI passes. Going to go ahead and merge based on your last approval |
* feat: first draft load multiple lora * feat: load weights within layer and refactor lora pass * fix: refactor and reduce lora math * feat: baseline impl single request multi lora support * feat: prefer lorax implementation and port loading logic * fix: prefer adapter_data and refactors * feat: perfer loraxs custom punica kernels and add mlp loras * fix: adjust batch for bgmv * fix: adjust adapter_segments logic when in batch * fix: refactor and move changes to v3 proto * fix: pass model_id for all flash causal lms * fix: pass model_id for all causal and seq2seq lms * fix: add model_id to model test * feat: add lora support to mistral and refactors * feat: prefer model id in request * fix: include rust code for adapter id * feat: bump launcher and add new lora docs * feat: support base model generation and refactors * fix: rename doc to retry ci build * feat: support if vlm models * fix: add adapter_data param and avoid missing layers * fix: add adapter_data param to phi and neox * fix: update all models forwards to include adapter_data * fix: add model_id to IdeficsCausalLM * Update lora.md Fixed a typo * Update lora.md Fixing spam image * fix: add lora kernel to dockerfile, support running without kernels and refactors * fix: avoid dockerfile conflict * fix: refactors and adjust flash llama lora logic * fix: skip llama test due to CI issue (temp) * fix: skip llama test CI (temp) 2 * fix: revert skips and prefer updated ci token for tests * fix: refactors and helpful comments * fix: add noop in TensorParallelAdapterRowLinear too * fix: refactor and move shard_lora_weights logic * fix: exit early if no adapter_data --------- Co-authored-by: Derek <[email protected]>
* feat: first draft load multiple lora * feat: load weights within layer and refactor lora pass * fix: refactor and reduce lora math * feat: baseline impl single request multi lora support * feat: prefer lorax implementation and port loading logic * fix: prefer adapter_data and refactors * feat: perfer loraxs custom punica kernels and add mlp loras * fix: adjust batch for bgmv * fix: adjust adapter_segments logic when in batch * fix: refactor and move changes to v3 proto * fix: pass model_id for all flash causal lms * fix: pass model_id for all causal and seq2seq lms * fix: add model_id to model test * feat: add lora support to mistral and refactors * feat: prefer model id in request * fix: include rust code for adapter id * feat: bump launcher and add new lora docs * feat: support base model generation and refactors * fix: rename doc to retry ci build * feat: support if vlm models * fix: add adapter_data param and avoid missing layers * fix: add adapter_data param to phi and neox * fix: update all models forwards to include adapter_data * fix: add model_id to IdeficsCausalLM * Update lora.md Fixed a typo * Update lora.md Fixing spam image * fix: add lora kernel to dockerfile, support running without kernels and refactors * fix: avoid dockerfile conflict * fix: refactors and adjust flash llama lora logic * fix: skip llama test due to CI issue (temp) * fix: skip llama test CI (temp) 2 * fix: revert skips and prefer updated ci token for tests * fix: refactors and helpful comments * fix: add noop in TensorParallelAdapterRowLinear too * fix: refactor and move shard_lora_weights logic * fix: exit early if no adapter_data --------- Co-authored-by: Derek <[email protected]>
This PR is a work in progress to add support for mutliple loras to be loaded at startup and then use 0 or 1 adapters in a request by specifying the adapter id.
Example usage
download adapter without auto merging
start server with multiple LoRa adapters
sending request without adapter id
with first LoRa adapter specified
with second LoRa adapter specified