Replies: 1 comment
-
@edwardjhu @microsoftopensource Could anyone please help with my question? |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Context:
In our company, we have the requirement for a sequence classification model. We currently make use of finetuned distilbert models which are specific to each respective client. Hence, assuming we have 50 clients, we have a FastAPI microservice which :
Proposed Change:
After experiments with Lora on larger language models (eg. deberta), we find that the performance is comparable to baseline (finetuned distilbert) for each client. As per current understanding, we find this appealing because instead of a finetuned model for each client, we can use a single large base model and have Lora adapters for each client.
Our understanding is lora adapter is 10x smaller than finetuned distilbert model, leading to large space saving. Lora also allows for the adapter to be attached onto the base model on inference (i.e per each api call). This should allow for us to have only 1 base model loaded into CPU/Memory on microservice initialization instead of 50 distilberts, leading to savings in CPU and other resources. Time taken for attaching a lora adapter to a base model was found to be sub 0.02 seconds (manageable).
Question:
Thank you in advance for help provided.
Beta Was this translation helpful? Give feedback.
All reactions