You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In supporting concurrent requests, we won't at first assume concurrent inference capability at the model weight stage. This would require us to mutex the LLMs. We'll want control over the LLM supervision, in which case we might as well just support multiple LLM types hosted at once.
Will probably require some sort of config, for example (TOML). For example, to mount 2 instances of Meta Llama & one of Mistral Nemo:
[llm]
# Nickname to HF or local pathllama3-8B-8bit-1 = "mlx-community/Meta-Llama-3.1-8B-Instruct-8bit"llama3-8B-8bit-2 = "mlx-community/Meta-Llama-3.1-8B-Instruct-8bit"mistral-nemo-8bit = "mlx-community/Mistral-Nemo-Instruct-2407-8bit"
Each instance would be resident in memory, so that is a natural limit. The client would request by model type, and the model supervisor would pick the first available one.
Of interest, to understand the expected deployment scenarios:
In supporting concurrent requests, we won't at first assume concurrent inference capability at the model weight stage. This would require us to mutex the LLMs. We'll want control over the LLM supervision, in which case we might as well just support multiple LLM types hosted at once.
Will probably require some sort of config, for example (TOML). For example, to mount 2 instances of Meta Llama & one of Mistral Nemo:
Each instance would be resident in memory, so that is a natural limit. The client would request by model type, and the model supervisor would pick the first available one.
Of interest, to understand the expected deployment scenarios:
The text was updated successfully, but these errors were encountered: