Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Process pool for multiple loaded LLMs, and a queuing system from the FastAPI/uvicorn workers #17

Open
uogbuji opened this issue Aug 6, 2024 · 0 comments
Assignees

Comments

@uogbuji
Copy link
Contributor

uogbuji commented Aug 6, 2024

In supporting concurrent requests, we won't at first assume concurrent inference capability at the model weight stage. This would require us to mutex the LLMs. We'll want control over the LLM supervision, in which case we might as well just support multiple LLM types hosted at once.

Will probably require some sort of config, for example (TOML). For example, to mount 2 instances of Meta Llama & one of Mistral Nemo:

[llm]

# Nickname to HF or local path
llama3-8B-8bit-1 = "mlx-community/Meta-Llama-3.1-8B-Instruct-8bit"
llama3-8B-8bit-2 = "mlx-community/Meta-Llama-3.1-8B-Instruct-8bit"
mistral-nemo-8bit = "mlx-community/Mistral-Nemo-Instruct-2407-8bit"

Each instance would be resident in memory, so that is a natural limit. The client would request by model type, and the model supervisor would pick the first available one.

Of interest, to understand the expected deployment scenarios:

@uogbuji uogbuji self-assigned this Aug 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant