Process pool for multiple loaded LLMs, and a queuing system from the FastAPI/uvicorn workers #17

uogbuji · 2024-08-06T15:34:47Z

In supporting concurrent requests, we won't at first assume concurrent inference capability at the model weight stage. This would require us to mutex the LLMs. We'll want control over the LLM supervision, in which case we might as well just support multiple LLM types hosted at once.

Will probably require some sort of config, for example (TOML). For example, to mount 2 instances of Meta Llama & one of Mistral Nemo:

[llm]

# Nickname to HF or local path
llama3-8B-8bit-1 = "mlx-community/Meta-Llama-3.1-8B-Instruct-8bit"
llama3-8B-8bit-2 = "mlx-community/Meta-Llama-3.1-8B-Instruct-8bit"
mistral-nemo-8bit = "mlx-community/Mistral-Nemo-Instruct-2407-8bit"

Each instance would be resident in memory, so that is a natural limit. The client would request by model type, and the model supervisor would pick the first available one.

Of interest, to understand the expected deployment scenarios:

uogbuji self-assigned this Aug 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Process pool for multiple loaded LLMs, and a queuing system from the FastAPI/uvicorn workers #17

Process pool for multiple loaded LLMs, and a queuing system from the FastAPI/uvicorn workers #17

uogbuji commented Aug 6, 2024 •

edited

Loading

Process pool for multiple loaded LLMs, and a queuing system from the FastAPI/uvicorn workers #17

Process pool for multiple loaded LLMs, and a queuing system from the FastAPI/uvicorn workers #17

Comments

uogbuji commented Aug 6, 2024 • edited Loading

uogbuji commented Aug 6, 2024 •

edited

Loading