Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Loading model from S3 using RunAI Model Streamer as optional loader #10192

Merged
merged 17 commits into from
Dec 20, 2024

Conversation

omer-dayan
Copy link
Contributor

@omer-dayan omer-dayan commented Nov 10, 2024

The following PR is adding an option to load a model from S3 using Runai Model Streamer as a loader option, as well as from other storage options.

The RunAI Model Streamer is an open source model loader, that is able to stream tensors from any storage (NFS / Local dir / S3 / Object store) with concurrency (https://github.com/run-ai/runai-model-streamer).

Performance benchmarks:
image

Further reading can be found here: https://pages.run.ai/hubfs/PDFs/White%20Papers/Model-Streamer-Performance-Benchmarks.pdf

In this PR we have made the following changes:

  1. Added a new option for --load-format flag - runai_streamer (+ Help description)
  2. When using the runai_streamer vLLM will load the model using RunaiModelStreamerLoader
  3. The RunaiModelStreamerLoader is working only with Safetensors files
  4. The RunaiModelStreamerLoader can be initialized with tunable parameters (Concurrency, and CPU memory limit)
  5. Lazy load of runai-model-streamer package
  6. Add runai-model-streamer to requirements
  7. For config.json and tokenizer files we pull the model (No weights files) from S3 to a temporary memory fs backed directory under /dev/shm
  8. Added documentation of how to use it

After this PR, given a directory on AWS S3 with the model files:
image

One can run the following command:
vllm serve s3://core-llm/Llama-3-8b --load-format runai_streamer
(Authorization to the S3 endpoint is done through regular AWS S3 authorization mechanism - ~/.aws/credentials / env var / etc)

The tensors will be streamed directly from the S3 into the GPU memory, without going to the storage

Copy link

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

  • Add ready label to the PR
  • Enable auto-merge.

🚀

@mergify mergify bot added documentation Improvements or additions to documentation ci/build labels Nov 10, 2024
@omer-dayan omer-dayan changed the title Add RunAI Model Streamer as optional loader. [Core] Add RunAI Model Streamer as optional loader. Nov 10, 2024
@omer-dayan omer-dayan changed the title [Core] Add RunAI Model Streamer as optional loader. [Core] Add RunAI Model Streamer as optional loader Nov 11, 2024
Copy link

mergify bot commented Nov 13, 2024

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @omer-dayan.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@omer-dayan omer-dayan changed the title [Core] Add RunAI Model Streamer as optional loader [Core] Loading model from S3 using RunAI Model Streamer as optional loader Nov 14, 2024
@omer-dayan omer-dayan force-pushed the omer/run-loader branch 2 times, most recently from a8d45b4 to 0e5335d Compare November 14, 2024 09:14
Add it to the docs as well

Signed-off-by: OmerD <[email protected]>
@omer-dayan omer-dayan force-pushed the omer/run-loader branch 6 times, most recently from 29aa9d8 to 2a9c864 Compare November 15, 2024 14:42
Signed-off-by: OmerD <[email protected]>
Copy link

mergify bot commented Nov 15, 2024

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @omer-dayan.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Nov 15, 2024
@simon-mo
Copy link
Collaborator

@pandyamarut can you help review this PR?

@mergify mergify bot removed the needs-rebase label Nov 16, 2024
Copy link

mergify bot commented Nov 17, 2024

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @omer-dayan.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Nov 17, 2024
vllm/config.py Outdated
@@ -191,6 +192,18 @@ def __init__(
f"'Please instead use `--hf-overrides '{hf_override!r}'`")
warnings.warn(DeprecationWarning(msg), stacklevel=2)

if is_s3(model):
Copy link
Contributor Author

@omer-dayan omer-dayan Nov 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to weights files, to load model we need config.json and tokeinzer files.
If they are stored in s3 as well, the program needs to read it from there. Which before this change, is not possible.

There are 3 options for implementation for this:

image
(This image shows in high level the relevant chain of calls in the code)

Implement it like option 1, in the vllm/config.json means a single place of change, small as possible (Current implementation).
Option 2 means implement it in 2 files, seperately, in the tokeinzer and the config file. One may argue its prefered way, as model scope hub integrration is in this layer.
Option 3 means, lets implement it in HuggingFace library, lets make it able to get path from s3 bucket, and read the content from it.

In my opinion option 3 is the most transparent one and the most right one, however they are not interested in expanding their support (huggingface/transformers#19834 (comment))

WDYT?

@mergify mergify bot removed the needs-rebase label Dec 17, 2024
@kouroshHakha
Copy link

@simon-mo is this good to be merged?

@YaliEkstein
Copy link

Autoscaling llms could be a whole lot better with this addition! I'm happy to see this PR moving forward.

@simon-mo
Copy link
Collaborator

Looking into this PR now, a quick question, how does this work with a model on Huggingface Hub? Does the user need to manually mirror it to S3?

Copy link
Collaborator

@comaniac comaniac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM. Thanks for the work!

vllm/config.py Outdated Show resolved Hide resolved
vllm/transformers_utils/s3_utils.py Show resolved Hide resolved
vllm/transformers_utils/s3_utils.py Outdated Show resolved Hide resolved
vllm/model_executor/model_loader/loader.py Outdated Show resolved Hide resolved
vllm/model_executor/model_loader/loader.py Show resolved Hide resolved
@comaniac
Copy link
Collaborator

Looking into this PR now, a quick question, how does this work with a model on Huggingface Hub? Does the user need to manually mirror it to S3?

Looks like it will fallback to the current model loader if the model name is in HF format (org/model).

@omer-dayan omer-dayan force-pushed the omer/run-loader branch 2 times, most recently from 36c3f32 to 30af43e Compare December 19, 2024 18:08
@omer-dayan
Copy link
Contributor Author

@simon-mo

Looking into this PR now, a quick question, how does this work with a model on Huggingface Hub? Does the user need to manually mirror it to S3?

No, in case the model is in HuggingFace, we download it locally, basically we fallback to the default behavior as @comaniac said.

@simon-mo
Copy link
Collaborator

@omer-dayan out of curiosity, do you think it's possible to implement something to direct read from the hub in streaming fashion? is there limitations around this?

Copy link
Collaborator

@comaniac comaniac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved to unblock. My main comment is about docstring. Others LGTM.

vllm/transformers_utils/s3_utils.py Show resolved Hide resolved
vllm/config.py Show resolved Hide resolved
docs/source/serving/runai_model_streamer.rst Outdated Show resolved Hide resolved
docs/source/serving/runai_model_streamer.rst Outdated Show resolved Hide resolved
@omer-dayan omer-dayan force-pushed the omer/run-loader branch 2 times, most recently from 2c2b9f2 to e5fae51 Compare December 19, 2024 21:28
Signed-off-by: OmerD <[email protected]>
@comaniac
Copy link
Collaborator

@simon-mo leave to you

@omer-dayan
Copy link
Contributor Author

@comaniac Thanks a lot on the review!

@simon-mo

out of curiosity, do you think it's possible to implement something to direct read from the hub in streaming fashion? is there limitations around this?

Technically I dont see a reason why not.
Although downloading the weights from HuggingFace Hub is not a good practice for production, where you would look for better loading time.
I mean the first thing one would do in order to improve loading time is making sure he is not coupled to the World Wide Web, and the weights are stored closely.

However, notice that HuggingFace Hub is just a git server, and every model is a git repository.
I do think that a good solution would be to stream, like we do in this PR, from any git repo.
That way we would get implicitly "stream from HuggingFace Hub without the need of filesystem" + In production people can store their model on a close git servers

@simon-mo simon-mo enabled auto-merge (squash) December 19, 2024 23:32
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Dec 19, 2024
Copy link
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM as well, thanks for iterating and making the deps optional

@simon-mo simon-mo merged commit 995f562 into vllm-project:main Dec 20, 2024
76 checks passed
lucas-tucker pushed a commit to lucas-tucker/vllm-lucas-tucker that referenced this pull request Dec 21, 2024
BKitor pushed a commit to BKitor/vllm that referenced this pull request Dec 30, 2024
@rnoons
Copy link

rnoons commented Jan 7, 2025

@omer-dayan I actually cutover from s3 mounths to RunAI streamer. Everything is working well with llama models, but deepseek lite coder v2 doesn't seem to work. Complains about not beeing able to find deepseek.py, but works just fine using safetensors with an s3 mount.

I was going to file an issue, but not immediately sure if the issue resideds on the openai side or on the runai side?

@huaxuan250
Copy link

@omer-dayan RunAI streamer works well for me when I use only 1 gpu through --tensor-parallel-size 1. However, when the number is more than 1, it failed because of the ERROR 01-07 20:44:51 engine.py:366] _pickle.PicklingError: Can't pickle <class 'botocore.client.S3'>: attribute lookup S3 on botocore.client failed.

I am not sure if there is something on my end that is problematic.

This is the issue: #11819

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci/build documentation Improvements or additions to documentation ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.