-
Notifications
You must be signed in to change notification settings - Fork 487
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Windows and onnx dtype compatibility #1886
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
2c8c843
to
949743e
Compare
… in torch format before
@@ -852,6 +863,39 @@ def raise_on_numpy_input_io_binding(self, use_torch: bool): | |||
" with model.use_io_binding = False, or pass torch.Tensor inputs instead." | |||
) | |||
|
|||
def _prepare_onnx_inputs( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Philipp did not like that kind of dynamicity with perf in mind but I have no opinion, sounds fine to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it only applies when needed so for performance I think there's no added overhead vs what used to be done.
where things can be optimized and i think can be viewed as the optimization oriented path, is i/o binding:
- pre-allocation of output buffers, not during forward but before that, and dynamically changing the size of output buffers when batch size changes.
- decoder models with i/o binding where static cache implementation can be used as output buffers to reduce the overhead of their creation.
- decoder models input/output synchronization which can be before and after the generation loop instead of each forward call (if that's possible).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kind of agree, just not sure what each ORTModelForXXX is for then
FWIW, numpy 2.0 has just been released, and int size in windows is now 64 bit! https://numpy.org/devdocs/numpy_2_0_migration_guide.html#windows-default-integer |
What does this PR do?
Fixes multiple windows specific issues:
int64
but on windows it usesint32
. This results in tokenizers returning input ids and attention masks in a different format than torch's default which is int64/long.Also seq2seq decoder used to return past kv cache in torch format all the time.
Before submitting
Who can review?