Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix causal_lm cpp demo for llama architecture #71
Fix causal_lm cpp demo for llama architecture #71
Changes from 3 commits
c034679
b289e77
6c5e28f
8e67c5f
54d38b5
6659243
fd979e5
3475e00
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as far as I remember, chatglm has different index for batch dimension
https://github.com/openvinotoolkit/openvino.genai/pull/48/files#diff-bceea0d0f5f31cbb280e385123d098ecddc2ee4bcc57a87b6789bce72a8b481cR85
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this PR is only for llama, will add chatglm support in another PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apparently exporting a model as stateful removes any difference between the dimensions. I was able to run chatglm2-6b and chatglm3-6b using Wovchena#8 and get meaningful output. Here's the PR updating convert.py: #52. My next step is to rewrite this causal_lm as stateful as well, so we could decide if we need stateless approach at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree.
But we should wait until stateful models support is merged to HF, right?
It would be great to avoid when we refer to different commits to convert different models
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Wovchena pls consider we may still need stateless models until both CPU and GPU plugin support stateful model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are going to have stateful support for both CPU and GPU.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are considering to have separate samples for stateless models if they demonstrate popular architectures and require architecture-dependent kv-cache processing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But the main focus is on stateful models.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so, maybe we can merge chatglm sample as is and treat it as "sample for stateless model for popular architecture" ?
While current one will show stateful models and hence it become more generic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greedy search with stateful model: sammysun0711#1 That eliminates a need to have special implementation for chatglm.
I also had that Idea to have stateless model with architecture, which doesn't fit into the general scenario. And there's still Qwen which may require that #43. I haven't tried it yet