Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix causal_lm cpp demo for llama architecture #71

Merged

Conversation

sammysun0711
Copy link
Collaborator

This PR aim to fix causal_lm cpp demo for llama architecture

  • Change directory name from casual_lm to causal_lm
  • Fix causal lm cpp demo input & output mismatch for llama architecture
    • llama does not require position ids as input, only chatglm required
    • llama past kv cache as input starting from idx 2 instead of 3

This is a initial step to build baseline for consolidating following model support into causal_lm cpp demo:

@sammysun0711
Copy link
Collaborator Author

@Wovchena , @ilya-lavrenov, could you please review it?

text_generation/causal_lm/cpp/set_up_and_run.sh Outdated Show resolved Hide resolved
text_generation/causal_lm/cpp/causal_lm.cpp Outdated Show resolved Hide resolved
}}
};
std::vector<ov::Output<ov::Node>> inputs = model->inputs();
for (size_t idx = 3; idx < inputs.size(); ++idx) {
for (size_t idx = 2; idx < inputs.size(); ++idx) {
ov::PartialShape shape = inputs.at(idx).get_partial_shape();
shape[0] = BATCH_SIZE;
Copy link
Contributor

@ilya-lavrenov ilya-lavrenov Dec 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this PR is only for llama, will add chatglm support in another PR.

Copy link
Collaborator

@Wovchena Wovchena Dec 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently exporting a model as stateful removes any difference between the dimensions. I was able to run chatglm2-6b and chatglm3-6b using Wovchena#8 and get meaningful output. Here's the PR updating convert.py: #52. My next step is to rewrite this causal_lm as stateful as well, so we could decide if we need stateless approach at all.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree.
But we should wait until stateful models support is merged to HF, right?
It would be great to avoid when we refer to different commits to convert different models

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Wovchena pls consider we may still need stateless models until both CPU and GPU plugin support stateful model.

Copy link
Collaborator

@slyalin slyalin Dec 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are going to have stateful support for both CPU and GPU.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are considering to have separate samples for stateless models if they demonstrate popular architectures and require architecture-dependent kv-cache processing.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the main focus is on stateful models.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so, maybe we can merge chatglm sample as is and treat it as "sample for stateless model for popular architecture" ?
While current one will show stateful models and hence it become more generic

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greedy search with stateful model: sammysun0711#1 That eliminates a need to have special implementation for chatglm.

I also had that Idea to have stateless model with architecture, which doesn't fit into the general scenario. And there's still Qwen which may require that #43. I haven't tried it yet

@sammysun0711
Copy link
Collaborator Author

sammysun0711 commented Dec 14, 2023

@Wovchena, @ilya-lavrenov , @slyalin thanks for your feedback. I've learn a lot from it.
Based on discussion above, I think I will change this PR scope with only following changes:

As @ilya-lavrenov suggested, we can keep causal_lm sample for generic stateful model enabling.
Meanwhile, as @slyalin suggested, we can add ChatGLM3 and Qwen as stateless model sample that demonstrate popular architectures and require architecture-dependent kv-cache processing.

Please feel free to share any comments and feedback, thanks!

@ilya-lavrenov ilya-lavrenov merged commit 424c46c into openvinotoolkit:master Dec 14, 2023
1 check passed
@peterchen-intel peterchen-intel added the category: llm_bench Label for tool/llm_bench folder label Dec 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: llm_bench Label for tool/llm_bench folder
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants