Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix causal_lm cpp demo for llama architecture #71

Merged
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -2,25 +2,25 @@
# SPDX-License-Identifier: Apache-2.0

cmake_minimum_required(VERSION 3.15)
project(casual_lm)
project(causal_lm)

# Build user_ov_extensions
list(APPEND CUSTOM_OPERATIONS tokenizer)
add_subdirectory(../../../thirdparty/openvino_contrib/modules/custom_operations/ "${CMAKE_CURRENT_BINARY_DIR}/custom_operations/")

add_executable(casual_lm casual_lm.cpp)
target_compile_definitions(casual_lm PRIVATE USER_OV_EXTENSIONS_PATH=\"$<TARGET_FILE:user_ov_extensions>\")
add_executable(causal_lm causal_lm.cpp)
target_compile_definitions(causal_lm PRIVATE USER_OV_EXTENSIONS_PATH=\"$<TARGET_FILE:user_ov_extensions>\")
find_package(OpenVINO REQUIRED COMPONENTS Runtime)
target_link_libraries(casual_lm PRIVATE openvino::runtime user_ov_extensions)
set_target_properties(casual_lm PROPERTIES CXX_STANDARD 17)
set_target_properties(casual_lm PROPERTIES CXX_STANDARD_REQUIRED ON)
target_link_libraries(causal_lm PRIVATE openvino::runtime user_ov_extensions)
set_target_properties(causal_lm PROPERTIES CXX_STANDARD 17)
set_target_properties(causal_lm PROPERTIES CXX_STANDARD_REQUIRED ON)
if(MSVC)
target_compile_options(
casual_lm PRIVATE
causal_lm PRIVATE
/Wall # Display all warnings
/wd4710 /wd4711 # Disable the inline warnings
/EHsc # Enable standard C++ stack unwinding, assume functions with extern "C" never throw
)
else()
target_compile_options(casual_lm PRIVATE -Wall) # Display all warnings
target_compile_options(causal_lm PRIVATE -Wall) # Display all warnings
endif()
Original file line number Diff line number Diff line change
Expand Up @@ -53,8 +53,8 @@ python ./convert_tokenizers.py ./Llama-2-7b-hf/

## Run

Usage: `casual_lm <openvino_model.xml> <tokenizer.xml> <detokenizer.xml> "<prompt>"`
Usage: `causal_lm <openvino_model.xml> <tokenizer.xml> <detokenizer.xml> "<prompt>"`

Example: `./build/casual_lm ./Llama-2-7b-hf/openvino_model.xml ./tokenizer.xml ./detokenizer.xml "Why is the Sun yellow?"`
Example: `./build/causal_lm ./Llama-2-7b-hf/openvino_model.xml ./tokenizer.xml ./detokenizer.xml "Why is the Sun yellow?"`

To enable Unicode characters for Windows cmd open `Region` settings from `Control panel`. `Administrative`->`Change system locale`->`Beta: Use Unicode UTF-8 for worldwide language support`->`OK`. Reboot.
Original file line number Diff line number Diff line change
Expand Up @@ -39,13 +39,10 @@ int main(int argc, char* argv[]) try {
}},
{1, ov::PartialShape{
BATCH_SIZE, -1
}},
{2, ov::PartialShape{
BATCH_SIZE, -1
sammysun0711 marked this conversation as resolved.
Show resolved Hide resolved
}}
};
std::vector<ov::Output<ov::Node>> inputs = model->inputs();
for (size_t idx = 3; idx < inputs.size(); ++idx) {
for (size_t idx = 2; idx < inputs.size(); ++idx) {
ov::PartialShape shape = inputs.at(idx).get_partial_shape();
shape[0] = BATCH_SIZE;
Copy link
Contributor

@ilya-lavrenov ilya-lavrenov Dec 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this PR is only for llama, will add chatglm support in another PR.

Copy link
Collaborator

@Wovchena Wovchena Dec 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently exporting a model as stateful removes any difference between the dimensions. I was able to run chatglm2-6b and chatglm3-6b using Wovchena#8 and get meaningful output. Here's the PR updating convert.py: #52. My next step is to rewrite this causal_lm as stateful as well, so we could decide if we need stateless approach at all.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree.
But we should wait until stateful models support is merged to HF, right?
It would be great to avoid when we refer to different commits to convert different models

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Wovchena pls consider we may still need stateless models until both CPU and GPU plugin support stateful model.

Copy link
Collaborator

@slyalin slyalin Dec 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are going to have stateful support for both CPU and GPU.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are considering to have separate samples for stateless models if they demonstrate popular architectures and require architecture-dependent kv-cache processing.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the main focus is on stateful models.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so, maybe we can merge chatglm sample as is and treat it as "sample for stateless model for popular architecture" ?
While current one will show stateful models and hence it become more generic

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greedy search with stateful model: sammysun0711#1 That eliminates a need to have special implementation for chatglm.

I also had that Idea to have stateless model with architecture, which doesn't fit into the general scenario. And there's still Qwen which may require that #43. I haven't tried it yet

shapes.emplace(idx, shape);
Expand All @@ -59,23 +56,18 @@ int main(int argc, char* argv[]) try {
std::copy_n(input_ids.data<const int64_t>(), input_ids.get_size(), ireq.get_tensor("input_ids").data<int64_t>());
ireq.get_tensor("attention_mask").set_shape(attention_mask.get_shape());
std::fill_n(ireq.get_tensor("attention_mask").data<int64_t>(), attention_mask.get_size(), 1);
ireq.get_tensor("position_ids").set_shape(input_ids.get_shape());
std::iota(ireq.get_tensor("position_ids").data<int64_t>(), ireq.get_tensor("position_ids").data<int64_t>() + ireq.get_tensor("position_ids").get_size(), 0);
ireq.infer();
size_t vocab_size = ireq.get_tensor("logits").get_shape().back();
float* logits = ireq.get_tensor("logits").data<float>() + (input_ids.get_size() - 1) * vocab_size;
int64_t out_token = std::max_element(logits, logits + vocab_size) - logits;

ireq.get_tensor("input_ids").set_shape({BATCH_SIZE, 1});
ireq.get_tensor("position_ids").set_shape({BATCH_SIZE, 1});
constexpr int64_t SPECIAL_EOS_TOKEN = 2; // There's no way to extract the value from the detokenizer for now
while (out_token != SPECIAL_EOS_TOKEN) {
ireq.get_tensor("input_ids").data<int64_t>()[0] = out_token;
ireq.get_tensor("attention_mask").set_shape({BATCH_SIZE, ireq.get_tensor("attention_mask").get_shape()[1] + 1});
std::fill_n(ireq.get_tensor("attention_mask").data<int64_t>(), ireq.get_tensor("attention_mask").get_size(), 1);
ireq.get_tensor("position_ids").data<int64_t>()[0] = ireq.get_tensor("attention_mask").get_size() - 2;
for (size_t idx = 3; idx < inputs.size(); ++idx) {
ireq.set_input_tensor(idx, ireq.get_output_tensor(idx - 2));
for (size_t idx = 2; idx < inputs.size(); ++idx) {
ireq.set_input_tensor(idx, ireq.get_output_tensor(idx - 1));
}
ireq.start_async();
print_token(detokenizer, out_token);
Expand Down
Wovchena marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ function abs_path() {
cd "`abs_path`"

mkdir ./ov/
curl https://storage.openvinotoolkit.org/repositories/openvino/packages/2023.1/linux/l_openvino_toolkit_ubuntu20_2023.1.0.12185.47b736f63ed_x86_64.tgz | tar --directory ./ov/ --strip-components 1 -xz
curl https://storage.openvinotoolkit.org/repositories/openvino/packages/2023.2/linux/l_openvino_toolkit_ubuntu20_2023.2.0.13089.cfd42bd2cb0_x86_64.tgz | tar --directory ./ov/ --strip-components 1 -xz
sudo ./ov/install_dependencies/install_openvino_dependencies.sh

source ./ov/setupvars.sh
Expand All @@ -23,4 +23,4 @@ cmake --build ./build/ --config Release -j
wait

python ./convert_tokenizers.py ./open_llama_3b_v2/
./build/casual_lm ./open_llama_3b_v2/openvino_model.xml ./tokenizer.xml ./detokenizer.xml "return 0"
./build/causal_lm ./open_llama_3b_v2/openvino_model.xml ./tokenizer.xml ./detokenizer.xml "Why is the Sun yellow?"
Loading