-
Notifications
You must be signed in to change notification settings - Fork 864
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BabyLlama with CPP backend #2544
BabyLlama with CPP backend #2544
Conversation
…rsion errors. Signed-off-by: Shrinath Suresh <[email protected]>
Signed-off-by: Shrinath Suresh <[email protected]>
Signed-off-by: Shrinath Suresh <[email protected]>
Signed-off-by: Shrinath Suresh <[email protected]>
build_transformer(&transformer, checkpoint_path); | ||
|
||
char tokenizer_path[] = | ||
"/home/ubuntu/serve/cpp/src/examples/image_classifier/babyllama/" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Path is hard coded at present -- read from config file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shrinath-suresh you can also add the tokenizer.bin as an additional file when creating the mar file and set the filename as load_model_request->model_dir + "tokenizer.bin"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the code to read the tokenizer and model path from a config file
Signed-off-by: Shrinath Suresh <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, thanks for this great contribution! I've left some comments on the PR.
#ifndef LLM_HANDLER_HH_ | ||
#define LLM_HANDLER_HH_ | ||
|
||
#include "run.c" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason why run.c gets included here? Its also listed in the cmake file as source file which probably did not work as there is no header file to declare the content. I would recommend removing it from the cmake file and include it in the .cc instead to localize visibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved run.c import to the .cc file
cpp/src/examples/babyllama/run.c
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
run.c was published under MIT license which requests that we include the copy right notice and license. My proposal is to create a subfolder and include run.c + the original license file. @chauhang Is that a viable proceeding?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created a subdirectory named llama2.c
and copied the run.c with the license file
build_transformer(&transformer, checkpoint_path); | ||
|
||
char tokenizer_path[] = | ||
"/home/ubuntu/serve/cpp/src/examples/image_classifier/babyllama/" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shrinath-suresh you can also add the tokenizer.bin as an additional file when creating the mar file and set the filename as load_model_request->model_dir + "tokenizer.bin"
float topp = 0.9f; // top-p in nucleus sampling. 1.0 = off. 0.9 works well, | ||
// but slower | ||
int steps = 256; // number of steps to run for | ||
unsigned long long rng_seed = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initializing an rng with 0 (at bits zero) can be problematic in some cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed the initialization
|
||
std::string msg = torchserve::Converter::VectorToStr(data_it->second); | ||
|
||
char* msgCStr = new char[msg.size() + 1]; // +1 for the null terminator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use smart pointers when allocating dynamic memory and prefer new over malloc.
Something like
std::unique_ptr<int[]> prompt_tokens(new int[(strlen(msgCStr) + 3) * sizeof(int)]);
should work as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated code to use smart pointers in necessary places
long_vector.push_back(data_ptr[i]); | ||
} | ||
|
||
int* prompt_tokens = new int[num_elements]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated code to use smart pointer
|
||
int* prompt_tokens = new int[num_elements]; | ||
for (int64_t i = 0; i < num_elements; ++i) { | ||
prompt_tokens[i] = static_cast<int>(long_vector[i]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why can't we just copy the data from the tensor instead of going through long_vector?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the logic to directly copy the data from tensor
std::pair<std::string&, std::map<uint8_t, std::string>&>& idx_to_req_id, | ||
std::shared_ptr<torchserve::InferenceResponseBatch>& response_batch) { | ||
std::vector<torch::Tensor> tensor_vector; | ||
auto tokens_list_tensor = inputs[0].toTensor(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we extend this to batched processing or at least process all entries in the batch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Working on the batch processing part. Will keep you posted once it is done
@mreso Thanks for your comments. I will address it. |
Signed-off-by: Shrinath Suresh <[email protected]>
Signed-off-by: Shrinath Suresh <[email protected]>
Signed-off-by: Shrinath Suresh <[email protected]>
… upfront Signed-off-by: Shrinath Suresh <[email protected]>
Signed-off-by: Shrinath Suresh <[email protected]>
Signed-off-by: Shrinath Suresh <[email protected]>
Signed-off-by: Shrinath Suresh <[email protected]>
Signed-off-by: Shrinath Suresh <[email protected]>
Signed-off-by: Shrinath Suresh <[email protected]>
Signed-off-by: Shrinath Suresh <[email protected]>
Signed-off-by: Shrinath Suresh <[email protected]>
@mreso I have addressed most of your comments. Working on updating the code to enable batch processing. Once it is done, I will run a sanity test and update the steps/README with the details. Will keep you posted on this. |
return batch_ivalue; | ||
} | ||
|
||
torch::Tensor LlmHandler::Inference( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add with torch.inference_mode():
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it c10::InferenceMode guard;
on cpp ? - https://pytorch.org/cppdocs/notes/inference_mode.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
torch::inferencemode is a high level API, c10::InferenceMode is a low level api. According to libtorch doc, they are trying to use torch::xxx to unify the low level apis.
Signed-off-by: Shrinath Suresh <[email protected]>
Signed-off-by: Shrinath Suresh <[email protected]>
Updated code to process all the items in the batch. Please review when you find time and let me know if any other changes are needed. |
@lxning Should we add the model and tokenizer download steps in test script. Do we have any concept of setup and teardown in the cpp backend for each test case ?. For the unit tests to pass, these files are mandatory. |
manifest_->GetModel().serialized_file), | ||
*device)); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
current cpp backend only support one device id, which means there is no across gpu device partition.
i assume this example only work for single gpu.
std::shared_ptr<torchserve::InferenceResponseBatch>& response_batch) { | ||
c10::InferenceMode guard; | ||
std::vector<torch::Tensor> batch_output_vector; | ||
for (const torch::jit::IValue& input : inputs) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This for loop predict each inference request one by one. Can we optimize this section to either leverage C++ multithreading or GPU batching power?
Signed-off-by: Shrinath Suresh <[email protected]>
Signed-off-by: Shrinath Suresh <[email protected]>
Signed-off-by: Shrinath Suresh <[email protected]>
Signed-off-by: Shrinath Suresh <[email protected]>
Signed-off-by: Shrinath Suresh <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we remove the absolute paths (and the tests pass) this will be ready to go for now. We can add batched processing in a follow-up PR @lxning .
@@ -0,0 +1,5 @@ | |||
{ | |||
"checkpoint_path" : "/home/ubuntu/serve/cpp/stories15M.bin", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be good to make these relative instead of absolute.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-rw-rw-r-- 1 ubuntu ubuntu 424K Sep 5 23:55 tokenizer.bin
-rw-rw-r-- 1 ubuntu ubuntu 58M Jul 27 04:09 stories15M.bin
These two files are needed at run time for the unit test case. I can think of two apporaches
- We can download these files in
build.sh
and remove it once the unit test case passes - Add the download logic in cpp -
serve/cpp/test/backends/torch_scripted/torch_scripted_backend_test.cc
when executing the test case.
Second option seems to be more resonable. Whats your opinion ?
Signed-off-by: Shrinath Suresh <[email protected]>
Signed-off-by: Shrinath Suresh <[email protected]>
Closing this in favor of #2903 which picks up all changes and adds adjusts. |
Description
Benchmarking Babyllama deployment with CPP Backend
Setup and Test
Follow the instructions from README.md to set up the cpp backend environment
Download the stories model using
Download the tokenizer.bin file
Create a config file named
config.json
with the path of the downloaded model and tokenizer.Once the build is successful
libllm_handler.so
shared object file would be generated inserve/cpp/test/resources/torchscript_model/babyllama/llm_handler
folder.dummy.pt
file to thellm_handler
folder.llm_handler
folder and run the following command to generate mar fileThe default timeout is 120000. When the context size is large, LLM generation takes more time to complete the request in the single gpu machine.
prompt.txt
if needed and runSample response
Benchmarking
Clone the llama2.c repo
Move to the folder and compile. Executed
run
will be generated.Download the model
Run inference using the following command
babyllama_gcc_standalone.txt
The standalone version generates output with
55.29
tokens per second. The variation is due to the compiler options.Check the PR - karpathy/llama2.c#116 for cmake build support
Clone the krrishnarraj/llama2.c/ branch - from pull request
Follow build instructions from here or run the following commands
Once the build is succeded , run the following command
Standalone cmake version generates -
147.39
tokens per secondtorchserve curl request
ts_log.txt
babyllama with cpp backend generates
172.3
tokens per secondType of change
Please delete options that are not relevant.
Feature/Issue validation/testing
Please describe the Unit or Integration tests that you ran to verify your changes and relevant result summary. Provide instructions so it can be reproduced.
Please also list any relevant details for your test configuration.
Test A
Logs for Test A
Test B
Logs for Test B
Checklist: