You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
message LoadModelResponse {
// OPTIONAL - If nontrivial cost is involved in
// determining the size, return 0 here and
// do the sizing in the modelSize function
uint64 sizeInBytes = 1;
// EXPERIMENTAL - Applies only if limitModelConcurrency = true
// was returned from runtimeStatus rpc.
// See RuntimeStatusResponse.limitModelConcurrency for more detail
uint32 maxConcurrency = 2;
}
Hi, in the model-runtime.proto, the LoadModelResponse specify the model size in bytes and the max concurrency of the model. Currently, the size in bytes is hard-coded as the size of model files, which may be reasonable for Deep Learning weights but inaccurate for example, triton python backend. In addition, each model should indeed have different max concurrency.
Therefore, I propose that the adapter perhaps can read these configurations from a separate config file within the model folder (just like the config.pbtxt file) to override these configurations.
I am open to create a PR.
The text was updated successfully, but these errors were encountered:
Hi, in the model-runtime.proto, the LoadModelResponse specify the model size in bytes and the max concurrency of the model. Currently, the size in bytes is hard-coded as the size of model files, which may be reasonable for Deep Learning weights but inaccurate for example, triton python backend. In addition, each model should indeed have different max concurrency.
Therefore, I propose that the adapter perhaps can read these configurations from a separate config file within the model folder (just like the config.pbtxt file) to override these configurations.
I am open to create a PR.
The text was updated successfully, but these errors were encountered: