Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kserve MNIST CI failure #3308

Closed
maaquib opened this issue Sep 9, 2024 · 1 comment
Closed

Kserve MNIST CI failure #3308

maaquib opened this issue Sep 9, 2024 · 1 comment

Comments

@maaquib
Copy link
Collaborator

maaquib commented Sep 9, 2024

🐛 Describe the bug

Kserve CI workflow started failing recently due to a change introduced in this PR. Due to the new model parameter startup_timeout, starting the model server using any old snapshots leads to exceptions.

Error logs

Error trace:

2024-09-06T00:08:11,306 [INFO ] main org.pytorch.serve.ModelServer - Torchserve stopped.
java.lang.NullPointerException: Cannot invoke "com.google.gson.JsonElement.getAsInt()" because the return value of "com.google.gson.JsonObject.get(String)" is null
        at org.pytorch.serve.wlm.Model.setModelState(Model.java:197)
        at org.pytorch.serve.wlm.ModelManager.createModel(ModelManager.java:493)
        at org.pytorch.serve.wlm.ModelManager.registerAndUpdateModel(ModelManager.java:98)
        at org.pytorch.serve.snapshot.SnapshotManager.initModels(SnapshotManager.java:137)
        at org.pytorch.serve.snapshot.SnapshotManager.restore(SnapshotManager.java:120)
        at org.pytorch.serve.ModelServer.initModelStore(ModelServer.java:162)
        at org.pytorch.serve.ModelServer.startRESTserver(ModelServer.java:398)
        at org.pytorch.serve.ModelServer.startAndWait(ModelServer.java:124)
        at org.pytorch.serve.ModelServer.main(ModelServer.java:105)
INFO:root:Loading mnist .. 2 of 10 tries..
INFO:root:The model mnist is not ready
INFO:root:Sleep 30 seconds for load mnist..

Installation instructions

Ran CI locally on minikube: https://github.com/pytorch/serve/blob/master/.github/workflows/kserve_cpu_tests.yml

Model Packaging

https://github.com/pytorch/serve/blob/master/kubernetes/kserve/tests/configs/mnist_v2_cpu.yaml

config.properties

From gs://kfserving-examples/models/torchserve/image_classifier/v2

$ cat config/config.properties
inference_address=http://0.0.0.0:8085
management_address=http://0.0.0.0:8085
metrics_address=http://0.0.0.0:8082
grpc_inference_port=7070
grpc_management_port=7071
enable_metrics_api=true
metrics_format=prometheus
number_of_netty_threads=4
job_queue_size=10
enable_envvars_config=true
install_py_dep_per_model=true
model_store=/mnt/models/model-store
model_snapshot={"name":"startup.cfg","modelCount":1,"models":{"mnist":{"1.0":{"defaultVersion":true,"marName":"mnist.mar","minWorkers":1,"maxWorkers":5,"batchSize":2,"maxBatchDelay":500,"responseTimeout":60}}}}

Versions

$ git log -1 --format='%H' | cat
a2ba1c7127b96f4d14e1d79529e1f973c0fde3ee

Repro instructions

Ran https://github.com/pytorch/serve/blob/master/.github/workflows/kserve_cpu_tests.yml locally

Possible Solution

Short-term

  • Update the gs://kfserving-examples/models/torchserve/**/config.properties file to add the new parameter wherever server starts from a snapshot

Long-term:

  • Update the Model.java code to handle null pointers better i.e. just use the default value
@agunapal
Copy link
Collaborator

Resolved via #3328

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants