Expose cuda device health status in /healthz endpoint #1056

papa99do · 2024-11-28T03:57:37Z

What kind of change does this PR introduce? (Bug fix, feature, docs update, ...)
Operation improvement
What is the current behavior? (You can also link to an open issue here)
Cuda device can silently fail, which causes weird behaviour of Marqo
What is the new behavior (if this is a feature change)?
Expose a /healthz endpoint to be called to check cuda device status. This endpoint will return 500 errors when

Cuda device becomes unavailable
Cuda device is out of memory
This endpoint can be used by any scheduling framework as a liveness check for Marqo container.

Does this PR introduce a breaking change? (What changes might users need to make in their application due to this PR?)
No
Have unit tests been run against this PR? (Has there also been any additional testing?)
Yes
Related Python client changes (link commit/PR here)
No
Related documentation changes (link commit/PR here)
N/A
Other information:
Please check if the PR fulfills these requirements

The commit message follows our guidelines
Tests for the changes have been added (for bug fixes/features)
Docs have been added / updated (for bug fixes / features)

papa99do · 2024-12-03T05:47:44Z

src/marqo/config.py

@@ -39,6 +40,7 @@ def __init__(

        self.timeout = timeout
        self.backend = backend if backend is not None else enums.SearchDb.vespa
+        # TODO [Refactoring device logic] deprecate default_device since it's not used


To control the scope of this ticket, and reduce the risk of doing refactoring without enough test coverage, I added TODOs for all the possible improvement to consolidate device check logic.

vicilliar · 2024-12-03T06:40:27Z

src/marqo/core/inference/device_manager.py

+logger = get_logger('device_manager')
+
+
+class DeviceType(str, Enum):


We have an existing enum for this in tensor_search:

marqo/src/marqo/tensor_search/enums.py

Line 30 in a262ede

class Device(str, Enum):

although this might actually be a better place to put it

Yes, my plan is to use this one in the future. The one in enums.py will be removed.

vicilliar · 2024-12-03T06:43:48Z

src/marqo/tensor_search/on_start_script.py

@@ -109,6 +110,7 @@ def id_to_device(id):


 class SetBestAvailableDevice:
+    # TODO [Refactoring device logic] move this logic to device manager, get rid of MARQO_BEST_AVAILABLE_DEVICE envvar


good idea. We could remove the torch.cuda.is_available() call here and replace it with DeviceManager _is_cuda_available_at_startup() right?

vicilliar · 2024-12-03T06:57:14Z

src/marqo/core/inference/device_manager.py

+
+    @classmethod
+    def cpu(cls) -> 'Device':
+        return Device(id=-1, name='cpu', type=DeviceType.cpu)


Do we have no total_memory for cpu?

I used to populate it from psutil. But we never use it in this health check, so I removed it. We can add it easily if we see a use case.

vicilliar · 2024-12-03T07:00:02Z

src/marqo/core/inference/device_manager.py

+            # CUDA devices could become unavailable/unreachable if the docker container running Marqo loses access
+            # to the device symlinks. There is no way to recover from this, we will need to restart the container.
+            # See https://github.com/NVIDIA/nvidia-container-toolkit/issues/48 for more details.
+            logger.error('Cuda device becomes unavailable')


We could fix the error message: CUDA device/s have become unavailable

vicilliar · 2024-12-03T07:13:17Z

src/marqo/core/exceptions.py

@@ -112,3 +112,15 @@ class DuplicateDocumentError(AddDocumentsError):

 class TooManyFieldsError(MarqoError):
    pass
+
+
+class DeviceError(MarqoError):


How is this handled by the API layer?

See https://github.com/marqo-ai/marqo/pull/1056/files#diff-ba0027ca86b6804c6a459ce5153a53376e7c99eae82e93bcde231c854b0d402aR115.

vicilliar · 2024-12-03T07:41:02Z

src/marqo/tensor_search/api.py

-def memory():
-    return memory_profiler.get_memory_profile()
+@app.get("/healthz", include_in_schema=False)
+def check_health(marqo_config: config.Config = Depends(get_config)):


Is there an issue with having the same function name check_health as the health endpoint? We don't want this to conflict

good point. I've ignored this, will change the function name.

vicilliar · 2024-12-03T07:42:37Z

src/marqo/core/inference/device_manager.py

+                             f'Memory stats: {str(memory_stats)}')
+
+                torch.randn(3, device=cuda_device)
+            except RuntimeError as e:


Is my understanding correct that the very first error encountered will stop this loop? What if multiple CUDA devices are out of memory? Should we distinguish if some of the devices are available and some or not? That could be useful. Maybe we could report on the status of each CUDA device.

Yes, correct. we can check all cuda devices and error out if any one is out of memory.

vicilliar · 2024-12-03T07:43:22Z

src/marqo/core/inference/device_manager.py

+                    allocated_mem = memory_stats.get("allocated.all.current", None) if memory_stats else None
+                    raise CudaOutOfMemoryError(f'Cuda device {device.name} is out of memory: '
+                                               f'({allocated_mem}/{device.total_memory})')
+            except Exception as e:


This exception catch seems too broad. This may mask CUDA issues in the future

The purpose is to catch all CUDA errors or non-CUDA related errors, so that we don't crash the container on unknown errors.

vicilliar · 2024-12-03T07:48:19Z

src/marqo/core/monitoring/monitoring.py

@@ -154,6 +154,7 @@ def _get_vespa_health(self, hostname_filter: Optional[str]) -> VespaHealthStatus
        )

    def get_cuda_info(self) -> MarqoCudaInfoResponse:
+        # TODO [Refactoring device logic] move this logic to device manager


It seems like this method overlaps a lot of functionality of cuda_device_health_check. But yes, refactoring can be another ticket

vicilliar · 2024-12-03T07:51:36Z

tests/core/inference/test_device_manager.py

+            device_manager.cuda_device_health_check()
+        self.assertEqual(0, len(mock_cuda.mock_calls))
+
+    def test_cuda_health_check_should_pass_when_cuda_device_is_healthy(self):


Let's add a check where there are multiple cuda devices. 1 or more is healthy, while some are unhealthy

papa99do temporarily deployed to marqo-test-suite November 28, 2024 03:58 — with GitHub Actions Inactive

papa99do temporarily deployed to marqo-test-suite November 28, 2024 03:59 — with GitHub Actions Inactive

papa99do changed the title ~~yihan/cuda-issue-mitigation~~ Support cuda device health check in k8s liveness check Nov 28, 2024

papa99do changed the title ~~Support cuda device health check in k8s liveness check~~ Expose cuda device health status in /healthz endpoint Nov 28, 2024

papa99do marked this pull request as ready for review November 28, 2024 05:04

papa99do requested review from vicilliar, farshidz and VitusAcabado November 28, 2024 05:04

papa99do force-pushed the yihan/cuda-issue-mitigation branch from c862f09 to d7c5d1c Compare November 28, 2024 22:49

papa99do added 4 commits December 3, 2024 16:46

introduce device_manager

049ddd8

expose /healthz for liveness check

5902fc3

Add unit tests

70aa4e3

Add test in api.py

ae4278e

papa99do force-pushed the yihan/cuda-issue-mitigation branch from d7c5d1c to ae4278e Compare December 3, 2024 05:46

papa99do temporarily deployed to marqo-test-suite December 3, 2024 05:47 — with GitHub Actions Inactive

papa99do commented Dec 3, 2024

View reviewed changes

papa99do temporarily deployed to marqo-test-suite December 3, 2024 05:47 — with GitHub Actions Inactive

papa99do temporarily deployed to marqo-test-suite December 3, 2024 05:48 — with GitHub Actions Inactive

vicilliar requested changes Dec 3, 2024

View reviewed changes

Address PR comments

918a77e

papa99do requested a deployment to marqo-test-suite December 4, 2024 01:21 — with GitHub Actions In progress

papa99do requested a deployment to marqo-test-suite December 4, 2024 01:22 — with GitHub Actions In progress

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expose cuda device health status in /healthz endpoint #1056

Expose cuda device health status in /healthz endpoint #1056

papa99do commented Nov 28, 2024 •

edited

Loading

papa99do Dec 3, 2024

vicilliar Dec 3, 2024

papa99do Dec 3, 2024

vicilliar Dec 3, 2024

vicilliar Dec 3, 2024

papa99do Dec 3, 2024

vicilliar Dec 3, 2024

vicilliar Dec 3, 2024

papa99do Dec 3, 2024

vicilliar Dec 3, 2024

papa99do Dec 3, 2024

vicilliar Dec 3, 2024

papa99do Dec 3, 2024

vicilliar Dec 3, 2024

papa99do Dec 3, 2024

vicilliar Dec 3, 2024

vicilliar Dec 3, 2024

		logger = get_logger('device_manager')


		class DeviceType(str, Enum):

		@@ -109,6 +110,7 @@ def id_to_device(id):


		class SetBestAvailableDevice:
		# TODO [Refactoring device logic] move this logic to device manager, get rid of MARQO_BEST_AVAILABLE_DEVICE envvar

Expose cuda device health status in /healthz endpoint #1056

Are you sure you want to change the base?

Expose cuda device health status in /healthz endpoint #1056

Conversation

papa99do commented Nov 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

papa99do commented Nov 28, 2024 •

edited

Loading