-
Notifications
You must be signed in to change notification settings - Fork 193
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expose cuda device health status in /healthz endpoint #1056
base: mainline
Are you sure you want to change the base?
Conversation
c862f09
to
d7c5d1c
Compare
d7c5d1c
to
ae4278e
Compare
@@ -39,6 +40,7 @@ def __init__( | |||
|
|||
self.timeout = timeout | |||
self.backend = backend if backend is not None else enums.SearchDb.vespa | |||
# TODO [Refactoring device logic] deprecate default_device since it's not used |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To control the scope of this ticket, and reduce the risk of doing refactoring without enough test coverage, I added TODOs for all the possible improvement to consolidate device check logic.
logger = get_logger('device_manager') | ||
|
||
|
||
class DeviceType(str, Enum): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have an existing enum for this in tensor_search
:
marqo/src/marqo/tensor_search/enums.py
Line 30 in a262ede
class Device(str, Enum): |
although this might actually be a better place to put it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, my plan is to use this one in the future. The one in enums.py will be removed.
@@ -109,6 +110,7 @@ def id_to_device(id): | |||
|
|||
|
|||
class SetBestAvailableDevice: | |||
# TODO [Refactoring device logic] move this logic to device manager, get rid of MARQO_BEST_AVAILABLE_DEVICE envvar |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good idea. We could remove the torch.cuda.is_available()
call here and replace it with DeviceManager
_is_cuda_available_at_startup()
right?
|
||
@classmethod | ||
def cpu(cls) -> 'Device': | ||
return Device(id=-1, name='cpu', type=DeviceType.cpu) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have no total_memory
for cpu
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used to populate it from psutil. But we never use it in this health check, so I removed it. We can add it easily if we see a use case.
# CUDA devices could become unavailable/unreachable if the docker container running Marqo loses access | ||
# to the device symlinks. There is no way to recover from this, we will need to restart the container. | ||
# See https://github.com/NVIDIA/nvidia-container-toolkit/issues/48 for more details. | ||
logger.error('Cuda device becomes unavailable') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could fix the error message: CUDA device/s have become unavailable
@@ -112,3 +112,15 @@ class DuplicateDocumentError(AddDocumentsError): | |||
|
|||
class TooManyFieldsError(MarqoError): | |||
pass | |||
|
|||
|
|||
class DeviceError(MarqoError): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is this handled by the API layer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
src/marqo/tensor_search/api.py
Outdated
def memory(): | ||
return memory_profiler.get_memory_profile() | ||
@app.get("/healthz", include_in_schema=False) | ||
def check_health(marqo_config: config.Config = Depends(get_config)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there an issue with having the same function name check_health
as the health
endpoint? We don't want this to conflict
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point. I've ignored this, will change the function name.
f'Memory stats: {str(memory_stats)}') | ||
|
||
torch.randn(3, device=cuda_device) | ||
except RuntimeError as e: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is my understanding correct that the very first error encountered will stop this loop? What if multiple CUDA devices are out of memory? Should we distinguish if some of the devices are available and some or not? That could be useful. Maybe we could report on the status of each CUDA device.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, correct. we can check all cuda devices and error out if any one is out of memory.
allocated_mem = memory_stats.get("allocated.all.current", None) if memory_stats else None | ||
raise CudaOutOfMemoryError(f'Cuda device {device.name} is out of memory: ' | ||
f'({allocated_mem}/{device.total_memory})') | ||
except Exception as e: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This exception catch seems too broad. This may mask CUDA issues in the future
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The purpose is to catch all CUDA errors or non-CUDA related errors, so that we don't crash the container on unknown errors.
@@ -154,6 +154,7 @@ def _get_vespa_health(self, hostname_filter: Optional[str]) -> VespaHealthStatus | |||
) | |||
|
|||
def get_cuda_info(self) -> MarqoCudaInfoResponse: | |||
# TODO [Refactoring device logic] move this logic to device manager |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like this method overlaps a lot of functionality of cuda_device_health_check
. But yes, refactoring can be another ticket
device_manager.cuda_device_health_check() | ||
self.assertEqual(0, len(mock_cuda.mock_calls)) | ||
|
||
def test_cuda_health_check_should_pass_when_cuda_device_is_healthy(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add a check where there are multiple cuda devices. 1 or more is healthy, while some are unhealthy
What kind of change does this PR introduce? (Bug fix, feature, docs update, ...)
Operation improvement
What is the current behavior? (You can also link to an open issue here)
Cuda device can silently fail, which causes weird behaviour of Marqo
What is the new behavior (if this is a feature change)?
Expose a /healthz endpoint to be called to check cuda device status. This endpoint will return 500 errors when
This endpoint can be used by any scheduling framework as a liveness check for Marqo container.
Does this PR introduce a breaking change? (What changes might users need to make in their application due to this PR?)
No
Have unit tests been run against this PR? (Has there also been any additional testing?)
Yes
Related Python client changes (link commit/PR here)
No
Related documentation changes (link commit/PR here)
N/A
Other information:
Please check if the PR fulfills these requirements