Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose cuda device health status in /healthz endpoint #1056

Open
wants to merge 5 commits into
base: mainline
Choose a base branch
from

Conversation

papa99do
Copy link
Collaborator

@papa99do papa99do commented Nov 28, 2024

  • What kind of change does this PR introduce? (Bug fix, feature, docs update, ...)
    Operation improvement

  • What is the current behavior? (You can also link to an open issue here)
    Cuda device can silently fail, which causes weird behaviour of Marqo

  • What is the new behavior (if this is a feature change)?
    Expose a /healthz endpoint to be called to check cuda device status. This endpoint will return 500 errors when

  • Cuda device becomes unavailable
  • Cuda device is out of memory
    This endpoint can be used by any scheduling framework as a liveness check for Marqo container.
  • Does this PR introduce a breaking change? (What changes might users need to make in their application due to this PR?)
    No

  • Have unit tests been run against this PR? (Has there also been any additional testing?)
    Yes

  • Related Python client changes (link commit/PR here)
    No

  • Related documentation changes (link commit/PR here)
    N/A

  • Other information:

  • Please check if the PR fulfills these requirements

  • The commit message follows our guidelines
  • Tests for the changes have been added (for bug fixes/features)
  • Docs have been added / updated (for bug fixes / features)

@papa99do papa99do changed the title yihan/cuda-issue-mitigation Support cuda device health check in k8s liveness check Nov 28, 2024
@papa99do papa99do changed the title Support cuda device health check in k8s liveness check Expose cuda device health status in /healthz endpoint Nov 28, 2024
@papa99do papa99do marked this pull request as ready for review November 28, 2024 05:04
@@ -39,6 +40,7 @@ def __init__(

self.timeout = timeout
self.backend = backend if backend is not None else enums.SearchDb.vespa
# TODO [Refactoring device logic] deprecate default_device since it's not used
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To control the scope of this ticket, and reduce the risk of doing refactoring without enough test coverage, I added TODOs for all the possible improvement to consolidate device check logic.

logger = get_logger('device_manager')


class DeviceType(str, Enum):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have an existing enum for this in tensor_search:

class Device(str, Enum):

although this might actually be a better place to put it

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, my plan is to use this one in the future. The one in enums.py will be removed.

@@ -109,6 +110,7 @@ def id_to_device(id):


class SetBestAvailableDevice:
# TODO [Refactoring device logic] move this logic to device manager, get rid of MARQO_BEST_AVAILABLE_DEVICE envvar
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea. We could remove the torch.cuda.is_available() call here and replace it with DeviceManager _is_cuda_available_at_startup() right?


@classmethod
def cpu(cls) -> 'Device':
return Device(id=-1, name='cpu', type=DeviceType.cpu)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have no total_memory for cpu?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used to populate it from psutil. But we never use it in this health check, so I removed it. We can add it easily if we see a use case.

# CUDA devices could become unavailable/unreachable if the docker container running Marqo loses access
# to the device symlinks. There is no way to recover from this, we will need to restart the container.
# See https://github.com/NVIDIA/nvidia-container-toolkit/issues/48 for more details.
logger.error('Cuda device becomes unavailable')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could fix the error message: CUDA device/s have become unavailable

@@ -112,3 +112,15 @@ class DuplicateDocumentError(AddDocumentsError):

class TooManyFieldsError(MarqoError):
pass


class DeviceError(MarqoError):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this handled by the API layer?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

def memory():
return memory_profiler.get_memory_profile()
@app.get("/healthz", include_in_schema=False)
def check_health(marqo_config: config.Config = Depends(get_config)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an issue with having the same function name check_health as the health endpoint? We don't want this to conflict

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point. I've ignored this, will change the function name.

f'Memory stats: {str(memory_stats)}')

torch.randn(3, device=cuda_device)
except RuntimeError as e:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is my understanding correct that the very first error encountered will stop this loop? What if multiple CUDA devices are out of memory? Should we distinguish if some of the devices are available and some or not? That could be useful. Maybe we could report on the status of each CUDA device.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, correct. we can check all cuda devices and error out if any one is out of memory.

allocated_mem = memory_stats.get("allocated.all.current", None) if memory_stats else None
raise CudaOutOfMemoryError(f'Cuda device {device.name} is out of memory: '
f'({allocated_mem}/{device.total_memory})')
except Exception as e:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This exception catch seems too broad. This may mask CUDA issues in the future

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The purpose is to catch all CUDA errors or non-CUDA related errors, so that we don't crash the container on unknown errors.

@@ -154,6 +154,7 @@ def _get_vespa_health(self, hostname_filter: Optional[str]) -> VespaHealthStatus
)

def get_cuda_info(self) -> MarqoCudaInfoResponse:
# TODO [Refactoring device logic] move this logic to device manager
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like this method overlaps a lot of functionality of cuda_device_health_check. But yes, refactoring can be another ticket

device_manager.cuda_device_health_check()
self.assertEqual(0, len(mock_cuda.mock_calls))

def test_cuda_health_check_should_pass_when_cuda_device_is_healthy(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a check where there are multiple cuda devices. 1 or more is healthy, while some are unhealthy

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants