Skip to content

Commit

Permalink
Raise exception when CreateFleet doesn't return any instance
Browse files Browse the repository at this point in the history
Raise an exception when CreateFleet doesn't return any instance and the error list in the CreateFleet contains only one error entry.
The exception raised is built with the same error code coming from the CreateFleet response, so that this error code can be set into the reason when putting the Slurm nodes into DOWN.
This is useful to avoid to trigger the fast capacity failover (error code InsufficientInstanceCapacity) when the CreateFleet call doesn't return any instance because of throttling (error code RequestLimitExceeded).

Signed-off-by: Luca Carrogu <[email protected]>
  • Loading branch information
lukeseawalker committed Oct 17, 2023
1 parent 659d56d commit b57bee5
Show file tree
Hide file tree
Showing 2 changed files with 14 additions and 1 deletion.
13 changes: 12 additions & 1 deletion src/slurm_plugin/fleet_manager.py
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,14 @@ def __init__(self, message: str):
super().__init__(message)


class LaunchInstancesError(Exception):
"""Represent an error during the launch of EC2 instances"""

def __init__(self, code: str, message: str = ""):
self.code = code
super().__init__(message)

Check warning on line 79 in src/slurm_plugin/fleet_manager.py

View check run for this annotation

Codecov / codecov/patch

src/slurm_plugin/fleet_manager.py#L78-L79

Added lines #L78 - L79 were not covered by tests


class FleetManagerFactory:
@staticmethod
def get_manager(
Expand Down Expand Up @@ -361,7 +369,8 @@ def _launch_instances(self, launch_params):

instances = response.get("Instances", [])
log_level = logging.WARNING if instances else logging.ERROR
for err in response.get("Errors", []):
err_list = response.get("Errors", [])
for err in err_list:
logger.log(
log_level,
"Error in CreateFleet request (%s): %s - %s",
Expand All @@ -375,6 +384,8 @@ def _launch_instances(self, launch_params):
if partial_instance_ids:
logger.error("Unable to retrieve instance info for instances: %s", partial_instance_ids)

if not instances and len(err_list) == 1:
raise LaunchInstancesError(err_list[0].get("ErrorCode"), err_list[0].get("ErrorMessage"))

Check warning on line 388 in src/slurm_plugin/fleet_manager.py

View check run for this annotation

Codecov / codecov/patch

src/slurm_plugin/fleet_manager.py#L388

Added line #L388 was not covered by tests
return {"Instances": instances}
except ClientError as e:
logger.error("Failed CreateFleet request: %s", e.response.get("ResponseMetadata", {}).get("RequestId"))
Expand Down
2 changes: 2 additions & 0 deletions src/slurm_plugin/instance_manager.py
Original file line number Diff line number Diff line change
Expand Up @@ -1016,6 +1016,8 @@ def _launch_instances(
update_failed_nodes_parameters = {"nodeset": set(batch_nodes)}
if isinstance(e, ClientError):
update_failed_nodes_parameters["error_code"] = e.response.get("Error", {}).get("Code")
if isinstance(e, Exception) and hasattr(e, "code"):
update_failed_nodes_parameters["error_code"] = e.code

Check warning on line 1020 in src/slurm_plugin/instance_manager.py

View check run for this annotation

Codecov / codecov/patch

src/slurm_plugin/instance_manager.py#L1020

Added line #L1020 was not covered by tests
self._update_failed_nodes(**update_failed_nodes_parameters)

if job and all_or_nothing_batch:
Expand Down

0 comments on commit b57bee5

Please sign in to comment.