Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[develop] Add support for Capacity Blocks for ML #591

Merged
merged 36 commits into from
Nov 9, 2023

Conversation

enrico-usai
Copy link
Contributor

@enrico-usai enrico-usai commented Nov 3, 2023

Capacity Block for ML

Capacity Block reservation for Machine Learning (CB) allow AWS users to reserve blocks of GPU capacity for a future start date and for a time duration of their choice. They are capacity reservations of type capacity-block, with additional attributes like: StartDate, EndDate and additional states.

Capacity Block instances are managed by ParallelCluster as a static nodes but in a peculiar way. ParallelCluster permits to create a cluster even if the CB reservation is not yet active and automatically launch the instances when it will become active.

The Slurm nodes corresponding to compute resources associated to CB reservations are kept in maintenance until the CB start time is reached. When CB is not active, the slurm nodes will stay in a Slurm reservation/maintenance state, associated to the slurm admin user, it means that the nodes can accept jobs but they will stay in pending, until the reservation will be removed.

Clustermgtd will automatically create/delete slurm reservations, putting the related CB nodes in maintenance or not, accordingly to the CB state. The Slurm reservation will be removed when CB is active, nodes will start and become available for the pending jobs or for new jobs submissions.

When the CB end time is reached nodes will be re-inserted in a reservation/maintenance state. It’s up to the user to resubmit/requeue the jobs to a new queue/compute-resource when CB time is ended.

CapacityBlockManager

CapacityBlockManager is a new class that will perform the following actions:

  • when CB is not active or expired, a new Slurm reservation will be created/updated
  • when CB is active, Slurm reservation will be removed and nodes will become standard static instances

Slurm reservation name will be pcluster-{capacity_block_reservation_id}

CapacityBlockManager will be initialized by clustermgtd with region, boto3_config and fleet_config info.
Fleet config is the existing fleet-config.json file, that has been extended to include capacity reservation ids and capacity type (ondemand vs spot vs capacity-block).
The manager will reload fleet-config info and capacity reservation info every time the daemon is restarted
or when the config is modified.
Current logic updates CB reservation info every 10 minutes.

The manager will remove the nodes associated with CB reservations that are not active
from the list of unhealthy static nodenames.

CapacityBlock

CapacityBlock is a new class to store internal info about the capacity block,
merging them from ec2 (e.g. capacity block state) and from the config (i.e. list of nodes associated to it).

Managing slurm reservations

Created a new set of commands (using scontrol) to manage the Slurm reservations:

  • create a reservation
    • Use slurm as default user rather than root, given that "slurm" is the admin in ParallelCluster setup.
  • update an existing reservation
    • I put in common steps to populate create and update commands since they are very similar.
  • check if reservation exists
  • delete_reservation
  • added the reservation_name attribute to the SlurmNode class.

Leftover slurm reservations

When a CB is removed from a cluster config during a cluster update, we need to remove related slurm reservations.
We're retrieving all the slurm reservations from slurm and deleting them if they are no longer associated with existing CBs in the config.

Nodes in maintenance state

After this patch the nodes in MAINTENANCE+RESERVED state will be excluded by the static nodes in replacement list.
Will see in the "replacement list" only nodes that are not yet up (as before) and nodes that are not in maintenance.

This mechanism permits the daemons (or the user) to avoid replacement of turned down static nodes (DOWN) by putting them in maintenance as preliminary step.
This works only for nodes in both MAINTENANCE and RESERVED state, nodes only in RESERVED state will be replaced as before.

In the future, we can extend this mechanism to have an additional field to check (e.g. skip them only if the maintenance is set by the root user).

boto3 layer

Added boto3 layer, this code is taken from the CLI, removing the caching mechanism.
This permits to decouple node daemons code from boto3 calls, adding exceptions catching mechanism.

### Managing exceptions

Defined a new SlurmCommandError to identify errors coming from scontrol commands.
Defined a new SlurmCommandErrorHandler to permit to handle SlurmCommandError and log messages.

Added retry (max-attempt:2, wait=1s) and SlurmCommandErrorHandler decorator to all the reservations commands.

The main method of the CapacityBlockManager (get_reserved_nodenames), called by clustermgtd cannot raise any exception.
If the error in the update_slurm_reservation happens when updating a single capacity block/slurm reservation,
this is catched and the loop will continue, updating the others.

If there is a generic error like AWSClientError (wrapped to CapacityBlockManagerError) the entire list of capacity_blocks and reserved_nodes won't be changed, logging an error.

If cleanup_leftover_slurm_reservation fails this will be logged, but the process will continue.

### FleetManager changes

The main difference between on-demand, spot and capacity-block instances, is that capacity-block requires Market-Type=capacity-block but this is added to the Launch Template at generation time from the CLI/CDK.
Capacity reservation id and capacity type info for the compute resources are saved in the config-fleet.json file, that is generated by the cookbook according to Cluster configuration.

According to this file the node is able to understand if the compute-resource is associated to a CB reservation and add the following additional info:

        "OnDemandOptions": {
            ...
            "CapacityReservationOptions": {"UsageStrategy": "use-capacity-reservations-first"},
        },
        "TargetCapacitySpecification": {...."DefaultTargetCapacityType": "capacity-block"},
    }

The FleetManager has been modified to support empty AllocationStrategy because CB does not support any of the existing options.

Tests

  • The new and modified code is verified by unit tests.

References

Logging

Slurm reservation creation

2023-11-07 12:43:55,361 - [slurm_plugin.capacity_block_manager:_retrieve_capacity_blocks_from_fleet_config] - INFO - Retrieving Capacity Blocks from fleet configuration.
2023-11-07 12:43:55,361 - [slurm_plugin.capacity_block_manager:_update_capacity_blocks_info_from_ec2] - INFO - Retrieving Capacity Blocks information from EC2 for cr-0296d8df657e57a7b
2023-11-07 12:43:55,384 - [aws.common:_log_boto3_calls] - INFO - Executing boto3 call: region=us-east-2, service=ec2, operation=DescribeCapacityReservations, params={'CapacityReservationIds': ['cr-0296d8df657e57a7b']}
2023-11-07 12:43:55,512 - [common.schedulers.slurm_reservation_commands:is_slurm_reservation] - INFO - Slurm reservation pcluster-cr-0296d8df657e57a7b not found.
2023-11-07 12:43:55,513 - [slurm_plugin.capacity_block_manager:_log_cb_info] - INFO - Capacity Block reservation cr-0296d8df657e57a7b is in state scheduled. Creating Slurm reservation pcluster-cr-0296d8df657e57a7b for nodes queue1-st-p5-1.
2023-11-07 12:43:55,513 - [common.schedulers.slurm_reservation_commands:create_slurm_reservation] - INFO - Creating Slurm reservation with command: sudo /opt/slurm/bin/scontrol create reservation
Reservation created: pcluster-cr-0296d8df657e57a7b
2023-11-07 12:43:55,573 - [slurm_plugin.clustermgtd:_find_unhealthy_slurm_nodes] - INFO - The nodes queue1-st-p5-1 are associated with unactive Capacity Blocks, they will not be considered as unhealthy nodes.
2023-11-07 12:43:55,574 - [slurm_plugin.slurm_resources:_is_static_node_ip_configuration_valid] - WARNING - Node state check: static node without nodeaddr set, node queue1-st-p5-1(queue1-st-p5-1), node state DOWN+CLOUD+NOT_RESPONDING+POWERING_UP:
2023-11-07 12:43:55,574 - [slurm_plugin.clustermgtd:_terminate_orphaned_instances] - INFO - Checking for orphaned instance

No time to update, capacity_block_manager does anything.

2023-11-07 17:30:51,134 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Following nodes are currently in replacement: (x0) []
2023-11-07 17:30:51,135 - [slurm_plugin.clustermgtd:_find_unhealthy_slurm_nodes] - INFO - The nodes associated with inactive Capacity Blocks and not considered as unhealthy nodes are: queue1-st-p5-1
2023-11-07 17:30:51,135 - [slurm_plugin.slurm_resources:_is_static_node_ip_configuration_valid] - WARNING - Node state check: static node without nodeaddr set, node queue1-st-p5-1(queue1-st-p5-1), node state IDLE+CLOUD+MAINTENANCE+POWERED_DOWN+RESERVED:
2023-11-07 17:30:51,135 - [slurm_plugin.clustermgtd:_terminate_orphaned_instances] - INFO - Checking for orphaned instance

Time to update, because update period passed (I set it to 2 minutes for testing the behaviour)

2023-11-07 17:31:51,163 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Following nodes are currently in replacement: (x0) []
2023-11-07 17:31:51,163 - [slurm_plugin.capacity_block_manager:_retrieve_capacity_blocks_from_fleet_config] - INFO - Retrieving Capacity Blocks from fleet configuration.
2023-11-07 17:31:51,163 - [slurm_plugin.capacity_block_manager:_update_capacity_blocks_info_from_ec2] - INFO - Retrieving Capacity Blocks information from EC2 for cr-0296d8df657e57a7b
2023-11-07 17:31:51,177 - [aws.common:_log_boto3_calls] - INFO - Executing boto3 call: region=us-east-2, service=ec2, operation=DescribeCapacityReservations, params={'CapacityReservationIds': ['cr-0296d8df657e57a7b']}
2023-11-07 17:31:51,310 - [slurm_plugin.capacity_block_manager:_log_cb_info] - INFO - Capacity Block reservation cr-0296d8df657e57a7b is in state scheduled. Nothing to do. Already existing Slurm reservation pcluster-cr-0296d8df657e57a7b for nodes queue1-st-p5-1.
2023-11-07 17:31:51,337 - [slurm_plugin.clustermgtd:_find_unhealthy_slurm_nodes] - INFO - The nodes associated with inactive Capacity Blocks and not considered as unhealthy nodes are: queue1-st-p5-1
2023-11-07 17:31:51,338 - [slurm_plugin.slurm_resources:_is_static_node_ip_configuration_valid] - WARNING - Node state check: static node without nodeaddr set, node queue1-st-p5-1(queue1-st-p5-1), node state IDLE+CLOUD+MAINTENANCE+POWERED_DOWN+RESERVED:
2023-11-07 17:31:51,338 - [slurm_plugin.clustermgtd:_terminate_orphaned_instances] - INFO - Checking for orphaned instance

@enrico-usai enrico-usai requested review from a team as code owners November 3, 2023 14:37
Copy link

codecov bot commented Nov 3, 2023

Codecov Report

Attention: 20 lines in your changes are missing coverage. Please review.

Comparison is base (7ad9fa8) 90.17% compared to head (5508027) 90.90%.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop     #591      +/-   ##
===========================================
+ Coverage    90.17%   90.90%   +0.72%     
===========================================
  Files           16       20       +4     
  Lines         2708     3134     +426     
===========================================
+ Hits          2442     2849     +407     
- Misses         266      285      +19     
Flag Coverage Δ
unittests 90.90% <95.44%> (+0.72%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
src/aws/ec2.py 100.00% <100.00%> (ø)
src/common/schedulers/slurm_commands.py 92.30% <100.00%> (+0.03%) ⬆️
src/common/time_utils.py 83.33% <100.00%> (+8.33%) ⬆️
src/common/utils.py 75.15% <100.00%> (+3.04%) ⬆️
src/slurm_plugin/capacity_block_manager.py 100.00% <100.00%> (ø)
src/slurm_plugin/clustermgtd.py 92.76% <100.00%> (+0.32%) ⬆️
src/slurm_plugin/fleet_manager.py 95.02% <100.00%> (+0.06%) ⬆️
src/slurm_plugin/slurm_resources.py 95.59% <100.00%> (+0.16%) ⬆️
...rc/common/schedulers/slurm_reservation_commands.py 98.85% <98.85%> (ø)
src/aws/common.py 77.90% <77.90%> (ø)

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@enrico-usai enrico-usai added 3.x skip-security-exclusions-check Skip the checks regarding the security exclusions labels Nov 3, 2023
@enrico-usai enrico-usai changed the title [develop] Add support for Capacity Block reservations [develop] Add support for Capacity Blocks for ML Nov 3, 2023
@enrico-usai enrico-usai force-pushed the wip/cbr branch 2 times, most recently from f7149b8 to 0344d45 Compare November 3, 2023 15:39
@enrico-usai enrico-usai force-pushed the wip/cbr branch 3 times, most recently from e9d63b3 to 92d3331 Compare November 6, 2023 09:53
CHANGELOG.md Outdated Show resolved Hide resolved
lukeseawalker
lukeseawalker previously approved these changes Nov 7, 2023
lukeseawalker
lukeseawalker previously approved these changes Nov 7, 2023
…ynamic nodes

Previously the logic was only applied to static nodes,
now the list of reserved nodes is evaluated for all the nodes.
Do not add reserved nodes to the list of all unhealthy nodes.

Added a new configuration parameter disable_capacity_blocks_management.

Signed-off-by: Enrico Usai <[email protected]>
Now all the main logic is in a try/except block.
The manager is not raising any exception, but instead keeping the
previous value for the list of reserved nodenames.

Added logic to manage AWSClientError when contacting boto3 and
converting it to CapacityBlockManagerError.

Signed-off-by: Enrico Usai <[email protected]>
…m reservations

Defined a new SlurmCommandError to identify errors coming from scontrol commands.
Defined a new SlurmCommandErrorHandler to permit to handle SlurmCommandError and log messages.

Added retry (max-attempt:2, wait=1s) and SlurmCommandErrorHandler decorator
to all the reservations commands.

Improved is_slurm_reservation command to be able to parse stderr and stdout
to retrieve reservation information.

Now the main method `get_reserved_nodenames`, called by clustermgtd cannot raise any exception.

If the error in the `update_slurm_reservation` happens when updating a single capacity block/slurm reservation,
this is catched and the loop will continue, updating the others.

If there is a generic error like AWSClientError (wrapped to CapacityBlockManagerError)
the entire list of capacity_blocks and reserved_nodes won't be changed, logging an error.

If cleanup_leftover_slurm_reservation fails this will be logged, but the process will continue.

Extended unit tests to cover command retries and error catching.

Signed-off-by: Enrico Usai <[email protected]>
…Nodes

From a SlurmNode perspective "terminate_down/drain_nodes" does not have any sense.
We can instead pass a flag to the is_healthy function to say if we want to consider
down/drain nodes as unhealthy.

Signed-off-by: Enrico Usai <[email protected]>
Region is required by describe-capacity-reservations call.

Signed-off-by: Enrico Usai <[email protected]>
Do not log errors when checking for Slurm reservation existence
Do not log slurm commands

Signed-off-by: Enrico Usai <[email protected]>
It is an object, so it is not callable.
Fixing this I found that we were wrongly calling describe_capacity_reservations
by passing a dict rather than a list of ids.

Signed-off-by: Enrico Usai <[email protected]>
Start time is required and when adding a start time
the duration is required as well.

Signed-off-by: Enrico Usai <[email protected]>
Previously the node was considered as reserved and removed from unhealthy nodes,
even if the slurm reservation process was failing.
Now we're checking if the slurm reservation has been correctly created/updated.

If the slurm reservation actions for ALL the CBs are failing,
we're not marking the nodes as reserved and not updating
the internal list of capacity blocks and update time.

If one of the reservation is updated correctly
the list of reserved nodes and update time attributes are updated.

Extended unit tests accordingly.

Signed-off-by: Enrico Usai <[email protected]>
Previously we were calling scontrol command to update the reservation
every 10 minutes, but this is useless.

The only case where we need to update the reservation is to update the list
of nodes, this can happen only at cluster update time, so when Clustermgtd is restarted.
To identify this moment we are using the `_capacity_blocks_update_time` attribute.

Defined a new `_is_initialized` method and using it for `_update_slurm_reservation`

Signed-off-by: Enrico Usai <[email protected]>
Previously we were checking if `seconds_to_minutes()` was `True`.
Now we're correctly checking if `seconds_to_minutes` is `> CAPACITY_BLOCK_RESERVATION_UPDATE_PERIOD`

Signed-off-by: Enrico Usai <[email protected]>
In the log there was an entry for reserved instances saying:
```
WARNING - Node state check: static node without nodeaddr set, node queue1-st-p5-1(queue1-st-p5-1),
node state IDLE+CLOUD+MAINTENANCE+POWERED_DOWN+RESERVED
```

I'm removing the print for the reserved nodes.

Signed-off-by: Enrico Usai <[email protected]>
Previously every CB was associated to a single queue and compute resource.
Now the queue/compute-resource association is stored in an internal map.

Signed-off-by: Enrico Usai <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.x skip-security-exclusions-check Skip the checks regarding the security exclusions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants