[develop] Add support for Capacity Blocks for ML #591

enrico-usai · 2023-11-03T14:37:59Z

Capacity Block for ML

Capacity Block reservation for Machine Learning (CB) allow AWS users to reserve blocks of GPU capacity for a future start date and for a time duration of their choice. They are capacity reservations of type capacity-block, with additional attributes like: StartDate, EndDate and additional states.

Capacity Block instances are managed by ParallelCluster as a static nodes but in a peculiar way. ParallelCluster permits to create a cluster even if the CB reservation is not yet active and automatically launch the instances when it will become active.

The Slurm nodes corresponding to compute resources associated to CB reservations are kept in maintenance until the CB start time is reached. When CB is not active, the slurm nodes will stay in a Slurm reservation/maintenance state, associated to the slurm admin user, it means that the nodes can accept jobs but they will stay in pending, until the reservation will be removed.

Clustermgtd will automatically create/delete slurm reservations, putting the related CB nodes in maintenance or not, accordingly to the CB state. The Slurm reservation will be removed when CB is active, nodes will start and become available for the pending jobs or for new jobs submissions.

When the CB end time is reached nodes will be re-inserted in a reservation/maintenance state. It’s up to the user to resubmit/requeue the jobs to a new queue/compute-resource when CB time is ended.

CapacityBlockManager

CapacityBlockManager is a new class that will perform the following actions:

when CB is not active or expired, a new Slurm reservation will be created/updated
when CB is active, Slurm reservation will be removed and nodes will become standard static instances

Slurm reservation name will be pcluster-{capacity_block_reservation_id}

CapacityBlockManager will be initialized by clustermgtd with region, boto3_config and fleet_config info.
Fleet config is the existing fleet-config.json file, that has been extended to include capacity reservation ids and capacity type (ondemand vs spot vs capacity-block).
The manager will reload fleet-config info and capacity reservation info every time the daemon is restarted
or when the config is modified.
Current logic updates CB reservation info every 10 minutes.

The manager will remove the nodes associated with CB reservations that are not active
from the list of unhealthy static nodenames.

CapacityBlock

CapacityBlock is a new class to store internal info about the capacity block,
merging them from ec2 (e.g. capacity block state) and from the config (i.e. list of nodes associated to it).

Managing slurm reservations

Created a new set of commands (using scontrol) to manage the Slurm reservations:

create a reservation
- Use slurm as default user rather than root, given that "slurm" is the admin in ParallelCluster setup.
update an existing reservation
- I put in common steps to populate create and update commands since they are very similar.
check if reservation exists
delete_reservation
added the reservation_name attribute to the SlurmNode class.

Leftover slurm reservations

When a CB is removed from a cluster config during a cluster update, we need to remove related slurm reservations.
We're retrieving all the slurm reservations from slurm and deleting them if they are no longer associated with existing CBs in the config.

Nodes in maintenance state

After this patch the nodes in MAINTENANCE+RESERVED state will be excluded by the static nodes in replacement list.
Will see in the "replacement list" only nodes that are not yet up (as before) and nodes that are not in maintenance.

This mechanism permits the daemons (or the user) to avoid replacement of turned down static nodes (DOWN) by putting them in maintenance as preliminary step.
This works only for nodes in both MAINTENANCE and RESERVED state, nodes only in RESERVED state will be replaced as before.

In the future, we can extend this mechanism to have an additional field to check (e.g. skip them only if the maintenance is set by the root user).

boto3 layer

Added boto3 layer, this code is taken from the CLI, removing the caching mechanism.
This permits to decouple node daemons code from boto3 calls, adding exceptions catching mechanism.

### Managing exceptions

Defined a new SlurmCommandError to identify errors coming from scontrol commands.
Defined a new SlurmCommandErrorHandler to permit to handle SlurmCommandError and log messages.

Added retry (max-attempt:2, wait=1s) and SlurmCommandErrorHandler decorator to all the reservations commands.

The main method of the CapacityBlockManager (get_reserved_nodenames), called by clustermgtd cannot raise any exception.
If the error in the update_slurm_reservation happens when updating a single capacity block/slurm reservation,
this is catched and the loop will continue, updating the others.

If there is a generic error like AWSClientError (wrapped to CapacityBlockManagerError) the entire list of capacity_blocks and reserved_nodes won't be changed, logging an error.

If cleanup_leftover_slurm_reservation fails this will be logged, but the process will continue.

### FleetManager changes

The main difference between on-demand, spot and capacity-block instances, is that capacity-block requires Market-Type=capacity-block but this is added to the Launch Template at generation time from the CLI/CDK.
Capacity reservation id and capacity type info for the compute resources are saved in the config-fleet.json file, that is generated by the cookbook according to Cluster configuration.

According to this file the node is able to understand if the compute-resource is associated to a CB reservation and add the following additional info:

        "OnDemandOptions": {
            ...
            "CapacityReservationOptions": {"UsageStrategy": "use-capacity-reservations-first"},
        },
        "TargetCapacitySpecification": {...."DefaultTargetCapacityType": "capacity-block"},
    }

The FleetManager has been modified to support empty AllocationStrategy because CB does not support any of the existing options.

Tests

The new and modified code is verified by unit tests.

References

Slurm reservations: https://slurm.schedmd.com/reservations.html
Capacity Blocks: https://docs.aws.amazon.com/en_us/AWSEC2/latest/UserGuide/ec2-capacity-blocks.html
CLI patch: [develop] Add support for Capacity Blocks for ML aws-parallelcluster#5817
Cookbook patch: [develop] Add support for Capacity Blocks for ML aws-parallelcluster-cookbook#2526

Logging

Slurm reservation creation

2023-11-07 12:43:55,361 - [slurm_plugin.capacity_block_manager:_retrieve_capacity_blocks_from_fleet_config] - INFO - Retrieving Capacity Blocks from fleet configuration.
2023-11-07 12:43:55,361 - [slurm_plugin.capacity_block_manager:_update_capacity_blocks_info_from_ec2] - INFO - Retrieving Capacity Blocks information from EC2 for cr-0296d8df657e57a7b
2023-11-07 12:43:55,384 - [aws.common:_log_boto3_calls] - INFO - Executing boto3 call: region=us-east-2, service=ec2, operation=DescribeCapacityReservations, params={'CapacityReservationIds': ['cr-0296d8df657e57a7b']}
2023-11-07 12:43:55,512 - [common.schedulers.slurm_reservation_commands:is_slurm_reservation] - INFO - Slurm reservation pcluster-cr-0296d8df657e57a7b not found.
2023-11-07 12:43:55,513 - [slurm_plugin.capacity_block_manager:_log_cb_info] - INFO - Capacity Block reservation cr-0296d8df657e57a7b is in state scheduled. Creating Slurm reservation pcluster-cr-0296d8df657e57a7b for nodes queue1-st-p5-1.
2023-11-07 12:43:55,513 - [common.schedulers.slurm_reservation_commands:create_slurm_reservation] - INFO - Creating Slurm reservation with command: sudo /opt/slurm/bin/scontrol create reservation
Reservation created: pcluster-cr-0296d8df657e57a7b
2023-11-07 12:43:55,573 - [slurm_plugin.clustermgtd:_find_unhealthy_slurm_nodes] - INFO - The nodes queue1-st-p5-1 are associated with unactive Capacity Blocks, they will not be considered as unhealthy nodes.
2023-11-07 12:43:55,574 - [slurm_plugin.slurm_resources:_is_static_node_ip_configuration_valid] - WARNING - Node state check: static node without nodeaddr set, node queue1-st-p5-1(queue1-st-p5-1), node state DOWN+CLOUD+NOT_RESPONDING+POWERING_UP:
2023-11-07 12:43:55,574 - [slurm_plugin.clustermgtd:_terminate_orphaned_instances] - INFO - Checking for orphaned instance

No time to update, capacity_block_manager does anything.

2023-11-07 17:30:51,134 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Following nodes are currently in replacement: (x0) []
2023-11-07 17:30:51,135 - [slurm_plugin.clustermgtd:_find_unhealthy_slurm_nodes] - INFO - The nodes associated with inactive Capacity Blocks and not considered as unhealthy nodes are: queue1-st-p5-1
2023-11-07 17:30:51,135 - [slurm_plugin.slurm_resources:_is_static_node_ip_configuration_valid] - WARNING - Node state check: static node without nodeaddr set, node queue1-st-p5-1(queue1-st-p5-1), node state IDLE+CLOUD+MAINTENANCE+POWERED_DOWN+RESERVED:
2023-11-07 17:30:51,135 - [slurm_plugin.clustermgtd:_terminate_orphaned_instances] - INFO - Checking for orphaned instance

Time to update, because update period passed (I set it to 2 minutes for testing the behaviour)

2023-11-07 17:31:51,163 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Following nodes are currently in replacement: (x0) []
2023-11-07 17:31:51,163 - [slurm_plugin.capacity_block_manager:_retrieve_capacity_blocks_from_fleet_config] - INFO - Retrieving Capacity Blocks from fleet configuration.
2023-11-07 17:31:51,163 - [slurm_plugin.capacity_block_manager:_update_capacity_blocks_info_from_ec2] - INFO - Retrieving Capacity Blocks information from EC2 for cr-0296d8df657e57a7b
2023-11-07 17:31:51,177 - [aws.common:_log_boto3_calls] - INFO - Executing boto3 call: region=us-east-2, service=ec2, operation=DescribeCapacityReservations, params={'CapacityReservationIds': ['cr-0296d8df657e57a7b']}
2023-11-07 17:31:51,310 - [slurm_plugin.capacity_block_manager:_log_cb_info] - INFO - Capacity Block reservation cr-0296d8df657e57a7b is in state scheduled. Nothing to do. Already existing Slurm reservation pcluster-cr-0296d8df657e57a7b for nodes queue1-st-p5-1.
2023-11-07 17:31:51,337 - [slurm_plugin.clustermgtd:_find_unhealthy_slurm_nodes] - INFO - The nodes associated with inactive Capacity Blocks and not considered as unhealthy nodes are: queue1-st-p5-1
2023-11-07 17:31:51,338 - [slurm_plugin.slurm_resources:_is_static_node_ip_configuration_valid] - WARNING - Node state check: static node without nodeaddr set, node queue1-st-p5-1(queue1-st-p5-1), node state IDLE+CLOUD+MAINTENANCE+POWERED_DOWN+RESERVED:
2023-11-07 17:31:51,338 - [slurm_plugin.clustermgtd:_terminate_orphaned_instances] - INFO - Checking for orphaned instance

tests/common/schedulers/test_slurm_reservation_commands.py

codecov · 2023-11-03T14:55:05Z

Codecov Report

Attention: 20 lines in your changes are missing coverage. Please review.

Comparison is base (7ad9fa8) 90.17% compared to head (5508027) 90.90%.

Additional details and impacted files

@@             Coverage Diff             @@
##           develop     #591      +/-   ##
===========================================
+ Coverage    90.17%   90.90%   +0.72%     
===========================================
  Files           16       20       +4     
  Lines         2708     3134     +426     
===========================================
+ Hits          2442     2849     +407     
- Misses         266      285      +19

Flag	Coverage Δ
unittests	`90.90% <95.44%> (+0.72%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files	Coverage Δ
src/aws/ec2.py	`100.00% <100.00%> (ø)`
src/common/schedulers/slurm_commands.py	`92.30% <100.00%> (+0.03%)`	⬆️
src/common/time_utils.py	`83.33% <100.00%> (+8.33%)`	⬆️
src/common/utils.py	`75.15% <100.00%> (+3.04%)`	⬆️
src/slurm_plugin/capacity_block_manager.py	`100.00% <100.00%> (ø)`
src/slurm_plugin/clustermgtd.py	`92.76% <100.00%> (+0.32%)`	⬆️
src/slurm_plugin/fleet_manager.py	`95.02% <100.00%> (+0.06%)`	⬆️
src/slurm_plugin/slurm_resources.py	`95.59% <100.00%> (+0.16%)`	⬆️
...rc/common/schedulers/slurm_reservation_commands.py	`98.85% <98.85%> (ø)`
src/aws/common.py	`77.90% <77.90%> (ø)`

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

CHANGELOG.md

tests/slurm_plugin/slurm_resources/test_slurm_resources.py

…ynamic nodes Previously the logic was only applied to static nodes, now the list of reserved nodes is evaluated for all the nodes. Do not add reserved nodes to the list of all unhealthy nodes. Added a new configuration parameter disable_capacity_blocks_management. Signed-off-by: Enrico Usai <[email protected]>

…_fleet_config Signed-off-by: Enrico Usai <[email protected]>

…pacity block Signed-off-by: Enrico Usai <[email protected]>

Now all the main logic is in a try/except block. The manager is not raising any exception, but instead keeping the previous value for the list of reserved nodenames. Added logic to manage AWSClientError when contacting boto3 and converting it to CapacityBlockManagerError. Signed-off-by: Enrico Usai <[email protected]>

…m reservations Defined a new SlurmCommandError to identify errors coming from scontrol commands. Defined a new SlurmCommandErrorHandler to permit to handle SlurmCommandError and log messages. Added retry (max-attempt:2, wait=1s) and SlurmCommandErrorHandler decorator to all the reservations commands. Improved is_slurm_reservation command to be able to parse stderr and stdout to retrieve reservation information. Now the main method `get_reserved_nodenames`, called by clustermgtd cannot raise any exception. If the error in the `update_slurm_reservation` happens when updating a single capacity block/slurm reservation, this is catched and the loop will continue, updating the others. If there is a generic error like AWSClientError (wrapped to CapacityBlockManagerError) the entire list of capacity_blocks and reserved_nodes won't be changed, logging an error. If cleanup_leftover_slurm_reservation fails this will be logged, but the process will continue. Extended unit tests to cover command retries and error catching. Signed-off-by: Enrico Usai <[email protected]>

…eservation_name Signed-off-by: Enrico Usai <[email protected]>

Signed-off-by: Enrico Usai <[email protected]>

…Nodes From a SlurmNode perspective "terminate_down/drain_nodes" does not have any sense. We can instead pass a flag to the is_healthy function to say if we want to consider down/drain nodes as unhealthy. Signed-off-by: Enrico Usai <[email protected]>

Signed-off-by: Enrico Usai <[email protected]>

…dd unit tests Signed-off-by: Enrico Usai <[email protected]>

Region is required by describe-capacity-reservations call. Signed-off-by: Enrico Usai <[email protected]>

Do not log errors when checking for Slurm reservation existence Do not log slurm commands Signed-off-by: Enrico Usai <[email protected]>

It is an object, so it is not callable. Fixing this I found that we were wrongly calling describe_capacity_reservations by passing a dict rather than a list of ids. Signed-off-by: Enrico Usai <[email protected]>

Start time is required and when adding a start time the duration is required as well. Signed-off-by: Enrico Usai <[email protected]>

Previously the node was considered as reserved and removed from unhealthy nodes, even if the slurm reservation process was failing. Now we're checking if the slurm reservation has been correctly created/updated. If the slurm reservation actions for ALL the CBs are failing, we're not marking the nodes as reserved and not updating the internal list of capacity blocks and update time. If one of the reservation is updated correctly the list of reserved nodes and update time attributes are updated. Extended unit tests accordingly. Signed-off-by: Enrico Usai <[email protected]>

Previously we were calling scontrol command to update the reservation every 10 minutes, but this is useless. The only case where we need to update the reservation is to update the list of nodes, this can happen only at cluster update time, so when Clustermgtd is restarted. To identify this moment we are using the `_capacity_blocks_update_time` attribute. Defined a new `_is_initialized` method and using it for `_update_slurm_reservation` Signed-off-by: Enrico Usai <[email protected]>

Previously we were checking if `seconds_to_minutes()` was `True`. Now we're correctly checking if `seconds_to_minutes` is `> CAPACITY_BLOCK_RESERVATION_UPDATE_PERIOD` Signed-off-by: Enrico Usai <[email protected]>

In the log there was an entry for reserved instances saying: ``` WARNING - Node state check: static node without nodeaddr set, node queue1-st-p5-1(queue1-st-p5-1), node state IDLE+CLOUD+MAINTENANCE+POWERED_DOWN+RESERVED ``` I'm removing the print for the reserved nodes. Signed-off-by: Enrico Usai <[email protected]>

tests/slurm_plugin/test_capacity_block_manager.py

Previously every CB was associated to a single queue and compute resource. Now the queue/compute-resource association is stored in an internal map. Signed-off-by: Enrico Usai <[email protected]>

enrico-usai requested review from a team as code owners November 3, 2023 14:37

github-advanced-security bot found potential problems Nov 3, 2023

View reviewed changes

tests/common/schedulers/test_slurm_reservation_commands.py Fixed Show resolved Hide resolved

enrico-usai force-pushed the wip/cbr branch from 66f91dd to 372aaec Compare November 3, 2023 14:53

enrico-usai force-pushed the wip/cbr branch from 372aaec to a8993cc Compare November 3, 2023 14:57

enrico-usai added 3.x skip-security-exclusions-check Skip the checks regarding the security exclusions labels Nov 3, 2023

enrico-usai force-pushed the wip/cbr branch from a8993cc to 04c8b03 Compare November 3, 2023 14:59

enrico-usai changed the title ~~[develop] Add support for Capacity Block reservations~~ [develop] Add support for Capacity Blocks for ML Nov 3, 2023

enrico-usai force-pushed the wip/cbr branch 2 times, most recently from f7149b8 to 0344d45 Compare November 3, 2023 15:39

This was referenced Nov 3, 2023

[develop] Add support for Capacity Blocks for ML aws/aws-parallelcluster-cookbook#2526

Merged

[develop] Add support for Capacity Blocks for ML aws/aws-parallelcluster#5817

Merged

enrico-usai force-pushed the wip/cbr branch 3 times, most recently from e9d63b3 to 92d3331 Compare November 6, 2023 09:53

lukeseawalker reviewed Nov 7, 2023

View reviewed changes

CHANGELOG.md Outdated Show resolved Hide resolved

lukeseawalker previously approved these changes Nov 7, 2023

View reviewed changes

enrico-usai dismissed lukeseawalker’s stale review via 45abb12 November 7, 2023 09:58

enrico-usai force-pushed the wip/cbr branch 3 times, most recently from e235f4a to 99ca3ce Compare November 7, 2023 10:07

lukeseawalker previously approved these changes Nov 7, 2023

View reviewed changes

enrico-usai dismissed lukeseawalker’s stale review via 0b0ed29 November 7, 2023 13:40

enrico-usai force-pushed the wip/cbr branch 4 times, most recently from a7099ee to f9f2a57 Compare November 8, 2023 11:15

lukeseawalker reviewed Nov 9, 2023

View reviewed changes

tests/slurm_plugin/slurm_resources/test_slurm_resources.py Show resolved Hide resolved

enrico-usai added 19 commits November 9, 2023 11:12

Rename _capacity_blocks_from_config to _retrieve_capacity_blocks_from…

4639b1c

…_fleet_config Signed-off-by: Enrico Usai <[email protected]>

Avoid to raise an exception if there is an error updating a single ca…

7b9f34d

…pacity block Signed-off-by: Enrico Usai <[email protected]>

Rename slurm_reservation_name_to_id to capacity_block_id_from_slurm_r…

222f79f

…eservation_name Signed-off-by: Enrico Usai <[email protected]>

Manage empty allocation strategy when using capacity block

0a7b1a2

Signed-off-by: Enrico Usai <[email protected]>

Add unit tests for fleet manager when using capacity block

1e93e63

Signed-off-by: Enrico Usai <[email protected]>

Improve unit tests, covering missing error cases

54d1b5b

Signed-off-by: Enrico Usai <[email protected]>

Unify CapacityBlockReservationInfo with CapacityReservationInfo and a…

cb93e53

…dd unit tests Signed-off-by: Enrico Usai <[email protected]>

Add region to Boto3 client initialization

7c84360

Region is required by describe-capacity-reservations call. Signed-off-by: Enrico Usai <[email protected]>

Minor logging improvements

c10d899

Do not log errors when checking for Slurm reservation existence Do not log slurm commands Signed-off-by: Enrico Usai <[email protected]>

Fix ec2_client usage

5730e14

It is an object, so it is not callable. Fixing this I found that we were wrongly calling describe_capacity_reservations by passing a dict rather than a list of ids. Signed-off-by: Enrico Usai <[email protected]>

Fix slurm reservation creation

4ad7793

Start time is required and when adding a start time the duration is required as well. Signed-off-by: Enrico Usai <[email protected]>

Fix is_time_to_update_capacity_blocks_info method

d3ca87f

Previously we were checking if `seconds_to_minutes()` was `True`. Now we're correctly checking if `seconds_to_minutes` is `> CAPACITY_BLOCK_RESERVATION_UPDATE_PERIOD` Signed-off-by: Enrico Usai <[email protected]>

enrico-usai force-pushed the wip/cbr branch from f9f2a57 to 2a687fc Compare November 9, 2023 11:50

lukeseawalker reviewed Nov 9, 2023

View reviewed changes

tests/slurm_plugin/test_capacity_block_manager.py Show resolved Hide resolved

Support same CB in multiple queues and compute resources

298cac3

Previously every CB was associated to a single queue and compute resource. Now the queue/compute-resource association is stored in an internal map. Signed-off-by: Enrico Usai <[email protected]>

enrico-usai force-pushed the wip/cbr branch from 2a687fc to 298cac3 Compare November 9, 2023 12:10

lukeseawalker approved these changes Nov 9, 2023

View reviewed changes

Merge branch 'develop' into wip/cbr

5508027

enrico-usai merged commit 88b61da into aws:develop Nov 9, 2023
12 checks passed

enrico-usai deleted the wip/cbr branch November 9, 2023 14:21

This was referenced Nov 10, 2023

[develop] Fix Slurm reservation deletion command #600

Merged

[develop] Update Capacity Blocks at every clustermgtd loop if there were errors updating Slurm reservations #601

Merged

[release-3.8] Add support for Capacity Blocks for ML #605

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[develop] Add support for Capacity Blocks for ML #591

[develop] Add support for Capacity Blocks for ML #591

enrico-usai commented Nov 3, 2023 •

edited

Loading

codecov bot commented Nov 3, 2023 •

edited

Loading

[develop] Add support for Capacity Blocks for ML #591

[develop] Add support for Capacity Blocks for ML #591

Conversation

enrico-usai commented Nov 3, 2023 • edited Loading

Capacity Block for ML

CapacityBlockManager

CapacityBlock

Managing slurm reservations

Leftover slurm reservations

Nodes in maintenance state

boto3 layer

Tests

References

Logging

codecov bot commented Nov 3, 2023 • edited Loading

Codecov Report

enrico-usai commented Nov 3, 2023 •

edited

Loading

codecov bot commented Nov 3, 2023 •

edited

Loading