-
Notifications
You must be signed in to change notification settings - Fork 88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[develop] Add support for Capacity Blocks for ML #591
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## develop #591 +/- ##
===========================================
+ Coverage 90.17% 90.90% +0.72%
===========================================
Files 16 20 +4
Lines 2708 3134 +426
===========================================
+ Hits 2442 2849 +407
- Misses 266 285 +19
Flags with carried forward coverage won't be shown. Click here to find out more.
☔ View full report in Codecov by Sentry. |
enrico-usai
added
3.x
skip-security-exclusions-check
Skip the checks regarding the security exclusions
labels
Nov 3, 2023
enrico-usai
changed the title
[develop] Add support for Capacity Block reservations
[develop] Add support for Capacity Blocks for ML
Nov 3, 2023
enrico-usai
force-pushed
the
wip/cbr
branch
2 times, most recently
from
November 3, 2023 15:39
f7149b8
to
0344d45
Compare
This was referenced Nov 3, 2023
enrico-usai
force-pushed
the
wip/cbr
branch
3 times, most recently
from
November 6, 2023 09:53
e9d63b3
to
92d3331
Compare
lukeseawalker
previously approved these changes
Nov 7, 2023
enrico-usai
force-pushed
the
wip/cbr
branch
3 times, most recently
from
November 7, 2023 10:07
e235f4a
to
99ca3ce
Compare
lukeseawalker
previously approved these changes
Nov 7, 2023
enrico-usai
force-pushed
the
wip/cbr
branch
4 times, most recently
from
November 8, 2023 11:15
a7099ee
to
f9f2a57
Compare
…ynamic nodes Previously the logic was only applied to static nodes, now the list of reserved nodes is evaluated for all the nodes. Do not add reserved nodes to the list of all unhealthy nodes. Added a new configuration parameter disable_capacity_blocks_management. Signed-off-by: Enrico Usai <[email protected]>
…_fleet_config Signed-off-by: Enrico Usai <[email protected]>
…pacity block Signed-off-by: Enrico Usai <[email protected]>
Now all the main logic is in a try/except block. The manager is not raising any exception, but instead keeping the previous value for the list of reserved nodenames. Added logic to manage AWSClientError when contacting boto3 and converting it to CapacityBlockManagerError. Signed-off-by: Enrico Usai <[email protected]>
…m reservations Defined a new SlurmCommandError to identify errors coming from scontrol commands. Defined a new SlurmCommandErrorHandler to permit to handle SlurmCommandError and log messages. Added retry (max-attempt:2, wait=1s) and SlurmCommandErrorHandler decorator to all the reservations commands. Improved is_slurm_reservation command to be able to parse stderr and stdout to retrieve reservation information. Now the main method `get_reserved_nodenames`, called by clustermgtd cannot raise any exception. If the error in the `update_slurm_reservation` happens when updating a single capacity block/slurm reservation, this is catched and the loop will continue, updating the others. If there is a generic error like AWSClientError (wrapped to CapacityBlockManagerError) the entire list of capacity_blocks and reserved_nodes won't be changed, logging an error. If cleanup_leftover_slurm_reservation fails this will be logged, but the process will continue. Extended unit tests to cover command retries and error catching. Signed-off-by: Enrico Usai <[email protected]>
…eservation_name Signed-off-by: Enrico Usai <[email protected]>
Signed-off-by: Enrico Usai <[email protected]>
Signed-off-by: Enrico Usai <[email protected]>
…Nodes From a SlurmNode perspective "terminate_down/drain_nodes" does not have any sense. We can instead pass a flag to the is_healthy function to say if we want to consider down/drain nodes as unhealthy. Signed-off-by: Enrico Usai <[email protected]>
Signed-off-by: Enrico Usai <[email protected]>
…dd unit tests Signed-off-by: Enrico Usai <[email protected]>
Region is required by describe-capacity-reservations call. Signed-off-by: Enrico Usai <[email protected]>
Do not log errors when checking for Slurm reservation existence Do not log slurm commands Signed-off-by: Enrico Usai <[email protected]>
It is an object, so it is not callable. Fixing this I found that we were wrongly calling describe_capacity_reservations by passing a dict rather than a list of ids. Signed-off-by: Enrico Usai <[email protected]>
Start time is required and when adding a start time the duration is required as well. Signed-off-by: Enrico Usai <[email protected]>
Previously the node was considered as reserved and removed from unhealthy nodes, even if the slurm reservation process was failing. Now we're checking if the slurm reservation has been correctly created/updated. If the slurm reservation actions for ALL the CBs are failing, we're not marking the nodes as reserved and not updating the internal list of capacity blocks and update time. If one of the reservation is updated correctly the list of reserved nodes and update time attributes are updated. Extended unit tests accordingly. Signed-off-by: Enrico Usai <[email protected]>
Previously we were calling scontrol command to update the reservation every 10 minutes, but this is useless. The only case where we need to update the reservation is to update the list of nodes, this can happen only at cluster update time, so when Clustermgtd is restarted. To identify this moment we are using the `_capacity_blocks_update_time` attribute. Defined a new `_is_initialized` method and using it for `_update_slurm_reservation` Signed-off-by: Enrico Usai <[email protected]>
Previously we were checking if `seconds_to_minutes()` was `True`. Now we're correctly checking if `seconds_to_minutes` is `> CAPACITY_BLOCK_RESERVATION_UPDATE_PERIOD` Signed-off-by: Enrico Usai <[email protected]>
In the log there was an entry for reserved instances saying: ``` WARNING - Node state check: static node without nodeaddr set, node queue1-st-p5-1(queue1-st-p5-1), node state IDLE+CLOUD+MAINTENANCE+POWERED_DOWN+RESERVED ``` I'm removing the print for the reserved nodes. Signed-off-by: Enrico Usai <[email protected]>
Previously every CB was associated to a single queue and compute resource. Now the queue/compute-resource association is stored in an internal map. Signed-off-by: Enrico Usai <[email protected]>
lukeseawalker
approved these changes
Nov 9, 2023
This was referenced Nov 10, 2023
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Capacity Block for ML
Capacity Block reservation for Machine Learning (CB) allow AWS users to reserve blocks of GPU capacity for a future start date and for a time duration of their choice. They are capacity reservations of type
capacity-block
, with additional attributes like:StartDate
,EndDate
and additional states.Capacity Block instances are managed by ParallelCluster as a static nodes but in a peculiar way. ParallelCluster permits to create a cluster even if the CB reservation is not yet active and automatically launch the instances when it will become active.
The Slurm nodes corresponding to compute resources associated to CB reservations are kept in maintenance until the CB start time is reached. When CB is not active, the slurm nodes will stay in a Slurm reservation/maintenance state, associated to the slurm admin user, it means that the nodes can accept jobs but they will stay in pending, until the reservation will be removed.
Clustermgtd will automatically create/delete slurm reservations, putting the related CB nodes in maintenance or not, accordingly to the CB state. The Slurm reservation will be removed when CB is active, nodes will start and become available for the pending jobs or for new jobs submissions.
When the CB end time is reached nodes will be re-inserted in a reservation/maintenance state. It’s up to the user to resubmit/requeue the jobs to a new queue/compute-resource when CB time is ended.
CapacityBlockManager
CapacityBlockManager
is a new class that will perform the following actions:Slurm reservation name will be
pcluster-{capacity_block_reservation_id}
CapacityBlockManager will be initialized by clustermgtd with region, boto3_config and fleet_config info.
Fleet config is the existing
fleet-config.json
file, that has been extended to include capacity reservation ids and capacity type (ondemand vs spot vs capacity-block).The manager will reload fleet-config info and capacity reservation info every time the daemon is restarted
or when the config is modified.
Current logic updates CB reservation info every 10 minutes.
The manager will remove the nodes associated with CB reservations that are not active
from the list of unhealthy static nodenames.
CapacityBlock
CapacityBlock
is a new class to store internal info about the capacity block,merging them from ec2 (e.g. capacity block state) and from the config (i.e. list of nodes associated to it).
Managing slurm reservations
Created a new set of commands (using scontrol) to manage the Slurm reservations:
slurm
as default user rather than root, given that "slurm" is the admin in ParallelCluster setup.SlurmNode
class.Leftover slurm reservations
When a CB is removed from a cluster config during a cluster update, we need to remove related slurm reservations.
We're retrieving all the slurm reservations from slurm and deleting them if they are no longer associated with existing CBs in the config.
Nodes in maintenance state
After this patch the nodes in
MAINTENANCE+RESERVED
state will be excluded by the static nodes in replacement list.Will see in the "replacement list" only nodes that are not yet up (as before) and nodes that are not in maintenance.
This mechanism permits the daemons (or the user) to avoid replacement of turned down static nodes (
DOWN
) by putting them in maintenance as preliminary step.This works only for nodes in both
MAINTENANCE
andRESERVED
state, nodes only inRESERVED
state will be replaced as before.In the future, we can extend this mechanism to have an additional field to check (e.g. skip them only if the maintenance is set by the
root
user).boto3 layer
Added boto3 layer, this code is taken from the CLI, removing the caching mechanism.
This permits to decouple node daemons code from boto3 calls, adding exceptions catching mechanism.
### Managing exceptions
Defined a new
SlurmCommandError
to identify errors coming from scontrol commands.Defined a new
SlurmCommandErrorHandler
to permit to handleSlurmCommandError
and log messages.Added
retry (max-attempt:2, wait=1s)
andSlurmCommandErrorHandler
decorator to all the reservations commands.The main method of the
CapacityBlockManager
(get_reserved_nodenames
), called by clustermgtd cannot raise any exception.If the error in the
update_slurm_reservation
happens when updating a single capacity block/slurm reservation,this is catched and the loop will continue, updating the others.
If there is a generic error like
AWSClientError
(wrapped toCapacityBlockManagerError
) the entire list ofcapacity_blocks
andreserved_nodes
won't be changed, logging an error.If
cleanup_leftover_slurm_reservation
fails this will be logged, but the process will continue.### FleetManager changes
The main difference between on-demand, spot and capacity-block instances, is that capacity-block requires
Market-Type=capacity-block
but this is added to the Launch Template at generation time from the CLI/CDK.Capacity reservation id and capacity type info for the compute resources are saved in the
config-fleet.json
file, that is generated by the cookbook according to Cluster configuration.According to this file the node is able to understand if the compute-resource is associated to a CB reservation and add the following additional info:
The FleetManager has been modified to support empty
AllocationStrategy
because CB does not support any of the existing options.Tests
References
Logging
Slurm reservation creation
No time to update,
capacity_block_manager
does anything.Time to update, because update period passed (I set it to 2 minutes for testing the behaviour)