-
Notifications
You must be signed in to change notification settings - Fork 88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[develop] Add best-effort launch strategy for job-level scaling #560
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## develop #560 +/- ##
===========================================
+ Coverage 89.53% 89.70% +0.17%
===========================================
Files 16 16
Lines 2656 2681 +25
===========================================
+ Hits 2378 2405 +27
+ Misses 278 276 -2
Flags with carried forward coverage won't be shown. Click here to find out more.
☔ View full report in Codecov by Sentry. |
lukeseawalker
force-pushed
the
wip/nodeSharingJLS
branch
from
September 14, 2023 09:31
bbf3251
to
d12fbe3
Compare
lukeseawalker
force-pushed
the
wip/nodeSharingJLS
branch
3 times, most recently
from
September 15, 2023 09:53
963c387
to
bb6aaf3
Compare
NSsirena
previously approved these changes
Sep 15, 2023
lukeseawalker
force-pushed
the
wip/nodeSharingJLS
branch
from
September 15, 2023 14:41
bb6aaf3
to
bcbc3f0
Compare
Add best-effort launch strategy for job-level scaling. All-or-nothing is now the new default. When set to "False", best-effort will be performed. Small refactoring on log string messages. Tests done: given the following submission command: ``` sbatch --wrap "sleep 10" -N 4 --constraint="[(c5.4xlarge)*3&(p4d.24xlarge)*1]" -p q4 --exclusive; sbatch --wrap "sleep 10" -N 2 --constraint="[(c5.4xlarge)*1&(p4d.24xlarge)*1]" -p q4 --exclusive; sbatch --wrap "sleep 10" -N 3 --constraint="[(c5.4xlarge)*3]" -p q4 --exclusive ``` where there is capacity for c5.4xlarge but not for p4d.24xlarge the two scaling strategies were tested: all_or_nothing_batch = true expected nodes running at the end of the resume call: (x3) q4-dy-c4-1-* resume log: ``` 2023-09-14 09:03:09,530 - [slurm_plugin.resume:main] - INFO - ResumeProgram startup. 2023-09-14 09:03:09,531 - [slurm_plugin.resume:_get_config] - INFO - Reading /etc/parallelcluster/slurm_plugin/parallelcluster_slurm_resume.conf 2023-09-14 09:03:09,533 - [slurm_plugin.resume:main] - INFO - ResumeProgram config: SlurmResumeConfig(region='us-east-1', cluster_name='bootstrap', dynamodb_table='parallelcluster-slurm-bootstrap', hosted_zone='Z09815256PBUS3QRIMRV', dns_domain='bootstrap.pcluster.', use_private_hostname=False, head_node_private_ip='192.168.24.99', head_node_hostname='ip-192-168-24-99.ec2.internal', launch_max_batch_size=500, assign_node_max_batch_size=500, terminate_max_batch_size=1000, update_node_address=True, all_or_nothing_batch=True, job_level_scaling=True, temp_jls_for_node_sharing=False, fleet_config={'q1': {'c1': {'Api': 'create-fleet', 'CapacityType': 'on-demand', 'AllocationStrategy': 'lowest-price', 'Instances': [{'InstanceType': 'c5.xlarge'}], 'Networking': {'SubnetIds': ['subnet-0b48ed99988e56110']}}}, 'q2': {'c2': {'Api': 'create-fleet', 'CapacityType': 'on-demand', 'AllocationStrategy': 'lowest-price', 'Instances': [{'InstanceType': 'c5.2xlarge'}], 'Networking': {'SubnetIds': ['subnet-0b48ed99988e56110']}}}, 'q3': {'c3': {'Api': 'create-fleet', 'CapacityType': 'on-demand', 'AllocationStrategy': 'lowest-price', 'Instances': [{'InstanceType': 'c5.4xlarge'}], 'Networking': {'SubnetIds': ['subnet-0b48ed99988e56110']}}}, 'q4': {'c4-1': {'Api': 'create-fleet', 'CapacityType': 'on-demand', 'AllocationStrategy': 'lowest-price', 'Instances': [{'InstanceType': 'c5.4xlarge'}], 'Networking': {'SubnetIds': ['subnet-0b48ed99988e56110']}}, 'c4-2': {'Api': 'create-fleet', 'CapacityType': 'on-demand', 'AllocationStrategy': 'lowest-price', 'Instances': [{'InstanceType': 'p4d.24xlarge'}], 'Networking': {'SubnetIds': ['subnet-0b48ed99988e56110']}}}}, run_instances_overrides={}, create_fleet_overrides={}, clustermgtd_timeout=300, clustermgtd_heartbeat_file_path='/opt/slurm/etc/pcluster/.slurm_plugin/clustermgtd_heartbeat', _boto3_retry=1, _boto3_config={'retries': {'max_attempts': 1, 'mode': 'standard'}}, boto3_config=<botocore.config.Config object at 0x7f75379b6d60>, logging_config='/opt/parallelcluster/pyenv/versions/3.9.16/envs/node_virtualenv/lib/python3.9/site-packages/slurm_plugin/logging/parallelcluster_resume_logging.conf', head_node_instance_id='i-0145afe796a5e375a') 2023-09-14 09:03:09,533 - [slurm_plugin.resume:_get_slurm_resume] - INFO - Slurm Resume File content: {'jobs': [{'extra': None, 'job_id': 185, 'features': '[(c5.4xlarge)*3&(p4d.24xlarge)*1]', 'nodes_alloc': 'q4-dy-c4-1-[1-3],q4-dy-c4-2-1', 'nodes_resume': 'q4-dy-c4-1-[1-3],q4-dy-c4-2-1', 'oversubscribe': 'NO', 'partition': 'q4', 'reservation': None}, {'extra': None, 'job_id': 186, 'features': '[(c5.4xlarge)*1&(p4d.24xlarge)*1]', 'nodes_alloc': 'q4-dy-c4-1-4,q4-dy-c4-2-2', 'nodes_resume': 'q4-dy-c4-1-4,q4-dy-c4-2-2', 'oversubscribe': 'NO', 'partition': 'q4', 'reservation': None}, {'extra': None, 'job_id': 187, 'features': '[(c5.4xlarge)*3]', 'nodes_alloc': 'q4-dy-c4-1-[5-7]', 'nodes_resume': 'q4-dy-c4-1-[5-7]', 'oversubscribe': 'NO', 'partition': 'q4', 'reservation': None}], 'all_nodes_resume': 'q4-dy-c4-1-[1-7],q4-dy-c4-2-[1-2]'} 2023-09-14 09:03:09,537 - [slurm_plugin.common:is_clustermgtd_heartbeat_valid] - INFO - Latest heartbeat from clustermgtd: 2023-09-14 09:02:27.308559+00:00 2023-09-14 09:03:09,538 - [slurm_plugin.resume:_resume] - INFO - Launching EC2 instances for the following Slurm nodes: q4-dy-c4-1-[1-7],q4-dy-c4-2-[1-2] 2023-09-14 09:03:09,594 - [slurm_plugin.resume:_resume] - INFO - Current state of Slurm nodes to resume: [('q4-dy-c4-1-1', 'ALLOCATED+CLOUD+NOT_RESPONDING+POWERING_UP'), ('q4-dy-c4-1-2', 'ALLOCATED+CLOUD+NOT_RESPONDING+POWERING_UP'), ('q4-dy-c4-1-3', 'ALLOCATED+CLOUD+NOT_RESPONDING+POWERING_UP'), ('q4-dy-c4-1-4', 'ALLOCATED+CLOUD+NOT_RESPONDING+POWERING_UP'), ('q4-dy-c4-1-5', 'ALLOCATED+CLOUD+NOT_RESPONDING+POWERING_UP'), ('q4-dy-c4-1-6', 'ALLOCATED+CLOUD+NOT_RESPONDING+POWERING_UP'), ('q4-dy-c4-1-7', 'ALLOCATED+CLOUD+NOT_RESPONDING+POWERING_UP'), ('q4-dy-c4-2-1', 'ALLOCATED+CLOUD+NOT_RESPONDING+POWERING_UP'), ('q4-dy-c4-2-2', 'ALLOCATED+CLOUD+NOT_RESPONDING+POWERING_UP')] 2023-09-14 09:03:09,620 - [botocore.credentials:load] - INFO - Found credentials from IAM Role: bootstrap-RoleHeadNode-NKATKTSA4IIU 2023-09-14 09:03:09,660 - [slurm_plugin.instance_manager:_launch_instances] - INFO - Launching all-or-nothing instances for nodes (x7) ['q4-dy-c4-1-1', 'q4-dy-c4-1-2', 'q4-dy-c4-1-3', 'q4-dy-c4-1-4', 'q4-dy-c4-1-5', 'q4-dy-c4-1-6', 'q4-dy-c4-1-7'] 2023-09-14 09:03:09,661 - [slurm_plugin.fleet_manager:create_fleet] - INFO - Launching instances with create_fleet API. Parameters: {'LaunchTemplateConfigs': [{'LaunchTemplateSpecification': {'LaunchTemplateName': 'bootstrap-q4-c4-1', 'Version': '$Latest'}, 'Overrides': [{'InstanceType': 'c5.4xlarge', 'SubnetId': 'subnet-0b48ed99988e56110'}]}], 'TargetCapacitySpecification': {'TotalTargetCapacity': 7, 'DefaultTargetCapacityType': 'on-demand'}, 'Type': 'instant', 'OnDemandOptions': {'AllocationStrategy': 'lowest-price', 'SingleInstanceType': True, 'SingleAvailabilityZone': True, 'MinTargetCapacity': 7, 'CapacityReservationOptions': {'UsageStrategy': 'use-capacity-reservations-first'}}} 2023-09-14 09:03:12,930 - [slurm_plugin.fleet_manager:launch_ec2_instances] - INFO - Launched the following instances (x7) ['i-09ba3d3b0753ddc33', 'i-095c89ec9f1e389d8', 'i-0414b54e1cfb7f5b8', 'i-01ac20db646a75ffa', 'i-03bdd4851aa584786', 'i-0b5adaef26df1187d', 'i-08584b017f57195b0'] 2023-09-14 09:03:12,931 - [slurm_plugin.instance_manager:_launch_instances] - INFO - Launching all-or-nothing instances for nodes (x2) ['q4-dy-c4-2-1', 'q4-dy-c4-2-2'] 2023-09-14 09:03:12,931 - [slurm_plugin.fleet_manager:create_fleet] - INFO - Launching instances with create_fleet API. Parameters: {'LaunchTemplateConfigs': [{'LaunchTemplateSpecification': {'LaunchTemplateName': 'bootstrap-q4-c4-2', 'Version': '$Latest'}, 'Overrides': [{'InstanceType': 'p4d.24xlarge', 'SubnetId': 'subnet-0b48ed99988e56110'}]}], 'TargetCapacitySpecification': {'TotalTargetCapacity': 2, 'DefaultTargetCapacityType': 'on-demand'}, 'Type': 'instant', 'OnDemandOptions': {'AllocationStrategy': 'lowest-price', 'SingleInstanceType': True, 'SingleAvailabilityZone': True, 'MinTargetCapacity': 2, 'CapacityReservationOptions': {'UsageStrategy': 'use-capacity-reservations-first'}}} 2023-09-14 09:03:13,971 - [slurm_plugin.fleet_manager:_launch_instances] - ERROR - Error in CreateFleet request (7e76aa68-8d69-42a8-bead-7de1a50f9037): InsufficientInstanceCapacity - We currently do not have sufficient p4d.24xlarge capacity in the Availability Zone you requested (us-east-1d). Our system will be working on provisioning additional capacity. You can currently get p4d.24xlarge capacity by not specifying an Availability Zone in your request or choosing us-east-1a, us-east-1b. 2023-09-14 09:03:14,072 - [slurm_plugin.instance_manager:_add_instances_for_job] - INFO - JobID 185 - The job nodes_resume list is (x4) ['q4-dy-c4-1-1', 'q4-dy-c4-1-2', 'q4-dy-c4-1-3', 'q4-dy-c4-2-1'] 2023-09-14 09:03:14,072 - [slurm_plugin.instance_manager:_resize_slurm_node_list] - INFO - JobID 185 - Booking already launched instances for nodes (x3) ['q4-dy-c4-1-1', 'q4-dy-c4-1-2', 'q4-dy-c4-1-3']: 2023-09-14 09:03:14,072 - [slurm_plugin.instance_manager:_launch_instances] - INFO - JobID 185 - Launching all-or-nothing instances for nodes (x1) ['q4-dy-c4-2-1'] 2023-09-14 09:03:14,072 - [slurm_plugin.fleet_manager:create_fleet] - INFO - JobID 185 - Launching instances with create_fleet API. Parameters: {'LaunchTemplateConfigs': [{'LaunchTemplateSpecification': {'LaunchTemplateName': 'bootstrap-q4-c4-2', 'Version': '$Latest'}, 'Overrides': [{'InstanceType': 'p4d.24xlarge', 'SubnetId': 'subnet-0b48ed99988e56110'}]}], 'TargetCapacitySpecification': {'TotalTargetCapacity': 1, 'DefaultTargetCapacityType': 'on-demand'}, 'Type': 'instant', 'OnDemandOptions': {'AllocationStrategy': 'lowest-price', 'SingleInstanceType': True, 'SingleAvailabilityZone': True, 'MinTargetCapacity': 1, 'CapacityReservationOptions': {'UsageStrategy': 'use-capacity-reservations-first'}}} 2023-09-14 09:03:15,050 - [slurm_plugin.fleet_manager:_launch_instances] - ERROR - JobID 185 - Error in CreateFleet request (044cbd43-2925-4874-af52-40ca1240e179): InsufficientInstanceCapacity - We currently do not have sufficient p4d.24xlarge capacity in the Availability Zone you requested (us-east-1d). Our system will be working on provisioning additional capacity. You can currently get p4d.24xlarge capacity by not specifying an Availability Zone in your request or choosing us-east-1a, us-east-1b. 2023-09-14 09:03:15,151 - [slurm_plugin.instance_manager:all_or_nothing_node_assignment] - INFO - JobID 185 - Releasing booked instances (x3) ["('q4', 'c4-1', EC2Instance(id='i-09ba3d3b0753ddc33', private_ip='192.168.109.64', hostname='ip-192-168-109-64', launch_time=datetime.datetime(2023, 9, 14, 9, 3, 11, tzinfo=tzlocal()), slurm_node=None))", "('q4', 'c4-1', EC2Instance(id='i-095c89ec9f1e389d8', private_ip='192.168.107.253', hostname='ip-192-168-107-253', launch_time=datetime.datetime(2023, 9, 14, 9, 3, 11, tzinfo=tzlocal()), slurm_node=None))", "('q4', 'c4-1', EC2Instance(id='i-0414b54e1cfb7f5b8', private_ip='192.168.111.135', hostname='ip-192-168-111-135', launch_time=datetime.datetime(2023, 9, 14, 9, 3, 11, tzinfo=tzlocal()), slurm_node=None))"] 2023-09-14 09:03:15,151 - [slurm_plugin.instance_manager:_add_instances_for_job] - INFO - JobID 186 - The job nodes_resume list is (x2) ['q4-dy-c4-1-4', 'q4-dy-c4-2-2'] 2023-09-14 09:03:15,151 - [slurm_plugin.instance_manager:_resize_slurm_node_list] - INFO - JobID 186 - Booking already launched instances for nodes (x1) ['q4-dy-c4-1-4']: 2023-09-14 09:03:15,151 - [slurm_plugin.instance_manager:_launch_instances] - INFO - JobID 186 - Launching all-or-nothing instances for nodes (x1) ['q4-dy-c4-2-2'] 2023-09-14 09:03:15,152 - [slurm_plugin.fleet_manager:create_fleet] - INFO - JobID 186 - Launching instances with create_fleet API. Parameters: {'LaunchTemplateConfigs': [{'LaunchTemplateSpecification': {'LaunchTemplateName': 'bootstrap-q4-c4-2', 'Version': '$Latest'}, 'Overrides': [{'InstanceType': 'p4d.24xlarge', 'SubnetId': 'subnet-0b48ed99988e56110'}]}], 'TargetCapacitySpecification': {'TotalTargetCapacity': 1, 'DefaultTargetCapacityType': 'on-demand'}, 'Type': 'instant', 'OnDemandOptions': {'AllocationStrategy': 'lowest-price', 'SingleInstanceType': True, 'SingleAvailabilityZone': True, 'MinTargetCapacity': 1, 'CapacityReservationOptions': {'UsageStrategy': 'use-capacity-reservations-first'}}} 2023-09-14 09:03:16,162 - [slurm_plugin.fleet_manager:_launch_instances] - ERROR - JobID 186 - Error in CreateFleet request (f1829b1d-4426-4dfa-8f27-3cf306b784e1): InsufficientInstanceCapacity - We currently do not have sufficient p4d.24xlarge capacity in the Availability Zone you requested (us-east-1d). Our system will be working on provisioning additional capacity. You can currently get p4d.24xlarge capacity by not specifying an Availability Zone in your request or choosing us-east-1a, us-east-1b. 2023-09-14 09:03:16,262 - [slurm_plugin.instance_manager:all_or_nothing_node_assignment] - INFO - JobID 186 - Releasing booked instances (x1) ["('q4', 'c4-1', EC2Instance(id='i-01ac20db646a75ffa', private_ip='192.168.108.115', hostname='ip-192-168-108-115', launch_time=datetime.datetime(2023, 9, 14, 9, 3, 11, tzinfo=tzlocal()), slurm_node=None))"] 2023-09-14 09:03:16,262 - [slurm_plugin.instance_manager:_add_instances_for_job] - INFO - JobID 187 - The job nodes_resume list is (x3) ['q4-dy-c4-1-5', 'q4-dy-c4-1-6', 'q4-dy-c4-1-7'] 2023-09-14 09:03:16,262 - [slurm_plugin.instance_manager:_resize_slurm_node_list] - INFO - JobID 187 - Booking already launched instances for nodes (x3) ['q4-dy-c4-1-5', 'q4-dy-c4-1-6', 'q4-dy-c4-1-7']: 2023-09-14 09:03:16,280 - [slurm_plugin.instance_manager:_update_slurm_node_addrs] - INFO - JobID 187 - Nodes are now configured with instances: (x3) ["('q4-dy-c4-1-5', EC2Instance(id='i-03bdd4851aa584786', private_ip='192.168.107.163', hostname='ip-192-168-107-163', launch_time=datetime.datetime(2023, 9, 14, 9, 3, 11, tzinfo=tzlocal()), slurm_node=None))", "('q4-dy-c4-1-6', EC2Instance(id='i-0b5adaef26df1187d', private_ip='192.168.106.37', hostname='ip-192-168-106-37', launch_time=datetime.datetime(2023, 9, 14, 9, 3, 11, tzinfo=tzlocal()), slurm_node=None))", "('q4-dy-c4-1-7', EC2Instance(id='i-08584b017f57195b0', private_ip='192.168.110.115', hostname='ip-192-168-110-115', launch_time=datetime.datetime(2023, 9, 14, 9, 3, 11, tzinfo=tzlocal()), slurm_node=None))"] 2023-09-14 09:03:16,281 - [slurm_plugin.instance_manager:_store_assigned_hostnames] - INFO - JobID 187 - Saving assigned hostnames in DynamoDB 2023-09-14 09:03:16,327 - [slurm_plugin.instance_manager:_store_assigned_hostnames] - INFO - JobID 187 - Database update: COMPLETED 2023-09-14 09:03:16,327 - [slurm_plugin.instance_manager:_update_dns_hostnames] - INFO - JobID 187 - Updating DNS records for Z09815256PBUS3QRIMRV - bootstrap.pcluster. 2023-09-14 09:03:16,652 - [slurm_plugin.instance_manager:_update_dns_hostnames] - INFO - JobID 187 - DNS records update: COMPLETED 2023-09-14 09:03:16,653 - [slurm_plugin.instance_manager:all_or_nothing_node_assignment] - INFO - JobID 187 - Successful launched all instances for nodes (x3) ['q4-dy-c4-1-5', 'q4-dy-c4-1-6', 'q4-dy-c4-1-7'] 2023-09-14 09:03:16,653 - [slurm_plugin.instance_manager:_terminate_unassigned_launched_instances] - INFO - Terminating unassigned launched instances: {'q4': {'c4-1': [EC2Instance(id='i-09ba3d3b0753ddc33', private_ip='192.168.109.64', hostname='ip-192-168-109-64', launch_time=datetime.datetime(2023, 9, 14, 9, 3, 11, tzinfo=tzlocal()), slurm_node=None), EC2Instance(id='i-095c89ec9f1e389d8', private_ip='192.168.107.253', hostname='ip-192-168-107-253', launch_time=datetime.datetime(2023, 9, 14, 9, 3, 11, tzinfo=tzlocal()), slurm_node=None), EC2Instance(id='i-0414b54e1cfb7f5b8', private_ip='192.168.111.135', hostname='ip-192-168-111-135', launch_time=datetime.datetime(2023, 9, 14, 9, 3, 11, tzinfo=tzlocal()), slurm_node=None), EC2Instance(id='i-01ac20db646a75ffa', private_ip='192.168.108.115', hostname='ip-192-168-108-115', launch_time=datetime.datetime(2023, 9, 14, 9, 3, 11, tzinfo=tzlocal()), slurm_node=None)]}} 2023-09-14 09:03:16,662 - [slurm_plugin.instance_manager:delete_instances] - INFO - Terminating instances (x4) ['i-09ba3d3b0753ddc33', 'i-095c89ec9f1e389d8', 'i-0414b54e1cfb7f5b8', 'i-01ac20db646a75ffa'] 2023-09-14 09:03:17,131 - [slurm_plugin.resume:_resume] - INFO - Successfully launched nodes (x3) ['q4-dy-c4-1-5', 'q4-dy-c4-1-6', 'q4-dy-c4-1-7'] 2023-09-14 09:03:17,131 - [slurm_plugin.resume:_resume] - ERROR - Failed to launch following nodes, setting nodes to DOWN: (x6) ['q4-dy-c4-1-1', 'q4-dy-c4-1-3', 'q4-dy-c4-1-4', 'q4-dy-c4-2-2', 'q4-dy-c4-2-1', 'q4-dy-c4-1-2'] 2023-09-14 09:03:17,131 - [slurm_plugin.resume:_handle_failed_nodes] - INFO - Setting following failed nodes into DOWN state (x2) ['q4-dy-c4-2-2', 'q4-dy-c4-2-1'] with reason: (Code:InsufficientInstanceCapacity)Failure when resuming nodes 2023-09-14 09:03:17,149 - [slurm_plugin.resume:_handle_failed_nodes] - INFO - Setting following failed nodes into DOWN state (x4) ['q4-dy-c4-1-1', 'q4-dy-c4-1-3', 'q4-dy-c4-1-4', 'q4-dy-c4-1-2'] with reason: (Code:LimitedInstanceCapacity)Failure when resuming nodes 2023-09-14 09:03:17,169 - [slurm_plugin.resume:main] - INFO - ResumeProgram finished. ``` all_or_nothing_batch = false expected nodes running at the end of the resume call: (x7) q4-dy-c4-1-* resume log: ``` 2023-09-14 09:08:09,554 - [slurm_plugin.resume:main] - INFO - ResumeProgram startup. 2023-09-14 09:08:09,555 - [slurm_plugin.resume:_get_config] - INFO - Reading /etc/parallelcluster/slurm_plugin/parallelcluster_slurm_resume.conf 2023-09-14 09:08:09,556 - [slurm_plugin.resume:main] - INFO - ResumeProgram config: SlurmResumeConfig(region='us-east-1', cluster_name='bootstrap', dynamodb_table='parallelcluster-slurm-bootstrap', hosted_zone='Z09815256PBUS3QRIMRV', dns_domain='bootstrap.pcluster.', use_private_hostname=False, head_node_private_ip='192.168.24.99', head_node_hostname='ip-192-168-24-99.ec2.internal', launch_max_batch_size=500, assign_node_max_batch_size=500, terminate_max_batch_size=1000, update_node_address=True, all_or_nothing_batch=False, job_level_scaling=True, temp_jls_for_node_sharing=False, fleet_config={'q1': {'c1': {'Api': 'create-fleet', 'CapacityType': 'on-demand', 'AllocationStrategy': 'lowest-price', 'Instances': [{'InstanceType': 'c5.xlarge'}], 'Networking': {'SubnetIds': ['subnet-0b48ed99988e56110']}}}, 'q2': {'c2': {'Api': 'create-fleet', 'CapacityType': 'on-demand', 'AllocationStrategy': 'lowest-price', 'Instances': [{'InstanceType': 'c5.2xlarge'}], 'Networking': {'SubnetIds': ['subnet-0b48ed99988e56110']}}}, 'q3': {'c3': {'Api': 'create-fleet', 'CapacityType': 'on-demand', 'AllocationStrategy': 'lowest-price', 'Instances': [{'InstanceType': 'c5.4xlarge'}], 'Networking': {'SubnetIds': ['subnet-0b48ed99988e56110']}}}, 'q4': {'c4-1': {'Api': 'create-fleet', 'CapacityType': 'on-demand', 'AllocationStrategy': 'lowest-price', 'Instances': [{'InstanceType': 'c5.4xlarge'}], 'Networking': {'SubnetIds': ['subnet-0b48ed99988e56110']}}, 'c4-2': {'Api': 'create-fleet', 'CapacityType': 'on-demand', 'AllocationStrategy': 'lowest-price', 'Instances': [{'InstanceType': 'p4d.24xlarge'}], 'Networking': {'SubnetIds': ['subnet-0b48ed99988e56110']}}}}, run_instances_overrides={}, create_fleet_overrides={}, clustermgtd_timeout=300, clustermgtd_heartbeat_file_path='/opt/slurm/etc/pcluster/.slurm_plugin/clustermgtd_heartbeat', _boto3_retry=1, _boto3_config={'retries': {'max_attempts': 1, 'mode': 'standard'}}, boto3_config=<botocore.config.Config object at 0x7fed57aa1d60>, logging_config='/opt/parallelcluster/pyenv/versions/3.9.16/envs/node_virtualenv/lib/python3.9/site-packages/slurm_plugin/logging/parallelcluster_resume_logging.conf', head_node_instance_id='i-0145afe796a5e375a') 2023-09-14 09:08:09,557 - [slurm_plugin.resume:_get_slurm_resume] - INFO - Slurm Resume File content: {'jobs': [{'extra': None, 'job_id': 188, 'features': '[(c5.4xlarge)*3&(p4d.24xlarge)*1]', 'nodes_alloc': 'q4-dy-c4-1-[1-3],q4-dy-c4-2-1', 'nodes_resume': 'q4-dy-c4-1-[1-3],q4-dy-c4-2-1', 'oversubscribe': 'NO', 'partition': 'q4', 'reservation': None}, {'extra': None, 'job_id': 189, 'features': '[(c5.4xlarge)*1&(p4d.24xlarge)*1]', 'nodes_alloc': 'q4-dy-c4-1-4,q4-dy-c4-2-2', 'nodes_resume': 'q4-dy-c4-1-4,q4-dy-c4-2-2', 'oversubscribe': 'NO', 'partition': 'q4', 'reservation': None}, {'extra': None, 'job_id': 190, 'features': '[(c5.4xlarge)*3]', 'nodes_alloc': 'q4-dy-c4-1-[8-10]', 'nodes_resume': 'q4-dy-c4-1-[8-10]', 'oversubscribe': 'NO', 'partition': 'q4', 'reservation': None}], 'all_nodes_resume': 'q4-dy-c4-1-[1-4,8-10],q4-dy-c4-2-[1-2]'} 2023-09-14 09:08:09,561 - [slurm_plugin.common:is_clustermgtd_heartbeat_valid] - INFO - Latest heartbeat from clustermgtd: 2023-09-14 09:07:27.471205+00:00 2023-09-14 09:08:09,561 - [slurm_plugin.resume:_resume] - INFO - Launching EC2 instances for the following Slurm nodes: q4-dy-c4-1-[1-4,8-10],q4-dy-c4-2-[1-2] 2023-09-14 09:08:09,616 - [slurm_plugin.resume:_resume] - INFO - Current state of Slurm nodes to resume: [('q4-dy-c4-1-1', 'ALLOCATED+CLOUD+NOT_RESPONDING+POWERING_UP'), ('q4-dy-c4-1-2', 'ALLOCATED+CLOUD+NOT_RESPONDING+POWERING_UP'), ('q4-dy-c4-1-3', 'ALLOCATED+CLOUD+NOT_RESPONDING+POWERING_UP'), ('q4-dy-c4-1-4', 'ALLOCATED+CLOUD+NOT_RESPONDING+POWERING_UP'), ('q4-dy-c4-1-8', 'ALLOCATED+CLOUD+NOT_RESPONDING+POWERING_UP'), ('q4-dy-c4-1-9', 'ALLOCATED+CLOUD+NOT_RESPONDING+POWERING_UP'), ('q4-dy-c4-1-10', 'ALLOCATED+CLOUD+NOT_RESPONDING+POWERING_UP'), ('q4-dy-c4-2-1', 'ALLOCATED+CLOUD+NOT_RESPONDING+POWERING_UP'), ('q4-dy-c4-2-2', 'ALLOCATED+CLOUD+NOT_RESPONDING+POWERING_UP')] 2023-09-14 09:08:09,643 - [botocore.credentials:load] - INFO - Found credentials from IAM Role: bootstrap-RoleHeadNode-NKATKTSA4IIU 2023-09-14 09:08:09,683 - [slurm_plugin.instance_manager:_launch_instances] - INFO - Launching best-effort instances for nodes (x7) ['q4-dy-c4-1-1', 'q4-dy-c4-1-2', 'q4-dy-c4-1-3', 'q4-dy-c4-1-4', 'q4-dy-c4-1-8', 'q4-dy-c4-1-9', 'q4-dy-c4-1-10'] 2023-09-14 09:08:09,683 - [slurm_plugin.fleet_manager:create_fleet] - INFO - Launching instances with create_fleet API. Parameters: {'LaunchTemplateConfigs': [{'LaunchTemplateSpecification': {'LaunchTemplateName': 'bootstrap-q4-c4-1', 'Version': '$Latest'}, 'Overrides': [{'InstanceType': 'c5.4xlarge', 'SubnetId': 'subnet-0b48ed99988e56110'}]}], 'TargetCapacitySpecification': {'TotalTargetCapacity': 7, 'DefaultTargetCapacityType': 'on-demand'}, 'Type': 'instant', 'OnDemandOptions': {'AllocationStrategy': 'lowest-price', 'SingleInstanceType': True, 'SingleAvailabilityZone': True, 'MinTargetCapacity': 1, 'CapacityReservationOptions': {'UsageStrategy': 'use-capacity-reservations-first'}}} 2023-09-14 09:08:12,914 - [slurm_plugin.fleet_manager:launch_ec2_instances] - INFO - Launched the following instances (x7) ['i-0250d4e661b9d86eb', 'i-0d23930fc5b09fd33', 'i-07dad6e5f1eed664d', 'i-0ad5528556d13495b', 'i-0365529c953588fab', 'i-03a19e86c0d73e84b', 'i-05b6109e7c0940a9c'] 2023-09-14 09:08:12,915 - [slurm_plugin.instance_manager:_launch_instances] - INFO - Launching best-effort instances for nodes (x2) ['q4-dy-c4-2-1', 'q4-dy-c4-2-2'] 2023-09-14 09:08:12,915 - [slurm_plugin.fleet_manager:create_fleet] - INFO - Launching instances with create_fleet API. Parameters: {'LaunchTemplateConfigs': [{'LaunchTemplateSpecification': {'LaunchTemplateName': 'bootstrap-q4-c4-2', 'Version': '$Latest'}, 'Overrides': [{'InstanceType': 'p4d.24xlarge', 'SubnetId': 'subnet-0b48ed99988e56110'}]}], 'TargetCapacitySpecification': {'TotalTargetCapacity': 2, 'DefaultTargetCapacityType': 'on-demand'}, 'Type': 'instant', 'OnDemandOptions': {'AllocationStrategy': 'lowest-price', 'SingleInstanceType': True, 'SingleAvailabilityZone': True, 'MinTargetCapacity': 1, 'CapacityReservationOptions': {'UsageStrategy': 'use-capacity-reservations-first'}}} 2023-09-14 09:08:14,152 - [slurm_plugin.fleet_manager:_launch_instances] - ERROR - Error in CreateFleet request (af6b0eb4-086f-46ad-b58b-6c5f811d8280): InsufficientInstanceCapacity - We currently do not have sufficient p4d.24xlarge capacity in the Availability Zone you requested (us-east-1d). Our system will be working on provisioning additional capacity. You can currently get p4d.24xlarge capacity by not specifying an Availability Zone in your request or choosing us-east-1a, us-east-1b. 2023-09-14 09:08:14,253 - [slurm_plugin.instance_manager:_add_instances_for_job] - INFO - JobID 188 - The job nodes_resume list is (x4) ['q4-dy-c4-1-1', 'q4-dy-c4-1-2', 'q4-dy-c4-1-3', 'q4-dy-c4-2-1'] 2023-09-14 09:08:14,253 - [slurm_plugin.instance_manager:_resize_slurm_node_list] - INFO - JobID 188 - Booking already launched instances for nodes (x3) ['q4-dy-c4-1-1', 'q4-dy-c4-1-2', 'q4-dy-c4-1-3']: 2023-09-14 09:08:14,253 - [slurm_plugin.instance_manager:_launch_instances] - INFO - JobID 188 - Launching best-effort instances for nodes (x1) ['q4-dy-c4-2-1'] 2023-09-14 09:08:14,254 - [slurm_plugin.fleet_manager:create_fleet] - INFO - JobID 188 - Launching instances with create_fleet API. Parameters: {'LaunchTemplateConfigs': [{'LaunchTemplateSpecification': {'LaunchTemplateName': 'bootstrap-q4-c4-2', 'Version': '$Latest'}, 'Overrides': [{'InstanceType': 'p4d.24xlarge', 'SubnetId': 'subnet-0b48ed99988e56110'}]}], 'TargetCapacitySpecification': {'TotalTargetCapacity': 1, 'DefaultTargetCapacityType': 'on-demand'}, 'Type': 'instant', 'OnDemandOptions': {'AllocationStrategy': 'lowest-price', 'SingleInstanceType': True, 'SingleAvailabilityZone': True, 'MinTargetCapacity': 1, 'CapacityReservationOptions': {'UsageStrategy': 'use-capacity-reservations-first'}}} 2023-09-14 09:08:15,274 - [slurm_plugin.fleet_manager:_launch_instances] - ERROR - JobID 188 - Error in CreateFleet request (ff2ac807-49a8-41b4-8af9-2dcea2ed6dfb): InsufficientInstanceCapacity - We currently do not have sufficient p4d.24xlarge capacity in the Availability Zone you requested (us-east-1d). Our system will be working on provisioning additional capacity. You can currently get p4d.24xlarge capacity by not specifying an Availability Zone in your request or choosing us-east-1a, us-east-1b. 2023-09-14 09:08:15,409 - [slurm_plugin.instance_manager:_update_slurm_node_addrs] - INFO - JobID 188 - Nodes are now configured with instances: (x3) ["('q4-dy-c4-1-1', EC2Instance(id='i-0250d4e661b9d86eb', private_ip='192.168.111.231', hostname='ip-192-168-111-231', launch_time=datetime.datetime(2023, 9, 14, 9, 8, 11, tzinfo=tzlocal()), slurm_node=None))", "('q4-dy-c4-1-2', EC2Instance(id='i-0d23930fc5b09fd33', private_ip='192.168.110.38', hostname='ip-192-168-110-38', launch_time=datetime.datetime(2023, 9, 14, 9, 8, 11, tzinfo=tzlocal()), slurm_node=None))", "('q4-dy-c4-1-3', EC2Instance(id='i-07dad6e5f1eed664d', private_ip='192.168.104.249', hostname='ip-192-168-104-249', launch_time=datetime.datetime(2023, 9, 14, 9, 8, 11, tzinfo=tzlocal()), slurm_node=None))"] 2023-09-14 09:08:15,409 - [slurm_plugin.instance_manager:_store_assigned_hostnames] - INFO - JobID 188 - Saving assigned hostnames in DynamoDB 2023-09-14 09:08:15,447 - [slurm_plugin.instance_manager:_store_assigned_hostnames] - INFO - JobID 188 - Database update: COMPLETED 2023-09-14 09:08:15,447 - [slurm_plugin.instance_manager:_update_dns_hostnames] - INFO - JobID 188 - Updating DNS records for Z09815256PBUS3QRIMRV - bootstrap.pcluster. 2023-09-14 09:08:15,743 - [slurm_plugin.instance_manager:_update_dns_hostnames] - INFO - JobID 188 - DNS records update: COMPLETED 2023-09-14 09:08:15,744 - [slurm_plugin.instance_manager:best_effort_node_assignment] - INFO - JobID 188 - Successful launched partial instances for nodes (x3) ['q4-dy-c4-1-1', 'q4-dy-c4-1-2', 'q4-dy-c4-1-3'] 2023-09-14 09:08:15,744 - [slurm_plugin.instance_manager:_add_instances_for_job] - INFO - JobID 189 - The job nodes_resume list is (x2) ['q4-dy-c4-1-4', 'q4-dy-c4-2-2'] 2023-09-14 09:08:15,744 - [slurm_plugin.instance_manager:_resize_slurm_node_list] - INFO - JobID 189 - Booking already launched instances for nodes (x1) ['q4-dy-c4-1-4']: 2023-09-14 09:08:15,744 - [slurm_plugin.instance_manager:_launch_instances] - INFO - JobID 189 - Launching best-effort instances for nodes (x1) ['q4-dy-c4-2-2'] 2023-09-14 09:08:15,744 - [slurm_plugin.fleet_manager:create_fleet] - INFO - JobID 189 - Launching instances with create_fleet API. Parameters: {'LaunchTemplateConfigs': [{'LaunchTemplateSpecification': {'LaunchTemplateName': 'bootstrap-q4-c4-2', 'Version': '$Latest'}, 'Overrides': [{'InstanceType': 'p4d.24xlarge', 'SubnetId': 'subnet-0b48ed99988e56110'}]}], 'TargetCapacitySpecification': {'TotalTargetCapacity': 1, 'DefaultTargetCapacityType': 'on-demand'}, 'Type': 'instant', 'OnDemandOptions': {'AllocationStrategy': 'lowest-price', 'SingleInstanceType': True, 'SingleAvailabilityZone': True, 'MinTargetCapacity': 1, 'CapacityReservationOptions': {'UsageStrategy': 'use-capacity-reservations-first'}}} 2023-09-14 09:08:16,696 - [slurm_plugin.fleet_manager:_launch_instances] - ERROR - JobID 189 - Error in CreateFleet request (63180bc8-cad1-4754-a1c6-0b93fdd36461): InsufficientInstanceCapacity - We currently do not have sufficient p4d.24xlarge capacity in the Availability Zone you requested (us-east-1d). Our system will be working on provisioning additional capacity. You can currently get p4d.24xlarge capacity by not specifying an Availability Zone in your request or choosing us-east-1a, us-east-1b. 2023-09-14 09:08:16,814 - [slurm_plugin.instance_manager:_update_slurm_node_addrs] - INFO - JobID 189 - Nodes are now configured with instances: (x1) ["('q4-dy-c4-1-4', EC2Instance(id='i-0ad5528556d13495b', private_ip='192.168.104.152', hostname='ip-192-168-104-152', launch_time=datetime.datetime(2023, 9, 14, 9, 8, 11, tzinfo=tzlocal()), slurm_node=None))"] 2023-09-14 09:08:16,814 - [slurm_plugin.instance_manager:_store_assigned_hostnames] - INFO - JobID 189 - Saving assigned hostnames in DynamoDB 2023-09-14 09:08:16,821 - [slurm_plugin.instance_manager:_store_assigned_hostnames] - INFO - JobID 189 - Database update: COMPLETED 2023-09-14 09:08:16,822 - [slurm_plugin.instance_manager:_update_dns_hostnames] - INFO - JobID 189 - Updating DNS records for Z09815256PBUS3QRIMRV - bootstrap.pcluster. 2023-09-14 09:08:16,950 - [slurm_plugin.instance_manager:_update_dns_hostnames] - INFO - JobID 189 - DNS records update: COMPLETED 2023-09-14 09:08:16,951 - [slurm_plugin.instance_manager:best_effort_node_assignment] - INFO - JobID 189 - Successful launched partial instances for nodes (x1) ['q4-dy-c4-1-4'] 2023-09-14 09:08:16,952 - [slurm_plugin.instance_manager:_add_instances_for_job] - INFO - JobID 190 - The job nodes_resume list is (x3) ['q4-dy-c4-1-8', 'q4-dy-c4-1-9', 'q4-dy-c4-1-10'] 2023-09-14 09:08:16,952 - [slurm_plugin.instance_manager:_resize_slurm_node_list] - INFO - JobID 190 - Booking already launched instances for nodes (x3) ['q4-dy-c4-1-8', 'q4-dy-c4-1-9', 'q4-dy-c4-1-10']: 2023-09-14 09:08:16,969 - [slurm_plugin.instance_manager:_update_slurm_node_addrs] - INFO - JobID 190 - Nodes are now configured with instances: (x3) ["('q4-dy-c4-1-8', EC2Instance(id='i-0365529c953588fab', private_ip='192.168.108.102', hostname='ip-192-168-108-102', launch_time=datetime.datetime(2023, 9, 14, 9, 8, 11, tzinfo=tzlocal()), slurm_node=None))", "('q4-dy-c4-1-9', EC2Instance(id='i-03a19e86c0d73e84b', private_ip='192.168.105.222', hostname='ip-192-168-105-222', launch_time=datetime.datetime(2023, 9, 14, 9, 8, 11, tzinfo=tzlocal()), slurm_node=None))", "('q4-dy-c4-1-10', EC2Instance(id='i-05b6109e7c0940a9c', private_ip='192.168.111.72', hostname='ip-192-168-111-72', launch_time=datetime.datetime(2023, 9, 14, 9, 8, 11, tzinfo=tzlocal()), slurm_node=None))"] 2023-09-14 09:08:16,970 - [slurm_plugin.instance_manager:_store_assigned_hostnames] - INFO - JobID 190 - Saving assigned hostnames in DynamoDB 2023-09-14 09:08:16,980 - [slurm_plugin.instance_manager:_store_assigned_hostnames] - INFO - JobID 190 - Database update: COMPLETED 2023-09-14 09:08:16,980 - [slurm_plugin.instance_manager:_update_dns_hostnames] - INFO - JobID 190 - Updating DNS records for Z09815256PBUS3QRIMRV - bootstrap.pcluster. 2023-09-14 09:08:17,141 - [slurm_plugin.instance_manager:_update_dns_hostnames] - INFO - JobID 190 - DNS records update: COMPLETED 2023-09-14 09:08:17,142 - [slurm_plugin.instance_manager:best_effort_node_assignment] - INFO - JobID 190 - Successful launched all instances for nodes (x3) ['q4-dy-c4-1-8', 'q4-dy-c4-1-9', 'q4-dy-c4-1-10'] 2023-09-14 09:08:17,142 - [slurm_plugin.resume:_resume] - INFO - Successfully launched nodes (x7) ['q4-dy-c4-1-1', 'q4-dy-c4-1-2', 'q4-dy-c4-1-3', 'q4-dy-c4-1-4', 'q4-dy-c4-1-8', 'q4-dy-c4-1-9', 'q4-dy-c4-1-10'] 2023-09-14 09:08:17,143 - [slurm_plugin.resume:_resume] - ERROR - Failed to launch following nodes, setting nodes to DOWN: (x2) ['q4-dy-c4-2-1', 'q4-dy-c4-2-2'] 2023-09-14 09:08:17,143 - [slurm_plugin.resume:_handle_failed_nodes] - INFO - Setting following failed nodes into DOWN state (x2) ['q4-dy-c4-2-1', 'q4-dy-c4-2-2'] with reason: (Code:InsufficientInstanceCapacity)Failure when resuming nodes 2023-09-14 09:08:17,162 - [slurm_plugin.resume:main] - INFO - ResumeProgram finished. ``` Signed-off-by: Luca Carrogu <[email protected]>
lukeseawalker
force-pushed
the
wip/nodeSharingJLS
branch
from
September 15, 2023 14:43
bcbc3f0
to
fd1a374
Compare
NSsirena
approved these changes
Sep 15, 2023
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description of changes
Tests
given the following submission command:
where there is capacity for c5.4xlarge but not for p4d.24xlarge the two scaling strategies were tested:
all_or_nothing_batch = true
expected nodes running at the end of the resume call: (x3) q4-dy-c4-1-*
resume log:
all_or_nothing_batch = false
expected nodes running at the end of the resume call: (x7) q4-dy-c4-1-*
resume log:
References
n/a
Checklist
develop
add the branch name as prefix in the PR title (e.g.[release-3.6]
).Please review the guidelines for contributing and Pull Request Instructions.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.