Skip to content

Commit

Permalink
Add best-effort launch strategy for job-level scaling
Browse files Browse the repository at this point in the history
Add best-effort launch strategy for job-level scaling.
All-or-nothing is now the new default. When set to "False", best-effort will be performed.
Small refactoring on log string messages.

Tests done:
given the following submission command:
```
sbatch --wrap "sleep 10" -N 4 --constraint="[(c5.4xlarge)*3&(p4d.24xlarge)*1]" -p q4 --exclusive; sbatch --wrap "sleep 10" -N 2 --constraint="[(c5.4xlarge)*1&(p4d.24xlarge)*1]" -p q4 --exclusive; sbatch --wrap "sleep 10" -N 3 --constraint="[(c5.4xlarge)*3]" -p q4 --exclusive
```

where there is capacity for c5.4xlarge but not for p4d.24xlarge
the two scaling strategies were tested:

all_or_nothing_batch = true
expected nodes running at the end of the resume call: (x3) q4-dy-c4-1-*

resume log:
```
2023-09-14 09:03:09,530 - [slurm_plugin.resume:main] - INFO - ResumeProgram startup.
2023-09-14 09:03:09,531 - [slurm_plugin.resume:_get_config] - INFO - Reading /etc/parallelcluster/slurm_plugin/parallelcluster_slurm_resume.conf
2023-09-14 09:03:09,533 - [slurm_plugin.resume:main] - INFO - ResumeProgram config: SlurmResumeConfig(region='us-east-1', cluster_name='bootstrap', dynamodb_table='parallelcluster-slurm-bootstrap', hosted_zone='Z09815256PBUS3QRIMRV', dns_domain='bootstrap.pcluster.', use_private_hostname=False, head_node_private_ip='192.168.24.99', head_node_hostname='ip-192-168-24-99.ec2.internal', launch_max_batch_size=500, assign_node_max_batch_size=500, terminate_max_batch_size=1000, update_node_address=True, all_or_nothing_batch=True, job_level_scaling=True, temp_jls_for_node_sharing=False, fleet_config={'q1': {'c1': {'Api': 'create-fleet', 'CapacityType': 'on-demand', 'AllocationStrategy': 'lowest-price', 'Instances': [{'InstanceType': 'c5.xlarge'}], 'Networking': {'SubnetIds': ['subnet-0b48ed99988e56110']}}}, 'q2': {'c2': {'Api': 'create-fleet', 'CapacityType': 'on-demand', 'AllocationStrategy': 'lowest-price', 'Instances': [{'InstanceType': 'c5.2xlarge'}], 'Networking': {'SubnetIds': ['subnet-0b48ed99988e56110']}}}, 'q3': {'c3': {'Api': 'create-fleet', 'CapacityType': 'on-demand', 'AllocationStrategy': 'lowest-price', 'Instances': [{'InstanceType': 'c5.4xlarge'}], 'Networking': {'SubnetIds': ['subnet-0b48ed99988e56110']}}}, 'q4': {'c4-1': {'Api': 'create-fleet', 'CapacityType': 'on-demand', 'AllocationStrategy': 'lowest-price', 'Instances': [{'InstanceType': 'c5.4xlarge'}], 'Networking': {'SubnetIds': ['subnet-0b48ed99988e56110']}}, 'c4-2': {'Api': 'create-fleet', 'CapacityType': 'on-demand', 'AllocationStrategy': 'lowest-price', 'Instances': [{'InstanceType': 'p4d.24xlarge'}], 'Networking': {'SubnetIds': ['subnet-0b48ed99988e56110']}}}}, run_instances_overrides={}, create_fleet_overrides={}, clustermgtd_timeout=300, clustermgtd_heartbeat_file_path='/opt/slurm/etc/pcluster/.slurm_plugin/clustermgtd_heartbeat', _boto3_retry=1, _boto3_config={'retries': {'max_attempts': 1, 'mode': 'standard'}}, boto3_config=<botocore.config.Config object at 0x7f75379b6d60>, logging_config='/opt/parallelcluster/pyenv/versions/3.9.16/envs/node_virtualenv/lib/python3.9/site-packages/slurm_plugin/logging/parallelcluster_resume_logging.conf', head_node_instance_id='i-0145afe796a5e375a')
2023-09-14 09:03:09,533 - [slurm_plugin.resume:_get_slurm_resume] - INFO - Slurm Resume File content: {'jobs': [{'extra': None, 'job_id': 185, 'features': '[(c5.4xlarge)*3&(p4d.24xlarge)*1]', 'nodes_alloc': 'q4-dy-c4-1-[1-3],q4-dy-c4-2-1', 'nodes_resume': 'q4-dy-c4-1-[1-3],q4-dy-c4-2-1', 'oversubscribe': 'NO', 'partition': 'q4', 'reservation': None}, {'extra': None, 'job_id': 186, 'features': '[(c5.4xlarge)*1&(p4d.24xlarge)*1]', 'nodes_alloc': 'q4-dy-c4-1-4,q4-dy-c4-2-2', 'nodes_resume': 'q4-dy-c4-1-4,q4-dy-c4-2-2', 'oversubscribe': 'NO', 'partition': 'q4', 'reservation': None}, {'extra': None, 'job_id': 187, 'features': '[(c5.4xlarge)*3]', 'nodes_alloc': 'q4-dy-c4-1-[5-7]', 'nodes_resume': 'q4-dy-c4-1-[5-7]', 'oversubscribe': 'NO', 'partition': 'q4', 'reservation': None}], 'all_nodes_resume': 'q4-dy-c4-1-[1-7],q4-dy-c4-2-[1-2]'}
2023-09-14 09:03:09,537 - [slurm_plugin.common:is_clustermgtd_heartbeat_valid] - INFO - Latest heartbeat from clustermgtd: 2023-09-14 09:02:27.308559+00:00
2023-09-14 09:03:09,538 - [slurm_plugin.resume:_resume] - INFO - Launching EC2 instances for the following Slurm nodes: q4-dy-c4-1-[1-7],q4-dy-c4-2-[1-2]
2023-09-14 09:03:09,594 - [slurm_plugin.resume:_resume] - INFO - Current state of Slurm nodes to resume: [('q4-dy-c4-1-1', 'ALLOCATED+CLOUD+NOT_RESPONDING+POWERING_UP'), ('q4-dy-c4-1-2', 'ALLOCATED+CLOUD+NOT_RESPONDING+POWERING_UP'), ('q4-dy-c4-1-3', 'ALLOCATED+CLOUD+NOT_RESPONDING+POWERING_UP'), ('q4-dy-c4-1-4', 'ALLOCATED+CLOUD+NOT_RESPONDING+POWERING_UP'), ('q4-dy-c4-1-5', 'ALLOCATED+CLOUD+NOT_RESPONDING+POWERING_UP'), ('q4-dy-c4-1-6', 'ALLOCATED+CLOUD+NOT_RESPONDING+POWERING_UP'), ('q4-dy-c4-1-7', 'ALLOCATED+CLOUD+NOT_RESPONDING+POWERING_UP'), ('q4-dy-c4-2-1', 'ALLOCATED+CLOUD+NOT_RESPONDING+POWERING_UP'), ('q4-dy-c4-2-2', 'ALLOCATED+CLOUD+NOT_RESPONDING+POWERING_UP')]
2023-09-14 09:03:09,620 - [botocore.credentials:load] - INFO - Found credentials from IAM Role: bootstrap-RoleHeadNode-NKATKTSA4IIU
2023-09-14 09:03:09,660 - [slurm_plugin.instance_manager:_launch_instances] - INFO - Launching all-or-nothing instances for nodes (x7) ['q4-dy-c4-1-1', 'q4-dy-c4-1-2', 'q4-dy-c4-1-3', 'q4-dy-c4-1-4', 'q4-dy-c4-1-5', 'q4-dy-c4-1-6', 'q4-dy-c4-1-7']
2023-09-14 09:03:09,661 - [slurm_plugin.fleet_manager:create_fleet] - INFO - Launching instances with create_fleet API. Parameters: {'LaunchTemplateConfigs': [{'LaunchTemplateSpecification': {'LaunchTemplateName': 'bootstrap-q4-c4-1', 'Version': '$Latest'}, 'Overrides': [{'InstanceType': 'c5.4xlarge', 'SubnetId': 'subnet-0b48ed99988e56110'}]}], 'TargetCapacitySpecification': {'TotalTargetCapacity': 7, 'DefaultTargetCapacityType': 'on-demand'}, 'Type': 'instant', 'OnDemandOptions': {'AllocationStrategy': 'lowest-price', 'SingleInstanceType': True, 'SingleAvailabilityZone': True, 'MinTargetCapacity': 7, 'CapacityReservationOptions': {'UsageStrategy': 'use-capacity-reservations-first'}}}
2023-09-14 09:03:12,930 - [slurm_plugin.fleet_manager:launch_ec2_instances] - INFO - Launched the following instances (x7) ['i-09ba3d3b0753ddc33', 'i-095c89ec9f1e389d8', 'i-0414b54e1cfb7f5b8', 'i-01ac20db646a75ffa', 'i-03bdd4851aa584786', 'i-0b5adaef26df1187d', 'i-08584b017f57195b0']
2023-09-14 09:03:12,931 - [slurm_plugin.instance_manager:_launch_instances] - INFO - Launching all-or-nothing instances for nodes (x2) ['q4-dy-c4-2-1', 'q4-dy-c4-2-2']
2023-09-14 09:03:12,931 - [slurm_plugin.fleet_manager:create_fleet] - INFO - Launching instances with create_fleet API. Parameters: {'LaunchTemplateConfigs': [{'LaunchTemplateSpecification': {'LaunchTemplateName': 'bootstrap-q4-c4-2', 'Version': '$Latest'}, 'Overrides': [{'InstanceType': 'p4d.24xlarge', 'SubnetId': 'subnet-0b48ed99988e56110'}]}], 'TargetCapacitySpecification': {'TotalTargetCapacity': 2, 'DefaultTargetCapacityType': 'on-demand'}, 'Type': 'instant', 'OnDemandOptions': {'AllocationStrategy': 'lowest-price', 'SingleInstanceType': True, 'SingleAvailabilityZone': True, 'MinTargetCapacity': 2, 'CapacityReservationOptions': {'UsageStrategy': 'use-capacity-reservations-first'}}}
2023-09-14 09:03:13,971 - [slurm_plugin.fleet_manager:_launch_instances] - ERROR - Error in CreateFleet request (7e76aa68-8d69-42a8-bead-7de1a50f9037): InsufficientInstanceCapacity - We currently do not have sufficient p4d.24xlarge capacity in the Availability Zone you requested (us-east-1d). Our system will be working on provisioning additional capacity. You can currently get p4d.24xlarge capacity by not specifying an Availability Zone in your request or choosing us-east-1a, us-east-1b.
2023-09-14 09:03:14,072 - [slurm_plugin.instance_manager:_add_instances_for_job] - INFO - JobID 185 - The job nodes_resume list is (x4) ['q4-dy-c4-1-1', 'q4-dy-c4-1-2', 'q4-dy-c4-1-3', 'q4-dy-c4-2-1']
2023-09-14 09:03:14,072 - [slurm_plugin.instance_manager:_resize_slurm_node_list] - INFO - JobID 185 - Booking already launched instances for nodes (x3) ['q4-dy-c4-1-1', 'q4-dy-c4-1-2', 'q4-dy-c4-1-3']:
2023-09-14 09:03:14,072 - [slurm_plugin.instance_manager:_launch_instances] - INFO - JobID 185 - Launching all-or-nothing instances for nodes (x1) ['q4-dy-c4-2-1']
2023-09-14 09:03:14,072 - [slurm_plugin.fleet_manager:create_fleet] - INFO - JobID 185 - Launching instances with create_fleet API. Parameters: {'LaunchTemplateConfigs': [{'LaunchTemplateSpecification': {'LaunchTemplateName': 'bootstrap-q4-c4-2', 'Version': '$Latest'}, 'Overrides': [{'InstanceType': 'p4d.24xlarge', 'SubnetId': 'subnet-0b48ed99988e56110'}]}], 'TargetCapacitySpecification': {'TotalTargetCapacity': 1, 'DefaultTargetCapacityType': 'on-demand'}, 'Type': 'instant', 'OnDemandOptions': {'AllocationStrategy': 'lowest-price', 'SingleInstanceType': True, 'SingleAvailabilityZone': True, 'MinTargetCapacity': 1, 'CapacityReservationOptions': {'UsageStrategy': 'use-capacity-reservations-first'}}}
2023-09-14 09:03:15,050 - [slurm_plugin.fleet_manager:_launch_instances] - ERROR - JobID 185 - Error in CreateFleet request (044cbd43-2925-4874-af52-40ca1240e179): InsufficientInstanceCapacity - We currently do not have sufficient p4d.24xlarge capacity in the Availability Zone you requested (us-east-1d). Our system will be working on provisioning additional capacity. You can currently get p4d.24xlarge capacity by not specifying an Availability Zone in your request or choosing us-east-1a, us-east-1b.
2023-09-14 09:03:15,151 - [slurm_plugin.instance_manager:all_or_nothing_node_assignment] - INFO - JobID 185 - Releasing booked instances (x3) ["('q4', 'c4-1', EC2Instance(id='i-09ba3d3b0753ddc33', private_ip='192.168.109.64', hostname='ip-192-168-109-64', launch_time=datetime.datetime(2023, 9, 14, 9, 3, 11, tzinfo=tzlocal()), slurm_node=None))", "('q4', 'c4-1', EC2Instance(id='i-095c89ec9f1e389d8', private_ip='192.168.107.253', hostname='ip-192-168-107-253', launch_time=datetime.datetime(2023, 9, 14, 9, 3, 11, tzinfo=tzlocal()), slurm_node=None))", "('q4', 'c4-1', EC2Instance(id='i-0414b54e1cfb7f5b8', private_ip='192.168.111.135', hostname='ip-192-168-111-135', launch_time=datetime.datetime(2023, 9, 14, 9, 3, 11, tzinfo=tzlocal()), slurm_node=None))"]
2023-09-14 09:03:15,151 - [slurm_plugin.instance_manager:_add_instances_for_job] - INFO - JobID 186 - The job nodes_resume list is (x2) ['q4-dy-c4-1-4', 'q4-dy-c4-2-2']
2023-09-14 09:03:15,151 - [slurm_plugin.instance_manager:_resize_slurm_node_list] - INFO - JobID 186 - Booking already launched instances for nodes (x1) ['q4-dy-c4-1-4']:
2023-09-14 09:03:15,151 - [slurm_plugin.instance_manager:_launch_instances] - INFO - JobID 186 - Launching all-or-nothing instances for nodes (x1) ['q4-dy-c4-2-2']
2023-09-14 09:03:15,152 - [slurm_plugin.fleet_manager:create_fleet] - INFO - JobID 186 - Launching instances with create_fleet API. Parameters: {'LaunchTemplateConfigs': [{'LaunchTemplateSpecification': {'LaunchTemplateName': 'bootstrap-q4-c4-2', 'Version': '$Latest'}, 'Overrides': [{'InstanceType': 'p4d.24xlarge', 'SubnetId': 'subnet-0b48ed99988e56110'}]}], 'TargetCapacitySpecification': {'TotalTargetCapacity': 1, 'DefaultTargetCapacityType': 'on-demand'}, 'Type': 'instant', 'OnDemandOptions': {'AllocationStrategy': 'lowest-price', 'SingleInstanceType': True, 'SingleAvailabilityZone': True, 'MinTargetCapacity': 1, 'CapacityReservationOptions': {'UsageStrategy': 'use-capacity-reservations-first'}}}
2023-09-14 09:03:16,162 - [slurm_plugin.fleet_manager:_launch_instances] - ERROR - JobID 186 - Error in CreateFleet request (f1829b1d-4426-4dfa-8f27-3cf306b784e1): InsufficientInstanceCapacity - We currently do not have sufficient p4d.24xlarge capacity in the Availability Zone you requested (us-east-1d). Our system will be working on provisioning additional capacity. You can currently get p4d.24xlarge capacity by not specifying an Availability Zone in your request or choosing us-east-1a, us-east-1b.
2023-09-14 09:03:16,262 - [slurm_plugin.instance_manager:all_or_nothing_node_assignment] - INFO - JobID 186 - Releasing booked instances (x1) ["('q4', 'c4-1', EC2Instance(id='i-01ac20db646a75ffa', private_ip='192.168.108.115', hostname='ip-192-168-108-115', launch_time=datetime.datetime(2023, 9, 14, 9, 3, 11, tzinfo=tzlocal()), slurm_node=None))"]
2023-09-14 09:03:16,262 - [slurm_plugin.instance_manager:_add_instances_for_job] - INFO - JobID 187 - The job nodes_resume list is (x3) ['q4-dy-c4-1-5', 'q4-dy-c4-1-6', 'q4-dy-c4-1-7']
2023-09-14 09:03:16,262 - [slurm_plugin.instance_manager:_resize_slurm_node_list] - INFO - JobID 187 - Booking already launched instances for nodes (x3) ['q4-dy-c4-1-5', 'q4-dy-c4-1-6', 'q4-dy-c4-1-7']:
2023-09-14 09:03:16,280 - [slurm_plugin.instance_manager:_update_slurm_node_addrs] - INFO - JobID 187 - Nodes are now configured with instances: (x3) ["('q4-dy-c4-1-5', EC2Instance(id='i-03bdd4851aa584786', private_ip='192.168.107.163', hostname='ip-192-168-107-163', launch_time=datetime.datetime(2023, 9, 14, 9, 3, 11, tzinfo=tzlocal()), slurm_node=None))", "('q4-dy-c4-1-6', EC2Instance(id='i-0b5adaef26df1187d', private_ip='192.168.106.37', hostname='ip-192-168-106-37', launch_time=datetime.datetime(2023, 9, 14, 9, 3, 11, tzinfo=tzlocal()), slurm_node=None))", "('q4-dy-c4-1-7', EC2Instance(id='i-08584b017f57195b0', private_ip='192.168.110.115', hostname='ip-192-168-110-115', launch_time=datetime.datetime(2023, 9, 14, 9, 3, 11, tzinfo=tzlocal()), slurm_node=None))"]
2023-09-14 09:03:16,281 - [slurm_plugin.instance_manager:_store_assigned_hostnames] - INFO - JobID 187 - Saving assigned hostnames in DynamoDB
2023-09-14 09:03:16,327 - [slurm_plugin.instance_manager:_store_assigned_hostnames] - INFO - JobID 187 - Database update: COMPLETED
2023-09-14 09:03:16,327 - [slurm_plugin.instance_manager:_update_dns_hostnames] - INFO - JobID 187 - Updating DNS records for Z09815256PBUS3QRIMRV - bootstrap.pcluster.
2023-09-14 09:03:16,652 - [slurm_plugin.instance_manager:_update_dns_hostnames] - INFO - JobID 187 - DNS records update: COMPLETED
2023-09-14 09:03:16,653 - [slurm_plugin.instance_manager:all_or_nothing_node_assignment] - INFO - JobID 187 - Successful launched all instances for nodes (x3) ['q4-dy-c4-1-5', 'q4-dy-c4-1-6', 'q4-dy-c4-1-7']
2023-09-14 09:03:16,653 - [slurm_plugin.instance_manager:_terminate_unassigned_launched_instances] - INFO - Terminating unassigned launched instances: {'q4': {'c4-1': [EC2Instance(id='i-09ba3d3b0753ddc33', private_ip='192.168.109.64', hostname='ip-192-168-109-64', launch_time=datetime.datetime(2023, 9, 14, 9, 3, 11, tzinfo=tzlocal()), slurm_node=None), EC2Instance(id='i-095c89ec9f1e389d8', private_ip='192.168.107.253', hostname='ip-192-168-107-253', launch_time=datetime.datetime(2023, 9, 14, 9, 3, 11, tzinfo=tzlocal()), slurm_node=None), EC2Instance(id='i-0414b54e1cfb7f5b8', private_ip='192.168.111.135', hostname='ip-192-168-111-135', launch_time=datetime.datetime(2023, 9, 14, 9, 3, 11, tzinfo=tzlocal()), slurm_node=None), EC2Instance(id='i-01ac20db646a75ffa', private_ip='192.168.108.115', hostname='ip-192-168-108-115', launch_time=datetime.datetime(2023, 9, 14, 9, 3, 11, tzinfo=tzlocal()), slurm_node=None)]}}
2023-09-14 09:03:16,662 - [slurm_plugin.instance_manager:delete_instances] - INFO - Terminating instances (x4) ['i-09ba3d3b0753ddc33', 'i-095c89ec9f1e389d8', 'i-0414b54e1cfb7f5b8', 'i-01ac20db646a75ffa']
2023-09-14 09:03:17,131 - [slurm_plugin.resume:_resume] - INFO - Successfully launched nodes (x3) ['q4-dy-c4-1-5', 'q4-dy-c4-1-6', 'q4-dy-c4-1-7']
2023-09-14 09:03:17,131 - [slurm_plugin.resume:_resume] - ERROR - Failed to launch following nodes, setting nodes to DOWN: (x6) ['q4-dy-c4-1-1', 'q4-dy-c4-1-3', 'q4-dy-c4-1-4', 'q4-dy-c4-2-2', 'q4-dy-c4-2-1', 'q4-dy-c4-1-2']
2023-09-14 09:03:17,131 - [slurm_plugin.resume:_handle_failed_nodes] - INFO - Setting following failed nodes into DOWN state (x2) ['q4-dy-c4-2-2', 'q4-dy-c4-2-1'] with reason: (Code:InsufficientInstanceCapacity)Failure when resuming nodes
2023-09-14 09:03:17,149 - [slurm_plugin.resume:_handle_failed_nodes] - INFO - Setting following failed nodes into DOWN state (x4) ['q4-dy-c4-1-1', 'q4-dy-c4-1-3', 'q4-dy-c4-1-4', 'q4-dy-c4-1-2'] with reason: (Code:LimitedInstanceCapacity)Failure when resuming nodes
2023-09-14 09:03:17,169 - [slurm_plugin.resume:main] - INFO - ResumeProgram finished.
```

all_or_nothing_batch = false
expected nodes running at the end of the resume call: (x7) q4-dy-c4-1-*

resume log:
```
2023-09-14 09:08:09,554 - [slurm_plugin.resume:main] - INFO - ResumeProgram startup.
2023-09-14 09:08:09,555 - [slurm_plugin.resume:_get_config] - INFO - Reading /etc/parallelcluster/slurm_plugin/parallelcluster_slurm_resume.conf
2023-09-14 09:08:09,556 - [slurm_plugin.resume:main] - INFO - ResumeProgram config: SlurmResumeConfig(region='us-east-1', cluster_name='bootstrap', dynamodb_table='parallelcluster-slurm-bootstrap', hosted_zone='Z09815256PBUS3QRIMRV', dns_domain='bootstrap.pcluster.', use_private_hostname=False, head_node_private_ip='192.168.24.99', head_node_hostname='ip-192-168-24-99.ec2.internal', launch_max_batch_size=500, assign_node_max_batch_size=500, terminate_max_batch_size=1000, update_node_address=True, all_or_nothing_batch=False, job_level_scaling=True, temp_jls_for_node_sharing=False, fleet_config={'q1': {'c1': {'Api': 'create-fleet', 'CapacityType': 'on-demand', 'AllocationStrategy': 'lowest-price', 'Instances': [{'InstanceType': 'c5.xlarge'}], 'Networking': {'SubnetIds': ['subnet-0b48ed99988e56110']}}}, 'q2': {'c2': {'Api': 'create-fleet', 'CapacityType': 'on-demand', 'AllocationStrategy': 'lowest-price', 'Instances': [{'InstanceType': 'c5.2xlarge'}], 'Networking': {'SubnetIds': ['subnet-0b48ed99988e56110']}}}, 'q3': {'c3': {'Api': 'create-fleet', 'CapacityType': 'on-demand', 'AllocationStrategy': 'lowest-price', 'Instances': [{'InstanceType': 'c5.4xlarge'}], 'Networking': {'SubnetIds': ['subnet-0b48ed99988e56110']}}}, 'q4': {'c4-1': {'Api': 'create-fleet', 'CapacityType': 'on-demand', 'AllocationStrategy': 'lowest-price', 'Instances': [{'InstanceType': 'c5.4xlarge'}], 'Networking': {'SubnetIds': ['subnet-0b48ed99988e56110']}}, 'c4-2': {'Api': 'create-fleet', 'CapacityType': 'on-demand', 'AllocationStrategy': 'lowest-price', 'Instances': [{'InstanceType': 'p4d.24xlarge'}], 'Networking': {'SubnetIds': ['subnet-0b48ed99988e56110']}}}}, run_instances_overrides={}, create_fleet_overrides={}, clustermgtd_timeout=300, clustermgtd_heartbeat_file_path='/opt/slurm/etc/pcluster/.slurm_plugin/clustermgtd_heartbeat', _boto3_retry=1, _boto3_config={'retries': {'max_attempts': 1, 'mode': 'standard'}}, boto3_config=<botocore.config.Config object at 0x7fed57aa1d60>, logging_config='/opt/parallelcluster/pyenv/versions/3.9.16/envs/node_virtualenv/lib/python3.9/site-packages/slurm_plugin/logging/parallelcluster_resume_logging.conf', head_node_instance_id='i-0145afe796a5e375a')
2023-09-14 09:08:09,557 - [slurm_plugin.resume:_get_slurm_resume] - INFO - Slurm Resume File content: {'jobs': [{'extra': None, 'job_id': 188, 'features': '[(c5.4xlarge)*3&(p4d.24xlarge)*1]', 'nodes_alloc': 'q4-dy-c4-1-[1-3],q4-dy-c4-2-1', 'nodes_resume': 'q4-dy-c4-1-[1-3],q4-dy-c4-2-1', 'oversubscribe': 'NO', 'partition': 'q4', 'reservation': None}, {'extra': None, 'job_id': 189, 'features': '[(c5.4xlarge)*1&(p4d.24xlarge)*1]', 'nodes_alloc': 'q4-dy-c4-1-4,q4-dy-c4-2-2', 'nodes_resume': 'q4-dy-c4-1-4,q4-dy-c4-2-2', 'oversubscribe': 'NO', 'partition': 'q4', 'reservation': None}, {'extra': None, 'job_id': 190, 'features': '[(c5.4xlarge)*3]', 'nodes_alloc': 'q4-dy-c4-1-[8-10]', 'nodes_resume': 'q4-dy-c4-1-[8-10]', 'oversubscribe': 'NO', 'partition': 'q4', 'reservation': None}], 'all_nodes_resume': 'q4-dy-c4-1-[1-4,8-10],q4-dy-c4-2-[1-2]'}
2023-09-14 09:08:09,561 - [slurm_plugin.common:is_clustermgtd_heartbeat_valid] - INFO - Latest heartbeat from clustermgtd: 2023-09-14 09:07:27.471205+00:00
2023-09-14 09:08:09,561 - [slurm_plugin.resume:_resume] - INFO - Launching EC2 instances for the following Slurm nodes: q4-dy-c4-1-[1-4,8-10],q4-dy-c4-2-[1-2]
2023-09-14 09:08:09,616 - [slurm_plugin.resume:_resume] - INFO - Current state of Slurm nodes to resume: [('q4-dy-c4-1-1', 'ALLOCATED+CLOUD+NOT_RESPONDING+POWERING_UP'), ('q4-dy-c4-1-2', 'ALLOCATED+CLOUD+NOT_RESPONDING+POWERING_UP'), ('q4-dy-c4-1-3', 'ALLOCATED+CLOUD+NOT_RESPONDING+POWERING_UP'), ('q4-dy-c4-1-4', 'ALLOCATED+CLOUD+NOT_RESPONDING+POWERING_UP'), ('q4-dy-c4-1-8', 'ALLOCATED+CLOUD+NOT_RESPONDING+POWERING_UP'), ('q4-dy-c4-1-9', 'ALLOCATED+CLOUD+NOT_RESPONDING+POWERING_UP'), ('q4-dy-c4-1-10', 'ALLOCATED+CLOUD+NOT_RESPONDING+POWERING_UP'), ('q4-dy-c4-2-1', 'ALLOCATED+CLOUD+NOT_RESPONDING+POWERING_UP'), ('q4-dy-c4-2-2', 'ALLOCATED+CLOUD+NOT_RESPONDING+POWERING_UP')]
2023-09-14 09:08:09,643 - [botocore.credentials:load] - INFO - Found credentials from IAM Role: bootstrap-RoleHeadNode-NKATKTSA4IIU
2023-09-14 09:08:09,683 - [slurm_plugin.instance_manager:_launch_instances] - INFO - Launching best-effort instances for nodes (x7) ['q4-dy-c4-1-1', 'q4-dy-c4-1-2', 'q4-dy-c4-1-3', 'q4-dy-c4-1-4', 'q4-dy-c4-1-8', 'q4-dy-c4-1-9', 'q4-dy-c4-1-10']
2023-09-14 09:08:09,683 - [slurm_plugin.fleet_manager:create_fleet] - INFO - Launching instances with create_fleet API. Parameters: {'LaunchTemplateConfigs': [{'LaunchTemplateSpecification': {'LaunchTemplateName': 'bootstrap-q4-c4-1', 'Version': '$Latest'}, 'Overrides': [{'InstanceType': 'c5.4xlarge', 'SubnetId': 'subnet-0b48ed99988e56110'}]}], 'TargetCapacitySpecification': {'TotalTargetCapacity': 7, 'DefaultTargetCapacityType': 'on-demand'}, 'Type': 'instant', 'OnDemandOptions': {'AllocationStrategy': 'lowest-price', 'SingleInstanceType': True, 'SingleAvailabilityZone': True, 'MinTargetCapacity': 1, 'CapacityReservationOptions': {'UsageStrategy': 'use-capacity-reservations-first'}}}
2023-09-14 09:08:12,914 - [slurm_plugin.fleet_manager:launch_ec2_instances] - INFO - Launched the following instances (x7) ['i-0250d4e661b9d86eb', 'i-0d23930fc5b09fd33', 'i-07dad6e5f1eed664d', 'i-0ad5528556d13495b', 'i-0365529c953588fab', 'i-03a19e86c0d73e84b', 'i-05b6109e7c0940a9c']
2023-09-14 09:08:12,915 - [slurm_plugin.instance_manager:_launch_instances] - INFO - Launching best-effort instances for nodes (x2) ['q4-dy-c4-2-1', 'q4-dy-c4-2-2']
2023-09-14 09:08:12,915 - [slurm_plugin.fleet_manager:create_fleet] - INFO - Launching instances with create_fleet API. Parameters: {'LaunchTemplateConfigs': [{'LaunchTemplateSpecification': {'LaunchTemplateName': 'bootstrap-q4-c4-2', 'Version': '$Latest'}, 'Overrides': [{'InstanceType': 'p4d.24xlarge', 'SubnetId': 'subnet-0b48ed99988e56110'}]}], 'TargetCapacitySpecification': {'TotalTargetCapacity': 2, 'DefaultTargetCapacityType': 'on-demand'}, 'Type': 'instant', 'OnDemandOptions': {'AllocationStrategy': 'lowest-price', 'SingleInstanceType': True, 'SingleAvailabilityZone': True, 'MinTargetCapacity': 1, 'CapacityReservationOptions': {'UsageStrategy': 'use-capacity-reservations-first'}}}
2023-09-14 09:08:14,152 - [slurm_plugin.fleet_manager:_launch_instances] - ERROR - Error in CreateFleet request (af6b0eb4-086f-46ad-b58b-6c5f811d8280): InsufficientInstanceCapacity - We currently do not have sufficient p4d.24xlarge capacity in the Availability Zone you requested (us-east-1d). Our system will be working on provisioning additional capacity. You can currently get p4d.24xlarge capacity by not specifying an Availability Zone in your request or choosing us-east-1a, us-east-1b.
2023-09-14 09:08:14,253 - [slurm_plugin.instance_manager:_add_instances_for_job] - INFO - JobID 188 - The job nodes_resume list is (x4) ['q4-dy-c4-1-1', 'q4-dy-c4-1-2', 'q4-dy-c4-1-3', 'q4-dy-c4-2-1']
2023-09-14 09:08:14,253 - [slurm_plugin.instance_manager:_resize_slurm_node_list] - INFO - JobID 188 - Booking already launched instances for nodes (x3) ['q4-dy-c4-1-1', 'q4-dy-c4-1-2', 'q4-dy-c4-1-3']:
2023-09-14 09:08:14,253 - [slurm_plugin.instance_manager:_launch_instances] - INFO - JobID 188 - Launching best-effort instances for nodes (x1) ['q4-dy-c4-2-1']
2023-09-14 09:08:14,254 - [slurm_plugin.fleet_manager:create_fleet] - INFO - JobID 188 - Launching instances with create_fleet API. Parameters: {'LaunchTemplateConfigs': [{'LaunchTemplateSpecification': {'LaunchTemplateName': 'bootstrap-q4-c4-2', 'Version': '$Latest'}, 'Overrides': [{'InstanceType': 'p4d.24xlarge', 'SubnetId': 'subnet-0b48ed99988e56110'}]}], 'TargetCapacitySpecification': {'TotalTargetCapacity': 1, 'DefaultTargetCapacityType': 'on-demand'}, 'Type': 'instant', 'OnDemandOptions': {'AllocationStrategy': 'lowest-price', 'SingleInstanceType': True, 'SingleAvailabilityZone': True, 'MinTargetCapacity': 1, 'CapacityReservationOptions': {'UsageStrategy': 'use-capacity-reservations-first'}}}
2023-09-14 09:08:15,274 - [slurm_plugin.fleet_manager:_launch_instances] - ERROR - JobID 188 - Error in CreateFleet request (ff2ac807-49a8-41b4-8af9-2dcea2ed6dfb): InsufficientInstanceCapacity - We currently do not have sufficient p4d.24xlarge capacity in the Availability Zone you requested (us-east-1d). Our system will be working on provisioning additional capacity. You can currently get p4d.24xlarge capacity by not specifying an Availability Zone in your request or choosing us-east-1a, us-east-1b.
2023-09-14 09:08:15,409 - [slurm_plugin.instance_manager:_update_slurm_node_addrs] - INFO - JobID 188 - Nodes are now configured with instances: (x3) ["('q4-dy-c4-1-1', EC2Instance(id='i-0250d4e661b9d86eb', private_ip='192.168.111.231', hostname='ip-192-168-111-231', launch_time=datetime.datetime(2023, 9, 14, 9, 8, 11, tzinfo=tzlocal()), slurm_node=None))", "('q4-dy-c4-1-2', EC2Instance(id='i-0d23930fc5b09fd33', private_ip='192.168.110.38', hostname='ip-192-168-110-38', launch_time=datetime.datetime(2023, 9, 14, 9, 8, 11, tzinfo=tzlocal()), slurm_node=None))", "('q4-dy-c4-1-3', EC2Instance(id='i-07dad6e5f1eed664d', private_ip='192.168.104.249', hostname='ip-192-168-104-249', launch_time=datetime.datetime(2023, 9, 14, 9, 8, 11, tzinfo=tzlocal()), slurm_node=None))"]
2023-09-14 09:08:15,409 - [slurm_plugin.instance_manager:_store_assigned_hostnames] - INFO - JobID 188 - Saving assigned hostnames in DynamoDB
2023-09-14 09:08:15,447 - [slurm_plugin.instance_manager:_store_assigned_hostnames] - INFO - JobID 188 - Database update: COMPLETED
2023-09-14 09:08:15,447 - [slurm_plugin.instance_manager:_update_dns_hostnames] - INFO - JobID 188 - Updating DNS records for Z09815256PBUS3QRIMRV - bootstrap.pcluster.
2023-09-14 09:08:15,743 - [slurm_plugin.instance_manager:_update_dns_hostnames] - INFO - JobID 188 - DNS records update: COMPLETED
2023-09-14 09:08:15,744 - [slurm_plugin.instance_manager:best_effort_node_assignment] - INFO - JobID 188 - Successful launched partial instances for nodes (x3) ['q4-dy-c4-1-1', 'q4-dy-c4-1-2', 'q4-dy-c4-1-3']
2023-09-14 09:08:15,744 - [slurm_plugin.instance_manager:_add_instances_for_job] - INFO - JobID 189 - The job nodes_resume list is (x2) ['q4-dy-c4-1-4', 'q4-dy-c4-2-2']
2023-09-14 09:08:15,744 - [slurm_plugin.instance_manager:_resize_slurm_node_list] - INFO - JobID 189 - Booking already launched instances for nodes (x1) ['q4-dy-c4-1-4']:
2023-09-14 09:08:15,744 - [slurm_plugin.instance_manager:_launch_instances] - INFO - JobID 189 - Launching best-effort instances for nodes (x1) ['q4-dy-c4-2-2']
2023-09-14 09:08:15,744 - [slurm_plugin.fleet_manager:create_fleet] - INFO - JobID 189 - Launching instances with create_fleet API. Parameters: {'LaunchTemplateConfigs': [{'LaunchTemplateSpecification': {'LaunchTemplateName': 'bootstrap-q4-c4-2', 'Version': '$Latest'}, 'Overrides': [{'InstanceType': 'p4d.24xlarge', 'SubnetId': 'subnet-0b48ed99988e56110'}]}], 'TargetCapacitySpecification': {'TotalTargetCapacity': 1, 'DefaultTargetCapacityType': 'on-demand'}, 'Type': 'instant', 'OnDemandOptions': {'AllocationStrategy': 'lowest-price', 'SingleInstanceType': True, 'SingleAvailabilityZone': True, 'MinTargetCapacity': 1, 'CapacityReservationOptions': {'UsageStrategy': 'use-capacity-reservations-first'}}}
2023-09-14 09:08:16,696 - [slurm_plugin.fleet_manager:_launch_instances] - ERROR - JobID 189 - Error in CreateFleet request (63180bc8-cad1-4754-a1c6-0b93fdd36461): InsufficientInstanceCapacity - We currently do not have sufficient p4d.24xlarge capacity in the Availability Zone you requested (us-east-1d). Our system will be working on provisioning additional capacity. You can currently get p4d.24xlarge capacity by not specifying an Availability Zone in your request or choosing us-east-1a, us-east-1b.
2023-09-14 09:08:16,814 - [slurm_plugin.instance_manager:_update_slurm_node_addrs] - INFO - JobID 189 - Nodes are now configured with instances: (x1) ["('q4-dy-c4-1-4', EC2Instance(id='i-0ad5528556d13495b', private_ip='192.168.104.152', hostname='ip-192-168-104-152', launch_time=datetime.datetime(2023, 9, 14, 9, 8, 11, tzinfo=tzlocal()), slurm_node=None))"]
2023-09-14 09:08:16,814 - [slurm_plugin.instance_manager:_store_assigned_hostnames] - INFO - JobID 189 - Saving assigned hostnames in DynamoDB
2023-09-14 09:08:16,821 - [slurm_plugin.instance_manager:_store_assigned_hostnames] - INFO - JobID 189 - Database update: COMPLETED
2023-09-14 09:08:16,822 - [slurm_plugin.instance_manager:_update_dns_hostnames] - INFO - JobID 189 - Updating DNS records for Z09815256PBUS3QRIMRV - bootstrap.pcluster.
2023-09-14 09:08:16,950 - [slurm_plugin.instance_manager:_update_dns_hostnames] - INFO - JobID 189 - DNS records update: COMPLETED
2023-09-14 09:08:16,951 - [slurm_plugin.instance_manager:best_effort_node_assignment] - INFO - JobID 189 - Successful launched partial instances for nodes (x1) ['q4-dy-c4-1-4']
2023-09-14 09:08:16,952 - [slurm_plugin.instance_manager:_add_instances_for_job] - INFO - JobID 190 - The job nodes_resume list is (x3) ['q4-dy-c4-1-8', 'q4-dy-c4-1-9', 'q4-dy-c4-1-10']
2023-09-14 09:08:16,952 - [slurm_plugin.instance_manager:_resize_slurm_node_list] - INFO - JobID 190 - Booking already launched instances for nodes (x3) ['q4-dy-c4-1-8', 'q4-dy-c4-1-9', 'q4-dy-c4-1-10']:
2023-09-14 09:08:16,969 - [slurm_plugin.instance_manager:_update_slurm_node_addrs] - INFO - JobID 190 - Nodes are now configured with instances: (x3) ["('q4-dy-c4-1-8', EC2Instance(id='i-0365529c953588fab', private_ip='192.168.108.102', hostname='ip-192-168-108-102', launch_time=datetime.datetime(2023, 9, 14, 9, 8, 11, tzinfo=tzlocal()), slurm_node=None))", "('q4-dy-c4-1-9', EC2Instance(id='i-03a19e86c0d73e84b', private_ip='192.168.105.222', hostname='ip-192-168-105-222', launch_time=datetime.datetime(2023, 9, 14, 9, 8, 11, tzinfo=tzlocal()), slurm_node=None))", "('q4-dy-c4-1-10', EC2Instance(id='i-05b6109e7c0940a9c', private_ip='192.168.111.72', hostname='ip-192-168-111-72', launch_time=datetime.datetime(2023, 9, 14, 9, 8, 11, tzinfo=tzlocal()), slurm_node=None))"]
2023-09-14 09:08:16,970 - [slurm_plugin.instance_manager:_store_assigned_hostnames] - INFO - JobID 190 - Saving assigned hostnames in DynamoDB
2023-09-14 09:08:16,980 - [slurm_plugin.instance_manager:_store_assigned_hostnames] - INFO - JobID 190 - Database update: COMPLETED
2023-09-14 09:08:16,980 - [slurm_plugin.instance_manager:_update_dns_hostnames] - INFO - JobID 190 - Updating DNS records for Z09815256PBUS3QRIMRV - bootstrap.pcluster.
2023-09-14 09:08:17,141 - [slurm_plugin.instance_manager:_update_dns_hostnames] - INFO - JobID 190 - DNS records update: COMPLETED
2023-09-14 09:08:17,142 - [slurm_plugin.instance_manager:best_effort_node_assignment] - INFO - JobID 190 - Successful launched all instances for nodes (x3) ['q4-dy-c4-1-8', 'q4-dy-c4-1-9', 'q4-dy-c4-1-10']
2023-09-14 09:08:17,142 - [slurm_plugin.resume:_resume] - INFO - Successfully launched nodes (x7) ['q4-dy-c4-1-1', 'q4-dy-c4-1-2', 'q4-dy-c4-1-3', 'q4-dy-c4-1-4', 'q4-dy-c4-1-8', 'q4-dy-c4-1-9', 'q4-dy-c4-1-10']
2023-09-14 09:08:17,143 - [slurm_plugin.resume:_resume] - ERROR - Failed to launch following nodes, setting nodes to DOWN: (x2) ['q4-dy-c4-2-1', 'q4-dy-c4-2-2']
2023-09-14 09:08:17,143 - [slurm_plugin.resume:_handle_failed_nodes] - INFO - Setting following failed nodes into DOWN state (x2) ['q4-dy-c4-2-1', 'q4-dy-c4-2-2'] with reason: (Code:InsufficientInstanceCapacity)Failure when resuming nodes
2023-09-14 09:08:17,162 - [slurm_plugin.resume:main] - INFO - ResumeProgram finished.
```

Signed-off-by: Luca Carrogu <[email protected]>
  • Loading branch information
lukeseawalker committed Sep 15, 2023
1 parent c7621fb commit dd9c3dc
Show file tree
Hide file tree
Showing 6 changed files with 1,070 additions and 150 deletions.
8 changes: 7 additions & 1 deletion src/slurm_plugin/fleet_manager.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@
from botocore.exceptions import ClientError
from common.ec2_utils import get_private_ip_address_and_dns_name
from common.utils import setup_logging_filter
from slurm_plugin.common import print_with_count

logger = logging.getLogger(__name__)

Expand Down Expand Up @@ -172,7 +173,12 @@ def launch_ec2_instances(self, count, job_id=None):

launch_params = self._evaluate_launch_params(count)
assigned_nodes = self._launch_instances(launch_params)
logger.debug("Launched the following instances: %s", assigned_nodes.get("Instances"))
if len(assigned_nodes.get("Instances")) > 0:
logger.info(
"Launched the following instances %s",
print_with_count([instance.get("InstanceId", "") for instance in assigned_nodes.get("Instances")]),
)
logger.debug("Full launched instances information: %s", assigned_nodes.get("Instances"))

return [EC2Instance.from_describe_instance_data(instance_info) for instance_info in assigned_nodes["Instances"]]

Expand Down
Loading

0 comments on commit dd9c3dc

Please sign in to comment.