Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[develop] Add best-effort launch strategy for job-level scaling #560

Merged
merged 1 commit into from
Sep 15, 2023

Conversation

lukeseawalker
Copy link
Contributor

@lukeseawalker lukeseawalker commented Sep 14, 2023

Description of changes

  • Add best-effort launch strategy for job-level scaling.
  • All-or-nothing is now the new default. When set to "False", best-effort will be performed.
  • Small refactoring on log string messages.

Tests

  • Unit tests added
  • Manual tests performed on running cluster
    given the following submission command:
sbatch --wrap "sleep 10" -N 4 --constraint="[(c5.4xlarge)*3&(p4d.24xlarge)*1]" -p q4 --exclusive; sbatch --wrap "sleep 10" -N 2 --constraint="[(c5.4xlarge)*1&(p4d.24xlarge)*1]" -p q4 --exclusive; sbatch --wrap "sleep 10" -N 3 --constraint="[(c5.4xlarge)*3]" -p q4 --exclusive

where there is capacity for c5.4xlarge but not for p4d.24xlarge the two scaling strategies were tested:

all_or_nothing_batch = true
expected nodes running at the end of the resume call: (x3) q4-dy-c4-1-*

resume log:

2023-09-14 09:03:09,530 - [slurm_plugin.resume:main] - INFO - ResumeProgram startup.
2023-09-14 09:03:09,531 - [slurm_plugin.resume:_get_config] - INFO - Reading /etc/parallelcluster/slurm_plugin/parallelcluster_slurm_resume.conf
2023-09-14 09:03:09,533 - [slurm_plugin.resume:main] - INFO - ResumeProgram config: SlurmResumeConfig(region='us-east-1', cluster_name='bootstrap', dynamodb_table='parallelcluster-slurm-bootstrap', hosted_zone='Z09815256PBUS3QRIMRV', dns_domain='bootstrap.pcluster.', use_private_hostname=False, head_node_private_ip='192.168.24.99', head_node_hostname='ip-192-168-24-99.ec2.internal', launch_max_batch_size=500, assign_node_max_batch_size=500, terminate_max_batch_size=1000, update_node_address=True, all_or_nothing_batch=True, job_level_scaling=True, temp_jls_for_node_sharing=False, fleet_config={'q1': {'c1': {'Api': 'create-fleet', 'CapacityType': 'on-demand', 'AllocationStrategy': 'lowest-price', 'Instances': [{'InstanceType': 'c5.xlarge'}], 'Networking': {'SubnetIds': ['subnet-0b48ed99988e56110']}}}, 'q2': {'c2': {'Api': 'create-fleet', 'CapacityType': 'on-demand', 'AllocationStrategy': 'lowest-price', 'Instances': [{'InstanceType': 'c5.2xlarge'}], 'Networking': {'SubnetIds': ['subnet-0b48ed99988e56110']}}}, 'q3': {'c3': {'Api': 'create-fleet', 'CapacityType': 'on-demand', 'AllocationStrategy': 'lowest-price', 'Instances': [{'InstanceType': 'c5.4xlarge'}], 'Networking': {'SubnetIds': ['subnet-0b48ed99988e56110']}}}, 'q4': {'c4-1': {'Api': 'create-fleet', 'CapacityType': 'on-demand', 'AllocationStrategy': 'lowest-price', 'Instances': [{'InstanceType': 'c5.4xlarge'}], 'Networking': {'SubnetIds': ['subnet-0b48ed99988e56110']}}, 'c4-2': {'Api': 'create-fleet', 'CapacityType': 'on-demand', 'AllocationStrategy': 'lowest-price', 'Instances': [{'InstanceType': 'p4d.24xlarge'}], 'Networking': {'SubnetIds': ['subnet-0b48ed99988e56110']}}}}, run_instances_overrides={}, create_fleet_overrides={}, clustermgtd_timeout=300, clustermgtd_heartbeat_file_path='/opt/slurm/etc/pcluster/.slurm_plugin/clustermgtd_heartbeat', _boto3_retry=1, _boto3_config={'retries': {'max_attempts': 1, 'mode': 'standard'}}, boto3_config=<botocore.config.Config object at 0x7f75379b6d60>, logging_config='/opt/parallelcluster/pyenv/versions/3.9.16/envs/node_virtualenv/lib/python3.9/site-packages/slurm_plugin/logging/parallelcluster_resume_logging.conf', head_node_instance_id='i-0145afe796a5e375a')
2023-09-14 09:03:09,533 - [slurm_plugin.resume:_get_slurm_resume] - INFO - Slurm Resume File content: {'jobs': [{'extra': None, 'job_id': 185, 'features': '[(c5.4xlarge)*3&(p4d.24xlarge)*1]', 'nodes_alloc': 'q4-dy-c4-1-[1-3],q4-dy-c4-2-1', 'nodes_resume': 'q4-dy-c4-1-[1-3],q4-dy-c4-2-1', 'oversubscribe': 'NO', 'partition': 'q4', 'reservation': None}, {'extra': None, 'job_id': 186, 'features': '[(c5.4xlarge)*1&(p4d.24xlarge)*1]', 'nodes_alloc': 'q4-dy-c4-1-4,q4-dy-c4-2-2', 'nodes_resume': 'q4-dy-c4-1-4,q4-dy-c4-2-2', 'oversubscribe': 'NO', 'partition': 'q4', 'reservation': None}, {'extra': None, 'job_id': 187, 'features': '[(c5.4xlarge)*3]', 'nodes_alloc': 'q4-dy-c4-1-[5-7]', 'nodes_resume': 'q4-dy-c4-1-[5-7]', 'oversubscribe': 'NO', 'partition': 'q4', 'reservation': None}], 'all_nodes_resume': 'q4-dy-c4-1-[1-7],q4-dy-c4-2-[1-2]'}
2023-09-14 09:03:09,537 - [slurm_plugin.common:is_clustermgtd_heartbeat_valid] - INFO - Latest heartbeat from clustermgtd: 2023-09-14 09:02:27.308559+00:00
2023-09-14 09:03:09,538 - [slurm_plugin.resume:_resume] - INFO - Launching EC2 instances for the following Slurm nodes: q4-dy-c4-1-[1-7],q4-dy-c4-2-[1-2]
2023-09-14 09:03:09,594 - [slurm_plugin.resume:_resume] - INFO - Current state of Slurm nodes to resume: [('q4-dy-c4-1-1', 'ALLOCATED+CLOUD+NOT_RESPONDING+POWERING_UP'), ('q4-dy-c4-1-2', 'ALLOCATED+CLOUD+NOT_RESPONDING+POWERING_UP'), ('q4-dy-c4-1-3', 'ALLOCATED+CLOUD+NOT_RESPONDING+POWERING_UP'), ('q4-dy-c4-1-4', 'ALLOCATED+CLOUD+NOT_RESPONDING+POWERING_UP'), ('q4-dy-c4-1-5', 'ALLOCATED+CLOUD+NOT_RESPONDING+POWERING_UP'), ('q4-dy-c4-1-6', 'ALLOCATED+CLOUD+NOT_RESPONDING+POWERING_UP'), ('q4-dy-c4-1-7', 'ALLOCATED+CLOUD+NOT_RESPONDING+POWERING_UP'), ('q4-dy-c4-2-1', 'ALLOCATED+CLOUD+NOT_RESPONDING+POWERING_UP'), ('q4-dy-c4-2-2', 'ALLOCATED+CLOUD+NOT_RESPONDING+POWERING_UP')]
2023-09-14 09:03:09,620 - [botocore.credentials:load] - INFO - Found credentials from IAM Role: bootstrap-RoleHeadNode-NKATKTSA4IIU
2023-09-14 09:03:09,660 - [slurm_plugin.instance_manager:_launch_instances] - INFO - Launching all-or-nothing instances for nodes (x7) ['q4-dy-c4-1-1', 'q4-dy-c4-1-2', 'q4-dy-c4-1-3', 'q4-dy-c4-1-4', 'q4-dy-c4-1-5', 'q4-dy-c4-1-6', 'q4-dy-c4-1-7']
2023-09-14 09:03:09,661 - [slurm_plugin.fleet_manager:create_fleet] - INFO - Launching instances with create_fleet API. Parameters: {'LaunchTemplateConfigs': [{'LaunchTemplateSpecification': {'LaunchTemplateName': 'bootstrap-q4-c4-1', 'Version': '$Latest'}, 'Overrides': [{'InstanceType': 'c5.4xlarge', 'SubnetId': 'subnet-0b48ed99988e56110'}]}], 'TargetCapacitySpecification': {'TotalTargetCapacity': 7, 'DefaultTargetCapacityType': 'on-demand'}, 'Type': 'instant', 'OnDemandOptions': {'AllocationStrategy': 'lowest-price', 'SingleInstanceType': True, 'SingleAvailabilityZone': True, 'MinTargetCapacity': 7, 'CapacityReservationOptions': {'UsageStrategy': 'use-capacity-reservations-first'}}}
2023-09-14 09:03:12,930 - [slurm_plugin.fleet_manager:launch_ec2_instances] - INFO - Launched the following instances (x7) ['i-09ba3d3b0753ddc33', 'i-095c89ec9f1e389d8', 'i-0414b54e1cfb7f5b8', 'i-01ac20db646a75ffa', 'i-03bdd4851aa584786', 'i-0b5adaef26df1187d', 'i-08584b017f57195b0']
2023-09-14 09:03:12,931 - [slurm_plugin.instance_manager:_launch_instances] - INFO - Launching all-or-nothing instances for nodes (x2) ['q4-dy-c4-2-1', 'q4-dy-c4-2-2']
2023-09-14 09:03:12,931 - [slurm_plugin.fleet_manager:create_fleet] - INFO - Launching instances with create_fleet API. Parameters: {'LaunchTemplateConfigs': [{'LaunchTemplateSpecification': {'LaunchTemplateName': 'bootstrap-q4-c4-2', 'Version': '$Latest'}, 'Overrides': [{'InstanceType': 'p4d.24xlarge', 'SubnetId': 'subnet-0b48ed99988e56110'}]}], 'TargetCapacitySpecification': {'TotalTargetCapacity': 2, 'DefaultTargetCapacityType': 'on-demand'}, 'Type': 'instant', 'OnDemandOptions': {'AllocationStrategy': 'lowest-price', 'SingleInstanceType': True, 'SingleAvailabilityZone': True, 'MinTargetCapacity': 2, 'CapacityReservationOptions': {'UsageStrategy': 'use-capacity-reservations-first'}}}
2023-09-14 09:03:13,971 - [slurm_plugin.fleet_manager:_launch_instances] - ERROR - Error in CreateFleet request (7e76aa68-8d69-42a8-bead-7de1a50f9037): InsufficientInstanceCapacity - We currently do not have sufficient p4d.24xlarge capacity in the Availability Zone you requested (us-east-1d). Our system will be working on provisioning additional capacity. You can currently get p4d.24xlarge capacity by not specifying an Availability Zone in your request or choosing us-east-1a, us-east-1b.
2023-09-14 09:03:14,072 - [slurm_plugin.instance_manager:_add_instances_for_job] - INFO - JobID 185 - The job nodes_resume list is (x4) ['q4-dy-c4-1-1', 'q4-dy-c4-1-2', 'q4-dy-c4-1-3', 'q4-dy-c4-2-1']
2023-09-14 09:03:14,072 - [slurm_plugin.instance_manager:_resize_slurm_node_list] - INFO - JobID 185 - Booking already launched instances for nodes (x3) ['q4-dy-c4-1-1', 'q4-dy-c4-1-2', 'q4-dy-c4-1-3']:
2023-09-14 09:03:14,072 - [slurm_plugin.instance_manager:_launch_instances] - INFO - JobID 185 - Launching all-or-nothing instances for nodes (x1) ['q4-dy-c4-2-1']
2023-09-14 09:03:14,072 - [slurm_plugin.fleet_manager:create_fleet] - INFO - JobID 185 - Launching instances with create_fleet API. Parameters: {'LaunchTemplateConfigs': [{'LaunchTemplateSpecification': {'LaunchTemplateName': 'bootstrap-q4-c4-2', 'Version': '$Latest'}, 'Overrides': [{'InstanceType': 'p4d.24xlarge', 'SubnetId': 'subnet-0b48ed99988e56110'}]}], 'TargetCapacitySpecification': {'TotalTargetCapacity': 1, 'DefaultTargetCapacityType': 'on-demand'}, 'Type': 'instant', 'OnDemandOptions': {'AllocationStrategy': 'lowest-price', 'SingleInstanceType': True, 'SingleAvailabilityZone': True, 'MinTargetCapacity': 1, 'CapacityReservationOptions': {'UsageStrategy': 'use-capacity-reservations-first'}}}
2023-09-14 09:03:15,050 - [slurm_plugin.fleet_manager:_launch_instances] - ERROR - JobID 185 - Error in CreateFleet request (044cbd43-2925-4874-af52-40ca1240e179): InsufficientInstanceCapacity - We currently do not have sufficient p4d.24xlarge capacity in the Availability Zone you requested (us-east-1d). Our system will be working on provisioning additional capacity. You can currently get p4d.24xlarge capacity by not specifying an Availability Zone in your request or choosing us-east-1a, us-east-1b.
2023-09-14 09:03:15,151 - [slurm_plugin.instance_manager:all_or_nothing_node_assignment] - INFO - JobID 185 - Releasing booked instances (x3) ["('q4', 'c4-1', EC2Instance(id='i-09ba3d3b0753ddc33', private_ip='192.168.109.64', hostname='ip-192-168-109-64', launch_time=datetime.datetime(2023, 9, 14, 9, 3, 11, tzinfo=tzlocal()), slurm_node=None))", "('q4', 'c4-1', EC2Instance(id='i-095c89ec9f1e389d8', private_ip='192.168.107.253', hostname='ip-192-168-107-253', launch_time=datetime.datetime(2023, 9, 14, 9, 3, 11, tzinfo=tzlocal()), slurm_node=None))", "('q4', 'c4-1', EC2Instance(id='i-0414b54e1cfb7f5b8', private_ip='192.168.111.135', hostname='ip-192-168-111-135', launch_time=datetime.datetime(2023, 9, 14, 9, 3, 11, tzinfo=tzlocal()), slurm_node=None))"]
2023-09-14 09:03:15,151 - [slurm_plugin.instance_manager:_add_instances_for_job] - INFO - JobID 186 - The job nodes_resume list is (x2) ['q4-dy-c4-1-4', 'q4-dy-c4-2-2']
2023-09-14 09:03:15,151 - [slurm_plugin.instance_manager:_resize_slurm_node_list] - INFO - JobID 186 - Booking already launched instances for nodes (x1) ['q4-dy-c4-1-4']:
2023-09-14 09:03:15,151 - [slurm_plugin.instance_manager:_launch_instances] - INFO - JobID 186 - Launching all-or-nothing instances for nodes (x1) ['q4-dy-c4-2-2']
2023-09-14 09:03:15,152 - [slurm_plugin.fleet_manager:create_fleet] - INFO - JobID 186 - Launching instances with create_fleet API. Parameters: {'LaunchTemplateConfigs': [{'LaunchTemplateSpecification': {'LaunchTemplateName': 'bootstrap-q4-c4-2', 'Version': '$Latest'}, 'Overrides': [{'InstanceType': 'p4d.24xlarge', 'SubnetId': 'subnet-0b48ed99988e56110'}]}], 'TargetCapacitySpecification': {'TotalTargetCapacity': 1, 'DefaultTargetCapacityType': 'on-demand'}, 'Type': 'instant', 'OnDemandOptions': {'AllocationStrategy': 'lowest-price', 'SingleInstanceType': True, 'SingleAvailabilityZone': True, 'MinTargetCapacity': 1, 'CapacityReservationOptions': {'UsageStrategy': 'use-capacity-reservations-first'}}}
2023-09-14 09:03:16,162 - [slurm_plugin.fleet_manager:_launch_instances] - ERROR - JobID 186 - Error in CreateFleet request (f1829b1d-4426-4dfa-8f27-3cf306b784e1): InsufficientInstanceCapacity - We currently do not have sufficient p4d.24xlarge capacity in the Availability Zone you requested (us-east-1d). Our system will be working on provisioning additional capacity. You can currently get p4d.24xlarge capacity by not specifying an Availability Zone in your request or choosing us-east-1a, us-east-1b.
2023-09-14 09:03:16,262 - [slurm_plugin.instance_manager:all_or_nothing_node_assignment] - INFO - JobID 186 - Releasing booked instances (x1) ["('q4', 'c4-1', EC2Instance(id='i-01ac20db646a75ffa', private_ip='192.168.108.115', hostname='ip-192-168-108-115', launch_time=datetime.datetime(2023, 9, 14, 9, 3, 11, tzinfo=tzlocal()), slurm_node=None))"]
2023-09-14 09:03:16,262 - [slurm_plugin.instance_manager:_add_instances_for_job] - INFO - JobID 187 - The job nodes_resume list is (x3) ['q4-dy-c4-1-5', 'q4-dy-c4-1-6', 'q4-dy-c4-1-7']
2023-09-14 09:03:16,262 - [slurm_plugin.instance_manager:_resize_slurm_node_list] - INFO - JobID 187 - Booking already launched instances for nodes (x3) ['q4-dy-c4-1-5', 'q4-dy-c4-1-6', 'q4-dy-c4-1-7']:
2023-09-14 09:03:16,280 - [slurm_plugin.instance_manager:_update_slurm_node_addrs] - INFO - JobID 187 - Nodes are now configured with instances: (x3) ["('q4-dy-c4-1-5', EC2Instance(id='i-03bdd4851aa584786', private_ip='192.168.107.163', hostname='ip-192-168-107-163', launch_time=datetime.datetime(2023, 9, 14, 9, 3, 11, tzinfo=tzlocal()), slurm_node=None))", "('q4-dy-c4-1-6', EC2Instance(id='i-0b5adaef26df1187d', private_ip='192.168.106.37', hostname='ip-192-168-106-37', launch_time=datetime.datetime(2023, 9, 14, 9, 3, 11, tzinfo=tzlocal()), slurm_node=None))", "('q4-dy-c4-1-7', EC2Instance(id='i-08584b017f57195b0', private_ip='192.168.110.115', hostname='ip-192-168-110-115', launch_time=datetime.datetime(2023, 9, 14, 9, 3, 11, tzinfo=tzlocal()), slurm_node=None))"]
2023-09-14 09:03:16,281 - [slurm_plugin.instance_manager:_store_assigned_hostnames] - INFO - JobID 187 - Saving assigned hostnames in DynamoDB
2023-09-14 09:03:16,327 - [slurm_plugin.instance_manager:_store_assigned_hostnames] - INFO - JobID 187 - Database update: COMPLETED
2023-09-14 09:03:16,327 - [slurm_plugin.instance_manager:_update_dns_hostnames] - INFO - JobID 187 - Updating DNS records for Z09815256PBUS3QRIMRV - bootstrap.pcluster.
2023-09-14 09:03:16,652 - [slurm_plugin.instance_manager:_update_dns_hostnames] - INFO - JobID 187 - DNS records update: COMPLETED
2023-09-14 09:03:16,653 - [slurm_plugin.instance_manager:all_or_nothing_node_assignment] - INFO - JobID 187 - Successful launched all instances for nodes (x3) ['q4-dy-c4-1-5', 'q4-dy-c4-1-6', 'q4-dy-c4-1-7']
2023-09-14 09:03:16,653 - [slurm_plugin.instance_manager:_terminate_unassigned_launched_instances] - INFO - Terminating unassigned launched instances: {'q4': {'c4-1': [EC2Instance(id='i-09ba3d3b0753ddc33', private_ip='192.168.109.64', hostname='ip-192-168-109-64', launch_time=datetime.datetime(2023, 9, 14, 9, 3, 11, tzinfo=tzlocal()), slurm_node=None), EC2Instance(id='i-095c89ec9f1e389d8', private_ip='192.168.107.253', hostname='ip-192-168-107-253', launch_time=datetime.datetime(2023, 9, 14, 9, 3, 11, tzinfo=tzlocal()), slurm_node=None), EC2Instance(id='i-0414b54e1cfb7f5b8', private_ip='192.168.111.135', hostname='ip-192-168-111-135', launch_time=datetime.datetime(2023, 9, 14, 9, 3, 11, tzinfo=tzlocal()), slurm_node=None), EC2Instance(id='i-01ac20db646a75ffa', private_ip='192.168.108.115', hostname='ip-192-168-108-115', launch_time=datetime.datetime(2023, 9, 14, 9, 3, 11, tzinfo=tzlocal()), slurm_node=None)]}}
2023-09-14 09:03:16,662 - [slurm_plugin.instance_manager:delete_instances] - INFO - Terminating instances (x4) ['i-09ba3d3b0753ddc33', 'i-095c89ec9f1e389d8', 'i-0414b54e1cfb7f5b8', 'i-01ac20db646a75ffa']
2023-09-14 09:03:17,131 - [slurm_plugin.resume:_resume] - INFO - Successfully launched nodes (x3) ['q4-dy-c4-1-5', 'q4-dy-c4-1-6', 'q4-dy-c4-1-7']
2023-09-14 09:03:17,131 - [slurm_plugin.resume:_resume] - ERROR - Failed to launch following nodes, setting nodes to DOWN: (x6) ['q4-dy-c4-1-1', 'q4-dy-c4-1-3', 'q4-dy-c4-1-4', 'q4-dy-c4-2-2', 'q4-dy-c4-2-1', 'q4-dy-c4-1-2']
2023-09-14 09:03:17,131 - [slurm_plugin.resume:_handle_failed_nodes] - INFO - Setting following failed nodes into DOWN state (x2) ['q4-dy-c4-2-2', 'q4-dy-c4-2-1'] with reason: (Code:InsufficientInstanceCapacity)Failure when resuming nodes
2023-09-14 09:03:17,149 - [slurm_plugin.resume:_handle_failed_nodes] - INFO - Setting following failed nodes into DOWN state (x4) ['q4-dy-c4-1-1', 'q4-dy-c4-1-3', 'q4-dy-c4-1-4', 'q4-dy-c4-1-2'] with reason: (Code:LimitedInstanceCapacity)Failure when resuming nodes
2023-09-14 09:03:17,169 - [slurm_plugin.resume:main] - INFO - ResumeProgram finished.

all_or_nothing_batch = false
expected nodes running at the end of the resume call: (x7) q4-dy-c4-1-*

resume log:

2023-09-14 09:08:09,554 - [slurm_plugin.resume:main] - INFO - ResumeProgram startup.
2023-09-14 09:08:09,555 - [slurm_plugin.resume:_get_config] - INFO - Reading /etc/parallelcluster/slurm_plugin/parallelcluster_slurm_resume.conf
2023-09-14 09:08:09,556 - [slurm_plugin.resume:main] - INFO - ResumeProgram config: SlurmResumeConfig(region='us-east-1', cluster_name='bootstrap', dynamodb_table='parallelcluster-slurm-bootstrap', hosted_zone='Z09815256PBUS3QRIMRV', dns_domain='bootstrap.pcluster.', use_private_hostname=False, head_node_private_ip='192.168.24.99', head_node_hostname='ip-192-168-24-99.ec2.internal', launch_max_batch_size=500, assign_node_max_batch_size=500, terminate_max_batch_size=1000, update_node_address=True, all_or_nothing_batch=False, job_level_scaling=True, temp_jls_for_node_sharing=False, fleet_config={'q1': {'c1': {'Api': 'create-fleet', 'CapacityType': 'on-demand', 'AllocationStrategy': 'lowest-price', 'Instances': [{'InstanceType': 'c5.xlarge'}], 'Networking': {'SubnetIds': ['subnet-0b48ed99988e56110']}}}, 'q2': {'c2': {'Api': 'create-fleet', 'CapacityType': 'on-demand', 'AllocationStrategy': 'lowest-price', 'Instances': [{'InstanceType': 'c5.2xlarge'}], 'Networking': {'SubnetIds': ['subnet-0b48ed99988e56110']}}}, 'q3': {'c3': {'Api': 'create-fleet', 'CapacityType': 'on-demand', 'AllocationStrategy': 'lowest-price', 'Instances': [{'InstanceType': 'c5.4xlarge'}], 'Networking': {'SubnetIds': ['subnet-0b48ed99988e56110']}}}, 'q4': {'c4-1': {'Api': 'create-fleet', 'CapacityType': 'on-demand', 'AllocationStrategy': 'lowest-price', 'Instances': [{'InstanceType': 'c5.4xlarge'}], 'Networking': {'SubnetIds': ['subnet-0b48ed99988e56110']}}, 'c4-2': {'Api': 'create-fleet', 'CapacityType': 'on-demand', 'AllocationStrategy': 'lowest-price', 'Instances': [{'InstanceType': 'p4d.24xlarge'}], 'Networking': {'SubnetIds': ['subnet-0b48ed99988e56110']}}}}, run_instances_overrides={}, create_fleet_overrides={}, clustermgtd_timeout=300, clustermgtd_heartbeat_file_path='/opt/slurm/etc/pcluster/.slurm_plugin/clustermgtd_heartbeat', _boto3_retry=1, _boto3_config={'retries': {'max_attempts': 1, 'mode': 'standard'}}, boto3_config=<botocore.config.Config object at 0x7fed57aa1d60>, logging_config='/opt/parallelcluster/pyenv/versions/3.9.16/envs/node_virtualenv/lib/python3.9/site-packages/slurm_plugin/logging/parallelcluster_resume_logging.conf', head_node_instance_id='i-0145afe796a5e375a')
2023-09-14 09:08:09,557 - [slurm_plugin.resume:_get_slurm_resume] - INFO - Slurm Resume File content: {'jobs': [{'extra': None, 'job_id': 188, 'features': '[(c5.4xlarge)*3&(p4d.24xlarge)*1]', 'nodes_alloc': 'q4-dy-c4-1-[1-3],q4-dy-c4-2-1', 'nodes_resume': 'q4-dy-c4-1-[1-3],q4-dy-c4-2-1', 'oversubscribe': 'NO', 'partition': 'q4', 'reservation': None}, {'extra': None, 'job_id': 189, 'features': '[(c5.4xlarge)*1&(p4d.24xlarge)*1]', 'nodes_alloc': 'q4-dy-c4-1-4,q4-dy-c4-2-2', 'nodes_resume': 'q4-dy-c4-1-4,q4-dy-c4-2-2', 'oversubscribe': 'NO', 'partition': 'q4', 'reservation': None}, {'extra': None, 'job_id': 190, 'features': '[(c5.4xlarge)*3]', 'nodes_alloc': 'q4-dy-c4-1-[8-10]', 'nodes_resume': 'q4-dy-c4-1-[8-10]', 'oversubscribe': 'NO', 'partition': 'q4', 'reservation': None}], 'all_nodes_resume': 'q4-dy-c4-1-[1-4,8-10],q4-dy-c4-2-[1-2]'}
2023-09-14 09:08:09,561 - [slurm_plugin.common:is_clustermgtd_heartbeat_valid] - INFO - Latest heartbeat from clustermgtd: 2023-09-14 09:07:27.471205+00:00
2023-09-14 09:08:09,561 - [slurm_plugin.resume:_resume] - INFO - Launching EC2 instances for the following Slurm nodes: q4-dy-c4-1-[1-4,8-10],q4-dy-c4-2-[1-2]
2023-09-14 09:08:09,616 - [slurm_plugin.resume:_resume] - INFO - Current state of Slurm nodes to resume: [('q4-dy-c4-1-1', 'ALLOCATED+CLOUD+NOT_RESPONDING+POWERING_UP'), ('q4-dy-c4-1-2', 'ALLOCATED+CLOUD+NOT_RESPONDING+POWERING_UP'), ('q4-dy-c4-1-3', 'ALLOCATED+CLOUD+NOT_RESPONDING+POWERING_UP'), ('q4-dy-c4-1-4', 'ALLOCATED+CLOUD+NOT_RESPONDING+POWERING_UP'), ('q4-dy-c4-1-8', 'ALLOCATED+CLOUD+NOT_RESPONDING+POWERING_UP'), ('q4-dy-c4-1-9', 'ALLOCATED+CLOUD+NOT_RESPONDING+POWERING_UP'), ('q4-dy-c4-1-10', 'ALLOCATED+CLOUD+NOT_RESPONDING+POWERING_UP'), ('q4-dy-c4-2-1', 'ALLOCATED+CLOUD+NOT_RESPONDING+POWERING_UP'), ('q4-dy-c4-2-2', 'ALLOCATED+CLOUD+NOT_RESPONDING+POWERING_UP')]
2023-09-14 09:08:09,643 - [botocore.credentials:load] - INFO - Found credentials from IAM Role: bootstrap-RoleHeadNode-NKATKTSA4IIU
2023-09-14 09:08:09,683 - [slurm_plugin.instance_manager:_launch_instances] - INFO - Launching best-effort instances for nodes (x7) ['q4-dy-c4-1-1', 'q4-dy-c4-1-2', 'q4-dy-c4-1-3', 'q4-dy-c4-1-4', 'q4-dy-c4-1-8', 'q4-dy-c4-1-9', 'q4-dy-c4-1-10']
2023-09-14 09:08:09,683 - [slurm_plugin.fleet_manager:create_fleet] - INFO - Launching instances with create_fleet API. Parameters: {'LaunchTemplateConfigs': [{'LaunchTemplateSpecification': {'LaunchTemplateName': 'bootstrap-q4-c4-1', 'Version': '$Latest'}, 'Overrides': [{'InstanceType': 'c5.4xlarge', 'SubnetId': 'subnet-0b48ed99988e56110'}]}], 'TargetCapacitySpecification': {'TotalTargetCapacity': 7, 'DefaultTargetCapacityType': 'on-demand'}, 'Type': 'instant', 'OnDemandOptions': {'AllocationStrategy': 'lowest-price', 'SingleInstanceType': True, 'SingleAvailabilityZone': True, 'MinTargetCapacity': 1, 'CapacityReservationOptions': {'UsageStrategy': 'use-capacity-reservations-first'}}}
2023-09-14 09:08:12,914 - [slurm_plugin.fleet_manager:launch_ec2_instances] - INFO - Launched the following instances (x7) ['i-0250d4e661b9d86eb', 'i-0d23930fc5b09fd33', 'i-07dad6e5f1eed664d', 'i-0ad5528556d13495b', 'i-0365529c953588fab', 'i-03a19e86c0d73e84b', 'i-05b6109e7c0940a9c']
2023-09-14 09:08:12,915 - [slurm_plugin.instance_manager:_launch_instances] - INFO - Launching best-effort instances for nodes (x2) ['q4-dy-c4-2-1', 'q4-dy-c4-2-2']
2023-09-14 09:08:12,915 - [slurm_plugin.fleet_manager:create_fleet] - INFO - Launching instances with create_fleet API. Parameters: {'LaunchTemplateConfigs': [{'LaunchTemplateSpecification': {'LaunchTemplateName': 'bootstrap-q4-c4-2', 'Version': '$Latest'}, 'Overrides': [{'InstanceType': 'p4d.24xlarge', 'SubnetId': 'subnet-0b48ed99988e56110'}]}], 'TargetCapacitySpecification': {'TotalTargetCapacity': 2, 'DefaultTargetCapacityType': 'on-demand'}, 'Type': 'instant', 'OnDemandOptions': {'AllocationStrategy': 'lowest-price', 'SingleInstanceType': True, 'SingleAvailabilityZone': True, 'MinTargetCapacity': 1, 'CapacityReservationOptions': {'UsageStrategy': 'use-capacity-reservations-first'}}}
2023-09-14 09:08:14,152 - [slurm_plugin.fleet_manager:_launch_instances] - ERROR - Error in CreateFleet request (af6b0eb4-086f-46ad-b58b-6c5f811d8280): InsufficientInstanceCapacity - We currently do not have sufficient p4d.24xlarge capacity in the Availability Zone you requested (us-east-1d). Our system will be working on provisioning additional capacity. You can currently get p4d.24xlarge capacity by not specifying an Availability Zone in your request or choosing us-east-1a, us-east-1b.
2023-09-14 09:08:14,253 - [slurm_plugin.instance_manager:_add_instances_for_job] - INFO - JobID 188 - The job nodes_resume list is (x4) ['q4-dy-c4-1-1', 'q4-dy-c4-1-2', 'q4-dy-c4-1-3', 'q4-dy-c4-2-1']
2023-09-14 09:08:14,253 - [slurm_plugin.instance_manager:_resize_slurm_node_list] - INFO - JobID 188 - Booking already launched instances for nodes (x3) ['q4-dy-c4-1-1', 'q4-dy-c4-1-2', 'q4-dy-c4-1-3']:
2023-09-14 09:08:14,253 - [slurm_plugin.instance_manager:_launch_instances] - INFO - JobID 188 - Launching best-effort instances for nodes (x1) ['q4-dy-c4-2-1']
2023-09-14 09:08:14,254 - [slurm_plugin.fleet_manager:create_fleet] - INFO - JobID 188 - Launching instances with create_fleet API. Parameters: {'LaunchTemplateConfigs': [{'LaunchTemplateSpecification': {'LaunchTemplateName': 'bootstrap-q4-c4-2', 'Version': '$Latest'}, 'Overrides': [{'InstanceType': 'p4d.24xlarge', 'SubnetId': 'subnet-0b48ed99988e56110'}]}], 'TargetCapacitySpecification': {'TotalTargetCapacity': 1, 'DefaultTargetCapacityType': 'on-demand'}, 'Type': 'instant', 'OnDemandOptions': {'AllocationStrategy': 'lowest-price', 'SingleInstanceType': True, 'SingleAvailabilityZone': True, 'MinTargetCapacity': 1, 'CapacityReservationOptions': {'UsageStrategy': 'use-capacity-reservations-first'}}}
2023-09-14 09:08:15,274 - [slurm_plugin.fleet_manager:_launch_instances] - ERROR - JobID 188 - Error in CreateFleet request (ff2ac807-49a8-41b4-8af9-2dcea2ed6dfb): InsufficientInstanceCapacity - We currently do not have sufficient p4d.24xlarge capacity in the Availability Zone you requested (us-east-1d). Our system will be working on provisioning additional capacity. You can currently get p4d.24xlarge capacity by not specifying an Availability Zone in your request or choosing us-east-1a, us-east-1b.
2023-09-14 09:08:15,409 - [slurm_plugin.instance_manager:_update_slurm_node_addrs] - INFO - JobID 188 - Nodes are now configured with instances: (x3) ["('q4-dy-c4-1-1', EC2Instance(id='i-0250d4e661b9d86eb', private_ip='192.168.111.231', hostname='ip-192-168-111-231', launch_time=datetime.datetime(2023, 9, 14, 9, 8, 11, tzinfo=tzlocal()), slurm_node=None))", "('q4-dy-c4-1-2', EC2Instance(id='i-0d23930fc5b09fd33', private_ip='192.168.110.38', hostname='ip-192-168-110-38', launch_time=datetime.datetime(2023, 9, 14, 9, 8, 11, tzinfo=tzlocal()), slurm_node=None))", "('q4-dy-c4-1-3', EC2Instance(id='i-07dad6e5f1eed664d', private_ip='192.168.104.249', hostname='ip-192-168-104-249', launch_time=datetime.datetime(2023, 9, 14, 9, 8, 11, tzinfo=tzlocal()), slurm_node=None))"]
2023-09-14 09:08:15,409 - [slurm_plugin.instance_manager:_store_assigned_hostnames] - INFO - JobID 188 - Saving assigned hostnames in DynamoDB
2023-09-14 09:08:15,447 - [slurm_plugin.instance_manager:_store_assigned_hostnames] - INFO - JobID 188 - Database update: COMPLETED
2023-09-14 09:08:15,447 - [slurm_plugin.instance_manager:_update_dns_hostnames] - INFO - JobID 188 - Updating DNS records for Z09815256PBUS3QRIMRV - bootstrap.pcluster.
2023-09-14 09:08:15,743 - [slurm_plugin.instance_manager:_update_dns_hostnames] - INFO - JobID 188 - DNS records update: COMPLETED
2023-09-14 09:08:15,744 - [slurm_plugin.instance_manager:best_effort_node_assignment] - INFO - JobID 188 - Successful launched partial instances for nodes (x3) ['q4-dy-c4-1-1', 'q4-dy-c4-1-2', 'q4-dy-c4-1-3']
2023-09-14 09:08:15,744 - [slurm_plugin.instance_manager:_add_instances_for_job] - INFO - JobID 189 - The job nodes_resume list is (x2) ['q4-dy-c4-1-4', 'q4-dy-c4-2-2']
2023-09-14 09:08:15,744 - [slurm_plugin.instance_manager:_resize_slurm_node_list] - INFO - JobID 189 - Booking already launched instances for nodes (x1) ['q4-dy-c4-1-4']:
2023-09-14 09:08:15,744 - [slurm_plugin.instance_manager:_launch_instances] - INFO - JobID 189 - Launching best-effort instances for nodes (x1) ['q4-dy-c4-2-2']
2023-09-14 09:08:15,744 - [slurm_plugin.fleet_manager:create_fleet] - INFO - JobID 189 - Launching instances with create_fleet API. Parameters: {'LaunchTemplateConfigs': [{'LaunchTemplateSpecification': {'LaunchTemplateName': 'bootstrap-q4-c4-2', 'Version': '$Latest'}, 'Overrides': [{'InstanceType': 'p4d.24xlarge', 'SubnetId': 'subnet-0b48ed99988e56110'}]}], 'TargetCapacitySpecification': {'TotalTargetCapacity': 1, 'DefaultTargetCapacityType': 'on-demand'}, 'Type': 'instant', 'OnDemandOptions': {'AllocationStrategy': 'lowest-price', 'SingleInstanceType': True, 'SingleAvailabilityZone': True, 'MinTargetCapacity': 1, 'CapacityReservationOptions': {'UsageStrategy': 'use-capacity-reservations-first'}}}
2023-09-14 09:08:16,696 - [slurm_plugin.fleet_manager:_launch_instances] - ERROR - JobID 189 - Error in CreateFleet request (63180bc8-cad1-4754-a1c6-0b93fdd36461): InsufficientInstanceCapacity - We currently do not have sufficient p4d.24xlarge capacity in the Availability Zone you requested (us-east-1d). Our system will be working on provisioning additional capacity. You can currently get p4d.24xlarge capacity by not specifying an Availability Zone in your request or choosing us-east-1a, us-east-1b.
2023-09-14 09:08:16,814 - [slurm_plugin.instance_manager:_update_slurm_node_addrs] - INFO - JobID 189 - Nodes are now configured with instances: (x1) ["('q4-dy-c4-1-4', EC2Instance(id='i-0ad5528556d13495b', private_ip='192.168.104.152', hostname='ip-192-168-104-152', launch_time=datetime.datetime(2023, 9, 14, 9, 8, 11, tzinfo=tzlocal()), slurm_node=None))"]
2023-09-14 09:08:16,814 - [slurm_plugin.instance_manager:_store_assigned_hostnames] - INFO - JobID 189 - Saving assigned hostnames in DynamoDB
2023-09-14 09:08:16,821 - [slurm_plugin.instance_manager:_store_assigned_hostnames] - INFO - JobID 189 - Database update: COMPLETED
2023-09-14 09:08:16,822 - [slurm_plugin.instance_manager:_update_dns_hostnames] - INFO - JobID 189 - Updating DNS records for Z09815256PBUS3QRIMRV - bootstrap.pcluster.
2023-09-14 09:08:16,950 - [slurm_plugin.instance_manager:_update_dns_hostnames] - INFO - JobID 189 - DNS records update: COMPLETED
2023-09-14 09:08:16,951 - [slurm_plugin.instance_manager:best_effort_node_assignment] - INFO - JobID 189 - Successful launched partial instances for nodes (x1) ['q4-dy-c4-1-4']
2023-09-14 09:08:16,952 - [slurm_plugin.instance_manager:_add_instances_for_job] - INFO - JobID 190 - The job nodes_resume list is (x3) ['q4-dy-c4-1-8', 'q4-dy-c4-1-9', 'q4-dy-c4-1-10']
2023-09-14 09:08:16,952 - [slurm_plugin.instance_manager:_resize_slurm_node_list] - INFO - JobID 190 - Booking already launched instances for nodes (x3) ['q4-dy-c4-1-8', 'q4-dy-c4-1-9', 'q4-dy-c4-1-10']:
2023-09-14 09:08:16,969 - [slurm_plugin.instance_manager:_update_slurm_node_addrs] - INFO - JobID 190 - Nodes are now configured with instances: (x3) ["('q4-dy-c4-1-8', EC2Instance(id='i-0365529c953588fab', private_ip='192.168.108.102', hostname='ip-192-168-108-102', launch_time=datetime.datetime(2023, 9, 14, 9, 8, 11, tzinfo=tzlocal()), slurm_node=None))", "('q4-dy-c4-1-9', EC2Instance(id='i-03a19e86c0d73e84b', private_ip='192.168.105.222', hostname='ip-192-168-105-222', launch_time=datetime.datetime(2023, 9, 14, 9, 8, 11, tzinfo=tzlocal()), slurm_node=None))", "('q4-dy-c4-1-10', EC2Instance(id='i-05b6109e7c0940a9c', private_ip='192.168.111.72', hostname='ip-192-168-111-72', launch_time=datetime.datetime(2023, 9, 14, 9, 8, 11, tzinfo=tzlocal()), slurm_node=None))"]
2023-09-14 09:08:16,970 - [slurm_plugin.instance_manager:_store_assigned_hostnames] - INFO - JobID 190 - Saving assigned hostnames in DynamoDB
2023-09-14 09:08:16,980 - [slurm_plugin.instance_manager:_store_assigned_hostnames] - INFO - JobID 190 - Database update: COMPLETED
2023-09-14 09:08:16,980 - [slurm_plugin.instance_manager:_update_dns_hostnames] - INFO - JobID 190 - Updating DNS records for Z09815256PBUS3QRIMRV - bootstrap.pcluster.
2023-09-14 09:08:17,141 - [slurm_plugin.instance_manager:_update_dns_hostnames] - INFO - JobID 190 - DNS records update: COMPLETED
2023-09-14 09:08:17,142 - [slurm_plugin.instance_manager:best_effort_node_assignment] - INFO - JobID 190 - Successful launched all instances for nodes (x3) ['q4-dy-c4-1-8', 'q4-dy-c4-1-9', 'q4-dy-c4-1-10']
2023-09-14 09:08:17,142 - [slurm_plugin.resume:_resume] - INFO - Successfully launched nodes (x7) ['q4-dy-c4-1-1', 'q4-dy-c4-1-2', 'q4-dy-c4-1-3', 'q4-dy-c4-1-4', 'q4-dy-c4-1-8', 'q4-dy-c4-1-9', 'q4-dy-c4-1-10']
2023-09-14 09:08:17,143 - [slurm_plugin.resume:_resume] - ERROR - Failed to launch following nodes, setting nodes to DOWN: (x2) ['q4-dy-c4-2-1', 'q4-dy-c4-2-2']
2023-09-14 09:08:17,143 - [slurm_plugin.resume:_handle_failed_nodes] - INFO - Setting following failed nodes into DOWN state (x2) ['q4-dy-c4-2-1', 'q4-dy-c4-2-2'] with reason: (Code:InsufficientInstanceCapacity)Failure when resuming nodes
2023-09-14 09:08:17,162 - [slurm_plugin.resume:main] - INFO - ResumeProgram finished.

References

n/a

Checklist

  • Make sure you are pointing to the right branch.
  • If you're creating a patch for a branch other than develop add the branch name as prefix in the PR title (e.g. [release-3.6]).
  • Check all commits' messages are clear, describing what and why vs how.
  • Make sure to have added unit tests or integration tests to cover the new/modified code.
  • Check if documentation is impacted by this change.

Please review the guidelines for contributing and Pull Request Instructions.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@codecov
Copy link

codecov bot commented Sep 14, 2023

Codecov Report

Patch coverage: 96.42% and project coverage change: +0.17% 🎉

Comparison is base (c7621fb) 89.53% compared to head (fd1a374) 89.70%.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop     #560      +/-   ##
===========================================
+ Coverage    89.53%   89.70%   +0.17%     
===========================================
  Files           16       16              
  Lines         2656     2681      +25     
===========================================
+ Hits          2378     2405      +27     
+ Misses         278      276       -2     
Flag Coverage Δ
unittests 89.70% <96.42%> (+0.17%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Changed Coverage Δ
src/slurm_plugin/resume.py 76.47% <0.00%> (ø)
src/slurm_plugin/fleet_manager.py 92.41% <100.00%> (+0.10%) ⬆️
src/slurm_plugin/instance_manager.py 100.00% <100.00%> (+0.49%) ⬆️

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@lukeseawalker lukeseawalker force-pushed the wip/nodeSharingJLS branch 3 times, most recently from 963c387 to bb6aaf3 Compare September 15, 2023 09:53
@lukeseawalker lukeseawalker marked this pull request as ready for review September 15, 2023 09:58
@lukeseawalker lukeseawalker requested review from a team as code owners September 15, 2023 09:58
NSsirena
NSsirena previously approved these changes Sep 15, 2023
src/slurm_plugin/instance_manager.py Outdated Show resolved Hide resolved
Add best-effort launch strategy for job-level scaling.
All-or-nothing is now the new default. When set to "False", best-effort will be performed.
Small refactoring on log string messages.

Tests done:
given the following submission command:
```
sbatch --wrap "sleep 10" -N 4 --constraint="[(c5.4xlarge)*3&(p4d.24xlarge)*1]" -p q4 --exclusive; sbatch --wrap "sleep 10" -N 2 --constraint="[(c5.4xlarge)*1&(p4d.24xlarge)*1]" -p q4 --exclusive; sbatch --wrap "sleep 10" -N 3 --constraint="[(c5.4xlarge)*3]" -p q4 --exclusive
```

where there is capacity for c5.4xlarge but not for p4d.24xlarge
the two scaling strategies were tested:

all_or_nothing_batch = true
expected nodes running at the end of the resume call: (x3) q4-dy-c4-1-*

resume log:
```
2023-09-14 09:03:09,530 - [slurm_plugin.resume:main] - INFO - ResumeProgram startup.
2023-09-14 09:03:09,531 - [slurm_plugin.resume:_get_config] - INFO - Reading /etc/parallelcluster/slurm_plugin/parallelcluster_slurm_resume.conf
2023-09-14 09:03:09,533 - [slurm_plugin.resume:main] - INFO - ResumeProgram config: SlurmResumeConfig(region='us-east-1', cluster_name='bootstrap', dynamodb_table='parallelcluster-slurm-bootstrap', hosted_zone='Z09815256PBUS3QRIMRV', dns_domain='bootstrap.pcluster.', use_private_hostname=False, head_node_private_ip='192.168.24.99', head_node_hostname='ip-192-168-24-99.ec2.internal', launch_max_batch_size=500, assign_node_max_batch_size=500, terminate_max_batch_size=1000, update_node_address=True, all_or_nothing_batch=True, job_level_scaling=True, temp_jls_for_node_sharing=False, fleet_config={'q1': {'c1': {'Api': 'create-fleet', 'CapacityType': 'on-demand', 'AllocationStrategy': 'lowest-price', 'Instances': [{'InstanceType': 'c5.xlarge'}], 'Networking': {'SubnetIds': ['subnet-0b48ed99988e56110']}}}, 'q2': {'c2': {'Api': 'create-fleet', 'CapacityType': 'on-demand', 'AllocationStrategy': 'lowest-price', 'Instances': [{'InstanceType': 'c5.2xlarge'}], 'Networking': {'SubnetIds': ['subnet-0b48ed99988e56110']}}}, 'q3': {'c3': {'Api': 'create-fleet', 'CapacityType': 'on-demand', 'AllocationStrategy': 'lowest-price', 'Instances': [{'InstanceType': 'c5.4xlarge'}], 'Networking': {'SubnetIds': ['subnet-0b48ed99988e56110']}}}, 'q4': {'c4-1': {'Api': 'create-fleet', 'CapacityType': 'on-demand', 'AllocationStrategy': 'lowest-price', 'Instances': [{'InstanceType': 'c5.4xlarge'}], 'Networking': {'SubnetIds': ['subnet-0b48ed99988e56110']}}, 'c4-2': {'Api': 'create-fleet', 'CapacityType': 'on-demand', 'AllocationStrategy': 'lowest-price', 'Instances': [{'InstanceType': 'p4d.24xlarge'}], 'Networking': {'SubnetIds': ['subnet-0b48ed99988e56110']}}}}, run_instances_overrides={}, create_fleet_overrides={}, clustermgtd_timeout=300, clustermgtd_heartbeat_file_path='/opt/slurm/etc/pcluster/.slurm_plugin/clustermgtd_heartbeat', _boto3_retry=1, _boto3_config={'retries': {'max_attempts': 1, 'mode': 'standard'}}, boto3_config=<botocore.config.Config object at 0x7f75379b6d60>, logging_config='/opt/parallelcluster/pyenv/versions/3.9.16/envs/node_virtualenv/lib/python3.9/site-packages/slurm_plugin/logging/parallelcluster_resume_logging.conf', head_node_instance_id='i-0145afe796a5e375a')
2023-09-14 09:03:09,533 - [slurm_plugin.resume:_get_slurm_resume] - INFO - Slurm Resume File content: {'jobs': [{'extra': None, 'job_id': 185, 'features': '[(c5.4xlarge)*3&(p4d.24xlarge)*1]', 'nodes_alloc': 'q4-dy-c4-1-[1-3],q4-dy-c4-2-1', 'nodes_resume': 'q4-dy-c4-1-[1-3],q4-dy-c4-2-1', 'oversubscribe': 'NO', 'partition': 'q4', 'reservation': None}, {'extra': None, 'job_id': 186, 'features': '[(c5.4xlarge)*1&(p4d.24xlarge)*1]', 'nodes_alloc': 'q4-dy-c4-1-4,q4-dy-c4-2-2', 'nodes_resume': 'q4-dy-c4-1-4,q4-dy-c4-2-2', 'oversubscribe': 'NO', 'partition': 'q4', 'reservation': None}, {'extra': None, 'job_id': 187, 'features': '[(c5.4xlarge)*3]', 'nodes_alloc': 'q4-dy-c4-1-[5-7]', 'nodes_resume': 'q4-dy-c4-1-[5-7]', 'oversubscribe': 'NO', 'partition': 'q4', 'reservation': None}], 'all_nodes_resume': 'q4-dy-c4-1-[1-7],q4-dy-c4-2-[1-2]'}
2023-09-14 09:03:09,537 - [slurm_plugin.common:is_clustermgtd_heartbeat_valid] - INFO - Latest heartbeat from clustermgtd: 2023-09-14 09:02:27.308559+00:00
2023-09-14 09:03:09,538 - [slurm_plugin.resume:_resume] - INFO - Launching EC2 instances for the following Slurm nodes: q4-dy-c4-1-[1-7],q4-dy-c4-2-[1-2]
2023-09-14 09:03:09,594 - [slurm_plugin.resume:_resume] - INFO - Current state of Slurm nodes to resume: [('q4-dy-c4-1-1', 'ALLOCATED+CLOUD+NOT_RESPONDING+POWERING_UP'), ('q4-dy-c4-1-2', 'ALLOCATED+CLOUD+NOT_RESPONDING+POWERING_UP'), ('q4-dy-c4-1-3', 'ALLOCATED+CLOUD+NOT_RESPONDING+POWERING_UP'), ('q4-dy-c4-1-4', 'ALLOCATED+CLOUD+NOT_RESPONDING+POWERING_UP'), ('q4-dy-c4-1-5', 'ALLOCATED+CLOUD+NOT_RESPONDING+POWERING_UP'), ('q4-dy-c4-1-6', 'ALLOCATED+CLOUD+NOT_RESPONDING+POWERING_UP'), ('q4-dy-c4-1-7', 'ALLOCATED+CLOUD+NOT_RESPONDING+POWERING_UP'), ('q4-dy-c4-2-1', 'ALLOCATED+CLOUD+NOT_RESPONDING+POWERING_UP'), ('q4-dy-c4-2-2', 'ALLOCATED+CLOUD+NOT_RESPONDING+POWERING_UP')]
2023-09-14 09:03:09,620 - [botocore.credentials:load] - INFO - Found credentials from IAM Role: bootstrap-RoleHeadNode-NKATKTSA4IIU
2023-09-14 09:03:09,660 - [slurm_plugin.instance_manager:_launch_instances] - INFO - Launching all-or-nothing instances for nodes (x7) ['q4-dy-c4-1-1', 'q4-dy-c4-1-2', 'q4-dy-c4-1-3', 'q4-dy-c4-1-4', 'q4-dy-c4-1-5', 'q4-dy-c4-1-6', 'q4-dy-c4-1-7']
2023-09-14 09:03:09,661 - [slurm_plugin.fleet_manager:create_fleet] - INFO - Launching instances with create_fleet API. Parameters: {'LaunchTemplateConfigs': [{'LaunchTemplateSpecification': {'LaunchTemplateName': 'bootstrap-q4-c4-1', 'Version': '$Latest'}, 'Overrides': [{'InstanceType': 'c5.4xlarge', 'SubnetId': 'subnet-0b48ed99988e56110'}]}], 'TargetCapacitySpecification': {'TotalTargetCapacity': 7, 'DefaultTargetCapacityType': 'on-demand'}, 'Type': 'instant', 'OnDemandOptions': {'AllocationStrategy': 'lowest-price', 'SingleInstanceType': True, 'SingleAvailabilityZone': True, 'MinTargetCapacity': 7, 'CapacityReservationOptions': {'UsageStrategy': 'use-capacity-reservations-first'}}}
2023-09-14 09:03:12,930 - [slurm_plugin.fleet_manager:launch_ec2_instances] - INFO - Launched the following instances (x7) ['i-09ba3d3b0753ddc33', 'i-095c89ec9f1e389d8', 'i-0414b54e1cfb7f5b8', 'i-01ac20db646a75ffa', 'i-03bdd4851aa584786', 'i-0b5adaef26df1187d', 'i-08584b017f57195b0']
2023-09-14 09:03:12,931 - [slurm_plugin.instance_manager:_launch_instances] - INFO - Launching all-or-nothing instances for nodes (x2) ['q4-dy-c4-2-1', 'q4-dy-c4-2-2']
2023-09-14 09:03:12,931 - [slurm_plugin.fleet_manager:create_fleet] - INFO - Launching instances with create_fleet API. Parameters: {'LaunchTemplateConfigs': [{'LaunchTemplateSpecification': {'LaunchTemplateName': 'bootstrap-q4-c4-2', 'Version': '$Latest'}, 'Overrides': [{'InstanceType': 'p4d.24xlarge', 'SubnetId': 'subnet-0b48ed99988e56110'}]}], 'TargetCapacitySpecification': {'TotalTargetCapacity': 2, 'DefaultTargetCapacityType': 'on-demand'}, 'Type': 'instant', 'OnDemandOptions': {'AllocationStrategy': 'lowest-price', 'SingleInstanceType': True, 'SingleAvailabilityZone': True, 'MinTargetCapacity': 2, 'CapacityReservationOptions': {'UsageStrategy': 'use-capacity-reservations-first'}}}
2023-09-14 09:03:13,971 - [slurm_plugin.fleet_manager:_launch_instances] - ERROR - Error in CreateFleet request (7e76aa68-8d69-42a8-bead-7de1a50f9037): InsufficientInstanceCapacity - We currently do not have sufficient p4d.24xlarge capacity in the Availability Zone you requested (us-east-1d). Our system will be working on provisioning additional capacity. You can currently get p4d.24xlarge capacity by not specifying an Availability Zone in your request or choosing us-east-1a, us-east-1b.
2023-09-14 09:03:14,072 - [slurm_plugin.instance_manager:_add_instances_for_job] - INFO - JobID 185 - The job nodes_resume list is (x4) ['q4-dy-c4-1-1', 'q4-dy-c4-1-2', 'q4-dy-c4-1-3', 'q4-dy-c4-2-1']
2023-09-14 09:03:14,072 - [slurm_plugin.instance_manager:_resize_slurm_node_list] - INFO - JobID 185 - Booking already launched instances for nodes (x3) ['q4-dy-c4-1-1', 'q4-dy-c4-1-2', 'q4-dy-c4-1-3']:
2023-09-14 09:03:14,072 - [slurm_plugin.instance_manager:_launch_instances] - INFO - JobID 185 - Launching all-or-nothing instances for nodes (x1) ['q4-dy-c4-2-1']
2023-09-14 09:03:14,072 - [slurm_plugin.fleet_manager:create_fleet] - INFO - JobID 185 - Launching instances with create_fleet API. Parameters: {'LaunchTemplateConfigs': [{'LaunchTemplateSpecification': {'LaunchTemplateName': 'bootstrap-q4-c4-2', 'Version': '$Latest'}, 'Overrides': [{'InstanceType': 'p4d.24xlarge', 'SubnetId': 'subnet-0b48ed99988e56110'}]}], 'TargetCapacitySpecification': {'TotalTargetCapacity': 1, 'DefaultTargetCapacityType': 'on-demand'}, 'Type': 'instant', 'OnDemandOptions': {'AllocationStrategy': 'lowest-price', 'SingleInstanceType': True, 'SingleAvailabilityZone': True, 'MinTargetCapacity': 1, 'CapacityReservationOptions': {'UsageStrategy': 'use-capacity-reservations-first'}}}
2023-09-14 09:03:15,050 - [slurm_plugin.fleet_manager:_launch_instances] - ERROR - JobID 185 - Error in CreateFleet request (044cbd43-2925-4874-af52-40ca1240e179): InsufficientInstanceCapacity - We currently do not have sufficient p4d.24xlarge capacity in the Availability Zone you requested (us-east-1d). Our system will be working on provisioning additional capacity. You can currently get p4d.24xlarge capacity by not specifying an Availability Zone in your request or choosing us-east-1a, us-east-1b.
2023-09-14 09:03:15,151 - [slurm_plugin.instance_manager:all_or_nothing_node_assignment] - INFO - JobID 185 - Releasing booked instances (x3) ["('q4', 'c4-1', EC2Instance(id='i-09ba3d3b0753ddc33', private_ip='192.168.109.64', hostname='ip-192-168-109-64', launch_time=datetime.datetime(2023, 9, 14, 9, 3, 11, tzinfo=tzlocal()), slurm_node=None))", "('q4', 'c4-1', EC2Instance(id='i-095c89ec9f1e389d8', private_ip='192.168.107.253', hostname='ip-192-168-107-253', launch_time=datetime.datetime(2023, 9, 14, 9, 3, 11, tzinfo=tzlocal()), slurm_node=None))", "('q4', 'c4-1', EC2Instance(id='i-0414b54e1cfb7f5b8', private_ip='192.168.111.135', hostname='ip-192-168-111-135', launch_time=datetime.datetime(2023, 9, 14, 9, 3, 11, tzinfo=tzlocal()), slurm_node=None))"]
2023-09-14 09:03:15,151 - [slurm_plugin.instance_manager:_add_instances_for_job] - INFO - JobID 186 - The job nodes_resume list is (x2) ['q4-dy-c4-1-4', 'q4-dy-c4-2-2']
2023-09-14 09:03:15,151 - [slurm_plugin.instance_manager:_resize_slurm_node_list] - INFO - JobID 186 - Booking already launched instances for nodes (x1) ['q4-dy-c4-1-4']:
2023-09-14 09:03:15,151 - [slurm_plugin.instance_manager:_launch_instances] - INFO - JobID 186 - Launching all-or-nothing instances for nodes (x1) ['q4-dy-c4-2-2']
2023-09-14 09:03:15,152 - [slurm_plugin.fleet_manager:create_fleet] - INFO - JobID 186 - Launching instances with create_fleet API. Parameters: {'LaunchTemplateConfigs': [{'LaunchTemplateSpecification': {'LaunchTemplateName': 'bootstrap-q4-c4-2', 'Version': '$Latest'}, 'Overrides': [{'InstanceType': 'p4d.24xlarge', 'SubnetId': 'subnet-0b48ed99988e56110'}]}], 'TargetCapacitySpecification': {'TotalTargetCapacity': 1, 'DefaultTargetCapacityType': 'on-demand'}, 'Type': 'instant', 'OnDemandOptions': {'AllocationStrategy': 'lowest-price', 'SingleInstanceType': True, 'SingleAvailabilityZone': True, 'MinTargetCapacity': 1, 'CapacityReservationOptions': {'UsageStrategy': 'use-capacity-reservations-first'}}}
2023-09-14 09:03:16,162 - [slurm_plugin.fleet_manager:_launch_instances] - ERROR - JobID 186 - Error in CreateFleet request (f1829b1d-4426-4dfa-8f27-3cf306b784e1): InsufficientInstanceCapacity - We currently do not have sufficient p4d.24xlarge capacity in the Availability Zone you requested (us-east-1d). Our system will be working on provisioning additional capacity. You can currently get p4d.24xlarge capacity by not specifying an Availability Zone in your request or choosing us-east-1a, us-east-1b.
2023-09-14 09:03:16,262 - [slurm_plugin.instance_manager:all_or_nothing_node_assignment] - INFO - JobID 186 - Releasing booked instances (x1) ["('q4', 'c4-1', EC2Instance(id='i-01ac20db646a75ffa', private_ip='192.168.108.115', hostname='ip-192-168-108-115', launch_time=datetime.datetime(2023, 9, 14, 9, 3, 11, tzinfo=tzlocal()), slurm_node=None))"]
2023-09-14 09:03:16,262 - [slurm_plugin.instance_manager:_add_instances_for_job] - INFO - JobID 187 - The job nodes_resume list is (x3) ['q4-dy-c4-1-5', 'q4-dy-c4-1-6', 'q4-dy-c4-1-7']
2023-09-14 09:03:16,262 - [slurm_plugin.instance_manager:_resize_slurm_node_list] - INFO - JobID 187 - Booking already launched instances for nodes (x3) ['q4-dy-c4-1-5', 'q4-dy-c4-1-6', 'q4-dy-c4-1-7']:
2023-09-14 09:03:16,280 - [slurm_plugin.instance_manager:_update_slurm_node_addrs] - INFO - JobID 187 - Nodes are now configured with instances: (x3) ["('q4-dy-c4-1-5', EC2Instance(id='i-03bdd4851aa584786', private_ip='192.168.107.163', hostname='ip-192-168-107-163', launch_time=datetime.datetime(2023, 9, 14, 9, 3, 11, tzinfo=tzlocal()), slurm_node=None))", "('q4-dy-c4-1-6', EC2Instance(id='i-0b5adaef26df1187d', private_ip='192.168.106.37', hostname='ip-192-168-106-37', launch_time=datetime.datetime(2023, 9, 14, 9, 3, 11, tzinfo=tzlocal()), slurm_node=None))", "('q4-dy-c4-1-7', EC2Instance(id='i-08584b017f57195b0', private_ip='192.168.110.115', hostname='ip-192-168-110-115', launch_time=datetime.datetime(2023, 9, 14, 9, 3, 11, tzinfo=tzlocal()), slurm_node=None))"]
2023-09-14 09:03:16,281 - [slurm_plugin.instance_manager:_store_assigned_hostnames] - INFO - JobID 187 - Saving assigned hostnames in DynamoDB
2023-09-14 09:03:16,327 - [slurm_plugin.instance_manager:_store_assigned_hostnames] - INFO - JobID 187 - Database update: COMPLETED
2023-09-14 09:03:16,327 - [slurm_plugin.instance_manager:_update_dns_hostnames] - INFO - JobID 187 - Updating DNS records for Z09815256PBUS3QRIMRV - bootstrap.pcluster.
2023-09-14 09:03:16,652 - [slurm_plugin.instance_manager:_update_dns_hostnames] - INFO - JobID 187 - DNS records update: COMPLETED
2023-09-14 09:03:16,653 - [slurm_plugin.instance_manager:all_or_nothing_node_assignment] - INFO - JobID 187 - Successful launched all instances for nodes (x3) ['q4-dy-c4-1-5', 'q4-dy-c4-1-6', 'q4-dy-c4-1-7']
2023-09-14 09:03:16,653 - [slurm_plugin.instance_manager:_terminate_unassigned_launched_instances] - INFO - Terminating unassigned launched instances: {'q4': {'c4-1': [EC2Instance(id='i-09ba3d3b0753ddc33', private_ip='192.168.109.64', hostname='ip-192-168-109-64', launch_time=datetime.datetime(2023, 9, 14, 9, 3, 11, tzinfo=tzlocal()), slurm_node=None), EC2Instance(id='i-095c89ec9f1e389d8', private_ip='192.168.107.253', hostname='ip-192-168-107-253', launch_time=datetime.datetime(2023, 9, 14, 9, 3, 11, tzinfo=tzlocal()), slurm_node=None), EC2Instance(id='i-0414b54e1cfb7f5b8', private_ip='192.168.111.135', hostname='ip-192-168-111-135', launch_time=datetime.datetime(2023, 9, 14, 9, 3, 11, tzinfo=tzlocal()), slurm_node=None), EC2Instance(id='i-01ac20db646a75ffa', private_ip='192.168.108.115', hostname='ip-192-168-108-115', launch_time=datetime.datetime(2023, 9, 14, 9, 3, 11, tzinfo=tzlocal()), slurm_node=None)]}}
2023-09-14 09:03:16,662 - [slurm_plugin.instance_manager:delete_instances] - INFO - Terminating instances (x4) ['i-09ba3d3b0753ddc33', 'i-095c89ec9f1e389d8', 'i-0414b54e1cfb7f5b8', 'i-01ac20db646a75ffa']
2023-09-14 09:03:17,131 - [slurm_plugin.resume:_resume] - INFO - Successfully launched nodes (x3) ['q4-dy-c4-1-5', 'q4-dy-c4-1-6', 'q4-dy-c4-1-7']
2023-09-14 09:03:17,131 - [slurm_plugin.resume:_resume] - ERROR - Failed to launch following nodes, setting nodes to DOWN: (x6) ['q4-dy-c4-1-1', 'q4-dy-c4-1-3', 'q4-dy-c4-1-4', 'q4-dy-c4-2-2', 'q4-dy-c4-2-1', 'q4-dy-c4-1-2']
2023-09-14 09:03:17,131 - [slurm_plugin.resume:_handle_failed_nodes] - INFO - Setting following failed nodes into DOWN state (x2) ['q4-dy-c4-2-2', 'q4-dy-c4-2-1'] with reason: (Code:InsufficientInstanceCapacity)Failure when resuming nodes
2023-09-14 09:03:17,149 - [slurm_plugin.resume:_handle_failed_nodes] - INFO - Setting following failed nodes into DOWN state (x4) ['q4-dy-c4-1-1', 'q4-dy-c4-1-3', 'q4-dy-c4-1-4', 'q4-dy-c4-1-2'] with reason: (Code:LimitedInstanceCapacity)Failure when resuming nodes
2023-09-14 09:03:17,169 - [slurm_plugin.resume:main] - INFO - ResumeProgram finished.
```

all_or_nothing_batch = false
expected nodes running at the end of the resume call: (x7) q4-dy-c4-1-*

resume log:
```
2023-09-14 09:08:09,554 - [slurm_plugin.resume:main] - INFO - ResumeProgram startup.
2023-09-14 09:08:09,555 - [slurm_plugin.resume:_get_config] - INFO - Reading /etc/parallelcluster/slurm_plugin/parallelcluster_slurm_resume.conf
2023-09-14 09:08:09,556 - [slurm_plugin.resume:main] - INFO - ResumeProgram config: SlurmResumeConfig(region='us-east-1', cluster_name='bootstrap', dynamodb_table='parallelcluster-slurm-bootstrap', hosted_zone='Z09815256PBUS3QRIMRV', dns_domain='bootstrap.pcluster.', use_private_hostname=False, head_node_private_ip='192.168.24.99', head_node_hostname='ip-192-168-24-99.ec2.internal', launch_max_batch_size=500, assign_node_max_batch_size=500, terminate_max_batch_size=1000, update_node_address=True, all_or_nothing_batch=False, job_level_scaling=True, temp_jls_for_node_sharing=False, fleet_config={'q1': {'c1': {'Api': 'create-fleet', 'CapacityType': 'on-demand', 'AllocationStrategy': 'lowest-price', 'Instances': [{'InstanceType': 'c5.xlarge'}], 'Networking': {'SubnetIds': ['subnet-0b48ed99988e56110']}}}, 'q2': {'c2': {'Api': 'create-fleet', 'CapacityType': 'on-demand', 'AllocationStrategy': 'lowest-price', 'Instances': [{'InstanceType': 'c5.2xlarge'}], 'Networking': {'SubnetIds': ['subnet-0b48ed99988e56110']}}}, 'q3': {'c3': {'Api': 'create-fleet', 'CapacityType': 'on-demand', 'AllocationStrategy': 'lowest-price', 'Instances': [{'InstanceType': 'c5.4xlarge'}], 'Networking': {'SubnetIds': ['subnet-0b48ed99988e56110']}}}, 'q4': {'c4-1': {'Api': 'create-fleet', 'CapacityType': 'on-demand', 'AllocationStrategy': 'lowest-price', 'Instances': [{'InstanceType': 'c5.4xlarge'}], 'Networking': {'SubnetIds': ['subnet-0b48ed99988e56110']}}, 'c4-2': {'Api': 'create-fleet', 'CapacityType': 'on-demand', 'AllocationStrategy': 'lowest-price', 'Instances': [{'InstanceType': 'p4d.24xlarge'}], 'Networking': {'SubnetIds': ['subnet-0b48ed99988e56110']}}}}, run_instances_overrides={}, create_fleet_overrides={}, clustermgtd_timeout=300, clustermgtd_heartbeat_file_path='/opt/slurm/etc/pcluster/.slurm_plugin/clustermgtd_heartbeat', _boto3_retry=1, _boto3_config={'retries': {'max_attempts': 1, 'mode': 'standard'}}, boto3_config=<botocore.config.Config object at 0x7fed57aa1d60>, logging_config='/opt/parallelcluster/pyenv/versions/3.9.16/envs/node_virtualenv/lib/python3.9/site-packages/slurm_plugin/logging/parallelcluster_resume_logging.conf', head_node_instance_id='i-0145afe796a5e375a')
2023-09-14 09:08:09,557 - [slurm_plugin.resume:_get_slurm_resume] - INFO - Slurm Resume File content: {'jobs': [{'extra': None, 'job_id': 188, 'features': '[(c5.4xlarge)*3&(p4d.24xlarge)*1]', 'nodes_alloc': 'q4-dy-c4-1-[1-3],q4-dy-c4-2-1', 'nodes_resume': 'q4-dy-c4-1-[1-3],q4-dy-c4-2-1', 'oversubscribe': 'NO', 'partition': 'q4', 'reservation': None}, {'extra': None, 'job_id': 189, 'features': '[(c5.4xlarge)*1&(p4d.24xlarge)*1]', 'nodes_alloc': 'q4-dy-c4-1-4,q4-dy-c4-2-2', 'nodes_resume': 'q4-dy-c4-1-4,q4-dy-c4-2-2', 'oversubscribe': 'NO', 'partition': 'q4', 'reservation': None}, {'extra': None, 'job_id': 190, 'features': '[(c5.4xlarge)*3]', 'nodes_alloc': 'q4-dy-c4-1-[8-10]', 'nodes_resume': 'q4-dy-c4-1-[8-10]', 'oversubscribe': 'NO', 'partition': 'q4', 'reservation': None}], 'all_nodes_resume': 'q4-dy-c4-1-[1-4,8-10],q4-dy-c4-2-[1-2]'}
2023-09-14 09:08:09,561 - [slurm_plugin.common:is_clustermgtd_heartbeat_valid] - INFO - Latest heartbeat from clustermgtd: 2023-09-14 09:07:27.471205+00:00
2023-09-14 09:08:09,561 - [slurm_plugin.resume:_resume] - INFO - Launching EC2 instances for the following Slurm nodes: q4-dy-c4-1-[1-4,8-10],q4-dy-c4-2-[1-2]
2023-09-14 09:08:09,616 - [slurm_plugin.resume:_resume] - INFO - Current state of Slurm nodes to resume: [('q4-dy-c4-1-1', 'ALLOCATED+CLOUD+NOT_RESPONDING+POWERING_UP'), ('q4-dy-c4-1-2', 'ALLOCATED+CLOUD+NOT_RESPONDING+POWERING_UP'), ('q4-dy-c4-1-3', 'ALLOCATED+CLOUD+NOT_RESPONDING+POWERING_UP'), ('q4-dy-c4-1-4', 'ALLOCATED+CLOUD+NOT_RESPONDING+POWERING_UP'), ('q4-dy-c4-1-8', 'ALLOCATED+CLOUD+NOT_RESPONDING+POWERING_UP'), ('q4-dy-c4-1-9', 'ALLOCATED+CLOUD+NOT_RESPONDING+POWERING_UP'), ('q4-dy-c4-1-10', 'ALLOCATED+CLOUD+NOT_RESPONDING+POWERING_UP'), ('q4-dy-c4-2-1', 'ALLOCATED+CLOUD+NOT_RESPONDING+POWERING_UP'), ('q4-dy-c4-2-2', 'ALLOCATED+CLOUD+NOT_RESPONDING+POWERING_UP')]
2023-09-14 09:08:09,643 - [botocore.credentials:load] - INFO - Found credentials from IAM Role: bootstrap-RoleHeadNode-NKATKTSA4IIU
2023-09-14 09:08:09,683 - [slurm_plugin.instance_manager:_launch_instances] - INFO - Launching best-effort instances for nodes (x7) ['q4-dy-c4-1-1', 'q4-dy-c4-1-2', 'q4-dy-c4-1-3', 'q4-dy-c4-1-4', 'q4-dy-c4-1-8', 'q4-dy-c4-1-9', 'q4-dy-c4-1-10']
2023-09-14 09:08:09,683 - [slurm_plugin.fleet_manager:create_fleet] - INFO - Launching instances with create_fleet API. Parameters: {'LaunchTemplateConfigs': [{'LaunchTemplateSpecification': {'LaunchTemplateName': 'bootstrap-q4-c4-1', 'Version': '$Latest'}, 'Overrides': [{'InstanceType': 'c5.4xlarge', 'SubnetId': 'subnet-0b48ed99988e56110'}]}], 'TargetCapacitySpecification': {'TotalTargetCapacity': 7, 'DefaultTargetCapacityType': 'on-demand'}, 'Type': 'instant', 'OnDemandOptions': {'AllocationStrategy': 'lowest-price', 'SingleInstanceType': True, 'SingleAvailabilityZone': True, 'MinTargetCapacity': 1, 'CapacityReservationOptions': {'UsageStrategy': 'use-capacity-reservations-first'}}}
2023-09-14 09:08:12,914 - [slurm_plugin.fleet_manager:launch_ec2_instances] - INFO - Launched the following instances (x7) ['i-0250d4e661b9d86eb', 'i-0d23930fc5b09fd33', 'i-07dad6e5f1eed664d', 'i-0ad5528556d13495b', 'i-0365529c953588fab', 'i-03a19e86c0d73e84b', 'i-05b6109e7c0940a9c']
2023-09-14 09:08:12,915 - [slurm_plugin.instance_manager:_launch_instances] - INFO - Launching best-effort instances for nodes (x2) ['q4-dy-c4-2-1', 'q4-dy-c4-2-2']
2023-09-14 09:08:12,915 - [slurm_plugin.fleet_manager:create_fleet] - INFO - Launching instances with create_fleet API. Parameters: {'LaunchTemplateConfigs': [{'LaunchTemplateSpecification': {'LaunchTemplateName': 'bootstrap-q4-c4-2', 'Version': '$Latest'}, 'Overrides': [{'InstanceType': 'p4d.24xlarge', 'SubnetId': 'subnet-0b48ed99988e56110'}]}], 'TargetCapacitySpecification': {'TotalTargetCapacity': 2, 'DefaultTargetCapacityType': 'on-demand'}, 'Type': 'instant', 'OnDemandOptions': {'AllocationStrategy': 'lowest-price', 'SingleInstanceType': True, 'SingleAvailabilityZone': True, 'MinTargetCapacity': 1, 'CapacityReservationOptions': {'UsageStrategy': 'use-capacity-reservations-first'}}}
2023-09-14 09:08:14,152 - [slurm_plugin.fleet_manager:_launch_instances] - ERROR - Error in CreateFleet request (af6b0eb4-086f-46ad-b58b-6c5f811d8280): InsufficientInstanceCapacity - We currently do not have sufficient p4d.24xlarge capacity in the Availability Zone you requested (us-east-1d). Our system will be working on provisioning additional capacity. You can currently get p4d.24xlarge capacity by not specifying an Availability Zone in your request or choosing us-east-1a, us-east-1b.
2023-09-14 09:08:14,253 - [slurm_plugin.instance_manager:_add_instances_for_job] - INFO - JobID 188 - The job nodes_resume list is (x4) ['q4-dy-c4-1-1', 'q4-dy-c4-1-2', 'q4-dy-c4-1-3', 'q4-dy-c4-2-1']
2023-09-14 09:08:14,253 - [slurm_plugin.instance_manager:_resize_slurm_node_list] - INFO - JobID 188 - Booking already launched instances for nodes (x3) ['q4-dy-c4-1-1', 'q4-dy-c4-1-2', 'q4-dy-c4-1-3']:
2023-09-14 09:08:14,253 - [slurm_plugin.instance_manager:_launch_instances] - INFO - JobID 188 - Launching best-effort instances for nodes (x1) ['q4-dy-c4-2-1']
2023-09-14 09:08:14,254 - [slurm_plugin.fleet_manager:create_fleet] - INFO - JobID 188 - Launching instances with create_fleet API. Parameters: {'LaunchTemplateConfigs': [{'LaunchTemplateSpecification': {'LaunchTemplateName': 'bootstrap-q4-c4-2', 'Version': '$Latest'}, 'Overrides': [{'InstanceType': 'p4d.24xlarge', 'SubnetId': 'subnet-0b48ed99988e56110'}]}], 'TargetCapacitySpecification': {'TotalTargetCapacity': 1, 'DefaultTargetCapacityType': 'on-demand'}, 'Type': 'instant', 'OnDemandOptions': {'AllocationStrategy': 'lowest-price', 'SingleInstanceType': True, 'SingleAvailabilityZone': True, 'MinTargetCapacity': 1, 'CapacityReservationOptions': {'UsageStrategy': 'use-capacity-reservations-first'}}}
2023-09-14 09:08:15,274 - [slurm_plugin.fleet_manager:_launch_instances] - ERROR - JobID 188 - Error in CreateFleet request (ff2ac807-49a8-41b4-8af9-2dcea2ed6dfb): InsufficientInstanceCapacity - We currently do not have sufficient p4d.24xlarge capacity in the Availability Zone you requested (us-east-1d). Our system will be working on provisioning additional capacity. You can currently get p4d.24xlarge capacity by not specifying an Availability Zone in your request or choosing us-east-1a, us-east-1b.
2023-09-14 09:08:15,409 - [slurm_plugin.instance_manager:_update_slurm_node_addrs] - INFO - JobID 188 - Nodes are now configured with instances: (x3) ["('q4-dy-c4-1-1', EC2Instance(id='i-0250d4e661b9d86eb', private_ip='192.168.111.231', hostname='ip-192-168-111-231', launch_time=datetime.datetime(2023, 9, 14, 9, 8, 11, tzinfo=tzlocal()), slurm_node=None))", "('q4-dy-c4-1-2', EC2Instance(id='i-0d23930fc5b09fd33', private_ip='192.168.110.38', hostname='ip-192-168-110-38', launch_time=datetime.datetime(2023, 9, 14, 9, 8, 11, tzinfo=tzlocal()), slurm_node=None))", "('q4-dy-c4-1-3', EC2Instance(id='i-07dad6e5f1eed664d', private_ip='192.168.104.249', hostname='ip-192-168-104-249', launch_time=datetime.datetime(2023, 9, 14, 9, 8, 11, tzinfo=tzlocal()), slurm_node=None))"]
2023-09-14 09:08:15,409 - [slurm_plugin.instance_manager:_store_assigned_hostnames] - INFO - JobID 188 - Saving assigned hostnames in DynamoDB
2023-09-14 09:08:15,447 - [slurm_plugin.instance_manager:_store_assigned_hostnames] - INFO - JobID 188 - Database update: COMPLETED
2023-09-14 09:08:15,447 - [slurm_plugin.instance_manager:_update_dns_hostnames] - INFO - JobID 188 - Updating DNS records for Z09815256PBUS3QRIMRV - bootstrap.pcluster.
2023-09-14 09:08:15,743 - [slurm_plugin.instance_manager:_update_dns_hostnames] - INFO - JobID 188 - DNS records update: COMPLETED
2023-09-14 09:08:15,744 - [slurm_plugin.instance_manager:best_effort_node_assignment] - INFO - JobID 188 - Successful launched partial instances for nodes (x3) ['q4-dy-c4-1-1', 'q4-dy-c4-1-2', 'q4-dy-c4-1-3']
2023-09-14 09:08:15,744 - [slurm_plugin.instance_manager:_add_instances_for_job] - INFO - JobID 189 - The job nodes_resume list is (x2) ['q4-dy-c4-1-4', 'q4-dy-c4-2-2']
2023-09-14 09:08:15,744 - [slurm_plugin.instance_manager:_resize_slurm_node_list] - INFO - JobID 189 - Booking already launched instances for nodes (x1) ['q4-dy-c4-1-4']:
2023-09-14 09:08:15,744 - [slurm_plugin.instance_manager:_launch_instances] - INFO - JobID 189 - Launching best-effort instances for nodes (x1) ['q4-dy-c4-2-2']
2023-09-14 09:08:15,744 - [slurm_plugin.fleet_manager:create_fleet] - INFO - JobID 189 - Launching instances with create_fleet API. Parameters: {'LaunchTemplateConfigs': [{'LaunchTemplateSpecification': {'LaunchTemplateName': 'bootstrap-q4-c4-2', 'Version': '$Latest'}, 'Overrides': [{'InstanceType': 'p4d.24xlarge', 'SubnetId': 'subnet-0b48ed99988e56110'}]}], 'TargetCapacitySpecification': {'TotalTargetCapacity': 1, 'DefaultTargetCapacityType': 'on-demand'}, 'Type': 'instant', 'OnDemandOptions': {'AllocationStrategy': 'lowest-price', 'SingleInstanceType': True, 'SingleAvailabilityZone': True, 'MinTargetCapacity': 1, 'CapacityReservationOptions': {'UsageStrategy': 'use-capacity-reservations-first'}}}
2023-09-14 09:08:16,696 - [slurm_plugin.fleet_manager:_launch_instances] - ERROR - JobID 189 - Error in CreateFleet request (63180bc8-cad1-4754-a1c6-0b93fdd36461): InsufficientInstanceCapacity - We currently do not have sufficient p4d.24xlarge capacity in the Availability Zone you requested (us-east-1d). Our system will be working on provisioning additional capacity. You can currently get p4d.24xlarge capacity by not specifying an Availability Zone in your request or choosing us-east-1a, us-east-1b.
2023-09-14 09:08:16,814 - [slurm_plugin.instance_manager:_update_slurm_node_addrs] - INFO - JobID 189 - Nodes are now configured with instances: (x1) ["('q4-dy-c4-1-4', EC2Instance(id='i-0ad5528556d13495b', private_ip='192.168.104.152', hostname='ip-192-168-104-152', launch_time=datetime.datetime(2023, 9, 14, 9, 8, 11, tzinfo=tzlocal()), slurm_node=None))"]
2023-09-14 09:08:16,814 - [slurm_plugin.instance_manager:_store_assigned_hostnames] - INFO - JobID 189 - Saving assigned hostnames in DynamoDB
2023-09-14 09:08:16,821 - [slurm_plugin.instance_manager:_store_assigned_hostnames] - INFO - JobID 189 - Database update: COMPLETED
2023-09-14 09:08:16,822 - [slurm_plugin.instance_manager:_update_dns_hostnames] - INFO - JobID 189 - Updating DNS records for Z09815256PBUS3QRIMRV - bootstrap.pcluster.
2023-09-14 09:08:16,950 - [slurm_plugin.instance_manager:_update_dns_hostnames] - INFO - JobID 189 - DNS records update: COMPLETED
2023-09-14 09:08:16,951 - [slurm_plugin.instance_manager:best_effort_node_assignment] - INFO - JobID 189 - Successful launched partial instances for nodes (x1) ['q4-dy-c4-1-4']
2023-09-14 09:08:16,952 - [slurm_plugin.instance_manager:_add_instances_for_job] - INFO - JobID 190 - The job nodes_resume list is (x3) ['q4-dy-c4-1-8', 'q4-dy-c4-1-9', 'q4-dy-c4-1-10']
2023-09-14 09:08:16,952 - [slurm_plugin.instance_manager:_resize_slurm_node_list] - INFO - JobID 190 - Booking already launched instances for nodes (x3) ['q4-dy-c4-1-8', 'q4-dy-c4-1-9', 'q4-dy-c4-1-10']:
2023-09-14 09:08:16,969 - [slurm_plugin.instance_manager:_update_slurm_node_addrs] - INFO - JobID 190 - Nodes are now configured with instances: (x3) ["('q4-dy-c4-1-8', EC2Instance(id='i-0365529c953588fab', private_ip='192.168.108.102', hostname='ip-192-168-108-102', launch_time=datetime.datetime(2023, 9, 14, 9, 8, 11, tzinfo=tzlocal()), slurm_node=None))", "('q4-dy-c4-1-9', EC2Instance(id='i-03a19e86c0d73e84b', private_ip='192.168.105.222', hostname='ip-192-168-105-222', launch_time=datetime.datetime(2023, 9, 14, 9, 8, 11, tzinfo=tzlocal()), slurm_node=None))", "('q4-dy-c4-1-10', EC2Instance(id='i-05b6109e7c0940a9c', private_ip='192.168.111.72', hostname='ip-192-168-111-72', launch_time=datetime.datetime(2023, 9, 14, 9, 8, 11, tzinfo=tzlocal()), slurm_node=None))"]
2023-09-14 09:08:16,970 - [slurm_plugin.instance_manager:_store_assigned_hostnames] - INFO - JobID 190 - Saving assigned hostnames in DynamoDB
2023-09-14 09:08:16,980 - [slurm_plugin.instance_manager:_store_assigned_hostnames] - INFO - JobID 190 - Database update: COMPLETED
2023-09-14 09:08:16,980 - [slurm_plugin.instance_manager:_update_dns_hostnames] - INFO - JobID 190 - Updating DNS records for Z09815256PBUS3QRIMRV - bootstrap.pcluster.
2023-09-14 09:08:17,141 - [slurm_plugin.instance_manager:_update_dns_hostnames] - INFO - JobID 190 - DNS records update: COMPLETED
2023-09-14 09:08:17,142 - [slurm_plugin.instance_manager:best_effort_node_assignment] - INFO - JobID 190 - Successful launched all instances for nodes (x3) ['q4-dy-c4-1-8', 'q4-dy-c4-1-9', 'q4-dy-c4-1-10']
2023-09-14 09:08:17,142 - [slurm_plugin.resume:_resume] - INFO - Successfully launched nodes (x7) ['q4-dy-c4-1-1', 'q4-dy-c4-1-2', 'q4-dy-c4-1-3', 'q4-dy-c4-1-4', 'q4-dy-c4-1-8', 'q4-dy-c4-1-9', 'q4-dy-c4-1-10']
2023-09-14 09:08:17,143 - [slurm_plugin.resume:_resume] - ERROR - Failed to launch following nodes, setting nodes to DOWN: (x2) ['q4-dy-c4-2-1', 'q4-dy-c4-2-2']
2023-09-14 09:08:17,143 - [slurm_plugin.resume:_handle_failed_nodes] - INFO - Setting following failed nodes into DOWN state (x2) ['q4-dy-c4-2-1', 'q4-dy-c4-2-2'] with reason: (Code:InsufficientInstanceCapacity)Failure when resuming nodes
2023-09-14 09:08:17,162 - [slurm_plugin.resume:main] - INFO - ResumeProgram finished.
```

Signed-off-by: Luca Carrogu <[email protected]>
@lukeseawalker lukeseawalker merged commit dd9c3dc into aws:develop Sep 15, 2023
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants