Amazon SSA not working #670

agrare · 2020-11-12T20:01:44Z

Hello,

Customer is trying to run SmartState Analysis on an instance of AWS running CentOS.

In the user tasks this analysis fails with: "job timed out after 3060.770851409 seconds of inactivity. Inactivity threshold [3000 seconds]"

This is Infrastructure Management running as virtual appliance on VMWare.

In the provided logs (TS004217864/2020-09-27/ dir on ecurep) I can see:

{"@timestamp":"2020-10-07T14:04:52.176002 ","hostname":"1-amazon-agent-coordinator-0-55fd49dc87-hsgwd","pid":7,"tid":"2acb296b6e48","level":"debug","message":"[Aws::SQS::Client 400 0.057859 0 retries] get_queue_url(queue_name:\"smartstate_extract_reply-b0d2ab4b-40ed-4025-a5e2-788ba23ba35d\") Aws::SQS::Errors::NonExistentQueue The specified queue does not exist for this wsdl version."}

{"@timestamp":"2020-10-07T14:05:02.577138 ","hostname":"1-amazon-agent-coordinator-0-55fd49dc87-hsgwd","pid":7,"tid":"2acb257d795c","level":"debug","message":"[Aws::S3::Client 404 0.20764 0 retries] head_bucket(bucket:\"smartstate-b0d2ab4b-40ed-4025-a5e2-788ba23ba35d\") Aws::S3::Errors::NotFound "}

--> process agent coordinator running

12190 root 21 1 395768 204608 14740 R 15.4 0.8 0:09.24 MIQ: Amazon::CloudManager::EventCatcher id: 76, queue: ems_3

12188 root 30 10 361160 202108 13100 R 15.3 0.8 0:09.19 MIQ: Amazon::AgentCoordinatorWorker id: 75, queue: ems_agent_coordinator

12194 root 23 3 361408 195428 13128 R 14.7 0.8 0:08.85 MIQ: Amazon::CloudManager::MetricsCollectorWorker id: 78, queue: amazon

--> The related task info with GUID

[----] I, [2020-09-27T15:03:16.993192 #2228:2abfbf325970] INFO -- : MIQ(ManageIQ::Providers::Amazon::AgentCoordinatorWorker#start) Worker started: ID [75], PID [12188], GUID [5b493b88-02ff-4570-8697-6e393122260b]

--> I have looked at ews.log for GUID but, the only timeout error I found is the following:

[----] I, [2020-09-27T15:59:18.098238 #12188:2aeeb017595c] INFO -- : MIQ(ManageIQ::Providers::Amazon::AgentCoordinatorWorker::Runner#do_work) Alive agents in EMS(guid=1c9078f4-1231-4f90-9b63-29fbc205ce72): [].

.....

and then:

[----] I, [2020-09-27T15:59:23.405619 #2684:2ad3ca51b968] INFO -- : MIQ(MiqQueue#deliver) Message id: [5879], Delivering...

[----] W, [2020-09-27T15:59:23.409676 #2684:2ad3ca51b968] WARN -- : MIQ(ManageIQ::Providers::Amazon::CloudManager::Scanning::Job#timeout!) Job: guid: [17167f79-8f9d-428f-a382-4cb29ee8e498], job timed out after 3052.271594948 seconds of inactivity. Inactivity threshold [3000 seconds], aborting

[----] I, [2020-09-27T15:59:23.416496 #2684:2ad3ca51b968] INFO -- : MIQ(MiqQueue.put) Message id: [5882], id: [], Zone: [default], Role: [smartstate], Server: [], MiqTask id: [], Ident: [generic], Target id: [], Instance id: [1], Task id: [], Command: [Job.signal_abort], Timeout: [600], Priority: [100], State: [ready], Deliver On: [], Data: [], Args: ["job timed out after 3052.271594948 seconds of inactivity. Inactivity threshold [3000 seconds]", "error"]

[----] I, [2020-09-27T15:59:23.416693 #2684:2ad3ca51b968] INFO -- : MIQ(MiqQueue#delivered) Message id: [5879], State: [ok], Delivered in [0.01107655] seconds

[----] I, [2020-09-27T15:59:23.422359 #2684:2ad3ca51b968] INFO -- : MIQ(MiqGenericWorker::Runner#get_message_via_drb) Message id: [5881], MiqWorker id: [64], Zone: [], Role: [], Server: [], MiqTask id: [], Ident: [generic], Target id: [], Instance id: [], Task id: [], Command: [MiqQueue.check_for_timeout], Timeout: [600], Priority: [90], State: [dequeue], Deliver On: [], Data: [], Args: [], Dequeued in: [3.302457898] seconds

[----] I, [2020-09-27T15:59:23.422617 #2684:2ad3ca51b968] INFO -- : MIQ(MiqQueue#deliver) Message id: [5881], Delivering...

[----] I, [2020-09-27T15:59:23.423784 #2684:2ad3ca51b968] INFO -- : MIQ(MiqQueue#delivered) Message id: [5881], State: [ok], Delivered in [0.001177885] seconds

[----] I, [2020-09-27T15:59:28.470684 #2228:2abfbf325970] INFO -- : MIQ(MiqServer#populate_queue_messages) Fetched 1 miq_queue rows for queue_name=generic, wcount=4, priority=200

[----] I, [2020-09-27T15:59:29.506630 #2684:2ad3ca51b968] INFO -- : MIQ(MiqGenericWorker::Runner#get_message_via_drb) Message id: [5882], MiqWorker id: [64], Zone: [default], Role: [smartstate], Server: [], MiqTask id: [], Ident: [generic], Target id: [], Instance id: [1], Task id: [], Command: [Job.signal_abort], Timeout: [600], Priority: [100], State: [dequeue], Deliver On: [], Data: [], Args: ["job timed out after 3052.271594948 seconds of inactivity. Inactivity threshold [3000 seconds]", "error"], Dequeued in: [6.094388538] seconds

[----] I, [2020-09-27T15:59:29.506939 #2684:2ad3ca51b968] INFO -- : MIQ(MiqQueue#deliver) Message id: [5882], Delivering...

[----] I, [2020-09-27T15:59:29.656524 #2684:2ad3ca51b968] INFO -- : MIQ(MiqQueue.put) Message id: [5883], id: [], Zone: [default], Role: [automate], Server: [], MiqTask id: [], Ident: [generic], Target id: [], Instance id: [], Task id: [], Command: [MiqAeEngine.deliver], Timeout: [3600], Priority: [20], State: [ready], Deliver On: [], Data: [], Args: [{:object_type=>"ManageIQ::Providers::Amazon::CloudManager::Vm", :object_id=>34, :attrs=>{:event_type=>"vm_scan_abort", "VmOrTemplate::vm"=>34, :vm_id=>34, :host=>nil, "MiqEvent::miq_event"=>308, :miq_event_id=>308, "EventStream::event_stream"=>308, :event_stream_id=>308}, :instance_name=>"Event", :user_id=>1, :miq_group_id=>1, :tenant_id=>1, :automate_message=>nil}]

[----] E, [2020-09-27T15:59:29.656639 #2684:2ad3ca51b968] ERROR -- : MIQ(ManageIQ::Providers::Amazon::CloudManager::Scanning::Job#process_abort) job aborting, job timed out after 3052.271594948 seconds of inactivity. Inactivity threshold [3000 seconds]

[----] E, [2020-09-27T15:59:29.662790 #2684:2ad3ca51b968] ERROR -- : MIQ(MiqQueue#deliver) Message id: [5882], Error: [finish is not permitted at state aborting]

[----] E, [2020-09-27T15:59:29.662909 #2684:2ad3ca51b968] ERROR -- : [RuntimeError]: finish is not permitted at state aborting Method:[block (2 levels) in class:LogProxy]

[----] E, [2020-09-27T15:59:29.662994 #2684:2ad3ca51b968] ERROR -- : /var/www/miq/vmdb/app/models/job/state_machine.rb:50:in `signal'

Not sure if this is related to the issue.

The text was updated successfully, but these errors were encountered:

agrare · 2020-11-12T20:12:42Z

cc @jerryk55

chessbyte · 2021-12-21T17:58:06Z

@agrare was this ever addressed? I recall you making some fixes for SmartState Analysis, but cannot figure out if this is still an issue

agrare · 2021-12-21T19:34:22Z

There was an issue with the agent_coordinator worker that we fixed, not sure if that fixes this or not though

miq-bot · 2023-02-27T00:12:32Z

This issue has been automatically marked as stale because it has not been updated for at least 3 months.

If you can still reproduce this issue on the current release or on master, please reply with all of the information you have about it in order to keep the issue open.

Thank you for all your contributions! More information about the ManageIQ triage process can be found in the triage process documentation.

agrare added the bug label Nov 12, 2020

agrare added the help wanted label Nov 19, 2020

miq-bot added the stale label Feb 27, 2023

Fryguy removed the stale label Mar 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Amazon SSA not working #670

Amazon SSA not working #670

agrare commented Nov 12, 2020 •

edited

Loading

agrare commented Nov 12, 2020

chessbyte commented Dec 21, 2021

agrare commented Dec 21, 2021

miq-bot commented Feb 27, 2023

Amazon SSA not working #670

Amazon SSA not working #670

Comments

agrare commented Nov 12, 2020 • edited Loading

agrare commented Nov 12, 2020

chessbyte commented Dec 21, 2021

agrare commented Dec 21, 2021

miq-bot commented Feb 27, 2023

agrare commented Nov 12, 2020 •

edited

Loading