Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Amazon SSA not working #670

Open
agrare opened this issue Nov 12, 2020 · 4 comments
Open

Amazon SSA not working #670

agrare opened this issue Nov 12, 2020 · 4 comments

Comments

@agrare
Copy link
Member

agrare commented Nov 12, 2020

Hello,

Customer is trying to run SmartState Analysis on an instance of AWS running CentOS.

In the user tasks this analysis fails with: "job timed out after 3060.770851409 seconds of inactivity. Inactivity threshold [3000 seconds]"

This is Infrastructure Management running as virtual appliance on VMWare.

In the provided logs (TS004217864/2020-09-27/ dir on ecurep) I can see:

{"@timestamp":"2020-10-07T14:04:52.176002 ","hostname":"1-amazon-agent-coordinator-0-55fd49dc87-hsgwd","pid":7,"tid":"2acb296b6e48","level":"debug","message":"[Aws::SQS::Client 400 0.057859 0 retries] get_queue_url(queue_name:\"smartstate_extract_reply-b0d2ab4b-40ed-4025-a5e2-788ba23ba35d\") Aws::SQS::Errors::NonExistentQueue The specified queue does not exist for this wsdl version."}
{"@timestamp":"2020-10-07T14:05:02.577138 ","hostname":"1-amazon-agent-coordinator-0-55fd49dc87-hsgwd","pid":7,"tid":"2acb257d795c","level":"debug","message":"[Aws::S3::Client 404 0.20764 0 retries] head_bucket(bucket:\"smartstate-b0d2ab4b-40ed-4025-a5e2-788ba23ba35d\") Aws::S3::Errors::NotFound "}

--> process agent coordinator running

12190 root 21 1 395768 204608 14740 R 15.4 0.8 0:09.24 MIQ: Amazon::CloudManager::EventCatcher id: 76, queue: ems_3

12188 root 30 10 361160 202108 13100 R 15.3 0.8 0:09.19 MIQ: Amazon::AgentCoordinatorWorker id: 75, queue: ems_agent_coordinator

12194 root 23 3 361408 195428 13128 R 14.7 0.8 0:08.85 MIQ: Amazon::CloudManager::MetricsCollectorWorker id: 78, queue: amazon

--> The related task info with GUID

[----] I, [2020-09-27T15:03:16.993192 #2228:2abfbf325970] INFO -- : MIQ(ManageIQ::Providers::Amazon::AgentCoordinatorWorker#start) Worker started: ID [75], PID [12188], GUID [5b493b88-02ff-4570-8697-6e393122260b]

--> I have looked at ews.log for GUID but, the only timeout error I found is the following:

[----] I, [2020-09-27T15:59:18.098238 #12188:2aeeb017595c] INFO -- : MIQ(ManageIQ::Providers::Amazon::AgentCoordinatorWorker::Runner#do_work) Alive agents in EMS(guid=1c9078f4-1231-4f90-9b63-29fbc205ce72): [].

.....

and then:

[----] I, [2020-09-27T15:59:23.405619 #2684:2ad3ca51b968] INFO -- : MIQ(MiqQueue#deliver) Message id: [5879], Delivering...

[----] W, [2020-09-27T15:59:23.409676 #2684:2ad3ca51b968] WARN -- : MIQ(ManageIQ::Providers::Amazon::CloudManager::Scanning::Job#timeout!) Job: guid: [17167f79-8f9d-428f-a382-4cb29ee8e498], job timed out after 3052.271594948 seconds of inactivity. Inactivity threshold [3000 seconds], aborting

[----] I, [2020-09-27T15:59:23.416496 #2684:2ad3ca51b968] INFO -- : MIQ(MiqQueue.put) Message id: [5882], id: [], Zone: [default], Role: [smartstate], Server: [], MiqTask id: [], Ident: [generic], Target id: [], Instance id: [1], Task id: [], Command: [Job.signal_abort], Timeout: [600], Priority: [100], State: [ready], Deliver On: [], Data: [], Args: ["job timed out after 3052.271594948 seconds of inactivity. Inactivity threshold [3000 seconds]", "error"]

[----] I, [2020-09-27T15:59:23.416693 #2684:2ad3ca51b968] INFO -- : MIQ(MiqQueue#delivered) Message id: [5879], State: [ok], Delivered in [0.01107655] seconds

[----] I, [2020-09-27T15:59:23.422359 #2684:2ad3ca51b968] INFO -- : MIQ(MiqGenericWorker::Runner#get_message_via_drb) Message id: [5881], MiqWorker id: [64], Zone: [], Role: [], Server: [], MiqTask id: [], Ident: [generic], Target id: [], Instance id: [], Task id: [], Command: [MiqQueue.check_for_timeout], Timeout: [600], Priority: [90], State: [dequeue], Deliver On: [], Data: [], Args: [], Dequeued in: [3.302457898] seconds

[----] I, [2020-09-27T15:59:23.422617 #2684:2ad3ca51b968] INFO -- : MIQ(MiqQueue#deliver) Message id: [5881], Delivering...

[----] I, [2020-09-27T15:59:23.423784 #2684:2ad3ca51b968] INFO -- : MIQ(MiqQueue#delivered) Message id: [5881], State: [ok], Delivered in [0.001177885] seconds

[----] I, [2020-09-27T15:59:28.470684 #2228:2abfbf325970] INFO -- : MIQ(MiqServer#populate_queue_messages) Fetched 1 miq_queue rows for queue_name=generic, wcount=4, priority=200

[----] I, [2020-09-27T15:59:29.506630 #2684:2ad3ca51b968] INFO -- : MIQ(MiqGenericWorker::Runner#get_message_via_drb) Message id: [5882], MiqWorker id: [64], Zone: [default], Role: [smartstate], Server: [], MiqTask id: [], Ident: [generic], Target id: [], Instance id: [1], Task id: [], Command: [Job.signal_abort], Timeout: [600], Priority: [100], State: [dequeue], Deliver On: [], Data: [], Args: ["job timed out after 3052.271594948 seconds of inactivity. Inactivity threshold [3000 seconds]", "error"], Dequeued in: [6.094388538] seconds

[----] I, [2020-09-27T15:59:29.506939 #2684:2ad3ca51b968] INFO -- : MIQ(MiqQueue#deliver) Message id: [5882], Delivering...

[----] I, [2020-09-27T15:59:29.656524 #2684:2ad3ca51b968] INFO -- : MIQ(MiqQueue.put) Message id: [5883], id: [], Zone: [default], Role: [automate], Server: [], MiqTask id: [], Ident: [generic], Target id: [], Instance id: [], Task id: [], Command: [MiqAeEngine.deliver], Timeout: [3600], Priority: [20], State: [ready], Deliver On: [], Data: [], Args: [{:object_type=>"ManageIQ::Providers::Amazon::CloudManager::Vm", :object_id=>34, :attrs=>{:event_type=>"vm_scan_abort", "VmOrTemplate::vm"=>34, :vm_id=>34, :host=>nil, "MiqEvent::miq_event"=>308, :miq_event_id=>308, "EventStream::event_stream"=>308, :event_stream_id=>308}, :instance_name=>"Event", :user_id=>1, :miq_group_id=>1, :tenant_id=>1, :automate_message=>nil}]

[----] E, [2020-09-27T15:59:29.656639 #2684:2ad3ca51b968] ERROR -- : MIQ(ManageIQ::Providers::Amazon::CloudManager::Scanning::Job#process_abort) job aborting, job timed out after 3052.271594948 seconds of inactivity. Inactivity threshold [3000 seconds]

[----] E, [2020-09-27T15:59:29.662790 #2684:2ad3ca51b968] ERROR -- : MIQ(MiqQueue#deliver) Message id: [5882], Error: [finish is not permitted at state aborting]

[----] E, [2020-09-27T15:59:29.662909 #2684:2ad3ca51b968] ERROR -- : [RuntimeError]: finish is not permitted at state aborting Method:[block (2 levels) in class:LogProxy]

[----] E, [2020-09-27T15:59:29.662994 #2684:2ad3ca51b968] ERROR -- : /var/www/miq/vmdb/app/models/job/state_machine.rb:50:in `signal'

Not sure if this is related to the issue.

@agrare agrare added the bug label Nov 12, 2020
@agrare
Copy link
Member Author

agrare commented Nov 12, 2020

cc @jerryk55

@chessbyte
Copy link
Member

@agrare was this ever addressed? I recall you making some fixes for SmartState Analysis, but cannot figure out if this is still an issue

@agrare
Copy link
Member Author

agrare commented Dec 21, 2021

There was an issue with the agent_coordinator worker that we fixed, not sure if that fixes this or not though

@miq-bot miq-bot added the stale label Feb 27, 2023
@miq-bot
Copy link
Member

miq-bot commented Feb 27, 2023

This issue has been automatically marked as stale because it has not been updated for at least 3 months.

If you can still reproduce this issue on the current release or on master, please reply with all of the information you have about it in order to keep the issue open.

Thank you for all your contributions! More information about the ManageIQ triage process can be found in the triage process documentation.

@Fryguy Fryguy removed the stale label Mar 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants