Skip to content
This repository has been archived by the owner on Jan 5, 2022. It is now read-only.

SysV::put() Failed for call_id 2: Retrying. Error Code: 11 #40

Open
willebil opened this issue Apr 29, 2014 · 18 comments
Open

SysV::put() Failed for call_id 2: Retrying. Error Code: 11 #40

willebil opened this issue Apr 29, 2014 · 18 comments

Comments

@willebil
Copy link

Hi,

I am building a solution that crawls some websites, and after a while the following messages are "dumped" in the log.

[2014-04-29 13:32:11] 6013 Workers SysV::put() Failed for call_id 2: Retrying. Error Code: 11
[29-Apr-2014 13:32:11 UTC] PHP Warning: msg_send(): msgsnd failed: Resource temporarily unavailable in /var/www/worker/includes/daemon/Core/Worker/Via/SysV.php on line 208 pid 6013

I am currently backtracing -and debugging the root cause, but also have tried to find an explanation of the Error code provided, so far I have not been able to find any documentation of the error code. Any ideas?

@shaneharter
Copy link
Owner

  1. Is it working? It says it's retrying, does it do so successfully or do you just see another error message? Just curious, obviously error messages are bad regardless.
  2. Try cleaning any errant IPC resources using: ipcrm -q (queue cleans message queues) from the commandline. Check out the manpag for ipcrm for more info on that.

@willebil
Copy link
Author

Thanks for the fast response!

Seems the code is locked, the Sysv php code does 3 retries if I am correct, and then gives up, but it does not halt the deamon. As said, I am debugging the current situation, and am not 100% sure what happens. I am trying to understand error code 11, I simply cannot find any reference to the meaning of this.

The log posted here is from the syslog, the deamon console only regististerd the Worker error. More information is not logged.

Will try your suggestion, and see what happens and get back to this.

@shaneharter
Copy link
Owner

A downside of using the SysV message queues and SHM in php is that it's little used and poorly documented. Didn't realize that when I built it. In the last big release (about a year ago) I refactored the Mediator code so the IPC is abstract, so now you can swap out SysV for another message queue (POSIX would be my general purpose choice, but server-based queues would also work). I just haven't done the work yet.

@willebil
Copy link
Author

Do I see a challenge coming up :-D I stick with debugging for now, but you never know :-)

@shaneharter
Copy link
Owner

Not sure if you've tried running your daemon with --debugworkers option set. It will let you step thru each call to the worker and possibly isolate problems. This error, though, is coming from the library.

This is the function PHP is using here:
http://www.lehman.cuny.edu/cgi-bin/man-cgi?msgsnd+2

If you look at error codes, there are constants. If you can find which constant is "11" it might give clues.

@shaneharter
Copy link
Owner

Ok, actually the daemon tries to handle this for you:

       case MSG_EAGAIN:    // Temporary Problem, Try Again
            usleep($this->mediator->backoff(20000, $try));
            return true;
            break;

So in this case, it's sleeping for 0.05 seconds, and then 0.10, and then 0.20 between each failure, trying again afterward as the error code indicates it should.

This is an OS (or stdlib) failure code, lmk if cleaning the IPC resources made any difference.

@willebil
Copy link
Author

Check. Will deepdive on all of this, will take some time since the error only occurs after the workers are running for some time. Will get back to this asap.

@shaneharter
Copy link
Owner

You could do this.. run with --debugworkers and type "help". You'll see the commands. There's a command you can use, skipfor, that will run the daemon normally (skipping breakpoints) for however many seconds you say. So you could say "skipfor 600" and 10 mins later it'll hit a breakpoint and you can step thru. What this will let you do is get one of these failures, then use ipcs to show current IPC resources. Maybe there are messages backing up on the queue and its hitting a resource limit?

@willebil
Copy link
Author

I went back to create a simple example with just one worker. The worker waits a random amount of seconds, and then finishes. Pretty straight forward, worker is initiated with the following:

    protected function setup_workers() {
        $that = $this;

        $via = new \Core_Worker_Via_SysV();
        $via->malloc(30 * 1024 * 1024);

        $this->worker('WorkerTasks', new WorkerTasks(), $via);
        $this->WorkerTasks->timeout(120);
        $this->WorkerTasks->workers(5);
    }

And is called in the execute method

    protected function execute() {
        $this->WorkerTasks->work();
    }

The actual code in the worker executed is the following:

    public function work () {
        $wait = rand(1, 60);
        $this->mediator->log("Tasks:work => waiting for $wait seconds");
        sleep ($wait);
    }

After a couple of iteration the exact error occurs. I must be doing something incredible simply wrong since the provided examples run without any problem. Debugging does not bring me anywhere. Any ideas?

@willebil
Copy link
Author

The error code is not really helpfull, see http://www-numi.fnal.gov/offline_software/srt_public_context/WebDocs/Errors/unix_system_errors.html and refers to Try Again ;-) The Daemon actually trying to do that 3 times, and then execution gets messed up.

@willebil
Copy link
Author

Indeed I have been overlooking something pretty elementary. Spend the whole day debugging and understanding what was going on, so the possitive side is that I have a good understanding of the codebase (an yes, I like the solution).

Problem is solved by validating if the worker was idle/busy, and only when idle I am executing the worker method. The problem was that to many worker processes got instantiated, and with that the solution ran out of resources pretty quickly. Thanks anyway for the proper help!

@shaneharter
Copy link
Owner

Well... that doesn't make me feel good. You shouldn't have to do that. Mind throwing up the exact code you were running (the simple example you mentioned) into a gist or something so I can run it?

What distro were you using?

Even if it's just the daemon using a more aggressive backoff -- slowing everything down in these instances -- it would be better than random cryptic errors.

@shaneharter
Copy link
Owner

Nevermind -- missed it upthread -- but would still like to know distro details.

@willebil
Copy link
Author

willebil commented May 1, 2014

Distro is Ubuntu 14.04. Let me know if you want the code example.

@brunnels
Copy link
Contributor

brunnels commented May 2, 2014

Here's one I've been getting occasionally that may be related:

Warning: shm_get_var(): variable key 2391 doesn't exist in PHP-Daemon/Core/Worer/Via/SysV.php on line 240

Call Stack:
    0.0004     225192   1. {main}() lib/daemon/run.php:0
    0.8631    6006160   2. Core_Daemon->run() lib/daemon/run.php:15
17949.1137   20455592   3. Core_Daemon->dispatch() PHP-Daemon/Core/Daemon.php:413
17949.1142   20458248   4. call_user_func_array() PHP-Daemon/Core/Daemon.php:510
17949.1142   20458416   5. Core_Worker_Mediator->run() PHP-Daemon/Core/Daemon.php:510
17949.1145   20460120   6. Core_Worker_Mediator->fork() PHP-Daemon/Core/Worker/Mediator.php:742
17949.1181   20460712   7. Core_Daemon->task() PHP-Daemon/Core/Worker/Mediator.php:636
17949.1386   20389312   8. call_user_func_array() PHP-Daemon/Core/Daemon.php:591
17949.1386   20389400   9. Core_Daemon->{closure:PHP-Daemon/Core/Daemon.php:555-559}() PHP-Daemon/Core/Daemon.php:591
17949.2726   11232776  10. call_user_func() PHP-Daemon/Core/Daemon.php:557
17949.2727   11233480  11. Core_Worker_Mediator->start() PHP-Daemon/Core/Daemon.php:557
17998.6530   67289328  12. Core_Worker_Via_SysV->get() PHP-Daemon/Core/Worker/Mediator.php:778
17998.6588   67291992  13. Core_Worker_Via_SysV->{closure:PHP-Daemon/Core/Worker/Via/SysV.php:239-245}() PHP-Daemon/Core/Worker/Via/SysV.php:276
17998.6588   67292040  14. shm_get_var() PHP-Daemon/Core/Worker/Via/SysV.php:240

Notice: Trying to get property of non-object in PHP-Daemon/Core/Worker/Via/SysV.php on line 241

Call Stack:
    0.0004     225192   1. {main}() lib/daemon/run.php:0
    0.8631    6006160   2. Core_Daemon->run() lib/daemon/run.php:15
17949.1137   20455592   3. Core_Daemon->dispatch() PHP-Daemon/Core/Daemon.php:413
17949.1142   20458248   4. call_user_func_array() PHP-Daemon/Core/Daemon.php:510
17949.1142   20458416   5. Core_Worker_Mediator->run() PHP-Daemon/Core/Daemon.php:510
17949.1145   20460120   6. Core_Worker_Mediator->fork() PHP-Daemon/Core/Worker/Mediator.php:742
17949.1181   20460712   7. Core_Daemon->task() PHP-Daemon/Core/Worker/Mediator.php:636
17949.1386   20389312   8. call_user_func_array() PHP-Daemon/Core/Daemon.php:591
17949.1386   20389400   9. Core_Daemon->{closure:PHP-Daemon/Core/Daemon.php:555-559}() PHP-Daemon/Core/Daemon.php:591
17949.2726   11232776  10. call_user_func() PHP-Daemon/Core/Daemon.php:557
17949.2727   11233480  11. Core_Worker_Mediator->start() PHP-Daemon/Core/Daemon.php:557
17998.6530   67289328  12. Core_Worker_Via_SysV->get() PHP-Daemon/Core/Worker/Mediator.php:778
17998.6588   67291992  13. Core_Worker_Via_SysV->{closure:PHP-Daemon/Core/Worker/Via/SysV.php:239-245}() PHP-Daemon/Core/Worker/Via/SysV.php:276


[2014-05-02 13:43:38] 12560               Exception Caught in Task: Core_Worker_Via_SysV::get Failed. Could Not Decode Message: Array
[2014-05-02 13:43:38] 12560               (
[2014-05-02 13:43:38] 12560                   [call_id] => 2391
[2014-05-02 13:43:38] 12560                   [status] => 0
[2014-05-02 13:43:38] 12560                   [microtime] => 1399052510.8313
[2014-05-02 13:43:38] 12560                   [pid] => 21996
[2014-05-02 13:43:38] 12560               )

@StevenMapes
Copy link

so what was the actual fix for this? I'm seeing it occur a lot and am having to kill the daemon, wait, then kill the worker processes and start up the daemon again via a cron just to try to limit the problem. The auto_restart_interval sadly isn't enough to clean it up and it's not a limit of physical resources either.

@defusiondev
Copy link

Also getting this quite a lot particularly when the task the daemon is performing has timeout issues. Have needed to kill the daemon and restart it to get it to work.

@cwhsu1984
Copy link

cwhsu1984 commented Sep 12, 2017

I really hope that someday someone could somehow fix this. This project is great but it really needs to get some bugs fixed, especially like this one. If you use this project, you are likely to see the errors like this:

[2017-09-12 21:09:11] 17597 QueueReceive  SysV::put() Failed for call_id 552: Retrying. Error Code: 11
PHP Warning:  msg_send(): msgsnd failed: Resource temporarily unavailable in /path/to/your/workdir/app/vendor/shaneharter/php-daemon/Core/Worker/Via/SysV.php on line 208

Like @willebil said, you don't really need a complex worker to make this happen. After inspecting the codes and performing some tests, I think it is caused by the message queue overflow. Run "ipcs -q" as soon as you start the daemon, and run it again when seeing "Resource temporarily unavailable" message. You will find that the used-bytes of queue is 16376.

------ Message Queues --------
key        msqid      owner      perms      used-bytes   messages    
...
0x5101058b 3080200    ubuntu     666        16376        170 

To be more specific, you only see the errors when your workers idle for some time, and the time for errors to occur depends on how many workers you start. Anyway, if the idling workers keep sending messages to the queue, the queue will be full eventually.

So, in order to fix this, we will need to know why we should send message to the queue even if the worker is idling. I don't have time to trace this bug now, but I will just leave a clue for anyone who is interested in it.

Anyway, @brunnels provides another modified version here #39. Too bad, it still has the same issue. So you are still going to see the errors as long as your workers idle for a few minutes.

I do some experiments on "default max size of queue" by using sysctl like this:

sudo sysctl -w kernel.msgmnb=65536

By increasing your message queue size, it really deferred the issue but still won't solve it. It does prove that this is where the issue comes from. Still, one thing that I don't understand is that when seeing the following message:

[2017-09-13 11:21:49] 23677 QueueReceive  Dropped Call. Requeuing Call 86 To `listen`
[2017-09-13 11:21:49] 23677 QueueReceive  Dropped Call. Requeuing Call 87 To `listen`

The requesting call might be dropped, but the message is not removed from the message queue. Hence, dropping call does not decrease the used size of message queue. If dropping call would removed the useless message in the message queue, we might have a workaround by just increasing the size of message queue.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants