SysV::put() Failed for call_id 2: Retrying. Error Code: 11 #40

willebil · 2014-04-29T14:21:25Z

Hi,

I am building a solution that crawls some websites, and after a while the following messages are "dumped" in the log.

[2014-04-29 13:32:11] 6013 Workers SysV::put() Failed for call_id 2: Retrying. Error Code: 11
[29-Apr-2014 13:32:11 UTC] PHP Warning: msg_send(): msgsnd failed: Resource temporarily unavailable in /var/www/worker/includes/daemon/Core/Worker/Via/SysV.php on line 208 pid 6013

I am currently backtracing -and debugging the root cause, but also have tried to find an explanation of the Error code provided, so far I have not been able to find any documentation of the error code. Any ideas?

shaneharter · 2014-04-29T16:34:42Z

Is it working? It says it's retrying, does it do so successfully or do you just see another error message? Just curious, obviously error messages are bad regardless.
Try cleaning any errant IPC resources using: ipcrm -q (queue cleans message queues) from the commandline. Check out the manpag for ipcrm for more info on that.

willebil · 2014-04-29T16:37:37Z

Thanks for the fast response!

Seems the code is locked, the Sysv php code does 3 retries if I am correct, and then gives up, but it does not halt the deamon. As said, I am debugging the current situation, and am not 100% sure what happens. I am trying to understand error code 11, I simply cannot find any reference to the meaning of this.

The log posted here is from the syslog, the deamon console only regististerd the Worker error. More information is not logged.

Will try your suggestion, and see what happens and get back to this.

shaneharter · 2014-04-29T16:44:20Z

A downside of using the SysV message queues and SHM in php is that it's little used and poorly documented. Didn't realize that when I built it. In the last big release (about a year ago) I refactored the Mediator code so the IPC is abstract, so now you can swap out SysV for another message queue (POSIX would be my general purpose choice, but server-based queues would also work). I just haven't done the work yet.

willebil · 2014-04-29T16:47:14Z

Do I see a challenge coming up :-D I stick with debugging for now, but you never know :-)

shaneharter · 2014-04-29T16:53:49Z

Not sure if you've tried running your daemon with --debugworkers option set. It will let you step thru each call to the worker and possibly isolate problems. This error, though, is coming from the library.

This is the function PHP is using here:
http://www.lehman.cuny.edu/cgi-bin/man-cgi?msgsnd+2

If you look at error codes, there are constants. If you can find which constant is "11" it might give clues.

shaneharter · 2014-04-29T16:57:25Z

Ok, actually the daemon tries to handle this for you:

       case MSG_EAGAIN:    // Temporary Problem, Try Again
            usleep($this->mediator->backoff(20000, $try));
            return true;
            break;

So in this case, it's sleeping for 0.05 seconds, and then 0.10, and then 0.20 between each failure, trying again afterward as the error code indicates it should.

This is an OS (or stdlib) failure code, lmk if cleaning the IPC resources made any difference.

willebil · 2014-04-29T16:59:18Z

Check. Will deepdive on all of this, will take some time since the error only occurs after the workers are running for some time. Will get back to this asap.

shaneharter · 2014-04-29T17:58:04Z

You could do this.. run with --debugworkers and type "help". You'll see the commands. There's a command you can use, skipfor, that will run the daemon normally (skipping breakpoints) for however many seconds you say. So you could say "skipfor 600" and 10 mins later it'll hit a breakpoint and you can step thru. What this will let you do is get one of these failures, then use ipcs to show current IPC resources. Maybe there are messages backing up on the queue and its hitting a resource limit?

willebil · 2014-04-30T10:45:55Z

I went back to create a simple example with just one worker. The worker waits a random amount of seconds, and then finishes. Pretty straight forward, worker is initiated with the following:

    protected function setup_workers() {
        $that = $this;

        $via = new \Core_Worker_Via_SysV();
        $via->malloc(30 * 1024 * 1024);

        $this->worker('WorkerTasks', new WorkerTasks(), $via);
        $this->WorkerTasks->timeout(120);
        $this->WorkerTasks->workers(5);
    }

And is called in the execute method

    protected function execute() {
        $this->WorkerTasks->work();
    }

The actual code in the worker executed is the following:

    public function work () {
        $wait = rand(1, 60);
        $this->mediator->log("Tasks:work => waiting for $wait seconds");
        sleep ($wait);
    }

After a couple of iteration the exact error occurs. I must be doing something incredible simply wrong since the provided examples run without any problem. Debugging does not bring me anywhere. Any ideas?

willebil · 2014-04-30T11:25:41Z

The error code is not really helpfull, see http://www-numi.fnal.gov/offline_software/srt_public_context/WebDocs/Errors/unix_system_errors.html and refers to Try Again ;-) The Daemon actually trying to do that 3 times, and then execution gets messed up.

willebil · 2014-04-30T16:28:04Z

Indeed I have been overlooking something pretty elementary. Spend the whole day debugging and understanding what was going on, so the possitive side is that I have a good understanding of the codebase (an yes, I like the solution).

Problem is solved by validating if the worker was idle/busy, and only when idle I am executing the worker method. The problem was that to many worker processes got instantiated, and with that the solution ran out of resources pretty quickly. Thanks anyway for the proper help!

shaneharter · 2014-05-01T02:48:10Z

Well... that doesn't make me feel good. You shouldn't have to do that. Mind throwing up the exact code you were running (the simple example you mentioned) into a gist or something so I can run it?

What distro were you using?

Even if it's just the daemon using a more aggressive backoff -- slowing everything down in these instances -- it would be better than random cryptic errors.

shaneharter · 2014-05-01T02:50:07Z

Nevermind -- missed it upthread -- but would still like to know distro details.

willebil · 2014-05-01T13:08:08Z

Distro is Ubuntu 14.04. Let me know if you want the code example.

brunnels · 2014-05-02T17:51:38Z

Here's one I've been getting occasionally that may be related:

Warning: shm_get_var(): variable key 2391 doesn't exist in PHP-Daemon/Core/Worer/Via/SysV.php on line 240

Call Stack:
    0.0004     225192   1. {main}() lib/daemon/run.php:0
    0.8631    6006160   2. Core_Daemon->run() lib/daemon/run.php:15
17949.1137   20455592   3. Core_Daemon->dispatch() PHP-Daemon/Core/Daemon.php:413
17949.1142   20458248   4. call_user_func_array() PHP-Daemon/Core/Daemon.php:510
17949.1142   20458416   5. Core_Worker_Mediator->run() PHP-Daemon/Core/Daemon.php:510
17949.1145   20460120   6. Core_Worker_Mediator->fork() PHP-Daemon/Core/Worker/Mediator.php:742
17949.1181   20460712   7. Core_Daemon->task() PHP-Daemon/Core/Worker/Mediator.php:636
17949.1386   20389312   8. call_user_func_array() PHP-Daemon/Core/Daemon.php:591
17949.1386   20389400   9. Core_Daemon->{closure:PHP-Daemon/Core/Daemon.php:555-559}() PHP-Daemon/Core/Daemon.php:591
17949.2726   11232776  10. call_user_func() PHP-Daemon/Core/Daemon.php:557
17949.2727   11233480  11. Core_Worker_Mediator->start() PHP-Daemon/Core/Daemon.php:557
17998.6530   67289328  12. Core_Worker_Via_SysV->get() PHP-Daemon/Core/Worker/Mediator.php:778
17998.6588   67291992  13. Core_Worker_Via_SysV->{closure:PHP-Daemon/Core/Worker/Via/SysV.php:239-245}() PHP-Daemon/Core/Worker/Via/SysV.php:276
17998.6588   67292040  14. shm_get_var() PHP-Daemon/Core/Worker/Via/SysV.php:240

Notice: Trying to get property of non-object in PHP-Daemon/Core/Worker/Via/SysV.php on line 241

Call Stack:
    0.0004     225192   1. {main}() lib/daemon/run.php:0
    0.8631    6006160   2. Core_Daemon->run() lib/daemon/run.php:15
17949.1137   20455592   3. Core_Daemon->dispatch() PHP-Daemon/Core/Daemon.php:413
17949.1142   20458248   4. call_user_func_array() PHP-Daemon/Core/Daemon.php:510
17949.1142   20458416   5. Core_Worker_Mediator->run() PHP-Daemon/Core/Daemon.php:510
17949.1145   20460120   6. Core_Worker_Mediator->fork() PHP-Daemon/Core/Worker/Mediator.php:742
17949.1181   20460712   7. Core_Daemon->task() PHP-Daemon/Core/Worker/Mediator.php:636
17949.1386   20389312   8. call_user_func_array() PHP-Daemon/Core/Daemon.php:591
17949.1386   20389400   9. Core_Daemon->{closure:PHP-Daemon/Core/Daemon.php:555-559}() PHP-Daemon/Core/Daemon.php:591
17949.2726   11232776  10. call_user_func() PHP-Daemon/Core/Daemon.php:557
17949.2727   11233480  11. Core_Worker_Mediator->start() PHP-Daemon/Core/Daemon.php:557
17998.6530   67289328  12. Core_Worker_Via_SysV->get() PHP-Daemon/Core/Worker/Mediator.php:778
17998.6588   67291992  13. Core_Worker_Via_SysV->{closure:PHP-Daemon/Core/Worker/Via/SysV.php:239-245}() PHP-Daemon/Core/Worker/Via/SysV.php:276


[2014-05-02 13:43:38] 12560               Exception Caught in Task: Core_Worker_Via_SysV::get Failed. Could Not Decode Message: Array
[2014-05-02 13:43:38] 12560               (
[2014-05-02 13:43:38] 12560                   [call_id] => 2391
[2014-05-02 13:43:38] 12560                   [status] => 0
[2014-05-02 13:43:38] 12560                   [microtime] => 1399052510.8313
[2014-05-02 13:43:38] 12560                   [pid] => 21996
[2014-05-02 13:43:38] 12560               )

StevenMapes · 2015-10-20T14:20:05Z

so what was the actual fix for this? I'm seeing it occur a lot and am having to kill the daemon, wait, then kill the worker processes and start up the daemon again via a cron just to try to limit the problem. The auto_restart_interval sadly isn't enough to clean it up and it's not a limit of physical resources either.

defusiondev · 2015-11-17T08:46:58Z

Also getting this quite a lot particularly when the task the daemon is performing has timeout issues. Have needed to kill the daemon and restart it to get it to work.

cwhsu1984 · 2017-09-12T13:43:53Z

I really hope that someday someone could somehow fix this. This project is great but it really needs to get some bugs fixed, especially like this one. If you use this project, you are likely to see the errors like this:

[2017-09-12 21:09:11] 17597 QueueReceive  SysV::put() Failed for call_id 552: Retrying. Error Code: 11
PHP Warning:  msg_send(): msgsnd failed: Resource temporarily unavailable in /path/to/your/workdir/app/vendor/shaneharter/php-daemon/Core/Worker/Via/SysV.php on line 208

Like @willebil said, you don't really need a complex worker to make this happen. After inspecting the codes and performing some tests, I think it is caused by the message queue overflow. Run "ipcs -q" as soon as you start the daemon, and run it again when seeing "Resource temporarily unavailable" message. You will find that the used-bytes of queue is 16376.

------ Message Queues --------
key        msqid      owner      perms      used-bytes   messages    
...
0x5101058b 3080200    ubuntu     666        16376        170

To be more specific, you only see the errors when your workers idle for some time, and the time for errors to occur depends on how many workers you start. Anyway, if the idling workers keep sending messages to the queue, the queue will be full eventually.

So, in order to fix this, we will need to know why we should send message to the queue even if the worker is idling. I don't have time to trace this bug now, but I will just leave a clue for anyone who is interested in it.

Anyway, @brunnels provides another modified version here #39. Too bad, it still has the same issue. So you are still going to see the errors as long as your workers idle for a few minutes.

I do some experiments on "default max size of queue" by using sysctl like this:

sudo sysctl -w kernel.msgmnb=65536

By increasing your message queue size, it really deferred the issue but still won't solve it. It does prove that this is where the issue comes from. Still, one thing that I don't understand is that when seeing the following message:

[2017-09-13 11:21:49] 23677 QueueReceive  Dropped Call. Requeuing Call 86 To `listen`
[2017-09-13 11:21:49] 23677 QueueReceive  Dropped Call. Requeuing Call 87 To `listen`

The requesting call might be dropped, but the message is not removed from the message queue. Hence, dropping call does not decrease the used size of message queue. If dropping call would removed the useless message in the message queue, we might have a workaround by just increasing the size of message queue.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SysV::put() Failed for call_id 2: Retrying. Error Code: 11 #40

SysV::put() Failed for call_id 2: Retrying. Error Code: 11 #40

willebil commented Apr 29, 2014

shaneharter commented Apr 29, 2014

willebil commented Apr 29, 2014

shaneharter commented Apr 29, 2014

willebil commented Apr 29, 2014

shaneharter commented Apr 29, 2014

shaneharter commented Apr 29, 2014

willebil commented Apr 29, 2014

shaneharter commented Apr 29, 2014

willebil commented Apr 30, 2014

willebil commented Apr 30, 2014

willebil commented Apr 30, 2014

shaneharter commented May 1, 2014

shaneharter commented May 1, 2014

willebil commented May 1, 2014

brunnels commented May 2, 2014

StevenMapes commented Oct 20, 2015

defusiondev commented Nov 17, 2015

cwhsu1984 commented Sep 12, 2017 •

edited

Loading

SysV::put() Failed for call_id 2: Retrying. Error Code: 11 #40

SysV::put() Failed for call_id 2: Retrying. Error Code: 11 #40

Comments

willebil commented Apr 29, 2014

shaneharter commented Apr 29, 2014

willebil commented Apr 29, 2014

shaneharter commented Apr 29, 2014

willebil commented Apr 29, 2014

shaneharter commented Apr 29, 2014

shaneharter commented Apr 29, 2014

willebil commented Apr 29, 2014

shaneharter commented Apr 29, 2014

willebil commented Apr 30, 2014

willebil commented Apr 30, 2014

willebil commented Apr 30, 2014

shaneharter commented May 1, 2014

shaneharter commented May 1, 2014

willebil commented May 1, 2014

brunnels commented May 2, 2014

StevenMapes commented Oct 20, 2015

defusiondev commented Nov 17, 2015

cwhsu1984 commented Sep 12, 2017 • edited Loading

cwhsu1984 commented Sep 12, 2017 •

edited

Loading