-
Notifications
You must be signed in to change notification settings - Fork 134
SysV::put() Failed for call_id 2: Retrying. Error Code: 11 #40
Comments
|
Thanks for the fast response! Seems the code is locked, the Sysv php code does 3 retries if I am correct, and then gives up, but it does not halt the deamon. As said, I am debugging the current situation, and am not 100% sure what happens. I am trying to understand error code 11, I simply cannot find any reference to the meaning of this. The log posted here is from the syslog, the deamon console only regististerd the Worker error. More information is not logged. Will try your suggestion, and see what happens and get back to this. |
A downside of using the SysV message queues and SHM in php is that it's little used and poorly documented. Didn't realize that when I built it. In the last big release (about a year ago) I refactored the Mediator code so the IPC is abstract, so now you can swap out SysV for another message queue (POSIX would be my general purpose choice, but server-based queues would also work). I just haven't done the work yet. |
Do I see a challenge coming up :-D I stick with debugging for now, but you never know :-) |
Not sure if you've tried running your daemon with --debugworkers option set. It will let you step thru each call to the worker and possibly isolate problems. This error, though, is coming from the library. This is the function PHP is using here: If you look at error codes, there are constants. If you can find which constant is "11" it might give clues. |
Ok, actually the daemon tries to handle this for you:
So in this case, it's sleeping for 0.05 seconds, and then 0.10, and then 0.20 between each failure, trying again afterward as the error code indicates it should. This is an OS (or stdlib) failure code, lmk if cleaning the IPC resources made any difference. |
Check. Will deepdive on all of this, will take some time since the error only occurs after the workers are running for some time. Will get back to this asap. |
You could do this.. run with --debugworkers and type "help". You'll see the commands. There's a command you can use, skipfor, that will run the daemon normally (skipping breakpoints) for however many seconds you say. So you could say "skipfor 600" and 10 mins later it'll hit a breakpoint and you can step thru. What this will let you do is get one of these failures, then use ipcs to show current IPC resources. Maybe there are messages backing up on the queue and its hitting a resource limit? |
I went back to create a simple example with just one worker. The worker waits a random amount of seconds, and then finishes. Pretty straight forward, worker is initiated with the following:
And is called in the execute method
The actual code in the worker executed is the following:
After a couple of iteration the exact error occurs. I must be doing something incredible simply wrong since the provided examples run without any problem. Debugging does not bring me anywhere. Any ideas? |
The error code is not really helpfull, see http://www-numi.fnal.gov/offline_software/srt_public_context/WebDocs/Errors/unix_system_errors.html and refers to Try Again ;-) The Daemon actually trying to do that 3 times, and then execution gets messed up. |
Indeed I have been overlooking something pretty elementary. Spend the whole day debugging and understanding what was going on, so the possitive side is that I have a good understanding of the codebase (an yes, I like the solution). Problem is solved by validating if the worker was idle/busy, and only when idle I am executing the worker method. The problem was that to many worker processes got instantiated, and with that the solution ran out of resources pretty quickly. Thanks anyway for the proper help! |
Well... that doesn't make me feel good. You shouldn't have to do that. Mind throwing up the exact code you were running (the simple example you mentioned) into a gist or something so I can run it? What distro were you using? Even if it's just the daemon using a more aggressive backoff -- slowing everything down in these instances -- it would be better than random cryptic errors. |
Nevermind -- missed it upthread -- but would still like to know distro details. |
Distro is Ubuntu 14.04. Let me know if you want the code example. |
Here's one I've been getting occasionally that may be related:
|
so what was the actual fix for this? I'm seeing it occur a lot and am having to kill the daemon, wait, then kill the worker processes and start up the daemon again via a cron just to try to limit the problem. The auto_restart_interval sadly isn't enough to clean it up and it's not a limit of physical resources either. |
Also getting this quite a lot particularly when the task the daemon is performing has timeout issues. Have needed to kill the daemon and restart it to get it to work. |
I really hope that someday someone could somehow fix this. This project is great but it really needs to get some bugs fixed, especially like this one. If you use this project, you are likely to see the errors like this:
Like @willebil said, you don't really need a complex worker to make this happen. After inspecting the codes and performing some tests, I think it is caused by the message queue overflow. Run "ipcs -q" as soon as you start the daemon, and run it again when seeing "Resource temporarily unavailable" message. You will find that the used-bytes of queue is 16376.
To be more specific, you only see the errors when your workers idle for some time, and the time for errors to occur depends on how many workers you start. Anyway, if the idling workers keep sending messages to the queue, the queue will be full eventually. So, in order to fix this, we will need to know why we should send message to the queue even if the worker is idling. I don't have time to trace this bug now, but I will just leave a clue for anyone who is interested in it. Anyway, @brunnels provides another modified version here #39. Too bad, it still has the same issue. So you are still going to see the errors as long as your workers idle for a few minutes. I do some experiments on "default max size of queue" by using sysctl like this:
By increasing your message queue size, it really deferred the issue but still won't solve it. It does prove that this is where the issue comes from. Still, one thing that I don't understand is that when seeing the following message:
The requesting call might be dropped, but the message is not removed from the message queue. Hence, dropping call does not decrease the used size of message queue. If dropping call would removed the useless message in the message queue, we might have a workaround by just increasing the size of message queue. |
Hi,
I am building a solution that crawls some websites, and after a while the following messages are "dumped" in the log.
[2014-04-29 13:32:11] 6013 Workers SysV::put() Failed for call_id 2: Retrying. Error Code: 11
[29-Apr-2014 13:32:11 UTC] PHP Warning: msg_send(): msgsnd failed: Resource temporarily unavailable in /var/www/worker/includes/daemon/Core/Worker/Via/SysV.php on line 208 pid 6013
I am currently backtracing -and debugging the root cause, but also have tried to find an explanation of the Error code provided, so far I have not been able to find any documentation of the error code. Any ideas?
The text was updated successfully, but these errors were encountered: