-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Timeout issues - finding their way back to the application #6
Comments
This is the type of message the queue worker gets (sometimes) when the queue connection drops:
I think that is handed over to Laravel as a queue message, which Laravel then attempts to run as a job (which it shouldn't, but Laravel's lack of validation of queue messages is another bugbear of mine). This locks up the queue worker, which can only be killed from the Linux command line. Not sure what it's doing - blocking while waiting for its non-existent child worker to finish, I suspect. Anyway, I need to do some more tests of long-running connections, but if what I think is happening is correct, then inspecting the messages that the Azure queue library gives to this connector, and looking for some text that indicates the connection has dropped, would allow it to reconnect and not pass the message back to Laravel as a job. "Fail:\nCode: %s\nValue: %s\ndetails (if any): %s." == AZURE_ERROR_MSG == ServiceException I guess "details" is the raw XML data returned by the API, and "Value" is the human-readable response phrase. |
Thanks for the report - I'll take a look into it this weekend.
|
Let me know what I can do to help. There are a lot of pins set up in the chain here, and some things just have to be left to run for a few hours before the problem is observed. So if there is specific points that it would be helpful for me to capture data, I can do that. |
I also think this is something that Laravel should take on board to a certain extent, perhaps allowing the connector to return a "connection lost" exception so that Laravel can ask for a reconnection before trying again. Or something like that. |
Noticed the Azure PHP storage libraries have all been taken up a point release in the last day. Not sure if the new version tackles any of the timeout/auto-reconnect issues. |
I've had a look into this, and with what you are seeing it appears there may be two different scenarios happening:
The Laravel workers operate on a polling frequency rather than maintaining open connections, however cURL could potentially be holding connections open under the hood, and the default polling of every 3 seconds could be sufficient to keep a connection open. Laravel already tries to re-attempt what it understands to be lost connections for queues, but this doesn't cover this specific scenario. Options I see are:
I'm not sure either will fix your first error, as that seems like a standard connection error. This could be Azure refusing the connection due to too many concurrent requests, or just an occasional network issue. |
Also, I am not 100% sure whether the latter HTTP 500 error is actually a formal timeout / connection lost, so may get some pushback from the Laravel maintainers if it is just a transient error where the Azure Service itself just timed out and failed. |
Thanks, that's some great analysis, very much appreciated. I think you may be right about connections being held open. If Laravel is polling the queue every few seconds, and it is opening a new connection each time, then if connections are not closed properly we would certainly have a problem building up. Without restarting the queue worker, it would take anywhere between four and six hours before it effectively freezes. I could kill the workers (I have three of them) with a a The long delays I was seeing in pushing to the queues could very well be a symptom of the same problem. If the queue workers have taken all available connection slots, then opening a connection to push a message could be blocking (somewhere in the route to the queue). Then we end up with a kind of deadlock. If this is what is happening, then restarting the queue workers every ten minutes, which I am doing now, is probably the best workaround for now. So, the options. I personally have an issue with Laravel's single list of "connection lost" messages for the database connections, that are hard-coded and need to cater for a wide range of databases. IMO those lists should be in the individual database connectors, and should be extendable to provide east fixes for specific cases, and the ability to add messages in other languages (I cannot fathom how the non-English locales are coping with lost connections). So the queue workers also support lost connection handling? I wasn't aware of that. If so, then keeping it in this queue connector makes more sense to me (option 1). This connector knows about Azure, and should be telling the Laravel framework what to do :-) I'm not an Azure export, but I can get someone else to monitor our connection pool. If they are growing every few seconds, then that would give us some clues for what needs to be addressed in the longer term. Looking at the Azure code last week, it seemed that just about any failure to talk to the remote queue resulted in a 500 exception. I suppose it is reasonable to always rethrow this as "connection lost", as a restart of the queue worker is likely to clear out any crap that has been building up by restarting the process. |
Just looking at the lost connection detection in https://github.com/illuminate/queue/blob/9c063c804f6ab0596c68c41830496029fbca31f7/Worker.php#L9 That's a bit WTF, TBH. It may be relevant for the database queue connector, and only then for error messages that it knows about (it lacks many) but is completely inappropriate for non-database connectors. This is why there will be push-back on adding to that list - it's a list of database errors. |
I had the same problem and wanted to know how to solve it. samm 500 time out error ~~ Fail: Code: 500 Value: Operation could not be completed within the specified time. details (if any): 錯誤檔案: |
Summary of issue
If the queue connection times out, due to being connected for some time without being used, then no attempt is made to reconnect if sending a message fails. With laravel, if a database connection times out, then laravel attempts to detect this, reconnects, then retries the query one more time.
The complete error message, including file & line numbers
I have a model observer that watches for certain events, and dispatches a job to handle certain types of event. Occasionally updating the model will throw this error (returned by
eloquent->save()
when the observer is triggered:**Copy of your config/queue.php file **
Software versions
This Package: 5.5.2
Laravel: 5.6
PHP: 7.2
Operating System: Windows / Linux / MacOS: Linux
Any other information
While the connections are up, everything works fine, but it is the timeouts that are the problem, both when sending on a timed out connection (expecting an auto-reconnect) and when listening on a long-running process.
My queue workers are restarted every 10 minutes to try to avoid the queue listener timing out. When it does time out, the MS Azure library either throws an exception or passes a message with the timeout error details (not sure which, but I think it is the latter). When this happens, the listener is sat listening to a dead connection, and so the queues are not served. That timeout should also be caught if possible and a reconnect done. The queue sender can afford to allow the connection to drop, so long as it can restart it when needed (like Laravel database connections). But the queue listener must always make sure its connection is up, running, and open.
The text was updated successfully, but these errors were encountered: