-
Notifications
You must be signed in to change notification settings - Fork 717
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
candump: Prevent file I/O blocking socket reading #381
Comments
Hi @Ipiano , |
Unfortunately, I think our solution will need to introduce pthread to do logging off the main thread. The larger issue for us is that if file I/O blocks for an extended period, it will prevent reading from the socket and messages are dropped. The potential to lose messages when rotating logs with something like We will evaluate starting with #268, finishing that, and then adding the logging thread in a fork from there. |
Thanks! Looking forward to your updated patch. |
I suspect we could "solve" the issue by increasing the buffer queue size, but an auto-resizing queue is preferable because we don't know how long it'll end up blocking. No, we're writing to the disk on custom hardware that we produce. It's a known issue with our hardware + BSP combo |
Hm, did i missed some thing? What are "auto-resizing queue"? As far as i know, we do never wait until the queue is full to read it. |
With the pthread you are moving the queue to the user space. Introducing more complicity and more code. If IO will be even longer, you will start to drop packet in user space or in user space and kernel. In the kernel case, you will have diagnostic only one level - kernel interface drop counters. In case of double queue in the app, you will need to introduce extra diags. |
Using (and increasing) the existing socket buffer between the kernel and user space is preferred than introducing another buffer. That said, if you introduce a user space buffer, an auto-resizing buffer without bounds is IMHO a bad idea, better use a sensible sized buffer as default, and add a command line switch to set the size. Instead of using threads, you can switch both file descriptions into non blocking mode and hang them into an |
The goal here is to avoid dropping any messages. Messages are dropped when we are unable to write to the file at the same speed that data is produced on the socket. Dropped messages are a huge issue for us because embedded systems don't write log files, so they're the only data we have to diagnose a lot of issues. If we increase the size of the socket queue at startup, it probably helps in most cases; but we don't have an upper bound right now for how long disk I/O can block and there'd likely still be some drops; maybe we could find a number that's sufficient for 90% of cases and call it a day though. This is already supported, and would require no changes. If we auto-resize the existing socket buffer, I'm not aware of a way to know that it needs to be resized until after we have already dropped a message; in which case a few messages will be dropped at the start of logs, but eventually it will (probably) stabilize until you reboot. If we use a user-space buffer, we can resize it as needed in order to guarantee a message is not dropped because we can discover that the buffer is not an appropriate size before dropping the message. I agree that not having an upper bound on it is probably not a great solution; but if we 're able to define an upper bound on the length of time a system might block up, then we can solve this with option 1 - just resizing the socket buffer. Assuming a non-blocking file descriptor would return Regardless, I think the answer to my initial question is "no, we don't want your 'fix' for your janky-ass hardware issues in candump"; and I'm fine with that being the answer; that's what I was initially expected when I asked. |
From my point of view a sensible sized buffer should be the default. Add a command line switch to overwrite the size and an additional switch to create the bottomless buffer you need. Add extra warnings to the help output like "use at your own risk, may eat all your memory". :) |
tl;dr - I have some changes, but we're not sure if they're appropriate to upstream because they're a bit specific to our issues. Please advise
I'm working with some systems that are particularly I/O bound and we're seeing a lot of dropped can messages in the logs we're taking with
candump
.We run
candump
on startup, logging to a file. Every hour, we kill it, rotate the file, and start it again. We've opted to use this approach over something likelogrotate
because an external tool can't guarantee we won't rotate in the middle of a write to the file (unless the kernel can?)Regardless of if this is an appropriate rotation strategy, we're seeing a lot of random dropped messages in the middle of logs. We've determined that it's due to the fact that we occasionally saturate the disk I/O on the system -
candump
blocks writing to the file for a couple seconds, can't read from the socket, and the socket overflows and drops frames.We're looking to solve this by spawning a new thread and doing all logging from there, buffering internally in an auto-expanding circular queue. This approach fixes the dropped frames issue; but currently is unbounded on the amount of memory that
candump
will use at runtime - in practice it's bounded by the amount of time that disk I/O blocks; once the buffer has expanded to a size that accommodates our average bus load for a fairly long block, it's pretty stable.At the same time, we've moved file rotation into
candump
in that side thread; this guarantees we won't drop any frames between ending one log and starting a new one. We could also accomplish this by starting the new log before stopping the old one and generating a brief period of overlap.So my question is really just should we put in a PR for some or all of these changes? I could see arguments against pulling in
pthread
as a dependency or makingcandump
a bit heavier-weight tool. The approach of adding file rotation internally also breaks thesingle-tool, single-function
philosophy, but I don't know how much this project strives to accomplish that.The text was updated successfully, but these errors were encountered: