-
-
Notifications
You must be signed in to change notification settings - Fork 109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compare performance vs std::fs::read_dir #120
Comments
I can reproduce this:
walkdir does have some overhead. You can see all the different branching going
Notably, that last link has a subtle but important fn main() {
let dir = &std::env::args().nth(1).expect("Usage: countdir dirname");
let count = std::fs::read_dir(dir)
.expect("read_dir")
.map(|dent| dent.unwrap().path())
.count();
println!("{}: {} files", dir, count);
} Then the performance difference almost evaporates:
This is close enough where the rest of it might just be a result of the It would in theory be possible to specialize to the particular use case of With all that said, I've been thinking about adapting some of the ideas from So I'll leave this issue open to track that work, but I otherwise don't think |
I've made some progress on attempting to fix this (and other bugs, such as #23), but I've kind of burned myself out on it. In particular, it essentially requires re-rolling all of the platform specific directory traversal APIs (completely avoiding std's My WIP is here, with the meat of it in the I'll get back around to this eventually, but I need to work on something else first. :-) |
This is absolutely fantastic!
They say, but when loading a directory with 1M files it ends up being ~30% faster (950ms -> 600ms). I'd say that's absolutely what I'm interested in ;). The API in your WIP code is exactly what I was looking for. It exposes the raw system call data without any extra allocation or expensive conversions and verification as far as I can see. I'm only worried about this part: /// Create a new cursor with the specified capacity. The capacity given
/// should be in bytes, and must be a multiple of the alignment of a raw
/// directory entry.
fn with_capacity(capacity: usize) -> DirEntryCursor {
// TODO: It would be nice to expose a way to control the capacity to
// the caller, but we'd really like the capacity to be a multiple of
// the alignment. (Technically, the only restriction is that
// the capacity and the alignment have a least common multiple that
// doesn't overflow `usize::MAX`. But requiring the size to be a
// multiple of alignment just seems like good sense in this case.) I'm a bit worried about the fact that the buffer size is not part of the public API. When testing on my system (running hunter) it's actually faster with a 4kb buffer than with a 16kb buffer (if my calculation is correct). With the 16kb buffer it's just a bit faster than using readdir. This is mostly because I copy the buffer and process the dirents in another thread, while the main thread does another blocking call to getdents64. That doesnt't work if the buffer is so large that all dirents fit into the buffer and only one system call is done. I think exposing the capacity in terms of "maximum, but not guaranteed number of dirents" makes the most sense since that's what the syscall does, too. As you wrote in the comment the actual size of the dirents is unpredictable and the best you can do is "guess" the average lengh. fn read_dirents(&mut self, max: usize) or something like that. |
@rabite0 Thanks for the thoughts. But yeah, that detail is probably what I am least worried about. I actually managed to burn myself out on this for a second time a few months ago. Getting the API correct in a way that both minimizes mistakes and provides the best performance possible is extremely challenging. Certainly, I have a new appreciation for why Most/all of the low level stuff is done, including almost completely re-rolling Overall, I've reached a point where I am not only writing a significant amount of new code, but much of that new code involves |
I can imagine that writing a cross platform API that's both fast, correct, and easy to use, especially with Windows in the mix, is quite the effort. From what I can see in your code even on *nix systems it's more of a mess than I'd have thought and it's a mess that has to be abstracted away somehow and just researching and testing all that crap sounds like quite the horror. It's true that that this herculean task just to see the numbers go down in ridiculous conditions might be seen as a vanity project of sorts. I mean, millions of files files in a directory or whatever is not a realistic scenario other than creating it just to test performance. On the other hand it's a bit like chasing the dragon, seeing the numbers go down is quite the dopamine hit. I wonder if it would make sense to release the low-level APIs as a separate crate? I think that would be quite useful by itself and if it's mostly done you'd have some low hanging fruit to push out. This might be cheesy, but thank you for your hard work, you're a legend, don't stress out over this too much. I can understand if you don't think it's worth all the hours you have to put in. There's only so much time in a day. But the dragon is quite attractive indeed. :) |
Indeed. That's pretty much what drives me forward. :-)
My thinking was to release it as part of the Certainly I wouldn't stop someone else from taking my work so far and moving forward with it. I have no idea when I'll finish it. :-)
Thanks. :-) <3 |
Sounds like a plan. I already "adopted" a few crates along the way and I was actually thinking about maybe just stealing your code for the Linux getdents implementation as is (looks really great), instead of writing my own. I'll probably get into it when I start adding high speed code paths for BSD and other OSs. |
Hope I'm not being too naive with this but is io_uring on linux a consideration? I've seen some benchmarks that are hard to ignore. No idea how well it's model fits in though, I haven't used it. |
I don't really know much about |
Yes, true, requires linux 5 at least I think, and some features require even more recent versions. So yes, you'd need a second impl for older kernels. Walkdir operates in a predictable way, so I assume the async aspect of io_uring could be leveraged to "queue up" subsequent iteration entries while the user's application logic is running. That would add quite a bit of complexity though, and I have no idea if it's worth it :) End rambling |
@tbillington That can already be done with threads. I don't think async is necessary for that. I suspect that's a large reason why the parallel traverser in the |
Seeing as what appears to be blocking this as well as #23 seems to be the need to re-write everything without using Though, it's totally possible that I just misunderstand #120 (comment) and think your issue is actually something else :) |
I don't see how that crate helps, sorry. It doesn't really do any of the hard parts. There is some code overlap, but it's pretty small and uninteresting. Basically, all openat could do is help with the If you want to talk more about this, I would encourage you to review the code on my branch: https://github.com/BurntSushi/walkdir/tree/ag/sys/src |
I don't know if this helps at all with complexity, but personally, I would have no objections whatsoever if "limit number of open file descriptors" was simply incompatible with the fastest traversal algorithm. Would eliminating that requirement substantially simplify this algorithm? |
Marginally, maybe. But it should be supported somewhere. |
(quite offtopic but since that was brought up in this thread: io_uring is for async yes but if you're thinking of threads it would already be worth using here; the point is that the kernel manages some background threads for you, and your single main thread avoid the syscalls so less overhead = faster IOs. . . (it's generally recommended to have one ring per thread, if you really want multiple threads on top of that) Anyway, I came here for openat and... wow, cross platform is hard. I have a few opinions on an interface I'd like to use on my own (exposing directory FDs and just the last component's d_name in a callback (+ per directory user state if they want to build a full path up to there or something themselves)), but that's not something that would be practical for general use anyway, and after reading most of this I don't think it's worth bothering. |
I had a quick look here because I happened to stumble upon this issue, while looking for a performant solution to iterate directories. Anyway, one thing does come to mind. Is there any reason you can't use NtQueryDirectoryFIle for the Windows implementation? Kind of curious, I'm new-ish to Rust, but I was surprised It's the de-facto standard for querying the FIleSystem in the fastest way possible on Windows. It returns multiple items at once at various detail levels (specified by |
Std uses Probably more relevant to this issue is that there is a |
Yeah I'm pretty excited about |
I recently wrote a short test program using walkdir to count 6M files in a single directory. I then wrote the same test program using
std::fs::read_dir
, and found a surprisingly large performance difference.To create the test data:
Using walkdir:
Performance:
Using
std::fs::read_dir
:Note that the two programs use almost the same system time, and strace shows them making almost exactly the same syscalls other than an extra
lstat
(which makes sense, as walkdir usesstd::fs::read_dir
underneath). However, walkdir uses a lot more user time.This seems worth investigating, to try to reduce the overhead.
The text was updated successfully, but these errors were encountered: