Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parent directory appears out of order with contents_first(true) in certain circumstances #163

Open
danielparks opened this issue Aug 9, 2022 · 6 comments · May be fixed by #165
Open

Parent directory appears out of order with contents_first(true) in certain circumstances #163

danielparks opened this issue Aug 9, 2022 · 6 comments · May be fixed by #165

Comments

@danielparks
Copy link

I’m not sure what’s going on here. I‘m happy to spend some time sorting this out, but I figured I’d report it first to see if it was obvious to you.

This is easier to understand by example than by explanation. Nevertheless, I’ll try to define it first.

Explanation

When:

  1. A certain directory structure is required. I’m not sure what the criteria are.
  2. Walkdir.new() is passed a symlink to the directory structure.
  3. contents_first(true)

and walkdir is ascending out of a certain subdirectory, it will return another entry in the parent directory before the subdirectory itself.

Example

use walkdir::WalkDir;

fn main() {
    let _ = std::fs::create_dir_all("dir/mdir/ndir");
    let _ = std::fs::File::create("dir/afile");
    let _ = std::os::unix::fs::symlink("dir", "link");

    walk("dir");
    println!("");
    walk("link");
}

fn walk(root: &str) {
    let mut last_depth: usize = 0;

    println!("walking {}", root);
    for entry in WalkDir::new(root).contents_first(true) {
        let entry = entry.unwrap();
        let depth = entry.depth();

        println!("{}", entry.path().display());
        if last_depth > depth && !entry.file_type().is_dir() {
            eprintln!("    Left directory but next entry was not directory itself");
        }

        last_depth = depth;
    }
}

Output:

walking dir
dir/mdir/ndir
dir/mdir
dir/afile
dir

walking link
link
link/mdir/ndir
link/afile
    Left directory but next entry was not directory itself
link/mdir

I would expect walking the symlink to either behave the same as walking the directory, or just refuse to traverse the symlink (since follow_links defaults to false).

@danielparks
Copy link
Author

Actually, if you set follow_links(true) they both work. So probably the bug is that it follows the first symlink, which violates some assumption.

@danielparks
Copy link
Author

Well, I have a fix, but I’m not quite sure why it works. The problem seems to be related to not adding the root symlink to deferred_dirs.

In other words, change

walkdir/src/lib.rs

Lines 844 to 846 in 461f4a8

if md.file_type().is_dir() {
itry!(self.push(&dent));
}
to:

            if md.file_type().is_dir() {
                itry!(self.push(&dent));
                if self.opts.contents_first {
                    self.deferred_dirs.push(dent);
                    return None;
                }
            }

(I suspect that doesn’t match your style, so this is more to make things clear.)

@BurntSushi
Copy link
Owner

Thanks for the diagnosis and coming up with a fix!

Would you mind submitting a PR? And in particular, it would be good to come up with a comment explaining that branch and why it fixes the bug.

I've had a quick look at the code, but unfortunately the contents_first option really ties the code up in knots and it is difficult to follow. So I'm not able to re-contextualize quickly.

@danielparks
Copy link
Author

Sure!

Yeah, definitely a little hard to follow. The price of flexibility, I suppose.

It will probably be a few days at least — I’d like a better understanding so I can be sure I'm fixing it right.

danielparks added a commit to danielparks/walkdir that referenced this issue Aug 10, 2022
This fixes two incorrect behaviors when `contents_first` is on and the
root path is a symlink:

  1. The root path was returned first in the iterator instead of last.
  2. Some subdirectories were returned out of order.

The issue was that root symlinks were returned immediately rather than
being pushed onto the `deferred_dirs` vec. That lead to `deferred_dirs`
and `depth` being out of sync, which lead to deferred directories being
processed one ascent too late.

TO DO:
  [ ] Better tests (the current ones are not stable)
  [ ] Remove FIXME
  [ ] Consider commenting changed code; it seems somewhat hard to follow
      to my sleep deprived brain.

Fixes: BurntSushi#163
@danielparks
Copy link
Author

Looks like there’s another bug when both same_file_system and contents_first are true and a subdirectory is a mount. I’ll fix that and write tests for it too.

I’m pretty sure I’m not breaking anything on Windows, but I don’t have recent Windows license to check (and even if I did I know nothing about reparse points beyond what I gleaned from rust-lang/rust#46484).

@danielparks
Copy link
Author

I’m pretty sure I’m wrong about there being a bug related to same_file_system. Still working on my understanding to be sure I get it right.

danielparks added a commit to danielparks/directory-count that referenced this issue Aug 13, 2022
danielparks added a commit to danielparks/walkdir that referenced this issue Aug 14, 2022
This fixes two incorrect behaviors when `contents_first` is on and the
root path is a symlink:

  1. The root path was returned first in the iterator instead of last.
  2. Some subdirectories were returned out of order.

The issue was that root symlinks were returned immediately rather than
being pushed onto the `deferred_dirs` vec. That lead to `deferred_dirs`
and `depth` being out of sync, which lead to deferred directories being
processed one ascent too late.

This also changes the internal handling of the `same_file_system` option
slightly. Previously, if a directory was found to be on a different file
system it would _not_ be pushed to the stack, but it _would_ be pushed
to the deferred list. This did not matter because in those cases
`handle_entry()` would return to `next()`, which would immediately pop
the directory off the deferred list and return it. This ensures that the
directory is returned immediately, and is not pushed to either the stack
or the deferred list. (I think this makes the code clearer.)

Fixes: BurntSushi#163
danielparks added a commit to danielparks/walkdir that referenced this issue Aug 14, 2022
This fixes two incorrect behaviors when `contents_first` is on and the
root path is a symlink:

  1. The root path was returned first in the iterator instead of last.
  2. Some subdirectories were returned out of order.

The issue was that root symlinks were returned immediately rather than
being pushed onto the `deferred_dirs` vec. That lead to `deferred_dirs`
and `depth` being out of sync, which lead to deferred directories being
processed one ascent too late.

TO DO:
  [ ] Remove FIXME
  [ ] Consider commenting changed code; it seems somewhat hard to follow
      to my sleep deprived brain.

Fixes: BurntSushi#163
@danielparks danielparks linked a pull request Aug 14, 2022 that will close this issue
danielparks added a commit to danielparks/walkdir that referenced this issue May 29, 2023
This fixes two incorrect behaviors when `contents_first` is on and the
root path is a symlink:

  1. The root path was returned first in the iterator instead of last.
  2. Some subdirectories were returned out of order.

The issue was that root symlinks were returned immediately rather than
being pushed onto the `deferred_dirs` vec. That lead to `deferred_dirs`
and `depth` being out of sync, which lead to deferred directories being
processed one ascent too late.

This also changes the internal handling of the `same_file_system` option
slightly. Previously, if a directory was found to be on a different file
system it would _not_ be pushed to the stack, but it _would_ be pushed
to the deferred list. This did not matter because in those cases
`handle_entry()` would return to `next()`, which would immediately pop
the directory off the deferred list and return it. This ensures that the
directory is returned immediately, and is not pushed to either the stack
or the deferred list. (I think this makes the code clearer.)

Fixes: BurntSushi#163
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants