Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add module parameter to disable prefetch in zfs_readdir #16318

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

clhedrick
Copy link

Add paramter zfs_readdir_dnode_prefetch_enabled, defaulting to 1, to control whether zfs_readdir prefetched metadata for all objects it look at when reading a directory.

Setting it to 0 can be important for NFS servers with directories containing many subdirectories.

Motivation and Context

This improves #16254, although a more radical redesign is possible. We have an NFS servers with directories containing 200,000 to 500,000 subdirectories. These arise in AI training scenarios. Our NFS servers can have 20 or more threads running at 100% for hours due to this.

The` problem has been found to be an interaction between the NFS code and zfs_readdir. If information about a directory is cached on the client but not the server, when the server tries to valid the NFS file id, it will add an entry in the dnode cache for the directory. Because he new dnode is not connected to the rest of the directory hierarchy, the NFS (actually exportfs) code goes up the tree trying to connect it. The ends up calling zfs_readdir to find the directory in its parent Because zfs_readdir reads the parent looking for the directory, the amount of time it takes is proportional to the size of the parent. If every directory is processed, the amount of time will be N**2 in the size of the parent. Reading a directory would have only moderate overhead except that zfs_readdir does a prefetch on every item as it looks at it, even if the item is already in the ARC. Of course if it's in the ARC, no disk I/O will be done. But there's still a substantial amount of code.

We've found a factor of 10 to 20 speedup by skipping the prefetches. This is enough to move us from a painful situation to one we can live with.

Of course a linear speedup in an N**2 problem isn't a full solution, but it's enough to survive the size directories we see in practice. A full solution would almost certainly require help from the kernel developers, which is going to be hard to get, since directory searching is fast enough for other Linux file systems that the problem only comes up with ZFS.

Description

This change add a module parameter, zfs_readdir_dnode_prefetch_enabled, normally 1, which can be changed to 0 to disable the prefetch that happens in zfs_readdir.

How Has This Been Tested?

We have done testing with a directory containing 200,000 subdirectories, with the results discussed above. It has been running in production for several weeks.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@stuartthebruce
Copy link

Would it make sense to change this from a boolean to a scalar threshold value on the directory link count to determine whether or not to prefetch?

@clhedrick
Copy link
Author

Yes, it probably does make sense. I'm willing to try. (I didn't actually write the code, but I can probaby figure that out.)

However before doing that it's worth asking whether we can solve the underlying problem. The problem occurs becasue NFS wants to know the name that the directory has in its parennt directory. There's a function to get the name of an inode. It's specific to the file system type. ZFS doesn't have it, so the backup is used. The backup ends up using zfs_readdir to search the directory looking for the inode.

So the obvious queston is: can we create a zfs_getname that does it better. Is there somewhere in the ZFS data structures that we could cause the filename. that way when zfs_readdir (or our replacemennt) looks through the parent directory, it can cache the name of every inode as it goes by. Then when we come to look up that node, the name will already be there, and zfs_getname won't need to do a search at all.

Since I'm not a zfs developer, I don't know the data structures. Is there a place the name can be put, and obviously cleared if there's a rename or delete? With a normal file, it can be in more than one directory, so a single entry wouldn't work. But for a directory, there's only one name that it has in its parent (i.e ".."), so we don't have that ambiguity.

If someone could point me to what needs to be done, I'm willing to code it and test it. I do have experience doing kernel code. (I was one of the original Linux developers when Linus was first writing linux.)

@clhedrick
Copy link
Author

clhedrick commented Jul 4, 2024

For clarify, the actual function is

int get_name(const struct path *path, char *name, struct dentry *child);

  • get_name - default export_operations->get_name function
  • @path: the directory in which to find a name
  • @name: a pointer to a %NAME_MAX+1 char buffer to store the name
  • @child: the dentry for the child directory.

The problem occurs when data is in cache on the client, but the dentry for the child directory isn't in the dentry cache anymolre. So the data can't be stored ni the dentry, since it may not exist for some of the directories whose names we'd like to cache.

man/man4/zfs.4 Outdated
have a very large number of subdirectories.
A reasonable value would be 20000.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be pedantic, please consider specifying what the units are.

@clhedrick
Copy link
Author

I've changed it to have a limit rather than a boolean. If non-zero, any directory larger than the limit disables prefetch.

Note that it fails checkstyle. Checkstyle demands that I use a posix type, replaceing ulong with ulong_t, but the macro doesn't support ulong_t. There are examples in other parts of the code using ulong.

I have tested this with 0, a value smaller than my directory size, and a value larger than my directory size. The performance is sufficiently different that I can tell it works as intended.

@clhedrick
Copy link
Author

there's no unit. It's a count. Directory size. Unlike other file systems, in ZFS the size of the directory is a count of the files in it.

@stuartthebruce
Copy link

stuartthebruce commented Jul 5, 2024

there's no unit. It's a count. Directory size. Unlike other file systems, in ZFS the size of the directory is a count of the files in it.

Perhaps too pedantic, but in general directories have a size in bytes as well as a size in link count. However, if looks like ZFS sets the size to equal the link count.

@clhedrick
Copy link
Author

I looked a bit further into writing an alternative zfs_get_name. It looks like one could use zap_value_search. However the basic functions for iterating through the zap seem to be the same. I'm sure it would be faster, but it's not clear whether it would be enough to justify the extra code.

@tonyhutter
Copy link
Contributor

I can take a look at this. Would you mind squashing the commits and take a look at the checkstyle errors?

@clhedrick
Copy link
Author

clhedrick commented Jul 12, 2024 via email

@clhedrick
Copy link
Author

clhedrick commented Jul 12, 2024

the complaint about non-posix definition was hard to understand. All lthe POSIX alternatives generate compile errors. And I found plenty of uses of ulong in other files.

Here's my attempt to use ulong_t

param_ops_ulong_t’ undeclared here

Here's another file with the same type of declaration not using POSIX.

in ./module/os/linux/zfs/zvol_os.c
module_param(zvol_max_discard_blocks, ulong, 0444);

I was unable to figure out what to do.

@tonyhutter
Copy link
Contributor

I see what you mean, that's bizarre. It's probably a bug in our cstyle.pl. I guess just make it a ulong_t for now in module_param() as a workaround.

To squash commits run git rebase -i and squash all your commits into your first commit. After you've successfully squashed them, you can do a force push (git push --force) to update this PR with the single commit.

@clhedrick
Copy link
Author

clhedrick commented Jul 12, 2024

ulong is the one that compiles but fails style
ulong_t is the one that passes style but doesn't compile.
I'm leaving ulong.

I haven't been able to figure out why mine fails but it works in other files. I suspect there's an include file I need.

I think I've squashed it correctly.

@clhedrick
Copy link
Author

clhedrick commented Jul 12, 2024 via email

@tonyhutter
Copy link
Contributor

@clhedrick looks like a /* CSTYLED */ will suppress the cstyle error:

diff --git a/module/os/linux/zfs/zfs_vnops_os.c b/module/os/linux/zfs/zfs_vnops_os.c
index 11f665d22..efe2371fa 100644
--- a/module/os/linux/zfs/zfs_vnops_os.c
+++ b/module/os/linux/zfs/zfs_vnops_os.c
@@ -4257,6 +4257,7 @@ EXPORT_SYMBOL(zfs_map);
 module_param(zfs_delete_blocks, ulong, 0644);
 MODULE_PARM_DESC(zfs_delete_blocks, "Delete files larger than N blocks async");
 
+/* CSTYLED */
 module_param(zfs_readdir_dnode_prefetch_limit, ulong, 0644);
 MODULE_PARM_DESC(zfs_readdir_dnode_prefetch_limit,
        "No zfs_readdir prefetch if non-zero and size > this");

Would you mind adding that in? Then we should be pretty much good to go.

@clhedrick
Copy link
Author

OK. I just tested on a 6.8 kernel with the limit 0, 20000 and 400000, with a directory of size 200000. I got the expected results. 4 min with 0 and 400000, 20 sec with 20000.

@clhedrick
Copy link
Author

This is tricky to test. The problem occurs when files are cached on the client, but metadata is not, and the dnoce is not cached on the server. This is realistic for us, because our clients have very large memory. Data can stay cached for days if not weeks, but the attribute cache time out after something like 90 sec.

I believe the problem occurs when the client does a getattr to see if the data is still fresh. The getattr uses what is in effect a inode number. That triggers the reconnect code in the NFS server, which searches the parent directory for the directory, in order to find its name.

So to duplicate the problem, clear cache on the server but not the cllient, and wait at least 120 sec on the client, for the attribute cache to timeout. Clearing cache on the client would cause it to do full lookups on all files, which would be fast.

@tonyhutter
Copy link
Contributor

Just so I'm clear, zfs_readdir_dnode_prefetch_limit means "stop prefetching file metadata after zfs_readdir_dnode_prefetch_limit files have been pre-fetched in a directory"? If so, could you update your man page entry with something like this:

diff --git a/man/man4/zfs.4 b/man/man4/zfs.4
index c12f838c5..588bf9339 100644
--- a/man/man4/zfs.4
+++ b/man/man4/zfs.4
@@ -1781,9 +1781,11 @@ When readdir searches a directory, it normally prefetches metadata for
 all objects in the directory it checks, even if it's just
 looking for a single object.
 Setting this to a non-zero value disables that prefetching for directories
-with more entries than that value.
-Disabling it for large directories can greatly lower CPU usage on NFS servers where directories
-have a very large number of subdirectories.
+with more entries than that value.  For example, if set to 1000, then stop
+prefetching file metadata after 1000 file's metadata have been prefetched for
+the directory.
+Disabling it for large directories can greatly lower CPU usage on NFS servers
+where directories have a very large number of subdirectories.
 A reasonable value would be 20000.
 .
 .It Sy zfs_qat_checksum_disable Ns = Ns Sy 0 Ns | Ns 1 Pq int

@clhedrick
Copy link
Author

The documentation is correct. I compare the directory size with the limit. What you describe could be implemented. I would use a counter in the loop. It would take a lot of work on lots of different situations to know whether this is better or not. It's not obvious that it is.

A more complex fix would be to implement a zfs-specific get_name. That would basically disable prefetch only in this specific NFS operation. Whether that's worth doing or not depends partly upon whether the prefetch is a good idea in the first place for very large directories. I'm not so sure it is. Consider "ls -l". If information on all files is already in the ARC, the prefetches are wasted. If not, it will queue up a bunch of prefetches, in the order that they occur in the directory traversal. The best case seems to be"ls -l". But ls will sort the files alphabetically. I think it's just as likely that the first file will have to wait for a bench of prefetches are finished to get its data. Prefetch won't be any faster than a demand fetch. It's useful only if the demand comes later, so that by the time it comes the data is in memory, where it would have had to wait otherwise. But it probably doesn't matter for typical file access. The "ls -l" case seems like the best hope, but I'm skeptical because of the ordering issue.

You'd need some pretty careful testing in quite a bunch of situations. But I'm skeptical about the usefulness of the prefetch. Certainly not for us, where metadata is always in NVMe.

It you can suggest a way to get a directory's name without doing a traversal of the parent at all, that would be worth doing. I'd be willing to code it. But I don't see any obvious way to do that given the data structures.

@tonyhutter
Copy link
Contributor

there's no unit. It's a count. Directory size. Unlike other file systems, in ZFS the size of the directory is a count of the files in it.

I compare the directory size with the limit

Ok, I misunderstood. So it's literally is the 'size' from stat(/directory). (https://kb.netapp.com/on-prem/ontap/Ontap_OS/OS-KBs/What_is_directory_size)

May I suggest the following change?:

diff --git a/man/man4/zfs.4 b/man/man4/zfs.4
index c12f838c5..083ffeaa4 100644
--- a/man/man4/zfs.4
+++ b/man/man4/zfs.4
@@ -1776,15 +1776,18 @@ that ends up not being needed, so it can't hurt performance.
 .
 .It Sy zfs_readdir_dnode_prefetch_limit Ns = Ns Sy 0 Pq u64
 Disable prefetches in readdir for large directories.
-(Normally off)
 When readdir searches a directory, it normally prefetches metadata for
 all objects in the directory it checks, even if it's just
 looking for a single object.
 Setting this to a non-zero value disables that prefetching for directories
-with more entries than that value.
-Disabling it for large directories can greatly lower CPU usage on NFS servers where directories
-have a very large number of subdirectories.
+with a greater size than that value.  Directory size in this case is the
+size returned from calling
+.Sy stat
+on the directory (stat.st_size).
+Disabling it for large directories can greatly lower CPU usage on NFS servers
+where directories have a very large number of subdirectories.
 A reasonable value would be 20000.
+A zero value (the default) means no limit on directory metadata prefetching.
 .
 .It Sy zfs_qat_checksum_disable Ns = Ns Sy 0 Ns | Ns 1 Pq int
 Disable QAT hardware acceleration for SHA256 checksums.

@clhedrick
Copy link
Author

clhedrick commented Jul 12, 2024

To my knowledge, that field for a directory is in fact the number of entries in the directory. Is that wrong? It's not, as it is for something like ext4, the amount of space used by the directory structure itself. Obviously I don't actually call stat, although I'm prepared to believe that this is the number that stat uses.

I just checked a few directories, and it seems close, off by one maybe.

I found code incrementing or drecrementing it for actons with a single file.

@tonyhutter
Copy link
Contributor

To my knowledge, that field for a directory is in fact the number of entries in the directory. Is that wrong? It's not, as it is for something like ext4, the amount of space used by the directory structure itself.

I double checked and that appears to be correct. On ZFS it's basically the number of files/subdirs in the directory, but on other filesystems it's like the metadata size needed for the directory. So it's not strictly defined 🤷‍♂️

How about:

diff --git a/man/man4/zfs.4 b/man/man4/zfs.4
index c12f838c5..a6a00e107 100644
--- a/man/man4/zfs.4
+++ b/man/man4/zfs.4
@@ -1776,15 +1776,20 @@ that ends up not being needed, so it can't hurt performance.
 .
 .It Sy zfs_readdir_dnode_prefetch_limit Ns = Ns Sy 0 Pq u64
 Disable prefetches in readdir for large directories.
-(Normally off)
 When readdir searches a directory, it normally prefetches metadata for
 all objects in the directory it checks, even if it's just
 looking for a single object.
 Setting this to a non-zero value disables that prefetching for directories
-with more entries than that value.
-Disabling it for large directories can greatly lower CPU usage on NFS servers where directories
-have a very large number of subdirectories.
+with a greater size than that value.
+Disabling it for large directories can greatly lower CPU usage on NFS servers
+where directories have a very large number of subdirectories.
+Directory size in this case is the size returned from calling
+.Sy stat
+on the directory (stat.st_size).
+On ZFS, this directory size value is approximately the number of files
+and subdirectories in the directory.
 A reasonable value would be 20000.
+A zero value (the default) means no limit on directory metadata prefetching.
 .
 .It Sy zfs_qat_checksum_disable Ns = Ns Sy 0 Ns | Ns 1 Pq int
 Disable QAT hardware acceleration for SHA256 checksums.

@tonyhutter
Copy link
Contributor

Can you edit the commit message to fix the commitcheck error?:

error: commit message body contains line over 72 characters

Other than that, it should be good to go..

@clhedrick
Copy link
Author

clhedrick commented Jul 15, 2024

ok. I also noted in the documentation that it only applies to Linux. I have no reason to think the BSD needs it. It's needed because of details of the NFS implementation. i would hope that BSD hasn't chosen the same approach.

At any rate, the BSD readdir doesn't prefetch.

@clhedrick
Copy link
Author

what's up with this? Test failures can't be due to this patch, as it obvviously has no effect if not enabled.

Is this going to be merged?

@clhedrick
Copy link
Author

What I find odd about zpl_lookup is that it seems to pass a name to zfs_lookup. But I don't have the name. Given the data structures I don't see how you could do a lookup without the name or searching.

@behlendorf
Copy link
Contributor

behlendorf commented Aug 13, 2024

What I find odd about zpl_lookup is that it seems to pass a name to zfs_lookup. But I don't have the name.

Indeed, sorry my initial reading on this was a bit to hasty. You're right, the only way we have to do this lookup is by iterating over the entire directory. That said, I still think it may be worthwhile to add our own get_name() callback. There's a lot of extra work, in addition to the prefetching, that ends up happening in the fallback zfs_readdir case. That could all be eliminated which I imagine would speed things up further.

@clhedrick
Copy link
Author

I think the place to look is zap_value_search. But there's another thing to think about: would it make sense to leave the prefetch, but when something is prefetched, if it's a directory, fill in the name and mark it connected? That would eliminate almost all of the searching, since a search happens only if a connected dnode can't be found.

@behlendorf
Copy link
Contributor

I think the place to look is zap_value_search.

This should work. The directory ZAP stores the dataset object number as the first integer, and we use this same object number when setting the inode number. The virtual .zfs snapshot directories doesn't appear the directory ZAPs and use virtual inode number so there might be a minor corner case there we need to handle.

would it make sense to leave the prefetch, but when something is prefetched, if it's a directory, fill in the name and mark it connected? That would eliminate almost all of the searching, since a search happens only if a connected dnode can't be found.

If we've gone through all the expense of prefetching the directory it makes sense to me to populate it and mark it connected.

@clhedrick
Copy link
Author

My feeling is that if we make prefetch connect the dnode, performance of this loop isn't so critical, so we don't need our own get_name. As long as the directory dnode is connected, none of this code will be called.

This is for realistic cases. In principle, I think the limit to prefetch is needed independent of the NFS issue described here. ZFS claims to support huge directories. But if you actually tried to use them, even if you aren't using NFS and thus don't have the issue described in this request, the first time you do an "ls" of a directory with a zillion files the prefetches will thrash the ARC and possibly fill task queues and other things. I can't say whether this would ever happen in reality.

I note that the BSD code for readdir doesn't prefetch.

Anyway, connecting he dnode would be interesting. zfs_readdir triggers an async operation. It ends with a callback to prefetch done, which doesn't know what's being prefetched. So we'd have to use the prefetch finish callback, I assume

This is getting beyond what I think I can do. For the moment we'll stick with this patch. It's hard to know whether anyone else is seeing this issue, since it wouldn't show any symptom other than more CPU usage than you'd like. A lot of our folks do AI. the training datasets seem to have a structure that triggers this. I doubt we're the only place doing this.

@clhedrick
Copy link
Author

It's actually not so clear that we need the prefetch. There may be enough info ni the inode. But I don't understand what is going on in the reconnect code, so I can't tell what would be needed to actually do what I propose, and whether it would be expensive enough to cause other issues.

@amotin
Copy link
Member

amotin commented Sep 6, 2024

I note that the BSD code for readdir doesn't prefetch.

As I can see, the prefetch code is identical for Linux and FreeBSD -- first zfs_readdir() call always prefetches, while later do it only if client is actually trying to access the returned entries via zfs_dirlook(). I guess your workload in some way tries get more than just an entry name and inode number for the entries, that triggers following prefetches, which may be legal in that case (unless those accesses are done by unrelated tasks). The only question is that you may not need them once all the dnodes are already in ARC.

To address the last case of unneeded prefetches when the data are already in ARC to mind comes mechanism I implemented some time ago for speculative prefetcher in dmu_zfetch() (see missed argument) -- do not issue data prefetches (only prefetch indirects) for an I/O stream until we get at least one cache miss on demand read access. It gave significant CPU usage reduction there.

@youzhongyang
Copy link
Contributor

@[
    osq_lock+1
    __mutex_lock_slowpath+27
    mutex_lock+64
    dnode_block_freed+124
    dbuf_prefetch_impl+280
    dbuf_prefetch+43
    dmu_prefetch+208
    zfs_readdir+817
    zpl_iterate+105
    iterate_dir+178
    get_name+348
    exportfs_get_name+116
    reconnect_path+264
    exportfs_decode_fh_raw+289
    nfsd_set_fh_dentry+585
    fh_verify+494
    nfsd3_proc_getattr+134
    nfsd_dispatch+374
    svc_process_common+1205
    svc_process+201
    nfsd+257
    kthread+327
    ret_from_fork+31
]: 84231614

We have this issue on our server. In 60 seconds, there were 84231614 times osq_lock() stack traces. The nfsd processes were 90%+ busy.

@behlendorf behlendorf added the Status: Revision Needed Changes are required for the PR to be accepted label Nov 23, 2024
@clhedrick
Copy link
Author

what revision is needed? I'm using the patch with 2.2.6. The diff doesn't apply automatically to master, but it's obvious what is needed.

If it's going to accepted, I'll fix it to work with master, but so far it doesn't seem like people are ready to take it.

Obviously we'll continue to use it, and it sounds like at least one other site needs it. Others might as well but not realize it.

My assessment is that the obvious alternative will be only marginally better. I'd rather rewrite things so a directory traversal isn't needed at all, but I'm not convinced that can be done without changing non-zfs kernel code.

@amotin
Copy link
Member

amotin commented Nov 25, 2024

@clhedrick I don't know what Brian meant with that, may be just the merge conflict. But on my side I am not happy of just disabling prefetch because it is not needed in some case. I don't like users turning random knobs. I'd prefer some automatic detection, similar to what should be there already, may be with some improvements, see my later comment.

@clhedrick
Copy link
Author

clhedrick commented Nov 25, 2024

Anything that's worth doing is beyond what I can do. One reasonable approach would be to prefetch each item only once. But the prefetch is async, and I don't see any place to tell whether one has already been scheduled.

Our cache hit rates for metadata are so high that I doubt there's any advantage to prefetching for us.

For myself, I'd prefer just to comment out

	if (prefetch)
		dmu_prefetch_dnode(os, objnum, ZIO_PRIORITY_SYNC_READ);

If someone else has this issue, that might be easiest. (I haven't found a way to get a diff for this change from github.)

@youzhongyang
Copy link
Contributor

Our directories do not have have many entries, so the locking contention during the prefetch was the problem.

Copy link
Contributor

@snajpa snajpa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's merge this and if it proves to be a performance problem I'll try to address it while it's still only in master - I mean, assuming this was blocked by my "smartass" comment, which wasn't really the intention at all

@github-actions github-actions bot removed the Status: Revision Needed Changes are required for the PR to be accepted label Nov 29, 2024
@clhedrick
Copy link
Author

I have resolved the conflict, and added advice on when to set the parameter in the man page.

Add paramter zfs_readdir_dnode_prefetch_limit, defaulting to 0, to
control whether zfs_readdir prefetched metadata for objects it look at
when reading a directory. If zero, metadata is prefetched for all
directory entries. If non-zero, metadata is prefetched only if
directory has fewer entries than this.

Setting it to non-0 can be important for NFS servers with directories
containing many subdirectories.

Signed-off-by: Charles Hedrick <[email protected]>
Co-authored-by: Chris Siebenmann<[email protected]>
@clhedrick clhedrick reopened this Nov 29, 2024
@clhedrick
Copy link
Author

clhedrick commented Nov 29, 2024

rebased

I've built it and booted it Simple tests look ok.

There was a comment about impact on performance. Note that it has no effect unless explicitly enabled.

Copy link
Member

@amotin amotin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still object this change, since the limit on the directory size it introduces makes no sense. The large directories have the same or may be even higher chance for dnodes they reference to not be in cache and so benefit from prefetch. For smaller directories it might just be less of problem, being hidden by the network and NFS stack latency.

Anything that's worth doing is beyond what I can do.

That is not true if you think for a moment. All we need is to limit prefetch to only first directory read. We just need another flag in znode_t (or replace/modify z_zn_prefetch to save space) to represent that directory was read till the end and there is nothing to prefetch any more next time. Or we could save the position till which directory was previously read, the complication there though is that as I understand serialized ZAP cursor is not monotonically increasing, so we could not compare it for bigger/smaller, but we could save just zc_hash, which should be enough for quick and cheap comparison. And/or we could store the time of the last full read, and issue prefetches only if since previous time passed some predefined time that would justify another prefetch.

Any of that would be pretty trivial to implement, and would be more logical than what is proposed here. BTW, please keep the code in sync between OS'es. FreeBSD code is similar enough to update it also.

@amotin amotin added the Status: Revision Needed Changes are required for the PR to be accepted label Nov 30, 2024
@clhedrick
Copy link
Author

The reason I said doing more is beyond me is because I'm not a ZFS developer. One implication is that I don't know the lifetme of various data structures, so it's not clear to me where I can put information and how long it will live.

@amotin
Copy link
Member

amotin commented Nov 30, 2024

One implication is that I don't know the lifetme of various data structures, so it's not clear to me where I can put information and how long it will live.

znode in most cases lives until kernel under memory pressure will decide to evict inodes, that will evict respective znodes also, or until it is deleted. So I'd say it lives long enough. Just search sources for z_zn_prefetch. It is used only in 5 places on Linux and 4 on FreeBSD, and its logic is pretty simple. Just extend it.

@clhedrick
Copy link
Author

The ideal solution would be to provide a better way to go from inode to name. Is there someplace to stash the name for a file when readdir (or an equivalent in getname) reads past it, so it could be translated without a directory search?

@amotin
Copy link
Member

amotin commented Nov 30, 2024

The ideal solution would be to provide a better way to go from inode to name. Is there someplace to stash the name for a file when readdir (or an equivalent in getname) reads past it, so it could be translated without a directory search?

I don't see how it is related to this topic of prefetch optimization. inode might have many names due to hard links, or no name at all if all of them were deleted while it was open. It was discusses a lot on recent OpenZFS developer summit and even in last few weeks somewhere here on the GitHub.

@clhedrick
Copy link
Author

The problem is not, as far as I know, triggered by people doing "ls". You wouldn't expect someone to do "ls" for the same directory enough to cause this performance disaster.. Rather, it's triggered by NFS. If an existing file is being access, NFS sends a file handle from client to server. For Linux, the file handle is file system and inode number, effectively. The server wants to validate the file handle. As part of doing this, the NFS server calls a function gettname, to go from inode number to file name. It is able to find the files's conntaining directory fairly easily, but to find the filename it calls readdir to read the directory until if finds the file with that inode number. You may not think this is a sensible approach, but it's what the LInux NFS server does.

Our clients have a large amount of memory, enough to keep the working set cached in memory, sometimes for days. It is common for them to access a file where the client has it cached but it's gone out of the server.s cache. that causes the client to send an inocde number for a file that is not longer in the servers's dnode cache. In that case, if the inode is a directory, the code is triggered that reads the containing directory to find the inode number. Some of our AI training jobs have directories with 200,000 - 500,000 subdirectories. If the client tries to access all of them, each access of a subdirectory will cause a read of on average half the containing directory to map it's inode number to its name. Hence the total number of prefetches will be something like 200,000 x 200000 / 2. This isn't a performance optimization. It's savign us from a disaster.

It is possible to supply a zfs-specific implementation of getname. If there's a place to cache the name, then the first time you read the containing directory, the inode to name mapping could be cached. That could turn the problem from 200,000 squared to 200,000. I realize that the same inode number could appear more than once in a directory, but the problem occurs only for subdirectories, and I don't believe that's possible for a directory. Or if it is, the NFS code doesn't care.

However youzhongyang may well have a different situation where that approach wouldn't help. He may indeed need to disable the prefetch.

Before doing more coding I'm going to do some experiments to see if the prefetch is useful at all. I conjecture that it is not. In which case doing more coding to allow it in some cases would be unnecessary. I propose to try "ls -lU" in large directories with the cache cleared. That seems like the best case for prefetch to have an effect.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Design Review Needed Architecture or design is under discussion Status: Revision Needed Changes are required for the PR to be accepted
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants