-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add module parameter to disable prefetch in zfs_readdir #16318
base: master
Are you sure you want to change the base?
Conversation
Would it make sense to change this from a boolean to a scalar threshold value on the directory link count to determine whether or not to prefetch? |
Yes, it probably does make sense. I'm willing to try. (I didn't actually write the code, but I can probaby figure that out.) However before doing that it's worth asking whether we can solve the underlying problem. The problem occurs becasue NFS wants to know the name that the directory has in its parennt directory. There's a function to get the name of an inode. It's specific to the file system type. ZFS doesn't have it, so the backup is used. The backup ends up using zfs_readdir to search the directory looking for the inode. So the obvious queston is: can we create a zfs_getname that does it better. Is there somewhere in the ZFS data structures that we could cause the filename. that way when zfs_readdir (or our replacemennt) looks through the parent directory, it can cache the name of every inode as it goes by. Then when we come to look up that node, the name will already be there, and zfs_getname won't need to do a search at all. Since I'm not a zfs developer, I don't know the data structures. Is there a place the name can be put, and obviously cleared if there's a rename or delete? With a normal file, it can be in more than one directory, so a single entry wouldn't work. But for a directory, there's only one name that it has in its parent (i.e ".."), so we don't have that ambiguity. If someone could point me to what needs to be done, I'm willing to code it and test it. I do have experience doing kernel code. (I was one of the original Linux developers when Linus was first writing linux.) |
For clarify, the actual function is int get_name(const struct path *path, char *name, struct dentry *child);
The problem occurs when data is in cache on the client, but the dentry for the child directory isn't in the dentry cache anymolre. So the data can't be stored ni the dentry, since it may not exist for some of the directories whose names we'd like to cache. |
man/man4/zfs.4
Outdated
have a very large number of subdirectories. | ||
A reasonable value would be 20000. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be pedantic, please consider specifying what the units are.
I've changed it to have a limit rather than a boolean. If non-zero, any directory larger than the limit disables prefetch. Note that it fails checkstyle. Checkstyle demands that I use a posix type, replaceing ulong with ulong_t, but the macro doesn't support ulong_t. There are examples in other parts of the code using ulong. I have tested this with 0, a value smaller than my directory size, and a value larger than my directory size. The performance is sufficiently different that I can tell it works as intended. |
there's no unit. It's a count. Directory size. Unlike other file systems, in ZFS the size of the directory is a count of the files in it. |
Perhaps too pedantic, but in general directories have a size in bytes as well as a size in link count. However, if looks like ZFS sets the size to equal the link count. |
I looked a bit further into writing an alternative zfs_get_name. It looks like one could use zap_value_search. However the basic functions for iterating through the zap seem to be the same. I'm sure it would be faster, but it's not clear whether it would be enough to justify the extra code. |
I can take a look at this. Would you mind squashing the commits and take a look at the checkstyle errors? |
How do I squash?
On Jul 11, 2024, at 8:07 PM, Tony Hutter ***@***.***> wrote:
I can take a look at this. Would you mind squashing the commits and take a look at the checkstyle errors?
—
Reply to this email directly, view it on GitHub<#16318 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAORUCCRC2JXFIHGYJZMMBDZL4M2NAVCNFSM6AAAAABKGFOVRCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMRUGE2DINJSG4>.
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
the complaint about non-posix definition was hard to understand. All lthe POSIX alternatives generate compile errors. And I found plenty of uses of ulong in other files. Here's my attempt to use ulong_t param_ops_ulong_t’ undeclared here Here's another file with the same type of declaration not using POSIX. in ./module/os/linux/zfs/zvol_os.c I was unable to figure out what to do. |
I see what you mean, that's bizarre. It's probably a bug in our cstyle.pl. I guess just make it a To squash commits run |
ulong is the one that compiles but fails style I haven't been able to figure out why mine fails but it works in other files. I suspect there's an include file I need. I think I've squashed it correctly. |
The original version, that allows one to disable all prefetch is running in production. I tried the current vetsion on a test server. I’ll retry on test tomorrow to make sure nothing has broken in the update, though I don’t see how that could have happened, since the only thing that changed was spacing.
The first chance to reboot production is mid August.
On Jul 11, 2024, at 9:00 PM, Tony Hutter ***@***.***> wrote:
I see what you mean, that's bizarre. It's probably a bug in our cstyle.pl. I guess just make it a ulong_t for now in module_param() as a workaround.
To squash commits run git rebase -i and squash all your commits into your first commit. After you've successfully squashed them, you can do a force push (git push --force) to update this PR with the single commit.
—
Reply to this email directly, view it on GitHub<#16318 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAORUCEI4Q6JT6UHYCQPFLLZL4TDDAVCNFSM6AAAAABKGFOVRCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMRUGIZTGOJUG4>.
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
@clhedrick looks like a diff --git a/module/os/linux/zfs/zfs_vnops_os.c b/module/os/linux/zfs/zfs_vnops_os.c
index 11f665d22..efe2371fa 100644
--- a/module/os/linux/zfs/zfs_vnops_os.c
+++ b/module/os/linux/zfs/zfs_vnops_os.c
@@ -4257,6 +4257,7 @@ EXPORT_SYMBOL(zfs_map);
module_param(zfs_delete_blocks, ulong, 0644);
MODULE_PARM_DESC(zfs_delete_blocks, "Delete files larger than N blocks async");
+/* CSTYLED */
module_param(zfs_readdir_dnode_prefetch_limit, ulong, 0644);
MODULE_PARM_DESC(zfs_readdir_dnode_prefetch_limit,
"No zfs_readdir prefetch if non-zero and size > this"); Would you mind adding that in? Then we should be pretty much good to go. |
OK. I just tested on a 6.8 kernel with the limit 0, 20000 and 400000, with a directory of size 200000. I got the expected results. 4 min with 0 and 400000, 20 sec with 20000. |
This is tricky to test. The problem occurs when files are cached on the client, but metadata is not, and the dnoce is not cached on the server. This is realistic for us, because our clients have very large memory. Data can stay cached for days if not weeks, but the attribute cache time out after something like 90 sec. I believe the problem occurs when the client does a getattr to see if the data is still fresh. The getattr uses what is in effect a inode number. That triggers the reconnect code in the NFS server, which searches the parent directory for the directory, in order to find its name. So to duplicate the problem, clear cache on the server but not the cllient, and wait at least 120 sec on the client, for the attribute cache to timeout. Clearing cache on the client would cause it to do full lookups on all files, which would be fast. |
Just so I'm clear, diff --git a/man/man4/zfs.4 b/man/man4/zfs.4
index c12f838c5..588bf9339 100644
--- a/man/man4/zfs.4
+++ b/man/man4/zfs.4
@@ -1781,9 +1781,11 @@ When readdir searches a directory, it normally prefetches metadata for
all objects in the directory it checks, even if it's just
looking for a single object.
Setting this to a non-zero value disables that prefetching for directories
-with more entries than that value.
-Disabling it for large directories can greatly lower CPU usage on NFS servers where directories
-have a very large number of subdirectories.
+with more entries than that value. For example, if set to 1000, then stop
+prefetching file metadata after 1000 file's metadata have been prefetched for
+the directory.
+Disabling it for large directories can greatly lower CPU usage on NFS servers
+where directories have a very large number of subdirectories.
A reasonable value would be 20000.
.
.It Sy zfs_qat_checksum_disable Ns = Ns Sy 0 Ns | Ns 1 Pq int |
The documentation is correct. I compare the directory size with the limit. What you describe could be implemented. I would use a counter in the loop. It would take a lot of work on lots of different situations to know whether this is better or not. It's not obvious that it is. A more complex fix would be to implement a zfs-specific get_name. That would basically disable prefetch only in this specific NFS operation. Whether that's worth doing or not depends partly upon whether the prefetch is a good idea in the first place for very large directories. I'm not so sure it is. Consider "ls -l". If information on all files is already in the ARC, the prefetches are wasted. If not, it will queue up a bunch of prefetches, in the order that they occur in the directory traversal. The best case seems to be"ls -l". But ls will sort the files alphabetically. I think it's just as likely that the first file will have to wait for a bench of prefetches are finished to get its data. Prefetch won't be any faster than a demand fetch. It's useful only if the demand comes later, so that by the time it comes the data is in memory, where it would have had to wait otherwise. But it probably doesn't matter for typical file access. The "ls -l" case seems like the best hope, but I'm skeptical because of the ordering issue. You'd need some pretty careful testing in quite a bunch of situations. But I'm skeptical about the usefulness of the prefetch. Certainly not for us, where metadata is always in NVMe. It you can suggest a way to get a directory's name without doing a traversal of the parent at all, that would be worth doing. I'd be willing to code it. But I don't see any obvious way to do that given the data structures. |
Ok, I misunderstood. So it's literally is the 'size' from stat(/directory). (https://kb.netapp.com/on-prem/ontap/Ontap_OS/OS-KBs/What_is_directory_size) May I suggest the following change?: diff --git a/man/man4/zfs.4 b/man/man4/zfs.4
index c12f838c5..083ffeaa4 100644
--- a/man/man4/zfs.4
+++ b/man/man4/zfs.4
@@ -1776,15 +1776,18 @@ that ends up not being needed, so it can't hurt performance.
.
.It Sy zfs_readdir_dnode_prefetch_limit Ns = Ns Sy 0 Pq u64
Disable prefetches in readdir for large directories.
-(Normally off)
When readdir searches a directory, it normally prefetches metadata for
all objects in the directory it checks, even if it's just
looking for a single object.
Setting this to a non-zero value disables that prefetching for directories
-with more entries than that value.
-Disabling it for large directories can greatly lower CPU usage on NFS servers where directories
-have a very large number of subdirectories.
+with a greater size than that value. Directory size in this case is the
+size returned from calling
+.Sy stat
+on the directory (stat.st_size).
+Disabling it for large directories can greatly lower CPU usage on NFS servers
+where directories have a very large number of subdirectories.
A reasonable value would be 20000.
+A zero value (the default) means no limit on directory metadata prefetching.
.
.It Sy zfs_qat_checksum_disable Ns = Ns Sy 0 Ns | Ns 1 Pq int
Disable QAT hardware acceleration for SHA256 checksums. |
To my knowledge, that field for a directory is in fact the number of entries in the directory. Is that wrong? It's not, as it is for something like ext4, the amount of space used by the directory structure itself. Obviously I don't actually call stat, although I'm prepared to believe that this is the number that stat uses. I just checked a few directories, and it seems close, off by one maybe. I found code incrementing or drecrementing it for actons with a single file. |
I double checked and that appears to be correct. On ZFS it's basically the number of files/subdirs in the directory, but on other filesystems it's like the metadata size needed for the directory. So it's not strictly defined 🤷♂️ How about: diff --git a/man/man4/zfs.4 b/man/man4/zfs.4
index c12f838c5..a6a00e107 100644
--- a/man/man4/zfs.4
+++ b/man/man4/zfs.4
@@ -1776,15 +1776,20 @@ that ends up not being needed, so it can't hurt performance.
.
.It Sy zfs_readdir_dnode_prefetch_limit Ns = Ns Sy 0 Pq u64
Disable prefetches in readdir for large directories.
-(Normally off)
When readdir searches a directory, it normally prefetches metadata for
all objects in the directory it checks, even if it's just
looking for a single object.
Setting this to a non-zero value disables that prefetching for directories
-with more entries than that value.
-Disabling it for large directories can greatly lower CPU usage on NFS servers where directories
-have a very large number of subdirectories.
+with a greater size than that value.
+Disabling it for large directories can greatly lower CPU usage on NFS servers
+where directories have a very large number of subdirectories.
+Directory size in this case is the size returned from calling
+.Sy stat
+on the directory (stat.st_size).
+On ZFS, this directory size value is approximately the number of files
+and subdirectories in the directory.
A reasonable value would be 20000.
+A zero value (the default) means no limit on directory metadata prefetching.
.
.It Sy zfs_qat_checksum_disable Ns = Ns Sy 0 Ns | Ns 1 Pq int
Disable QAT hardware acceleration for SHA256 checksums. |
Can you edit the commit message to fix the commitcheck error?:
Other than that, it should be good to go.. |
ok. I also noted in the documentation that it only applies to Linux. I have no reason to think the BSD needs it. It's needed because of details of the NFS implementation. i would hope that BSD hasn't chosen the same approach. At any rate, the BSD readdir doesn't prefetch. |
what's up with this? Test failures can't be due to this patch, as it obvviously has no effect if not enabled. Is this going to be merged? |
What I find odd about zpl_lookup is that it seems to pass a name to zfs_lookup. But I don't have the name. Given the data structures I don't see how you could do a lookup without the name or searching. |
Indeed, sorry my initial reading on this was a bit to hasty. You're right, the only way we have to do this lookup is by iterating over the entire directory. That said, I still think it may be worthwhile to add our own |
I think the place to look is zap_value_search. But there's another thing to think about: would it make sense to leave the prefetch, but when something is prefetched, if it's a directory, fill in the name and mark it connected? That would eliminate almost all of the searching, since a search happens only if a connected dnode can't be found. |
This should work. The directory ZAP stores the dataset object number as the first integer, and we use this same object number when setting the inode number. The virtual
If we've gone through all the expense of prefetching the directory it makes sense to me to populate it and mark it connected. |
My feeling is that if we make prefetch connect the dnode, performance of this loop isn't so critical, so we don't need our own get_name. As long as the directory dnode is connected, none of this code will be called. This is for realistic cases. In principle, I think the limit to prefetch is needed independent of the NFS issue described here. ZFS claims to support huge directories. But if you actually tried to use them, even if you aren't using NFS and thus don't have the issue described in this request, the first time you do an "ls" of a directory with a zillion files the prefetches will thrash the ARC and possibly fill task queues and other things. I can't say whether this would ever happen in reality. I note that the BSD code for readdir doesn't prefetch. Anyway, connecting This is getting beyond what I think I can do. For the moment we'll stick with this patch. It's hard to know whether anyone else is seeing this issue, since it wouldn't show any symptom other than more CPU usage than you'd like. A lot of our folks do AI. the training datasets seem to have a structure that triggers this. I doubt we're the only place doing this. |
It's actually not so clear that we need the prefetch. There may be enough info ni the inode. But I don't understand what is going on in the reconnect code, so I can't tell what would be needed to actually do what I propose, and whether it would be expensive enough to cause other issues. |
As I can see, the prefetch code is identical for Linux and FreeBSD -- first To address the last case of unneeded prefetches when the data are already in ARC to mind comes mechanism I implemented some time ago for speculative prefetcher in |
We have this issue on our server. In 60 seconds, there were 84231614 times osq_lock() stack traces. The nfsd processes were 90%+ busy. |
what revision is needed? I'm using the patch with 2.2.6. The diff doesn't apply automatically to master, but it's obvious what is needed. If it's going to accepted, I'll fix it to work with master, but so far it doesn't seem like people are ready to take it. Obviously we'll continue to use it, and it sounds like at least one other site needs it. Others might as well but not realize it. My assessment is that the obvious alternative will be only marginally better. I'd rather rewrite things so a directory traversal isn't needed at all, but I'm not convinced that can be done without changing non-zfs kernel code. |
@clhedrick I don't know what Brian meant with that, may be just the merge conflict. But on my side I am not happy of just disabling prefetch because it is not needed in some case. I don't like users turning random knobs. I'd prefer some automatic detection, similar to what should be there already, may be with some improvements, see my later comment. |
Anything that's worth doing is beyond what I can do. One reasonable approach would be to prefetch each item only once. But the prefetch is async, and I don't see any place to tell whether one has already been scheduled. Our cache hit rates for metadata are so high that I doubt there's any advantage to prefetching for us. For myself, I'd prefer just to comment out
If someone else has this issue, that might be easiest. (I haven't found a way to get a diff for this change from github.) |
Our directories do not have have many entries, so the locking contention during the prefetch was the problem. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's merge this and if it proves to be a performance problem I'll try to address it while it's still only in master - I mean, assuming this was blocked by my "smartass" comment, which wasn't really the intention at all
I have resolved the conflict, and added advice on when to set the parameter in the man page. |
Add paramter zfs_readdir_dnode_prefetch_limit, defaulting to 0, to control whether zfs_readdir prefetched metadata for objects it look at when reading a directory. If zero, metadata is prefetched for all directory entries. If non-zero, metadata is prefetched only if directory has fewer entries than this. Setting it to non-0 can be important for NFS servers with directories containing many subdirectories. Signed-off-by: Charles Hedrick <[email protected]> Co-authored-by: Chris Siebenmann<[email protected]>
rebased I've built it and booted it Simple tests look ok. There was a comment about impact on performance. Note that it has no effect unless explicitly enabled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still object this change, since the limit on the directory size it introduces makes no sense. The large directories have the same or may be even higher chance for dnodes they reference to not be in cache and so benefit from prefetch. For smaller directories it might just be less of problem, being hidden by the network and NFS stack latency.
Anything that's worth doing is beyond what I can do.
That is not true if you think for a moment. All we need is to limit prefetch to only first directory read. We just need another flag in znode_t (or replace/modify z_zn_prefetch to save space) to represent that directory was read till the end and there is nothing to prefetch any more next time. Or we could save the position till which directory was previously read, the complication there though is that as I understand serialized ZAP cursor is not monotonically increasing, so we could not compare it for bigger/smaller, but we could save just zc_hash, which should be enough for quick and cheap comparison. And/or we could store the time of the last full read, and issue prefetches only if since previous time passed some predefined time that would justify another prefetch.
Any of that would be pretty trivial to implement, and would be more logical than what is proposed here. BTW, please keep the code in sync between OS'es. FreeBSD code is similar enough to update it also.
The reason I said doing more is beyond me is because I'm not a ZFS developer. One implication is that I don't know the lifetme of various data structures, so it's not clear to me where I can put information and how long it will live. |
znode in most cases lives until kernel under memory pressure will decide to evict inodes, that will evict respective znodes also, or until it is deleted. So I'd say it lives long enough. Just search sources for z_zn_prefetch. It is used only in 5 places on Linux and 4 on FreeBSD, and its logic is pretty simple. Just extend it. |
The ideal solution would be to provide a better way to go from inode to name. Is there someplace to stash the name for a file when readdir (or an equivalent in getname) reads past it, so it could be translated without a directory search? |
I don't see how it is related to this topic of prefetch optimization. inode might have many names due to hard links, or no name at all if all of them were deleted while it was open. It was discusses a lot on recent OpenZFS developer summit and even in last few weeks somewhere here on the GitHub. |
The problem is not, as far as I know, triggered by people doing "ls". You wouldn't expect someone to do "ls" for the same directory enough to cause this performance disaster.. Rather, it's triggered by NFS. If an existing file is being access, NFS sends a file handle from client to server. For Linux, the file handle is file system and inode number, effectively. The server wants to validate the file handle. As part of doing this, the NFS server calls a function gettname, to go from inode number to file name. It is able to find the files's conntaining directory fairly easily, but to find the filename it calls readdir to read the directory until if finds the file with that inode number. You may not think this is a sensible approach, but it's what the LInux NFS server does. Our clients have a large amount of memory, enough to keep the working set cached in memory, sometimes for days. It is common for them to access a file where the client has it cached but it's gone out of the server.s cache. that causes the client to send an inocde number for a file that is not longer in the servers's dnode cache. In that case, if the inode is a directory, the code is triggered that reads the containing directory to find the inode number. Some of our AI training jobs have directories with 200,000 - 500,000 subdirectories. If the client tries to access all of them, each access of a subdirectory will cause a read of on average half the containing directory to map it's inode number to its name. Hence the total number of prefetches will be something like 200,000 x 200000 / 2. This isn't a performance optimization. It's savign us from a disaster. It is possible to supply a zfs-specific implementation of getname. If there's a place to cache the name, then the first time you read the containing directory, the inode to name mapping could be cached. That could turn the problem from 200,000 squared to 200,000. I realize that the same inode number could appear more than once in a directory, but the problem occurs only for subdirectories, and I don't believe that's possible for a directory. Or if it is, the NFS code doesn't care. However youzhongyang may well have a different situation where that approach wouldn't help. He may indeed need to disable the prefetch. Before doing more coding I'm going to do some experiments to see if the prefetch is useful at all. I conjecture that it is not. In which case doing more coding to allow it in some cases would be unnecessary. I propose to try "ls -lU" in large directories with the cache cleared. That seems like the best case for prefetch to have an effect. |
Add paramter zfs_readdir_dnode_prefetch_enabled, defaulting to 1, to control whether zfs_readdir prefetched metadata for all objects it look at when reading a directory.
Setting it to 0 can be important for NFS servers with directories containing many subdirectories.
Motivation and Context
This improves #16254, although a more radical redesign is possible. We have an NFS servers with directories containing 200,000 to 500,000 subdirectories. These arise in AI training scenarios. Our NFS servers can have 20 or more threads running at 100% for hours due to this.
The` problem has been found to be an interaction between the NFS code and zfs_readdir. If information about a directory is cached on the client but not the server, when the server tries to valid the NFS file id, it will add an entry in the dnode cache for the directory. Because he new dnode is not connected to the rest of the directory hierarchy, the NFS (actually exportfs) code goes up the tree trying to connect it. The ends up calling zfs_readdir to find the directory in its parent Because zfs_readdir reads the parent looking for the directory, the amount of time it takes is proportional to the size of the parent. If every directory is processed, the amount of time will be N**2 in the size of the parent. Reading a directory would have only moderate overhead except that zfs_readdir does a prefetch on every item as it looks at it, even if the item is already in the ARC. Of course if it's in the ARC, no disk I/O will be done. But there's still a substantial amount of code.
We've found a factor of 10 to 20 speedup by skipping the prefetches. This is enough to move us from a painful situation to one we can live with.
Of course a linear speedup in an N**2 problem isn't a full solution, but it's enough to survive the size directories we see in practice. A full solution would almost certainly require help from the kernel developers, which is going to be hard to get, since directory searching is fast enough for other Linux file systems that the problem only comes up with ZFS.
Description
This change add a module parameter, zfs_readdir_dnode_prefetch_enabled, normally 1, which can be changed to 0 to disable the prefetch that happens in zfs_readdir.
How Has This Been Tested?
We have done testing with a directory containing 200,000 subdirectories, with the results discussed above. It has been running in production for several weeks.
Types of changes
Checklist:
Signed-off-by
.