Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GNU parallel vs xargs #6

Open
hot007 opened this issue Feb 6, 2023 · 8 comments
Open

GNU parallel vs xargs #6

hot007 opened this issue Feb 6, 2023 · 8 comments

Comments

@hot007
Copy link

hot007 commented Feb 6, 2023

I've never been an xargs user, it confuses me, but here's an example of doing an rsync with xargs insead of parallel, just documenting this here for reference (h/t @dsroberts).
The following copies the contents of the current directory in 8 parallel streams, using xargs as a sort of metascheduler.

printf '%s\n' * | xargs -P 8 -n 1 -I{} rsync --verbose --recursive --links --times --specials --partial --progress --one-file-system --hard-links {} /path/to/destination/
@hot007
Copy link
Author

hot007 commented Feb 6, 2023

He observed a copy rate of about 1.2GBps within NCI which is about what we'd expect from our parallel tests - appears to be CPU limited.

@dsroberts
Copy link

Hi all. I had a bit more of a think about this, and I came up with the following:

xargs -a <( find ! -type d ) -P 8 -n 1 -I{} rsync --verbose --recursive --links --times --specials --partial --progress --one-file-system --hard-links --relative {} /path/to/destination

This launches a different rsync process for every file. So probably too much overhead when transferring small files or lots of symlinks etc. However, if you're transferring lots of large files (My test is 5.6TB across 273 files), this gets around login node CPU time limits, as none of the individual rsyncs hit the CPU limit. It also has the benefit of neatly balancing transfers with top-level directories of varying sizes.

The find command probably needs refinement, I'm only transferring files, so didn't need to think too hard about it.

@Thomas-Moore-Creative
Copy link
Owner

Thanks @hot007 & @dsroberts for documenting this here. It's been a little while since I really tested the parallel approach but hopeful it's still useful as a template that can offer some chunkier performance.

@dsroberts
Copy link

Just resurrecting this post, I've been using this to move lots of data around and I've found that in the case of files of varying size, you can wind up with a 'long tail' problem whereby a large file ends up towards the end of the file list which means the whole command takes much longer to run. I propose the following:

xargs -a <( find ! -type d -ls | sort -h -k7 -r | awk '{print $11}' ) -P 8 -n 1 -I{} rsync --verbose --recursive --links --times --spe
cials --partial --progress --one-file-system --hard-links --relative {}

Which sorts the output of find by file size, meaning the largest files are always transferred first. As above, the find needs refinement as this will fall over for filenames with spaces.

@Thomas-Moore-Creative
Copy link
Owner

Just resurrecting this post, I've been using this to move lots of data around and I've found ...

Thanks for that advice / experience @dsroberts. I have not needed to move lots of data around recently but it's great that you are using and now "tuning" this.

Q: In your opinion does this still beat the "new" offerings via Globus?

@dsroberts
Copy link

I'm moving data between file systems on Gadi, so not really in a place where I can compare it with Globus.

@hot007
Copy link
Author

hot007 commented Nov 22, 2023

That is some utterly arcane bash!! That said, that's a good idea, thank you.
I haven't used Globus enough recently to make a meaningful comment, but my observation of both Globus and rclone (which is also great esp for cloud transfers), is that I'm not aware of either of them doing anything smart about file size in ordering transfers. I've got a feeling that Globus might have the capability to split files which would help, but my guess is that the above is a good solution for command-line data transfers. The downside with Globus is that it tends to take a few days to get storage locations added to endpoints in my experience so if your storage issue is urgent you may need to initiate a transfer like this to chug away while setting up something more professional for future use...

@dsroberts
Copy link

Splitting the files and transferring the chunks in parallel would negate the need to sort by file size. But well beyond anything that can be done sensibly in bash.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants