GNU parallel vs xargs #6

hot007 · 2023-02-06T01:25:47Z

I've never been an xargs user, it confuses me, but here's an example of doing an rsync with xargs insead of parallel, just documenting this here for reference (h/t @dsroberts).
The following copies the contents of the current directory in 8 parallel streams, using xargs as a sort of metascheduler.

printf '%s\n' * | xargs -P 8 -n 1 -I{} rsync --verbose --recursive --links --times --specials --partial --progress --one-file-system --hard-links {} /path/to/destination/

The text was updated successfully, but these errors were encountered:

hot007 · 2023-02-06T01:26:58Z

He observed a copy rate of about 1.2GBps within NCI which is about what we'd expect from our parallel tests - appears to be CPU limited.

dsroberts · 2023-02-06T02:08:41Z

Hi all. I had a bit more of a think about this, and I came up with the following:

xargs -a <( find ! -type d ) -P 8 -n 1 -I{} rsync --verbose --recursive --links --times --specials --partial --progress --one-file-system --hard-links --relative {} /path/to/destination

This launches a different rsync process for every file. So probably too much overhead when transferring small files or lots of symlinks etc. However, if you're transferring lots of large files (My test is 5.6TB across 273 files), this gets around login node CPU time limits, as none of the individual rsyncs hit the CPU limit. It also has the benefit of neatly balancing transfers with top-level directories of varying sizes.

The find command probably needs refinement, I'm only transferring files, so didn't need to think too hard about it.

Thomas-Moore-Creative · 2023-02-06T02:24:52Z

Thanks @hot007 & @dsroberts for documenting this here. It's been a little while since I really tested the parallel approach but hopeful it's still useful as a template that can offer some chunkier performance.

dsroberts · 2023-11-22T00:23:50Z

Just resurrecting this post, I've been using this to move lots of data around and I've found that in the case of files of varying size, you can wind up with a 'long tail' problem whereby a large file ends up towards the end of the file list which means the whole command takes much longer to run. I propose the following:

xargs -a <( find ! -type d -ls | sort -h -k7 -r | awk '{print $11}' ) -P 8 -n 1 -I{} rsync --verbose --recursive --links --times --spe
cials --partial --progress --one-file-system --hard-links --relative {}

Which sorts the output of find by file size, meaning the largest files are always transferred first. As above, the find needs refinement as this will fall over for filenames with spaces.

Thomas-Moore-Creative · 2023-11-22T00:30:11Z

Just resurrecting this post, I've been using this to move lots of data around and I've found ...

Thanks for that advice / experience @dsroberts. I have not needed to move lots of data around recently but it's great that you are using and now "tuning" this.

Q: In your opinion does this still beat the "new" offerings via Globus?

dsroberts · 2023-11-22T00:32:16Z

I'm moving data between file systems on Gadi, so not really in a place where I can compare it with Globus.

hot007 · 2023-11-22T00:39:12Z

That is some utterly arcane bash!! That said, that's a good idea, thank you.
I haven't used Globus enough recently to make a meaningful comment, but my observation of both Globus and rclone (which is also great esp for cloud transfers), is that I'm not aware of either of them doing anything smart about file size in ordering transfers. I've got a feeling that Globus might have the capability to split files which would help, but my guess is that the above is a good solution for command-line data transfers. The downside with Globus is that it tends to take a few days to get storage locations added to endpoints in my experience so if your storage issue is urgent you may need to initiate a transfer like this to chug away while setting up something more professional for future use...

dsroberts · 2023-11-22T00:48:04Z

Splitting the files and transferring the chunks in parallel would negate the need to sort by file size. But well beyond anything that can be done sensibly in bash.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GNU parallel vs xargs #6

GNU parallel vs xargs #6

hot007 commented Feb 6, 2023

hot007 commented Feb 6, 2023 •

edited

Loading

dsroberts commented Feb 6, 2023

Thomas-Moore-Creative commented Feb 6, 2023

dsroberts commented Nov 22, 2023

Thomas-Moore-Creative commented Nov 22, 2023

dsroberts commented Nov 22, 2023

hot007 commented Nov 22, 2023

dsroberts commented Nov 22, 2023

GNU parallel vs xargs #6

GNU parallel vs xargs #6

Comments

hot007 commented Feb 6, 2023

hot007 commented Feb 6, 2023 • edited Loading

dsroberts commented Feb 6, 2023

Thomas-Moore-Creative commented Feb 6, 2023

dsroberts commented Nov 22, 2023

Thomas-Moore-Creative commented Nov 22, 2023

dsroberts commented Nov 22, 2023

hot007 commented Nov 22, 2023

dsroberts commented Nov 22, 2023

hot007 commented Feb 6, 2023 •

edited

Loading