Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flagging and recovering from parallel rsync failures #3

Open
3 tasks
Thomas-Moore-Creative opened this issue Sep 18, 2020 · 0 comments
Open
3 tasks

Comments

@Thomas-Moore-Creative
Copy link
Owner

Thomas-Moore-Creative commented Sep 18, 2020

Flagging and recovering from parallel rsync failures?

Overview

The approach of using parallel rsync to transfer large datasets from NCI to CSIRO has yeilded speeds at least an order of magnitude greater than previous experience.

However the method relies on numerous streams, each with it's own rsync that can fail.

How can we confidently alert the user to these failures and then recover from them and, for this use case, with the ongoing transfer to tape in mind?

Example: a 97 file, 11TB transfer with failures

command: time cat /datastore/d/dcfp/NCI_file_lists/cut_f6_2012_filelist.txt | parallel -j 10 --results /datastore/d/dcfp/logs/ 'rsync -ailPW --log-file="/datastore/d/dcfp/logs/f6_2012_rsync.log.$(date +%Y%m%d%H%m%S)" -e "ssh -T -c aes128-ctr" [email protected]:/scratch/v14/$USER/tar_tmp/f6.WIP.c5-d60-pX-f6-20121101.20200831_153624/{} /datastore/d/dcfp/CAFE/forecasts/f6/'

--results /datastore/d/dcfp/logs/ saves a directory structure of log files according to the GNU parallel docs here: https://www.gnu.org/software/parallel/

cd /datastore/d/dcfp/logs/1
/datastore/d/dcfp/logs/1/f6.WIP.c5-d60-pX-f6-20111101.top_level.20200831_165650.tar> ls
seq  stderr  stdout

How do we know there's been a failure?

We happen to see it in the command line output (this is not robust):
f+++++++++ f6.WIP.c5-d60-pX-f6-20121101.mem079.20200831_153624.tar
120,233,226,240 100%   41.75MB/s    0:45:46 (xfr#1, to-chk=0/1)
Connection closed by 192.43.239.112
rsync: connection unexpectedly closed (0 bytes received so far) [Receiver]
rsync error: unexplained error (code 255) at io.c(228) [Receiver=3.2.3]
Connection closed by 192.43.239.112
rsync: connection unexpectedly closed (0 bytes received so far) [Receiver]
rsync error: unexplained error (code 255) at io.c(228) [Receiver=3.2.3]
receiving incremental file list
f+++++++++ f6.WIP.c5-d60-pX-f6-20121101.mem082.20200831_153624.tar
120,198,225,920 100%   30.29MB/s    1:03:04 (xfr#1, to-chk=0/1)
receiving incremental file list
After the fact we compare to source filelist and there are differences:
/datastore/d/dcfp/checks> find /datastore/d/dcfp/CAFE/forecasts/f6/*2012*.tar -type f > check_2012.txt
cut -c 37- check_2012.txt > cut_check_2012.txt
sed -i -e 's#^#./#' cut_check_2012.txt
diff cut_check_2012.txt ../NCI_file_lists/f6_2012_filelist.txt

51a52
> ./f6.WIP.c5-d60-pX-f6-20121101.mem052.20200831_153624.tar
61a63
> ./f6.WIP.c5-d60-pX-f6-20121101.mem063.20200831_153624.tar
82a85,86
> ./f6.WIP.c5-d60-pX-f6-20121101.mem085.20200831_153624.tar
> ./f6.WIP.c5-d60-pX-f6-20121101.mem086.20200831_153624.tar
We capture it in one of the many stout files:
/datastore/d/dcfp/logs> grep -rnw '/datastore/d/dcfp/logs/' -e 'error'
/datastore/d/dcfp/logs/1/f6.WIP.c5-d60-pX-f6-20121101.mem085.20200831_153624.tar/stderr:3:rsync error: unexplained error (code 255) at io.c(228) [Receiver=3.2.3]
/datastore/d/dcfp/logs/1/f6.WIP.c5-d60-pX-f6-20121101.mem086.20200831_153624.tar/stderr:3:rsync error: unexplained error (code 255) at io.c(228) [Receiver=3.2.3]
/datastore/d/dcfp/logs/f6_2012_rsync.log.20200914160905:2:2020/09/14 16:07:51 [3334036] rsync error: unexplained error (code 255) at io.c(228) [Receiver=3.2.3]
/datastore/d/dcfp/logs/f6_2012_rsync.log.20200914160948:2:2020/09/14 16:07:51 [3339499] rsync error: unexplained error (code 255) at io.c(228) [Receiver=3.2.3]
NB: note above that mem052 and mem063 don't appear in the grep -rnw '/datastore/d/dcfp/logs/' -e 'error' ???

ToDo:

  • best approach to catch failures and alert users?
  • best approach to recover from failures, restart, and finish rsync task confidently?
  • solve additional issue of how rsync assesses files that have already been moved to tape
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant