Flagging and recovering from `parallel rsync` failures #3

Thomas-Moore-Creative · 2020-09-18T22:36:44Z

Flagging and recovering from `parallel rsync` failures?

Overview

The approach of using parallel rsync to transfer large datasets from NCI to CSIRO has yeilded speeds at least an order of magnitude greater than previous experience.

However the method relies on numerous streams, each with it's own rsync that can fail.

How can we confidently alert the user to these failures and then recover from them and, for this use case, with the ongoing transfer to tape in mind?

Example: a 97 file, 11TB transfer with failures

command: time cat /datastore/d/dcfp/NCI_file_lists/cut_f6_2012_filelist.txt | parallel -j 10 --results /datastore/d/dcfp/logs/ 'rsync -ailPW --log-file="/datastore/d/dcfp/logs/f6_2012_rsync.log.$(date +%Y%m%d%H%m%S)" -e "ssh -T -c aes128-ctr" [email protected]:/scratch/v14/$USER/tar_tmp/f6.WIP.c5-d60-pX-f6-20121101.20200831_153624/{} /datastore/d/dcfp/CAFE/forecasts/f6/'

--results /datastore/d/dcfp/logs/ saves a directory structure of log files according to the GNU parallel docs here: https://www.gnu.org/software/parallel/

cd /datastore/d/dcfp/logs/1
/datastore/d/dcfp/logs/1/f6.WIP.c5-d60-pX-f6-20111101.top_level.20200831_165650.tar> ls
seq  stderr  stdout

How do we know there's been a failure?

We happen to see it in the command line output (this is not robust):

f+++++++++ f6.WIP.c5-d60-pX-f6-20121101.mem079.20200831_153624.tar
120,233,226,240 100%   41.75MB/s    0:45:46 (xfr#1, to-chk=0/1)
Connection closed by 192.43.239.112
rsync: connection unexpectedly closed (0 bytes received so far) [Receiver]
rsync error: unexplained error (code 255) at io.c(228) [Receiver=3.2.3]
Connection closed by 192.43.239.112
rsync: connection unexpectedly closed (0 bytes received so far) [Receiver]
rsync error: unexplained error (code 255) at io.c(228) [Receiver=3.2.3]
receiving incremental file list
f+++++++++ f6.WIP.c5-d60-pX-f6-20121101.mem082.20200831_153624.tar
120,198,225,920 100%   30.29MB/s    1:03:04 (xfr#1, to-chk=0/1)
receiving incremental file list

After the fact we compare to source filelist and there are differences:

/datastore/d/dcfp/checks> find /datastore/d/dcfp/CAFE/forecasts/f6/*2012*.tar -type f > check_2012.txt
cut -c 37- check_2012.txt > cut_check_2012.txt
sed -i -e 's#^#./#' cut_check_2012.txt
diff cut_check_2012.txt ../NCI_file_lists/f6_2012_filelist.txt

51a52
> ./f6.WIP.c5-d60-pX-f6-20121101.mem052.20200831_153624.tar
61a63
> ./f6.WIP.c5-d60-pX-f6-20121101.mem063.20200831_153624.tar
82a85,86
> ./f6.WIP.c5-d60-pX-f6-20121101.mem085.20200831_153624.tar
> ./f6.WIP.c5-d60-pX-f6-20121101.mem086.20200831_153624.tar

We capture it in one of the many stout files:

/datastore/d/dcfp/logs> grep -rnw '/datastore/d/dcfp/logs/' -e 'error'
/datastore/d/dcfp/logs/1/f6.WIP.c5-d60-pX-f6-20121101.mem085.20200831_153624.tar/stderr:3:rsync error: unexplained error (code 255) at io.c(228) [Receiver=3.2.3]
/datastore/d/dcfp/logs/1/f6.WIP.c5-d60-pX-f6-20121101.mem086.20200831_153624.tar/stderr:3:rsync error: unexplained error (code 255) at io.c(228) [Receiver=3.2.3]
/datastore/d/dcfp/logs/f6_2012_rsync.log.20200914160905:2:2020/09/14 16:07:51 [3334036] rsync error: unexplained error (code 255) at io.c(228) [Receiver=3.2.3]
/datastore/d/dcfp/logs/f6_2012_rsync.log.20200914160948:2:2020/09/14 16:07:51 [3339499] rsync error: unexplained error (code 255) at io.c(228) [Receiver=3.2.3]

NB: note above that mem052 and mem063 don't appear in the `grep -rnw '/datastore/d/dcfp/logs/' -e 'error'` ???

ToDo:

best approach to catch failures and alert users?
best approach to recover from failures, restart, and finish rsync task confidently?
solve additional issue of how rsync assesses files that have already been moved to tape

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flagging and recovering from `parallel rsync` failures #3

Flagging and recovering from `parallel rsync` failures #3

Thomas-Moore-Creative commented Sep 18, 2020 •

edited

Loading

Flagging and recovering from parallel rsync failures #3

Flagging and recovering from parallel rsync failures #3

Comments

Thomas-Moore-Creative commented Sep 18, 2020 • edited Loading

Flagging and recovering from parallel rsync failures?

Overview

How can we confidently alert the user to these failures and then recover from them and, for this use case, with the ongoing transfer to tape in mind?

Example: a 97 file, 11TB transfer with failures

How do we know there's been a failure?

We happen to see it in the command line output (this is not robust):

After the fact we compare to source filelist and there are differences:

We capture it in one of the many stout files:

NB: note above that mem052 and mem063 don't appear in the grep -rnw '/datastore/d/dcfp/logs/' -e 'error' ???

ToDo:

Flagging and recovering from `parallel rsync` failures #3

Flagging and recovering from `parallel rsync` failures #3

Thomas-Moore-Creative commented Sep 18, 2020 •

edited

Loading

Flagging and recovering from `parallel rsync` failures?

NB: note above that mem052 and mem063 don't appear in the `grep -rnw '/datastore/d/dcfp/logs/' -e 'error'` ???