-
Notifications
You must be signed in to change notification settings - Fork 136
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
File lost due to late removal #7613
Comments
Hi @chaen, If I may suggest, please urgently ask Adrien to keep this xrootd door's access log file on 2024-07-06, from 08:52:38.584+02 to 09:30:16.004+02. The access log files are often rotated (and delete) relatively quickly, so being prompt with this request would be wise. Cheers, |
Hi, On dCache side, the xrootd queue on pools were full with active requests and hundreds of xrootd transfers on queue. Adrien |
The worrying part is, as @ageorget says, there's no mention in the xrootd door's access log file for either the failed upload attempt (at ~08:52:38) or the problematic delete (at ~09:30:16). The filename The access log file is recording at INFO level (which is good), so it should capture all successful and failed client requests, so I do not understand why billing is reporting door activity that the door does not report as having any corresponding client activity. Does the domain log file (for the domain hosting the xrootd door) indicate any errors/problems at around this time? |
Transfer is identified by session id string, looking like "door:Xrootd01-fndcadoor01@xrootd01-fndcadoor01Domain:AAYc3hPdkMg" (for instance) To me it looks like client issues remove command TBH. It would be great to have client side log (of failure of course, I would like to see if they log the removal). |
That output in form on table from @ageorget looks like it is query of billing DB, can it include initator and transaction? |
Regarding the read pattern, there is a job running on that file which tries to download the file locally 2024-07-10 14:29:02 UTC None/[904754301]DownloadInputData INFO: Preliminary checks OK : now downloading /lhcb/MC/2018/RXCHAD.STRIP.DST/00231199/0000/00231199_00003326_7.RXcHad.Strip.dst from IN2P3-BUFFER
2024-07-10 14:29:02 UTC None/[904754301]DownloadInputData VERBOSE: Attempting to download file /lhcb/MC/2018/RXCHAD.STRIP.DST/00231199/0000/00231199_00003326_7.RXcHad.Strip.dst from IN2P3-BUFFER:
Trying to download root://[email protected]//pnfs/in2p3.fr/data/lhcb/LHCb-Disk/lhcb/buffer/lhcb/MC/2018/RXCHAD.STRIP.DST/00231199/0000/00231199_00003326_7.RXcHad.Strip.dst to /tmp/904754301/00231199_00003326_7.RXcHad.Strip.dst
1 records inserted to MonitoringDB
Failure in plugin to perform getFile Plugin: GFAL2_XROOT lfn: /lhcb/MC/2018/RXCHAD.STRIP.DST/00231199/0000/00231199_00003326_7.RXcHad.Strip.dst error GError('Error on XrdCl::CopyProcess::Run(): [FATAL] Redirect limit has been reached: (source)', 40)
Trying to download https://ccdavlhcb.in2p3.fr:2880/pnfs/in2p3.fr/data/lhcb/LHCb-Disk/lhcb/buffer/lhcb/MC/2018/RXCHAD.STRIP.DST/00231199/0000/00231199_00003326_7.RXcHad.Strip.dst to /tmp/904754301/00231199_00003326_7.RXcHad.Strip.dst
2024-07-10 14:29:03 UTC None/[904754301]DownloadInputData VERBOSE: File successfully downloaded locally (/lhcb/MC/2018/RXCHAD.STRIP.DST/00231199/0000/ it fails with xrootd, and then tries successfully with https. The timestamp matches with the http read. What is interesting is that the failed xroot logs on the dcache side are timestamped exactly 300 after, so indeed just like if things had kept going, while the client definitely received an error and moved on |
I am afraid it is not logged at the standard level. Running in debug isn't an option either, as it's way too verbose 😬 |
As a practical step I would consider changing
to something like
This needs to be done on the pools. I would try one. |
I just opened a rt ticket with additional logs (xrootd/pool domain/access logs).
Ok I did it on one pool and restarted it. Should I have to check something particular then? |
Checking movers on pool, I see all of these duplicated (x17) xrootd movers transferring nothing (bytes=0) until they reach the timeout. |
I thinkis that these are reads in similar situation as your writes - mover started but client hasn't connected. |
Just one thing - do you see more errors on that pool or num ber of errors is the same as before If you do not see increase of these, then I would use this setting on all your pools. And of this helps with file deletion we have a breathing room to address the root cause. Or at least understand it better. |
The xrootd door should only remove a failed upload if the door knows the file's PNFS-ID, which it obtains when the transfer was started. It also includes this PNFS-ID when sending a message to the namespace requesting the entry is deleted. The namespace should only remove the file if it exists and has the expected PNFS-ID. If the file exists but has a different PNFS-ID then the namespace should ignore the door's deletion request. In this case, the file was (seemingly) successfully uploaded via WebDAV. The transfer would over overwritten the incomplete file (from xrootd transfer), so the WebDAV-uploaded file would have the same name but a different PNFS-ID. Therefore, the failed xrootd transfer should not have deleted the successful WebDAV-uploaded file. An alternative explanation is that the client connected to the xrootd door an issued a delete operation against the file, perhaps as a recovery procedure for the failed upload. |
That is definitely the case. |
That's what I thought from the get-go, but @chaen claimed that xrootd and https transfers are sequential.
It may tell us something |
OK, here's my best guess at what happened.
This explanation seems plausible to me, but it lacks direct evidence for the key points --- it's mainly guess work. I agree with @DmitryLitvintsev, getting the complete billing records for the selected entries may help us gain a better understanding. |
The complete billing record was sent in the RT tickets with logs from xrootd door and pool logs.
|
what you describe @paulmillar is exactly my gut feeling on what's happening. In particular since I have seen such occurrences at other sites (not dcache) but with SRM which was asynchronous by nature. |
In my opinion the client should not call remove after failure. That should solve this, no? |
Exactly. This is particularly true, given that the xrootd protocol has the persist-on-successful-close (POSC) flag, which (I think) should give the desired behaviour. |
Thanks. If you do not have tape backend you can lower xrootd lastAccess timeout from 24 hours to say 900 seconds like other protocols. I suggets to examine of raising default:
to higher number helps. Or running multiple xrootd door behind HAProxy (we do that at Fermilab). On xrootd door run "connections" command, count how many connections you have during peaks |
Hi, We have 2 xrootd doors behind a lb for each VO.
I will try to increase it a bit. |
Hi,
Lately, we have put the IN2P3 storage (dCache 9.2.6) under a lot of stress, leading to some instabilities. In practice, some
xroot
upload attempts from jobs running on worker nodes ended up hitting a timeout. When that happens, the job retries the same upload, using another protocol (https
).An example of job upload log would be the following
These are UTC time. If I convert to CET for easier comparison later on we have:
https
. (it is not in the logs, we don't log successes :-) )https
Uploads are done via the
gfal2
python API, in a completely sequential way (no thread, no process). It boils down to plainAnd this is called once per protocol. So first
dest_url
isxroot
, then it'shttps
if it fails.However, the file does not exist on the storage. Asking Adrien (@ageorget) for the dCache logs, this is what he has
There are multiple things to note:
Transfer interrupted
mean ? And if the transfer is interrupted, why do we still getNo connection from client after 300 s
?https
request at 09:25:32, which corresponds to what we see (ok, interpret) in the DIRAC logs.xrootd
removal request at 09:30:16, coming from the very same host. At this stage, the job is already finished (we have the exit timestamp at 09:25:53), and even the pilot job finished its duty (exit timestamp 09:26:01). We do not have explicit removal code, but gfal has a cleanup mechanism which triggers upon failed upload. But this is absolutely synchronous from the client side, and we would not have received thexroot
error (09:25:20) if thedelete
call had not returned.How that asynchronous removal can happen ? Can it be that the removal request reaches the headnode, but is only executed later ?
This is extremely bad, as we are convinced that the file has been successfully written, but in fact it is not. Effectively, this is data loss that has been biting us badly for the past couple of weeks.
Thanks a lot for your help !
The text was updated successfully, but these errors were encountered: