-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARCRecord entered inconsistent state for some ARC files #40
base: master
Are you sure you want to change the base?
Conversation
…arser could skip into next record, invalidating further iteration of the ARC stream
…, but ":" is legal according to RFC2616 Test: Extended unit test to check for extracted status
This 'off spec' hook looks like an ARC format variation edge case to me (see https://github.com/iipc/webarchive-commons/pull/40/files#diff-1a20f5668b210061c0267c7f079c8830L600), so I'd really like @nlevitt or @kngenie to take a look at this and see if they are happy with the change. |
With the check for empty line (i.e. no more HTTP-response lines), keeping the check for colon just means that individual records are wrongly processed instead of the rest of the ARC file. But I do find it curious to discard well-formed status lines. If the colon-check it kept, maybe a rationale could be given as a comment in the code? |
@tokee And, just to check, the removal of that colon-test is the only material change you're making right? All the other stuff is in the /test/ area, right? |
Removing of colon and adding of newline-check are the only real changes, yes. We are in the process of testing with 95 known problem-ARCs (~100MB each) locally and I am happy to say that the first 9 has passed without ARC-parse-showstoppers with the new ARCReader. |
Out test run finished with 94/95 previously partly-skipped ARC-files being fully processed. The last one seems to fail due to the HTTP-header-parser being too greedy (I'll update issue #41 with a description). @thomasegense informs me that about 3% of our ARC files gets marked as problematic when indexing without the patch, so it seems that colon is not uncommon in HTTP-status lines. |
Seems to me that this change is all about supporting real arcs which actually conforms to the spec. But since it is possible for this change to let through records to OpenWayback which today would fail, I propose it to be included in a minor release and not the next bugfix release. |
If no status could be found, ARCReader#readHttpHeader read forward until next status in the originating ARC-Stream, thereby reading past the current record and into the next one. This was combined with a faulty exclusion of status lines containing colons.
This pull request contains a sample ARC file that triggered the two bugs as well as unit tests for the bugs.