Recurring crawl errors! #1244

tidoust · 2024-06-03T17:31:23Z

The crawl is resilient, happily reuses previous extracts and hides the errors (see #1131), but it has been a while since Reffy managed to crawl all specs without any error.

The following git command can be used to track changes to the line that reports the number of errors in ed/index.json:

git log -L 650,653:ed/index.json

Looking at the result, last time there was 0 error was on 18 April 2024. About 20 crawl errors are reported in ed/index.json since then. There are variations but most errors are server errors (internal errors or rejections of requests) and timeouts. Looking at today's last crawl, I see 27 errors, including:

HTTP status 429: 4 errors, W3C specs (but only one /TR, the rest being Patent Policy, Process, GIF89a)
HTTP status 500: 5 errors, drafts.fxtf.org specs
HTTP status 503: 5 errors, 4 /TR specs, 1 w3c.github.io/reporting/
HTTP status 504: 8 errors, 7 Houdini specs, 1 for css-viewport-1
Network timeout: 4 errors, 2 for drafts.fxtf.org specs, 1 for https://w3c.github.io/aria/, 1 for the SVG draft
ReSpec generation timeout: 1 error for Gamepad Extensions (this one is easy to reproduce, need to investigate)

These errors seem representative of other crawl results. I don't get these errors when I run a crawl locally, except for the one on Gamepad Extensions... and for a 429 on https://drafts.fxtf.org/geometry-1/ which does not appear in Webref's data.

I'm creating this issue to explore possible workarounds we could perhaps consider to get back to normal. We also have recurring build errors with similar errors in browser-specs.

The text was updated successfully, but these errors were encountered:

tidoust · 2024-06-04T12:55:47Z

The crawler currently processes the list 4 specs at a time. Most of the time, it just fetches the core URL with appropriate HTTP cache headers, get a 304, reuse the previous crawl results, and move on to the next spec. This allows us to crawl things faster. From a server perspective, this might be interpreted as the crawler sending many requests at once though.

To be seen as a nicer bot, the crawler could perhaps:

Process the list 2 or 3 specs at a time. Serializing things completely would probably make crawl run too slow.
Sort specs initially to "spread origins", so that the crawler needs to process a few specs before it gets back to sending another request to a given origin.
Add something like a 1-2s delay between requests sent to a given origin, to avoid reaching the 180 requests/minute limit for W3C servers.
Block requests to CSS stylesheets and other known resources we don't need such as fixup.js for /TR specs. But the crawler already caches responses to these resources in practice, it's not obvious that we would gain anything.
Schedule browser-specs builds and Webref crawls further apart. They don't run at the same time but within the same hour for now, and rate limits probably get reset after an hour or so.

dontcallmedom · 2024-06-04T13:27:59Z

I think I'd start with 2 & 3 as the likely biggest bang for the buck

The crawler has a hard time crawling all specs nowadays due to more stringent restrictions on servers that lead to network timeouts and errors. See: w3c/webref#1244 The goal of this update is to reduce the load of the crawler onto servers. Two changes: 1. The list of specs to crawl gets sorted to distribute origins. This should help with diluting requests sent to a specific server at once. The notion of "origin" used in the code is loose and more meant to identify the server that serves the resource than the actual origin. 2. Requests sent to a given origin are serialized, and sent 2 seconds minimum after the last request was sent (and processed). The crawler still processes the list 4 specs at a time otherwise (provided the specs are to be retrieved from different origins). The consequence of 1. is that the specs are no longer processed in order, so logs will make the crawler look a bit drunk, processing specs seemingly randomly, as in: ``` 1/610 - https://aomediacodec.github.io/afgs1-spec/ - crawling 8/610 - https://compat.spec.whatwg.org/ - crawling 12/610 - https://datatracker.ietf.org/doc/html/draft-davidben-http-client-hint-reliability - crawling 13/610 - https://datatracker.ietf.org/doc/html/draft-ietf-httpbis-rfc6265bis - crawling 12/610 - https://datatracker.ietf.org/doc/html/draft-davidben-http-client-hint-reliability - done 16/610 - https://drafts.css-houdini.org/css-typed-om-2/ - crawling 13/610 - https://datatracker.ietf.org/doc/html/draft-ietf-httpbis-rfc6265bis - done 45/610 - https://fidoalliance.org/specs/fido-v2.1-ps-20210615/fido-client-to-authenticator-protocol-v2.1-ps-errata-20220621.html - crawling https://compat.spec.whatwg.org/ [error] Multiple event handler named orientationchange, cannot associate reliably to an interface in Compatibility Standard 8/610 - https://compat.spec.whatwg.org/ - done 66/610 - https://registry.khronos.org/glTF/specs/2.0/glTF-2.0.html - crawling https://aomediacodec.github.io/afgs1-spec/ [log] extract refs without rules 1/610 - https://aomediacodec.github.io/afgs1-spec/ - done ```

tidoust · 2024-06-08T12:00:16Z

The new throttling logic seems to work fine: no crawl error since yesterday. Rules now are:

The crawler only sends one request at a time to a given origin
The crawler sleeps 2s in between requests sent to the csswg.org server (down to 1s for www.w3.org, and 100ms for other origins)
The crawler avoids loading associated resources (CSS stylesheets, images including SVG, some scripts that we know we do not need)
csswg.org, fxtf.org and css-houdini.org are considered to be the same origin
All xxx.github.io URLs are considered to be the same origin

Crawl takes longer (16-20mn for a full crawl, 5-6mn when most specs can be skipped because they did not change), but it does not have to be fast and that remains reasonable.

For documentation purpose, known usage limits that were put into place on servers:

For the CSS server, 1 request per second, with temporary bans in case of excessive usage, see Add throttling logic per origin browser-specs#1356 (comment)
For W3C resources, 180 requests/minute
No precise usage limit for github.io URLs, but usage needs to remain "reasonable", see usage limits documentation

tidoust mentioned this issue Jun 3, 2024

Make multipage flag specifically target release and/or nightly w3c/browser-specs#1345

Merged

tidoust mentioned this issue Jun 5, 2024

Reduce crawler load on servers w3c/reffy#1581

Merged

tidoust closed this as completed Jun 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recurring crawl errors! #1244

Recurring crawl errors! #1244

tidoust commented Jun 3, 2024

tidoust commented Jun 4, 2024

dontcallmedom commented Jun 4, 2024

tidoust commented Jun 8, 2024

Recurring crawl errors! #1244

Recurring crawl errors! #1244

Comments

tidoust commented Jun 3, 2024

tidoust commented Jun 4, 2024

dontcallmedom commented Jun 4, 2024

tidoust commented Jun 8, 2024