Extending the API to expose crawl events as an RSS/Atom feed #28

anjackson · 2022-04-01T10:17:20Z

At the level of individual URLs, expose CDX information as an RSS feed of crawl events, allowing users to be notified if a particularly interesting page is changed, e.g.

e.g. /api/mementos/rss?url=http://example.com/

where (by default) only changes, crawls with different hashes, are reported.

The text was updated successfully, but these errors were encountered:

anjackson · 2022-04-01T15:04:03Z

Looking quickly at using the existing URL scanning experimental tools and seeing what the timings are like. First scanning up to 100 million entries from the Guardian, for all time... And it takes a while... Just under 7 minutes!

Now, if we restrict it to the last few months, does that help, or is it all scanning time.... on no java.lang.IllegalArgumentException: from={timestamp} and to={timestamp} are currently only implemented for exact matches

Ho hum, this means querying CDX for changes as an API is likely a not going to work, as client will timeout. Caching is possible, but would basically mean deriving millions of results from full table scans.

anjackson · 2022-04-01T15:17:14Z

So, host-level change feeds would need to run off one of

the ElasticSearch log index (which I'd rather not hook into the access system)
the full-text Solr index (but this would mean indexing crawl failures too, which might place too much strain on an already overloaded component)
a crawl result database that can be time-range limited. e.g. Apache Iceberg may be performant enough in this case?
a recent-stats database that gets populated from raw sources over time, with entries for every host. Quite complex and we'd want to be able to report individual URLs that got have changed, not just stats.

anjackson mentioned this issue Apr 1, 2022

Bring Save and Nominate together #13

Open

11 tasks

anjackson changed the title ~~Extending the API to expose crawl events of a URL (or site?!) as an RSS/Atom feed for integration into ITTT/Zapier/etc.~~ Extending the API to expose crawl events as an RSS/Atom feed Apr 1, 2022

anjackson mentioned this issue Apr 29, 2022

Add change alerts to Document Harvester ukwa/ukwa-services#89

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extending the API to expose crawl events as an RSS/Atom feed #28

Extending the API to expose crawl events as an RSS/Atom feed #28

anjackson commented Apr 1, 2022 •

edited

Loading

anjackson commented Apr 1, 2022 •

edited

Loading

anjackson commented Apr 1, 2022

Extending the API to expose crawl events as an RSS/Atom feed #28

Extending the API to expose crawl events as an RSS/Atom feed #28

Comments

anjackson commented Apr 1, 2022 • edited Loading

anjackson commented Apr 1, 2022 • edited Loading

anjackson commented Apr 1, 2022

anjackson commented Apr 1, 2022 •

edited

Loading

anjackson commented Apr 1, 2022 •

edited

Loading