Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extending the API to expose crawl events as an RSS/Atom feed #28

Open
anjackson opened this issue Apr 1, 2022 · 2 comments
Open

Extending the API to expose crawl events as an RSS/Atom feed #28

anjackson opened this issue Apr 1, 2022 · 2 comments

Comments

@anjackson
Copy link
Contributor

anjackson commented Apr 1, 2022

At the level of individual URLs, expose CDX information as an RSS feed of crawl events, allowing users to be notified if a particularly interesting page is changed, e.g.

e.g. /api/mementos/rss?url=http://example.com/

where (by default) only changes, crawls with different hashes, are reported.

@anjackson anjackson changed the title Extending the API to expose crawl events of a URL (or site?!) as an RSS/Atom feed for integration into ITTT/Zapier/etc. Extending the API to expose crawl events as an RSS/Atom feed Apr 1, 2022
@anjackson
Copy link
Contributor Author

anjackson commented Apr 1, 2022

Looking quickly at using the existing URL scanning experimental tools and seeing what the timings are like. First scanning up to 100 million entries from the Guardian, for all time... And it takes a while... Just under 7 minutes!

Now, if we restrict it to the last few months, does that help, or is it all scanning time.... on no java.lang.IllegalArgumentException: from={timestamp} and to={timestamp} are currently only implemented for exact matches

Ho hum, this means querying CDX for changes as an API is likely a not going to work, as client will timeout. Caching is possible, but would basically mean deriving millions of results from full table scans.

@anjackson
Copy link
Contributor Author

So, host-level change feeds would need to run off one of

  • the ElasticSearch log index (which I'd rather not hook into the access system)
  • the full-text Solr index (but this would mean indexing crawl failures too, which might place too much strain on an already overloaded component)
  • a crawl result database that can be time-range limited. e.g. Apache Iceberg may be performant enough in this case?
  • a recent-stats database that gets populated from raw sources over time, with entries for every host. Quite complex and we'd want to be able to report individual URLs that got have changed, not just stats.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant