Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add crawl-spec support #16

Open
anjackson opened this issue Sep 29, 2021 · 0 comments
Open

Add crawl-spec support #16

anjackson opened this issue Sep 29, 2021 · 0 comments

Comments

@anjackson
Copy link
Contributor

anjackson commented Sep 29, 2021

Look at supporting crawlspec in addition to the current UKWA format (see here while noting extensions here).

As part of this, consider if/how to collapse separate Targets for the same Host down to a single crawlspec. This needs investigation because some crawl configurations are only meaningful at the host level, and right now the configuration acts like they are independent. e.g. one crawlspec for Twitter, setting parallel queues etc, but including multiple seeds.

Also consider using a keyed compacted Kafka topic and always reading it all in on startup, using that to ensure consistent crawls when we restart. See ukwa/crawl-streams#4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant