You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Look at supporting crawlspec in addition to the current UKWA format (see here while noting extensions here).
As part of this, consider if/how to collapse separate Targets for the same Host down to a single crawlspec. This needs investigation because some crawl configurations are only meaningful at the host level, and right now the configuration acts like they are independent. e.g. one crawlspec for Twitter, setting parallel queues etc, but including multiple seeds.
Also consider using a keyed compacted Kafka topic and always reading it all in on startup, using that to ensure consistent crawls when we restart. See ukwa/crawl-streams#4
The text was updated successfully, but these errors were encountered:
Look at supporting crawlspec in addition to the current UKWA format (see here while noting extensions here).
As part of this, consider if/how to collapse separate Targets for the same Host down to a single crawlspec. This needs investigation because some crawl configurations are only meaningful at the host level, and right now the configuration acts like they are independent. e.g. one crawlspec for Twitter, setting parallel queues etc, but including multiple seeds.
Also consider using a keyed compacted Kafka topic and always reading it all in on startup, using that to ensure consistent crawls when we restart. See ukwa/crawl-streams#4
The text was updated successfully, but these errors were encountered: