A concurrent crawler and site downloader to make a local copy of a website. This is used by GOV.UK to populate mirrors hosted by AWS S3 and GCP Storage.
Configuration is handled through environment variables as listed below:
- SITE: Specifies the starting URL for the crawler.
- Example:
SITE=https://www.gov.uk
- Example:
- ALLOWED_DOMAINS: A comma-separated list of hostnames permitted to be crawled.
- Example:
ALLOWED_DOMAINS=domain1.com,domain2.com
- Example:
- USER_AGENT: Customizes the user agent for requests. Defaults to
govukbot
if not specified.- Example:
USER_AGENT=custom-user-agent
- Example:
- HEADERS: Provides custom headers for requests.
- Example:
HEADERS=Rate-Limit-Token:ABC123,X-Header:X-Value
- Example:
- CONCURRENCY: Controls the number of concurrent requests, useful for controlling request rate.
- Example:
CONCURRENCY=10
- Example:
- URL_RULES: A comma-separated list of regex patterns matching URLs that the crawler should crawl. All other URLs will be avoided.
- Example:
URL_RULES=https://www.gov.uk/.*
- Example:
- DISALLOWED_URL_RULES: A comma-separated list of regex patterns matching URLs that the crawler should avoid.
- Example:
DISALLOWED_URL_RULES=/search/.*,/government/.*\.atom
- Example: