GOV.UK Mirror

A concurrent crawler and site downloader to make a local copy of a website. This is used by GOV.UK to populate mirrors hosted by AWS S3 and GCP Storage.

Usage

Configuration is handled through environment variables as listed below:

SITE: Specifies the starting URL for the crawler.
- Example: SITE=https://www.gov.uk
ALLOWED_DOMAINS: A comma-separated list of hostnames permitted to be crawled.
- Example: ALLOWED_DOMAINS=domain1.com,domain2.com
USER_AGENT: Customizes the user agent for requests. Defaults to govukbot if not specified.
- Example: USER_AGENT=custom-user-agent
HEADERS: Provides custom headers for requests.
- Example: HEADERS=Rate-Limit-Token:ABC123,X-Header:X-Value
CONCURRENCY: Controls the number of concurrent requests, useful for controlling request rate.
- Example: CONCURRENCY=10
URL_RULES: A comma-separated list of regex patterns matching URLs that the crawler should crawl. All other URLs will be avoided.
- Example: URL_RULES=https://www.gov.uk/.*
DISALLOWED_URL_RULES: A comma-separated list of regex patterns matching URLs that the crawler should avoid.
- Example: DISALLOWED_URL_RULES=/search/.*,/government/.*\.atom

Name		Name	Last commit message	Last commit date
Latest commit History 97 Commits
.github		.github
cmd		cmd
internal		internal
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
go.mod		go.mod
go.sum		go.sum

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GOV.UK Mirror

Usage

About

Contributors 6

Languages

License

alphagov/govuk-mirror

Folders and files

Latest commit

History

Repository files navigation

GOV.UK Mirror

Usage

About

Topics

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Contributors 6

Languages