A Go package providing utilities for processing Wikipedia and Wikidata dumps.
Features:
- Supports Wikidata entities JSON dumps.
- Supports Wikimedia Enterprise HTML dumps.
- Supports Wikimedia Commons entities dumps.
- Supports SQL dumps (database layout).
- Decompression and JSON decoding is parallelized for maximum throughput on a single machine.
- Parses into idiomatic Go structs, with no loss of information.
- Can download and process a dump at the same time.
- Can cache downloaded files locally.
- Supports GZIP and BZIP2.
- Supports data in JSON arrays, NDJSON, and SQL.
This is a Go package. You can add it to your project using go get
:
go get gitlab.com/tozd/go/mediawiki
It requires Go 1.23 or newer.
See full package documentation on pkg.go.dev.
There is also a read-only GitHub mirror available, if you need to fork the project there.