All notable changes to this project are documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning (as of version 1.2.0).
- Upgrade to browsertrix crawler 1.4.0-beta.0 (#434)
- Upgrade to browsertrix crawler 1.3.5 (#426)
- Upgrade to browsertrix crawler 1.3.4 and warc2zim 2.1.3 (#424)
- Upgrade to browsertrix crawler 1.3.3 (#411)
- Upgrade to browsertrix crawler 1.3.2, warc2zim 2.1.2 and other dependencies (#406)
- Fix help (#393)
- Upgrade to browsertrix crawler 1.3.0-beta.1 (#387) (fixes "Ziming a website with huge assets (e.g. PDFs) is failing to proceed" - #380)
- Add support for uncompressed tar archive in --warcs (#369)
- Upgrade to browsertrix crawler 1.3.0-beta.0 (#379), including upgrage to Ubuntu Noble (#307)
- Stream files downloads to not exhaust memory (#373)
- Fix documentation on
--diskUtilization
setting (#375)
- Add
--custom-behaviors
argument to support path/HTTP(S) URL custom behaviors to pass to the crawler (#313) - Add daily automated end-to-end tests of a page with Youtube player (#330)
- Add
--warcs
option to directly process WARC files (#301)
- Make it clear that
--profile
argument can be an HTTP(S) URL (and not only a path) (#288) - Fix README imprecisions + add back warc2zim availability in docker image (#314)
- Enhance integration test to assert final content of the ZIM (#287)
- Stop fetching and passing browsertrix crawler version as scraperSuffix to warc2zim (#354)
- Do not log number of WARC files found (#357)
- Upgrade dependencies (warc2zim 2.1.0)
- Sort WARC directories found by modification time (#366)
- Upgraded Browsertrix Crawler to 1.2.6
- Upgraded Browsertrix Crawler to 1.2.5
- Upgraded warc2zim to 2.0.3
- Upgraded Browsertrix Crawler to 1.2.4 (fixes retrieve automatically the assets present in a data-xxx tag #316)
- Upgraded Browsertrix Crawler to 1.2.0 (fixes Youtube videos issue #323)
- Upgrade dependencies (mainly warc2zim 2.0.2)
- Upgrade dependencies (especially warc2zim 2.0.1 and browsertrix crawler 1.2.0-beta.0) (#318)
- Crawler is not correctly checking disk size / usage (#305)
- New
--version
flag to display Zimit version (#234) - New
--logging
flag to adjust Browsertrix Crawler logging (#273) - Use new
--scraper-suffix
flag of warc2zim to enhance ZIM "Scraper" metadata (#275) - New
--noMobileDevice
CLI argument - Publish Docker image for
linux/arm64
(in addition tolinux/amd64
) (#178)
- Use
warc2zim
version 2, which works without Service Worker anymore (#193) - Upgraded Browsertrix Crawler to 1.1.3
- Adopt Python bootstrap conventions
- Upgrade to Python 3.12 + upgrade dependencies
- Removed handling of redirects by zimit, they are handled by browsertrix crawler and detected properly by warc2zim (#284)
- Drop initial check of URL in Python (#256)
--userAgent
CLI argument overrides again the--userAgentSuffix
and--adminEmail
values--userAgent
CLI arguement is not mandatory anymore
- Fix support for Youtube videos (#291)
- Fix crawler
--waitUntil
values (#289)
- Adapt to new
warc2zim
code structure - Using browsertrix-crawler 0.12.4
- Using warc2zim 1.5.5
- New
--build
parameter (optional) to specify the directory holding Browsertrix files ; if not set,--output
directory is used ; zimit creates one subdir of this folder per invocation to isolate datasets ; subdir is kept only if--keep
is set.
--collection
parameter was not working (#252)
- Using browsertrix-crawler 0.12.3
- Fix logic passing args to crawler to support value '0' (#245)
- Fix documentation about Chrome and headless (#248)
- Using browsertrix-crawler 0.12.1
- Scraper fails for all HTTP error codes returned when checking URL at startup (#223)
- User-Agent now has a default value (#228)
- Manipulation of spaces with UA suffix and adminEmail has been modified
- Same User-Agent is used for check_url (Python) and Browsertrix crawler (#227)
- Using browsertrix-crawler 0.12.0
- Using browsertrix-crawler 0.11.2
- Using browsertrix-crawler 0.11.1
- Using browsertrix-crawler 0.11.0
- Scraper stat file is not created empty (#211)
- Crawler statistics are not available anymore (#213)
- Using warc2zim 1.5.4
--long-description
param
- Using browsertrix-crawler 0.10.4
- Using warc2zim 1.5.3
--title
to set ZIM title--description
to set ZIM description- New crawler options:
--maxPageLimit
,--delay
,--diskUtilization
--zim-lang
param to set warc2zim's--lang
(ISO-639-3)
- Using browsertrix-crawler 0.10.2
- Default and accepted values for
--waitUntil
from crawler's update - Using warc2zim 1.5.2
- Disabled Chrome updates to prevent incidental inclusion of update data in WARC/ZIM (#172)
--failOnFailedSeed
used inconditionally--lang
now passed to crawler (ISO-639-1)
--newContext
from crawler's update
- Using browsertrix-crawler 0.8.0
- Using warc2zim version 1.5.1 with wabac.js 2.15.2
- Initial url check normalizes homepage redirects to standart ports – 80/443 (#137)
- Using warc2zim version 1.5.0 with scope conflict fix and videos fix
- Using browsertrix-crawler 0.8.0-beta.1
- Fixed
--allowHashUrls
being a boolean param - Increased
check_url
timeout (12s to connect, 27s to read) instead of 10s
--urlFile
browsertrix crawler parameter--depth
browsertrix crawler parameter--extraHops
, parameter--collection
browsertrix crawler parameter--allowHashUrls
browsertrix crawler parameter--userAgentSuffix
browsertrix crawler parameter--behaviors
, parameter--behaviorTimeout
browsertrix crawler parameter--profile
browsertrix crawler parameter--sizeLimit
browsertrix crawler parameter--timeLimit
browsertrix crawler parameter--healthCheckPort
, parameter--overwrite
parameter
- using browsertrix-crawler
0.6.0
and warc2zim1.4.2
- default WARC location after crawl changed
from
collections/capture-*/archive/
tocollections/crawl-*/archive/
--scroll
browsertrix crawler parameter (see--behaviors
)--scope
browsertrix crawler parameter (see--scopeType
,--include
and--exclude
)
- using crawler 0.3.2 and warc2zim 1.3.6
- Defaults to
load,networkidle0
for waitUntil param (same as crawler) - Allows setting combinations of values for waitUntil param
- Updated warc2zim to 1.3.5
- Updated browsertrix-crawler to 0.3.1
- Warc to zim now written to
{temp_root_dir}/collections/capture-*/archive/
wherecapture-*
is dynamic and includes the datetime. (from browsertrix-crawler)
- allows same first-level-domain redirects
- fixed redirects to URL in scope
- updated crawler to 0.2.0
statsFilename
now informs whether limit was hit or not
- added support for --custom-css
- added domains block list (dfault)
- updated browsertrix-crawler to 0.1.4
- autofetcher script to be injected by defaultDriver to capture srcsets + URLs in dynamically added stylesheets
- initial version using browsertrix-crawler:0.1.3 and warc2zim:1.3.3