We tend to think of a web archive as a site we go to when links are broken – a useful fallback, rather than a source of new research data. But web archives don't just store old web pages, they capture multiple versions of web resources over time. Using web archives we can observe change – we can ask historical questions. This collection of notebooks is intended to help historians, and other researchers, frame those questions by revealing what sort of data is available, how to get it, and what you can do with it.
Web Archives share systems and standards, making it much easier for researchers wanting to get their hands on useful data. These notebooks focus on four particular web archives: the UK Web Archive, the Australian Web Archive (National Library of Australia ), the New Zealand Web Archive (National Library of New Zealand), and the Internet Archive. However, the tools and approaches here could be easily extended to other web archives.
Web archives are huge, and access is often limited for legal reasons. These notebooks focus on data that is readily accessible and able to be used without the need for special equipment. They use existing APIs to get data in manageable chunks. But many of the examples demonstrated can also be scaled up to build substantial datasets for analysis – you just have to be patient!
These notebooks are a starting point that I hope will encourage researchers to investigate the possibilities of web archives in more detail. They're intended to compliment the fabulous work being by projects such as Archives Unleashed to open web archives to new research uses.
The development of these notebooks was supported by the International Internet Preservation Consortium's Discretionary Funding Programme 2019-2020, with the participation of the British Library, the National Library of Australia, and the National Library of New Zealand. Thanks all!
See the web archives section of the GLAM Workbench for more information.
- Timegates, Timemaps, and Mementos – explore how the Memento protocol helps you get machine-readable data about web archive captures
- Exploring the Internet Archive's CDX API – some web archives provide indexes of the web pages they've archived through an API, this notebook looks in detail at the data provided by the Internet Archive's CDX API
- Comparing CDX APIs – this notebook documents differences between the Internet Archive's Wayback CDX API and the PyWb CDX API (used by AWA and UKWA)
- Timemaps vs CDX APIs – both Timemaps and CDX APIs can give us a list of captures from a particular web page, this notebook compares the results
- Get the archived version of a page closest to a particular date – the Memento API enables us to get the archived version of a page closest to a particular date, the functions in this notebook smooth out these some variations across repositories
- Find all the archived versions of a web page – you can get all the captures of an archived page using either Timemaps or the CDX API, this notebook demonstrates both
- Harvesting collections of text from archived web pages – create a dataset from the text contents of a single page across time, or multiple pages
- Harvesting data about a domain using the IA CDX API – extract information about a whole domain using
prefix
anddomain
queries - Find and explore Powerpoint presentations from a specific domain – a complete workflow from web archive to Powerpoints to PDFS to images and text, and explore it all in Datasette
- Exploring subdomains in the whole of gov.au - scale up your harvesting to assemble a complete set of subdomains over time, and visualise the results as a dendrogram
- Compare two versions of an archived web page – demonstrates a number of different ways to versions of an archive web page can be compared, from metadata to screenshots
- Observing change in a web page over time – getting and visualising information about all the captures of a single page over time
- Create and compare full page screenshots from archived web pages – generate full page screenshots of archived web pages, compare pages, captures, even repositories
- Using screenshots to visualise change in a page over time – create a time series of screenshots, one for each year, compiled into a single image
- Display changes in the text of an archived web page over time – work through, capture by capture, showing how the text contents of an archived web page has changed
- Find when a piece of text appears in an archived web page – look for the first or last occurance of text string in an archived web page, or just find every occurance and chart the frequency
See the GLAM Workbench or Zenodo for up-to-date citation details.
This repository is part of the GLAM Workbench.
If you think this project is worthwhile, you might like to sponsor me on GitHub.