Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement new crawler based on wpull #81

Merged
merged 9 commits into from
Nov 2, 2023
Merged

Implement new crawler based on wpull #81

merged 9 commits into from
Nov 2, 2023

Conversation

chosak
Copy link
Member

@chosak chosak commented Nov 2, 2023

This PR adds an alternate method of crawling a website based on wpull.

The current approach uses 2 steps:

  1. Use wget to crawl a website, generating a WARC file
  2. Run a Django management command (warc_to_db) to convert the WARC to a queryable SQLite database

The new approach uses only a single step:

  1. Use wpull plus a custom plugin to crawl a website directly into a queryable SQLite database

This can be done using a new Django management command:

% ./manage.py crawl --help
Usage: manage.py crawl [OPTIONS] START_URL DB_FILENAME

  Crawl a website to a SQLite database.

Options:
  --max-pages INTEGER            Maximum number of pages to crawl
  --depth INTEGER                Maximum crawl depth
  --recreate                     Overwrite SQLite database if it already
                                 exists  [default: False]
  --resume

Because wpull unfortunately doesn't support Python greater than 3.6 (ArchiveTeam/wpull#426), this new approach requires downgrading the runtime of this repo to Python 3.6 as well. This in turn requires downgrading Django from version 4.x back to 3.2.

Unfortunately wpull only supports Python 3.6, see

- ArchiveTeam/wpull#404
- ArchiveTeam/wpull#451

Django 4.0 dropped support for Python 3.6, see

https://docs.djangoproject.com/en/4.2/releases/4.0/#python-compatibility

In order to integrate wpull with the viewer application, we need to
downgrade the viewer Django version from 4.0 to 3.2.
Unfortunately wpull only supports Python 3.6, see

- ArchiveTeam/wpull#404
- ArchiveTeam/wpull#451

In order to integrate wpull with the viewer application, we need to
downgrade Python from 3.8 to 3.6.
This change adds a new management command (manage.py crawl) that crawls
a website directly into a SQLite database, using the wpull package:

https://github.com/ArchiveTeam/wpull

Usage: manage.py crawl [OPTIONS] START_URL DB_FILENAME
@chosak chosak merged commit bcd66f0 into main Nov 2, 2023
4 checks passed
@chosak chosak deleted the feature/wpull branch November 2, 2023 13:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant