Store multiple crawls in a single database #105

chosak · 2024-09-12T19:29:41Z

This PR significantly alters the way this package uses a database. Instead of storing individual website crawls into separate SQLite database files, all crawls are now stored into the same Django database. This database can be configured to use any database backend supported by Django. Database tables are now managed by Django migrations, and a new Crawl model keeps track of the status of past crawls, including whether they succeeded or failed.

This PR also adds Python tests for 100% of testable Python code (excluding only the plugin to the wpull crawler, which is difficult to test without running a real crawl). This package has been migrated to pytest and pytest-cov for simpler testing and coverage checks. Moving forward, PR checks will fail if Python coverage drops below 100%.

(As a TODO, a future PR will need to add a management command to clean up old crawls, to ensure the database doesn't continue to grow indefinitely).

chosak · 2024-09-12T21:11:34Z

@willbarton in 5bf3d93 I added a workaround to handle missing SVG icons in the Python tests, when we haven't run the frontend build. Like cf.gov, the frontend build copies the CFPB Design System SVG icons from node_modules to where Django can see and load them inline during template rendering. We don't want to have to run the frontend install and build steps in order to successfully run the Python tests.

To get around this, I've added a simple template loader that ignores missing SVGs. Cf.gov has a bunch of custom code that handles this a different way (by inserting a placeholder SVG) which I'd prefer not to copy here. Thanks to @anselmbradford for counsel on different options.

Currently this project only has a single settings file that doesn't disambiguate between testing or dev or production. As part of future work I'll split that out so that this new loader isn't running in production, but this seems a reasonable path forward for now.

We don't want to have to run the frontend build to run Python tests.

chosak added 3 commits September 11, 2024 09:35

Migrate Python tests to pytest

1ab76d4

Create Python test fixture from sample database

e8cf7f5

Store multiple crawls in single DB, + tests

4f95390

chosak requested a review from willbarton September 12, 2024 19:29

chosak added 2 commits September 12, 2024 15:32

Make JSON fixture compliant with Prettier

b1ebbdd

Exclude Django fixture file from prettier

12903eb

Don't fail Python tests for missing SVG icons

5bf3d93

We don't want to have to run the frontend build to run Python tests.

chosak force-pushed the feature/multi-crawl branch from 844f245 to 5bf3d93 Compare September 12, 2024 21:14

Update fabfile for persistent database

0610913

chosak merged commit 54d1451 into main Sep 13, 2024
3 checks passed

chosak deleted the feature/multi-crawl branch September 13, 2024 20:31

chosak mentioned this pull request Sep 16, 2024

Add management command to manage crawls in the database #106

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Store multiple crawls in a single database #105

Store multiple crawls in a single database #105

chosak commented Sep 12, 2024

chosak commented Sep 12, 2024 •

edited

Loading

Store multiple crawls in a single database #105

Store multiple crawls in a single database #105

Conversation

chosak commented Sep 12, 2024

chosak commented Sep 12, 2024 • edited Loading

chosak commented Sep 12, 2024 •

edited

Loading