add_license_url
DAG is inefficient and fails due to timeout
#1270
Labels
💻 aspect: code
Concerns the software code in the repository
🧰 goal: internal improvement
Improvement that benefits maintainers, not users
🟨 priority: medium
Not blocking but should be addressed soon
🧱 stack: catalog
Related to the catalog and Airflow DAGs
Problem
The DAG runs a
SELECT
query for each (license, license version) pair. Each such query lasts at least 30 minutes, while the update query runs for about 2-3 minutes.Description
Instead of running one
SELECT
query per license pair, it's better to select all items (identifier, license and license_version) withNULL
inmeta_data
, and then group them by license in Python and run the correspondingUPDATE
queries.Alternatives
We could also in the first step select all items with
meta_data
isNULL
(identifier, license and license_version). Then, add thelicense_url
to each row using Python, and then in the next step run the update queries for each row individually:I'm not sure if running so many individual queries would be faster than the current batch approach.
Additional context
Details on the run of the previous version of the DAG are here: WordPress/openverse-catalog#1005 (comment)
The text was updated successfully, but these errors were encountered: