Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add_license_url DAG is inefficient and fails due to timeout #1270

Closed
obulat opened this issue Mar 13, 2023 · 0 comments · Fixed by WordPress/openverse-catalog#1045
Closed
Assignees
Labels
💻 aspect: code Concerns the software code in the repository 🧰 goal: internal improvement Improvement that benefits maintainers, not users 🟨 priority: medium Not blocking but should be addressed soon 🧱 stack: catalog Related to the catalog and Airflow DAGs

Comments

@obulat
Copy link
Contributor

obulat commented Mar 13, 2023

Problem

The DAG runs a SELECT query for each (license, license version) pair. Each such query lasts at least 30 minutes, while the update query runs for about 2-3 minutes.

Description

Instead of running one SELECT query per license pair, it's better to select all items (identifier, license and license_version) with NULL in meta_data, and then group them by license in Python and run the corresponding UPDATE queries.

Alternatives

We could also in the first step select all items with meta_data is NULL (identifier, license and license_version). Then, add the license_url to each row using Python, and then in the next step run the update queries for each row individually:

UPDATE image
SET meta_data=Json({license_url: get_license_info_from_license_pair(license, license_version)})
WHERE identifier=identifier::UUID

I'm not sure if running so many individual queries would be faster than the current batch approach.

Additional context

Details on the run of the previous version of the DAG are here: WordPress/openverse-catalog#1005 (comment)

@obulat obulat added 🟨 priority: medium Not blocking but should be addressed soon ✨ goal: improvement Improvement to an existing user-facing feature 💻 aspect: code Concerns the software code in the repository 🧰 goal: internal improvement Improvement that benefits maintainers, not users labels Mar 13, 2023
@obulat obulat self-assigned this Mar 13, 2023
@obulat obulat added 🧱 stack: catalog Related to the catalog and Airflow DAGs and removed ✨ goal: improvement Improvement to an existing user-facing feature labels Mar 13, 2023
@obulat obulat transferred this issue from WordPress/openverse-catalog Apr 17, 2023
@github-project-automation github-project-automation bot moved this to 📋 Backlog in Openverse Backlog Apr 17, 2023
@obulat obulat moved this from 📋 Backlog to ✅ Done in Openverse Backlog Apr 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💻 aspect: code Concerns the software code in the repository 🧰 goal: internal improvement Improvement that benefits maintainers, not users 🟨 priority: medium Not blocking but should be addressed soon 🧱 stack: catalog Related to the catalog and Airflow DAGs
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

1 participant