Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Capture upsert strategy information in media properties doc #4521

Open
AetherUnbound opened this issue Jun 19, 2024 · 2 comments
Open

Capture upsert strategy information in media properties doc #4521

AetherUnbound opened this issue Jun 19, 2024 · 2 comments
Assignees
Labels
📄 aspect: text Concerns the textual material in the repository ✨ goal: improvement Improvement to an existing user-facing feature help wanted Open to participation from the community 🟩 priority: low Low priority and doesn't need to be rushed 🧱 stack: catalog Related to the catalog and Airflow DAGs 🧱 stack: documentation Related to Sphinx documentation 🐍 tech: python Involves Python

Comments

@AetherUnbound
Copy link
Collaborator

Description

When updating certain JSON fields in the catalog, we use a "merge" upsert strategy:

def _merge_jsonb_arrays(column: str) -> str:
return f"""{column} = COALESCE(
(
SELECT jsonb_agg(DISTINCT x)
FROM jsonb_array_elements(old.{column} || EXCLUDED.{column}) t(x)
),
EXCLUDED.{column},
old.{column}
)"""

This came up in a recent discussion about tags: #4475 (comment)

The catalog media properties expansion (#4366) added DB Column Type to the description of each field for easier referencing. We also add upsert_strategy to the table that gets generated at the top of the doc:

if self.upsert_strategy is None:
if self.python_type == "JSONColumn":
self.upsert_strategy = "merge_jsonb_objects"
elif self.python_type == "ArrayColumn":
self.upsert_strategy = "merge_array"
else:
self.upsert_strategy = "newest_non_null"

We should add an Upsert Strategy auto-generated field for each column description (similar to DB Column Type) which carries this information down to each field. It would also be helpful to have a section in the preamble/postamble which describes each update strategy and links to the code for each strategy in the columns.py file. The auto-generated field documentation can also then link to this description.

Alternatives

This was prompted by a recent mention about only new tags being added (and tags never being removed during ingestion). We could add a note explicitly to the tags column documentation instead, but it seems more prudent to do this programmatically not only because we can but because it would be more resilient going forward.

@AetherUnbound AetherUnbound added help wanted Open to participation from the community ✨ goal: improvement Improvement to an existing user-facing feature 🐍 tech: python Involves Python 📄 aspect: text Concerns the textual material in the repository 🟩 priority: low Low priority and doesn't need to be rushed 🧱 stack: catalog Related to the catalog and Airflow DAGs 🧱 stack: documentation Related to Sphinx documentation labels Jun 19, 2024
@openverse-bot openverse-bot moved this to 📋 Backlog in Openverse Backlog Jun 19, 2024
@obulat
Copy link
Contributor

obulat commented Jun 25, 2024

The upsert strategy is in the Python Column column of the table, because that is where we save this information:

Screenshot 2024-06-25 at 4 56 23 PM

Do you think it would be clearer to move the upsert strategy somewhere else?
I do agree with the idea of adding explanations for the strategies in a postamble.

@AetherUnbound
Copy link
Collaborator Author

Right! I mean also surfacing it in the individual descriptions, just like media types and DB column type:

image

@AetherUnbound AetherUnbound self-assigned this Oct 28, 2024
@openverse-bot openverse-bot moved this from 📋 Backlog to 📅 To Do in Openverse Backlog Oct 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
📄 aspect: text Concerns the textual material in the repository ✨ goal: improvement Improvement to an existing user-facing feature help wanted Open to participation from the community 🟩 priority: low Low priority and doesn't need to be rushed 🧱 stack: catalog Related to the catalog and Airflow DAGs 🧱 stack: documentation Related to Sphinx documentation 🐍 tech: python Involves Python
Projects
Status: 📅 To Do
Development

No branches or pull requests

2 participants