Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add additional logging to the cleanup process #861

Closed
krysal opened this issue Mar 7, 2023 · 0 comments · Fixed by #904
Closed

Add additional logging to the cleanup process #861

krysal opened this issue Mar 7, 2023 · 0 comments · Fixed by #904
Assignees
Labels
🤖 aspect: dx Concerns developers' experience with the codebase ✨ goal: improvement Improvement to an existing user-facing feature 🟧 priority: high Stalls work on the project or its dependents 🧱 stack: ingestion server Related to the ingestion/data refresh server

Comments

@krysal
Copy link
Member

krysal commented Mar 7, 2023

Problem

Currently, the cleanup process only records the final number of cleaned rows. We want to have more information on how many rows are affected and what type of cleanup is applied (malformed URL fixed, tag filtering) in order to develop a strategy to apply these changes on the upstream database and remove the steps from the ingestion server.

Additional context

Previous attempt at WordPress/openverse-api#1126

@krysal krysal added 🟧 priority: high Stalls work on the project or its dependents ✨ goal: improvement Improvement to an existing user-facing feature 🤖 aspect: dx Concerns developers' experience with the codebase 🧱 stack: ingestion server Related to the ingestion/data refresh server labels Mar 7, 2023
@github-project-automation github-project-automation bot moved this to 📋 Backlog in Openverse Backlog Mar 7, 2023
@obulat obulat self-assigned this Mar 8, 2023
@obulat obulat moved this from 📋 Backlog to 🏗 In progress in Openverse Backlog Mar 15, 2023
@github-project-automation github-project-automation bot moved this from 🏗 In progress to ✅ Done in Openverse Backlog Mar 28, 2023
dhruvkb pushed a commit that referenced this issue Apr 14, 2023
* Retired module commoncrawl and retired the commoncrawl_utils test

* updated DAGs.md and test_dag_parsing.py as suggested in ##861

* Remove ETL test module, additional documentation cleanup

* Delete more unused test files

* Remove unused testing buckets

* Update README.md

Co-authored-by: Olga Bulat <[email protected]>

Co-authored-by: Meet Parekh <[email protected]>
Co-authored-by: Meet Parekh <[email protected]>
Co-authored-by: Olga Bulat <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤖 aspect: dx Concerns developers' experience with the codebase ✨ goal: improvement Improvement to an existing user-facing feature 🟧 priority: high Stalls work on the project or its dependents 🧱 stack: ingestion server Related to the ingestion/data refresh server
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants