Save cleaned up data during the cleanup step #904

obulat · 2023-03-13T16:58:55Z

Fixes

Fixes #861 by @krysal
Fixes #654 by @obulat

Description

This PR adds more logging for the data refresh, but its main goal is to be a proof-of-concept of saving the data during weekly data refresh as a preparation step for data normalization.

Data refresh image cleanup steps:

Add http or https protocol to URLs that don't have a scheme in "url", "creator_url", "foreign_landing_url" fields
Clean up tags:
- tags that contain anything from the TAG_CONTAINS_DENYLIST set
- AI-generated tags ("provider": "clarifai") with confidence level below TAG_MIN_CONFIDENCE = 0.90

This PR also adds a Wikimedia title cleanup step that removes File: prefix and file extension suffix from the image title. This step was added because in the Openverse Inserter PR it was specifically pointed out that those titles are bad for UX. The Wikimedia title cleanup step can be added after during the second run of this PR in prod.

There is also a step that we need to add to the cleanup process for incorrect utf-8 tags, but I think we should add it in a later refresh (gist with the implementation) so as the cleanup step does not become much longer.

This PR saves one file per cleaned field in a tsv format. The files contain the image identifier and the cleaned data. I don't know where the best place to save them is - all suggestions welcome!

Testing Instructions

Replace sample_data/sample_images.csv with the file in this gist (https://gist.github.com/obulat/b31e43b131352b8f6cd66a2dd87061d8), and run just recreate (or just start -> just init, if you haven't run the API before). You should see the tsv files recreated, logging about the cleaned fields:

2023-02-05 10:11:52 2023-02-05 07:11:52,223 INFO cleanup.py:276 - Finished saving cleaned data in 0.059366464614868164
2023-02-05 10:11:52 2023-02-05 07:11:52,223 INFO cleanup.py:353 - Batch finished, records/s: cleanup_rate=229.1038788216563
2023-02-05 10:11:52 2023-02-05 07:11:52,223 INFO cleanup.py:354 - Fetching next batch. Records cleaned so far: 1899, counts: {'tags': 320, 'url': 2, 'creator_url': 404, 'foreign_landing_url': 294, 'title': 224}
2023-02-05 10:11:52 2023-02-05 07:11:52,238 INFO cleanup.py:362 - Cleaned all records in 8.332297325134277 seconds, counts: {'tags': 320, 'url': 2, 'creator_url': 404, 'foreign_landing_url': 294, 'title': 224}

Also, check the logs about updated fields/values and TLS_CACHE.

Checklist

My pull request has a descriptive title (not a vague title like
Update index.md).
My pull request targets the default branch of the repository (main) or
a parent feature branch.
My commit messages follow best practices.
My code follows the established code style of the repository.
I added or updated tests for the changes I made (if applicable).
I added or updated documentation (if applicable).
I tried running the project locally and verified that there are no visible
errors.

Developer Certificate of Origin

Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

github-actions · 2023-03-13T17:11:04Z

Full-stack documentation: Ready

https://WordPress.github.io/openverse/_preview/904

Please note that GitHub pages takes a little time to deploy newly pushed code, if the links above don't work or you see old versions, wait 5 minutes and try again.

You can check the GitHub pages deployment action list to see the current status of the deployments.

sarayourfriend

This appears to have worked perfectly for me locally using the sample images file you shared.

Should we switch to using that sample images file permanently to ensure this feature is tested on a regular basis?

Here are the sample output logs I found to show it working:

openverse-ingestion_server-1  | 2023-03-13 22:32:11,922 INFO cleanup.py:258 - TLS cache: {'www.flickr.com': True, 'commons.wikimedia.org': True, 'https://www.eol.org/': True, '.geograph.org.uk': True, '.eol.org': True, '.digitaltmuseum.org': True, 'www.geograph.org.uk': True, 'www.eol.org': True}
openverse-ingestion_server-1  | 2023-03-13 22:32:11,922 INFO cleanup.py:259 - Worker committing changes...
openverse-ingestion_server-1  | 2023-03-13 22:32:11,923 INFO cleanup.py:265 - Worker finished batch in 3.2522239685058594
openverse-ingestion_server-1  | 2023-03-13 22:32:14,006 INFO cleanup.py:200 - https://musee-mccord.qc.ca/ObjView/M965.199.10008.jpg:403
openverse-ingestion_server-1  | 2023-03-13 22:32:14,006 INFO cleanup.py:103 - Tested domain .musee-mccord.qc.ca
openverse-ingestion_server-1  | 2023-03-13 22:32:14,006 INFO cleanup.py:243 - Updated url for 74454cfd-489d-4c7a-bdda-d7eef06d6d2b from '{dirty_value}' to '{clean}'
openverse-ingestion_server-1  | 2023-03-13 22:32:14,007 INFO cleanup.py:200 - https://musee-mccord.qc.ca/ObjView/5344.jpg:403
openverse-ingestion_server-1  | 2023-03-13 22:32:14,007 INFO cleanup.py:103 - Tested domain .musee-mccord.qc.ca

Are there any useful unit tests for us to add for this change? I'm not requesting changes for it because I'm not certain that testing log output is 100% necessary. However, if we're going to rely on it for analysis or something else like that it might be good to add a unit test to re-enforce the expected format.

sarayourfriend · 2023-03-13T22:22:30Z

ingestion_server/ingestion_server/cleanup.py

+# We know that flickr and wikimedia support TLS, so we can add them here
+TLS_CACHE = {
+    "www.flickr.com": True,
+    "commons.wikimedia.org": True,
+    "https://www.eol.org/": True,
+    ".geograph.org.uk": True,
+    ".eol.org": True,
+    ".digitaltmuseum.org": True,
+    "www.geograph.org.uk": True,
+}


How did the others get added? Are they similar to Flickr and Wikimedia in that we just know that they do support TLS?

If that's the case, would it be worth manually testing providers for this and adding them to the list (understanding how tedious that is)? Or, is it something we need to monitor/update over time due to the potential for this status to change (I suppose, most likely, that someone starts to support it that previously didn't)?

Would moving this into Redis make sense at all (as an entirely separate issue) or is there a different future change that would persist this TLS support status?

I've added these manually, by looking through the logs and adding the ones that were being tested. Your logs suggest that .musee-mccord.qc.ca should have also been added :)

If that's the case, would it be worth manually testing providers for this and adding them to the list (understanding how tedious that is)? Or, is it something we need to monitor/update over time due to the potential for this status to change (I suppose, most likely, that someone starts to support it that previously didn't)?

In short, I think that the TLS support check, the way it's done right now, should go away after we finish the cleanup step (1-2 refreshes to get the updated TSV and 1-2 update DAG runs could be enough).

We do not test the URLs that have insecure http scheme for TLS support. The main reason we were testing for TLS support was to add a best scheme to the URLs that don't have it (to convert urls like www.flickr.com/image/path to https://www.flickr.com/image/path). There are not so many such rows in the database, mainly the ones that were ingested before the ImageStore improvements in the catalog, that were also not re-ingested. When we use the TSV from the cleanup run to update the catalog, all of the URLs will have a scheme, whether it is http or https, so the URL cleanup function will be parsing the URL and not running the TLS support checks because the scheme will not be "".

https://github.com/WordPress/openverse/blob/2646c5ead465603b42c70f58a190f7b50861d698/ingestion_server/ingestion_server/cleanup.py#L83-84

Do you think we should monitor for TLS support status? If so, I think this should be a separate issue. We could add a cleanup function to test domains with http for TLS support and report all of them, and then test them and update the URLs if the support changes.

Thanks for the explanation and motivation for this feature.

Do you think we should monitor for TLS support status?

If we're hand-maintaining the list then yes, I think we should revisit it periodically or else the cleanup step here will apply the incorrect transformations.

I agree that that is out of scope of this issue, sort of, but if we're expanding the list of sites we're automatically applying the transformation to, then it does make the matter slightly more pressing as the area of effect is slightly wider. The added providers are small though, I think, so it's negligible. In any case, I agree it's a separate issue. I wanted to mention it in case it needs to be explicitly documented as such in a GitHub issue.

If it's something that will go away soon though, due to some other mechanism that will render this step unnecessary, then we can ignore it altogether and just document in the code that it's a temporary hold-over.

sarayourfriend · 2023-03-13T22:48:03Z

Oh, one requested change. I just noticed my git state was messy after reviewing this PR. Can we add the files to gitignore so they don't appear locally as changes?

obulat · 2023-03-15T05:19:25Z

Should we switch to using that sample images file permanently to ensure this feature is tested on a regular basis?

I'm not sure. The end goal for these changes is to remove the cleanup step. Or at least to remove the functions that we can remove. So, optimally, we would not have any data in the catalog that does not have a scheme in the URLs, and tags that are denylisted or badly-formed. Then, this sample data would be wrong. Does it make sense to add these rows to sample data until we update the catalog, and remove it after we're done?

ingestion_server/ingestion_server/cleanup.py

krysal

Perfect! Then, do you have a plan for the produced files here?

obulat · 2023-03-16T07:07:26Z

Perfect! Then, do you have a plan for the produced files here?

No, actually that's what I need help with. What's the best way of getting these TSV files from the Ingestion server to somewhere where the catalog can use them? I assume we should upload them to S3. Would it be more practical to somehow do it manually to avoid managing secrets here? @krysal ? @AetherUnbound ?

sarayourfriend · 2023-03-17T00:07:58Z

Would it be more practical to somehow do it manually to avoid managing secrets here?

We might be able to set the permissions of the EC2 boxes to allow them to upload to a specific S3 bucket without needing credentials (I think). https://aws.amazon.com/premiumsupport/knowledge-center/ec2-instance-access-s3-bucket/

obulat · 2023-03-20T15:46:31Z

I am going to merge this PR as is, and we can download the files from the box and upload them to S3 manually.

Hopefully, we will remove this process soon after we clean up the data, so it is an acceptable solution for that.

sarayourfriend · 2023-03-20T22:18:28Z

@obulat Sounds good. The unit tests for ingestion server failed, I've re-run them to see if they pass.

Is there an issue for tracking the following up work?

github-actions · 2023-03-21T13:36:33Z

Size Change: 0 B

Total Size: 882 kB

ℹ️ View Unchanged

Filename	Size
`./frontend/.nuxt/dist/client/235.js`	273 B
`./frontend/.nuxt/dist/client/235.modern.js`	278 B
`./frontend/.nuxt/dist/client/236.js`	1.85 kB
`./frontend/.nuxt/dist/client/app.js`	143 kB
`./frontend/.nuxt/dist/client/app.modern.js`	117 kB
`./frontend/.nuxt/dist/client/commons/app.js`	87.8 kB
`./frontend/.nuxt/dist/client/commons/app.modern.js`	78.3 kB
`./frontend/.nuxt/dist/client/components/loading-icon.js`	747 B
`./frontend/.nuxt/dist/client/components/loading-icon.modern.js`	753 B
`./frontend/.nuxt/dist/client/components/table-sort-icon.js`	515 B
`./frontend/.nuxt/dist/client/components/table-sort-icon.modern.js`	518 B
`./frontend/.nuxt/dist/client/components/v-all-results-grid.js`	8.01 kB
`./frontend/.nuxt/dist/client/components/v-all-results-grid.modern.js`	5.49 kB
`./frontend/.nuxt/dist/client/components/v-audio-cell.js`	392 B
`./frontend/.nuxt/dist/client/components/v-audio-cell.modern.js`	397 B
`./frontend/.nuxt/dist/client/components/v-audio-details.js`	2.55 kB
`./frontend/.nuxt/dist/client/components/v-audio-details.modern.js`	1.79 kB
`./frontend/.nuxt/dist/client/components/v-audio-track-skeleton.js`	1.02 kB
`./frontend/.nuxt/dist/client/components/v-audio-track-skeleton.modern.js`	1.02 kB
`./frontend/.nuxt/dist/client/components/v-audio-track.js`	5.22 kB
`./frontend/.nuxt/dist/client/components/v-audio-track.modern.js`	5.18 kB
`./frontend/.nuxt/dist/client/components/v-back-to-search-results-link.js`	543 B
`./frontend/.nuxt/dist/client/components/v-back-to-search-results-link.modern.js`	547 B
`./frontend/.nuxt/dist/client/components/v-bone.js`	693 B
`./frontend/.nuxt/dist/client/components/v-bone.modern.js`	697 B
`./frontend/.nuxt/dist/client/components/v-box-layout.js`	1.24 kB
`./frontend/.nuxt/dist/client/components/v-box-layout.modern.js`	1.24 kB
`./frontend/.nuxt/dist/client/components/v-content-link.js`	1.12 kB
`./frontend/.nuxt/dist/client/components/v-content-link.modern.js`	1.1 kB
`./frontend/.nuxt/dist/client/components/v-content-page.js`	526 B
`./frontend/.nuxt/dist/client/components/v-content-page.modern.js`	530 B
`./frontend/.nuxt/dist/client/components/v-content-report-button.js`	785 B
`./frontend/.nuxt/dist/client/components/v-content-report-button.modern.js`	789 B
`./frontend/.nuxt/dist/client/components/v-content-report-form.js`	6.11 kB
`./frontend/.nuxt/dist/client/components/v-content-report-form.modern.js`	3.59 kB
`./frontend/.nuxt/dist/client/components/v-content-report-popover.js`	1.24 kB
`./frontend/.nuxt/dist/client/components/v-content-report-popover.modern.js`	4.25 kB
`./frontend/.nuxt/dist/client/components/v-copy-button.js`	4 kB
`./frontend/.nuxt/dist/client/components/v-copy-button.modern.js`	4.01 kB
`./frontend/.nuxt/dist/client/components/v-copy-license.js`	1 kB
`./frontend/.nuxt/dist/client/components/v-copy-license.modern.js`	1 kB
`./frontend/.nuxt/dist/client/components/v-copy-license/components/v-error-image/components/v-media-reuse/components/v-search-grid/d219393b.js`	9.96 kB
`./frontend/.nuxt/dist/client/components/v-copy-license/components/v-error-image/components/v-media-reuse/components/v-search-grid/d219393b.modern.js`	9.94 kB
`./frontend/.nuxt/dist/client/components/v-dmca-notice.js`	754 B
`./frontend/.nuxt/dist/client/components/v-dmca-notice.modern.js`	758 B
`./frontend/.nuxt/dist/client/components/v-error-image.js`	1.7 kB
`./frontend/.nuxt/dist/client/components/v-error-image.modern.js`	1.69 kB
`./frontend/.nuxt/dist/client/components/v-error-section.js`	372 B
`./frontend/.nuxt/dist/client/components/v-error-section.modern.js`	376 B
`./frontend/.nuxt/dist/client/components/v-external-search-form.js`	1.93 kB
`./frontend/.nuxt/dist/client/components/v-external-search-form.modern.js`	1.92 kB
`./frontend/.nuxt/dist/client/components/v-external-source-list.js`	905 B
`./frontend/.nuxt/dist/client/components/v-external-source-list.modern.js`	906 B
`./frontend/.nuxt/dist/client/components/v-full-layout.js`	1.52 kB
`./frontend/.nuxt/dist/client/components/v-full-layout.modern.js`	1.52 kB
`./frontend/.nuxt/dist/client/components/v-grid-skeleton.js`	1.62 kB
`./frontend/.nuxt/dist/client/components/v-grid-skeleton.modern.js`	1.63 kB
`./frontend/.nuxt/dist/client/components/v-home-gallery.js`	5.18 kB
`./frontend/.nuxt/dist/client/components/v-home-gallery.modern.js`	5.17 kB
`./frontend/.nuxt/dist/client/components/v-homepage-content.js`	1.76 kB
`./frontend/.nuxt/dist/client/components/v-homepage-content.modern.js`	1.73 kB
`./frontend/.nuxt/dist/client/components/v-image-carousel.js`	4.73 kB
`./frontend/.nuxt/dist/client/components/v-image-carousel.modern.js`	4.71 kB
`./frontend/.nuxt/dist/client/components/v-image-cell.js`	1.57 kB
`./frontend/.nuxt/dist/client/components/v-image-cell.modern.js`	1.56 kB
`./frontend/.nuxt/dist/client/components/v-image-details.js`	2.16 kB
`./frontend/.nuxt/dist/client/components/v-image-details.modern.js`	1.43 kB
`./frontend/.nuxt/dist/client/components/v-image-grid.js`	4.99 kB
`./frontend/.nuxt/dist/client/components/v-image-grid.modern.js`	2.52 kB
`./frontend/.nuxt/dist/client/components/v-license-tab-panel.js`	526 B
`./frontend/.nuxt/dist/client/components/v-license-tab-panel.modern.js`	529 B
`./frontend/.nuxt/dist/client/components/v-load-more.js`	3.17 kB
`./frontend/.nuxt/dist/client/components/v-load-more.modern.js`	695 B
`./frontend/.nuxt/dist/client/components/v-media-license.js`	829 B
`./frontend/.nuxt/dist/client/components/v-media-license.modern.js`	837 B
`./frontend/.nuxt/dist/client/components/v-media-reuse.js`	1.63 kB
`./frontend/.nuxt/dist/client/components/v-media-reuse.modern.js`	1.63 kB
`./frontend/.nuxt/dist/client/components/v-media-tag.js`	434 B
`./frontend/.nuxt/dist/client/components/v-media-tag.modern.js`	439 B
`./frontend/.nuxt/dist/client/components/v-modal.js`	1.01 kB
`./frontend/.nuxt/dist/client/components/v-modal.modern.js`	996 B
`./frontend/.nuxt/dist/client/components/v-no-results.js`	757 B
`./frontend/.nuxt/dist/client/components/v-no-results.modern.js`	756 B
`./frontend/.nuxt/dist/client/components/v-radio.js`	1.51 kB
`./frontend/.nuxt/dist/client/components/v-radio.modern.js`	1.47 kB
`./frontend/.nuxt/dist/client/components/v-related-audio.js`	1.26 kB
`./frontend/.nuxt/dist/client/components/v-related-audio.modern.js`	1.26 kB
`./frontend/.nuxt/dist/client/components/v-related-images.js`	1.06 kB
`./frontend/.nuxt/dist/client/components/v-related-images.modern.js`	3.1 kB
`./frontend/.nuxt/dist/client/components/v-report-desc-form.js`	977 B
`./frontend/.nuxt/dist/client/components/v-report-desc-form.modern.js`	981 B
`./frontend/.nuxt/dist/client/components/v-row-layout.js`	1.71 kB
`./frontend/.nuxt/dist/client/components/v-row-layout.modern.js`	1.72 kB
`./frontend/.nuxt/dist/client/components/v-scroll-button.js`	824 B
`./frontend/.nuxt/dist/client/components/v-scroll-button.modern.js`	830 B
`./frontend/.nuxt/dist/client/components/v-search-grid.js`	5.75 kB
`./frontend/.nuxt/dist/client/components/v-search-grid.modern.js`	5.68 kB
`./frontend/.nuxt/dist/client/components/v-search-results-title.js`	600 B
`./frontend/.nuxt/dist/client/components/v-search-results-title.modern.js`	604 B
`./frontend/.nuxt/dist/client/components/v-search-type-radio.js`	806 B
`./frontend/.nuxt/dist/client/components/v-search-type-radio.modern.js`	781 B
`./frontend/.nuxt/dist/client/components/v-server-timeout.js`	300 B
`./frontend/.nuxt/dist/client/components/v-server-timeout.modern.js`	303 B
`./frontend/.nuxt/dist/client/components/v-sketch-fab-viewer.js`	3.39 kB
`./frontend/.nuxt/dist/client/components/v-sketch-fab-viewer.modern.js`	913 B
`./frontend/.nuxt/dist/client/components/v-snackbar.js`	1.19 kB
`./frontend/.nuxt/dist/client/components/v-snackbar.modern.js`	1.19 kB
`./frontend/.nuxt/dist/client/components/v-sources-table.js`	16.2 kB
`./frontend/.nuxt/dist/client/components/v-sources-table.modern.js`	16.2 kB
`./frontend/.nuxt/dist/client/components/v-warning-suppressor.js`	306 B
`./frontend/.nuxt/dist/client/components/v-warning-suppressor.modern.js`	311 B
`./frontend/.nuxt/dist/client/pages/about.js`	1.4 kB
`./frontend/.nuxt/dist/client/pages/about.modern.js`	1.4 kB
`./frontend/.nuxt/dist/client/pages/audio/_id/index.js`	8.01 kB
`./frontend/.nuxt/dist/client/pages/audio/_id/index.modern.js`	4.85 kB
`./frontend/.nuxt/dist/client/pages/external-sources.js`	1.56 kB
`./frontend/.nuxt/dist/client/pages/external-sources.modern.js`	1.56 kB
`./frontend/.nuxt/dist/client/pages/feedback.js`	1.34 kB
`./frontend/.nuxt/dist/client/pages/feedback.modern.js`	1.34 kB
`./frontend/.nuxt/dist/client/pages/image/_id/index.js`	9.32 kB
`./frontend/.nuxt/dist/client/pages/image/_id/index.modern.js`	5.18 kB
`./frontend/.nuxt/dist/client/pages/image/_id/report.js`	3.66 kB
`./frontend/.nuxt/dist/client/pages/image/_id/report.modern.js`	4.27 kB
`./frontend/.nuxt/dist/client/pages/index.js`	7.28 kB
`./frontend/.nuxt/dist/client/pages/index.modern.js`	7.21 kB
`./frontend/.nuxt/dist/client/pages/preferences.js`	1.32 kB
`./frontend/.nuxt/dist/client/pages/preferences.modern.js`	1.32 kB
`./frontend/.nuxt/dist/client/pages/privacy.js`	1.01 kB
`./frontend/.nuxt/dist/client/pages/privacy.modern.js`	1.02 kB
`./frontend/.nuxt/dist/client/pages/search-help.js`	1.6 kB
`./frontend/.nuxt/dist/client/pages/search-help.modern.js`	1.58 kB
`./frontend/.nuxt/dist/client/pages/search.js`	4.55 kB
`./frontend/.nuxt/dist/client/pages/search.modern.js`	2.04 kB
`./frontend/.nuxt/dist/client/pages/search/audio.js`	6.02 kB
`./frontend/.nuxt/dist/client/pages/search/audio.modern.js`	3.55 kB
`./frontend/.nuxt/dist/client/pages/search/image.js`	507 B
`./frontend/.nuxt/dist/client/pages/search/image.modern.js`	2.71 kB
`./frontend/.nuxt/dist/client/pages/search/index.js`	443 B
`./frontend/.nuxt/dist/client/pages/search/index.modern.js`	448 B
`./frontend/.nuxt/dist/client/pages/search/model-3d.js`	244 B
`./frontend/.nuxt/dist/client/pages/search/model-3d.modern.js`	246 B
`./frontend/.nuxt/dist/client/pages/search/search-page.types.js`	266 B
`./frontend/.nuxt/dist/client/pages/search/search-page.types.modern.js`	271 B
`./frontend/.nuxt/dist/client/pages/search/video.js`	240 B
`./frontend/.nuxt/dist/client/pages/search/video.modern.js`	244 B
`./frontend/.nuxt/dist/client/pages/sources.js`	1.57 kB
`./frontend/.nuxt/dist/client/pages/sources.modern.js`	1.57 kB
`./frontend/.nuxt/dist/client/runtime.js`	2.72 kB
`./frontend/.nuxt/dist/client/runtime.modern.js`	2.73 kB
`./frontend/.nuxt/dist/client/vendors/app.js`	64.2 kB
`./frontend/.nuxt/dist/client/vendors/app.modern.js`	63.3 kB

_{compressed-size-action}

AetherUnbound · 2023-03-23T22:22:11Z

This can certainly be merged as-is, but I think it would be pretty straightforward to upload the files to S3! As Sara mentions, I don't think it'd require any explicit permissions management, perhaps besides some IAM/role changes. We could even make a new data-refresh bucket specifically for this kind of thing.

Co-authored-by: Krystle Salazar <[email protected]>

* Add first pass a db snapshot rotation DAG * Add unit tests * Fix DAG documentation * Add db snapshots DAG to parsing test * Add missing attributes to DAG * Fix DAG_ID Co-authored-by: Madison Swain-Bowden <[email protected]> * Fix template variable * Remove redundant parameter * Update openverse_catalog/dags/maintenance/rotate_db_snapshots.py Co-authored-by: Madison Swain-Bowden <[email protected]> * Use Airflow template strings to get variables Co-authored-by: Madison Swain-Bowden <[email protected]> * Fix dag name * Sort describe snapshots return value (just to make sure) Also fixes the usage of `describe_db_snapshots` to retrieve the actual list of snapshots on the pagination object. * Lint generated DAG file Co-authored-by: Madison Swain-Bowden <[email protected]>

obulat requested a review from a team as a code owner March 13, 2023 16:58

obulat requested review from krysal and sarayourfriend March 13, 2023 16:58

obulat self-assigned this Mar 13, 2023

obulat added 🟧 priority: high Stalls work on the project or its dependents 🤖 aspect: dx Concerns developers' experience with the codebase 🧰 goal: internal improvement Improvement that benefits maintainers, not users labels Mar 13, 2023

github-actions bot added the 🧱 stack: ingestion server Related to the ingestion/data refresh server label Mar 13, 2023

sarayourfriend approved these changes Mar 13, 2023

View reviewed changes

krysal reviewed Mar 15, 2023

View reviewed changes

ingestion_server/ingestion_server/cleanup.py Outdated Show resolved Hide resolved

krysal approved these changes Mar 15, 2023

View reviewed changes

github-actions bot added 🧱 stack: api Related to the Django API 🧱 stack: frontend Related to the Nuxt frontend labels Mar 20, 2023

obulat force-pushed the add/save_cleanup_info branch from 46078ce to 2e1dce9 Compare March 20, 2023 15:41

obulat requested review from a team as code owners March 20, 2023 15:41

obulat requested a review from zackkrida March 20, 2023 15:41

obulat force-pushed the add/save_cleanup_info branch from 2e1dce9 to 80798be Compare March 20, 2023 15:42

obulat removed 🧱 stack: api Related to the Django API 🧱 stack: frontend Related to the Nuxt frontend labels Mar 21, 2023

obulat force-pushed the add/save_cleanup_info branch from 97d57f3 to ccdd6bb Compare March 21, 2023 12:42

WordPress deleted a comment from github-actions bot Mar 21, 2023

obulat force-pushed the add/save_cleanup_info branch 2 times, most recently from edfed85 to d00abf9 Compare March 27, 2023 09:00

obulat and others added 15 commits March 28, 2023 21:23

Save cleaned up data during the cleanup step

38dd7f5

Update ingestion_server/ingestion_server/cleanup.py

21c9a7b

Co-authored-by: Krystle Salazar <[email protected]>

Add .tsv files to .gitignore

e3bb442

Add some cleanable data to the first rows in sample_image.csv

29c15fa

Revert changes to sample_image

6b7dc89

Refactor for readability

4a27af9

Improve TLS_CACHE

2cc0c68

Use context manager with multiprocessing pool

be0ce33

Try to get logs in CI

cc91d49

Revert changes to ci

35a2c2c

Move batch_cleaned_counts computation inside pool

c640c84

Add log line

64641ed

Add log line

6fbb196

Remove pool context manager

2b52f30

Clean logs

bb277b8

obulat force-pushed the add/save_cleanup_info branch from c434b80 to bb277b8 Compare March 28, 2023 18:24

obulat merged commit fd199b9 into main Mar 28, 2023

obulat deleted the add/save_cleanup_info branch March 28, 2023 18:37

This was referenced Mar 29, 2023

Error in ingestion server tests: index audio_temporary not found #1059

Closed

Use context manager for multiprocessing in the ingestion server #1057

Merged

obulat mentioned this pull request Apr 28, 2023

Fix tag encoding #1927

Open

obulat mentioned this pull request Jul 5, 2023

Temporarily disable saving clean tags to disk #2557

Merged

8 tasks

krysal mentioned this pull request Feb 22, 2024

Data normalization #430

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Save cleaned up data during the cleanup step #904

Save cleaned up data during the cleanup step #904

obulat commented Mar 13, 2023 •

edited

Loading

github-actions bot commented Mar 13, 2023 •

edited

Loading

sarayourfriend left a comment

sarayourfriend Mar 13, 2023

obulat Mar 15, 2023

sarayourfriend Mar 15, 2023

sarayourfriend commented Mar 13, 2023

obulat commented Mar 15, 2023

krysal left a comment

obulat commented Mar 16, 2023

sarayourfriend commented Mar 17, 2023

obulat commented Mar 20, 2023

sarayourfriend commented Mar 20, 2023

github-actions bot commented Mar 21, 2023

AetherUnbound commented Mar 23, 2023

Save cleaned up data during the cleanup step #904

Save cleaned up data during the cleanup step #904

Conversation

obulat commented Mar 13, 2023 • edited Loading

Fixes

Description

Testing Instructions

Checklist

Developer Certificate of Origin

github-actions bot commented Mar 13, 2023 • edited Loading

sarayourfriend left a comment

Choose a reason for hiding this comment

sarayourfriend Mar 13, 2023

Choose a reason for hiding this comment

obulat Mar 15, 2023

Choose a reason for hiding this comment

sarayourfriend Mar 15, 2023

Choose a reason for hiding this comment

sarayourfriend commented Mar 13, 2023

obulat commented Mar 15, 2023

krysal left a comment

Choose a reason for hiding this comment

obulat commented Mar 16, 2023

sarayourfriend commented Mar 17, 2023

obulat commented Mar 20, 2023

sarayourfriend commented Mar 20, 2023

github-actions bot commented Mar 21, 2023

AetherUnbound commented Mar 23, 2023

obulat commented Mar 13, 2023 •

edited

Loading

github-actions bot commented Mar 13, 2023 •

edited

Loading