-
Notifications
You must be signed in to change notification settings - Fork 204
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Save cleaned up data during the cleanup step #904
Conversation
Full-stack documentation: Ready https://WordPress.github.io/openverse/_preview/904 Please note that GitHub pages takes a little time to deploy newly pushed code, if the links above don't work or you see old versions, wait 5 minutes and try again. You can check the GitHub pages deployment action list to see the current status of the deployments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This appears to have worked perfectly for me locally using the sample images file you shared.
Should we switch to using that sample images file permanently to ensure this feature is tested on a regular basis?
Here are the sample output logs I found to show it working:
openverse-ingestion_server-1 | 2023-03-13 22:32:11,922 INFO cleanup.py:258 - TLS cache: {'www.flickr.com': True, 'commons.wikimedia.org': True, 'https://www.eol.org/': True, '.geograph.org.uk': True, '.eol.org': True, '.digitaltmuseum.org': True, 'www.geograph.org.uk': True, 'www.eol.org': True}
openverse-ingestion_server-1 | 2023-03-13 22:32:11,922 INFO cleanup.py:259 - Worker committing changes...
openverse-ingestion_server-1 | 2023-03-13 22:32:11,923 INFO cleanup.py:265 - Worker finished batch in 3.2522239685058594
openverse-ingestion_server-1 | 2023-03-13 22:32:14,006 INFO cleanup.py:200 - https://musee-mccord.qc.ca/ObjView/M965.199.10008.jpg:403
openverse-ingestion_server-1 | 2023-03-13 22:32:14,006 INFO cleanup.py:103 - Tested domain .musee-mccord.qc.ca
openverse-ingestion_server-1 | 2023-03-13 22:32:14,006 INFO cleanup.py:243 - Updated url for 74454cfd-489d-4c7a-bdda-d7eef06d6d2b from '{dirty_value}' to '{clean}'
openverse-ingestion_server-1 | 2023-03-13 22:32:14,007 INFO cleanup.py:200 - https://musee-mccord.qc.ca/ObjView/5344.jpg:403
openverse-ingestion_server-1 | 2023-03-13 22:32:14,007 INFO cleanup.py:103 - Tested domain .musee-mccord.qc.ca
Are there any useful unit tests for us to add for this change? I'm not requesting changes for it because I'm not certain that testing log output is 100% necessary. However, if we're going to rely on it for analysis or something else like that it might be good to add a unit test to re-enforce the expected format.
# We know that flickr and wikimedia support TLS, so we can add them here | ||
TLS_CACHE = { | ||
"www.flickr.com": True, | ||
"commons.wikimedia.org": True, | ||
"https://www.eol.org/": True, | ||
".geograph.org.uk": True, | ||
".eol.org": True, | ||
".digitaltmuseum.org": True, | ||
"www.geograph.org.uk": True, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How did the others get added? Are they similar to Flickr and Wikimedia in that we just know that they do support TLS?
If that's the case, would it be worth manually testing providers for this and adding them to the list (understanding how tedious that is)? Or, is it something we need to monitor/update over time due to the potential for this status to change (I suppose, most likely, that someone starts to support it that previously didn't)?
Would moving this into Redis make sense at all (as an entirely separate issue) or is there a different future change that would persist this TLS support status?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added these manually, by looking through the logs and adding the ones that were being tested. Your logs suggest that .musee-mccord.qc.ca
should have also been added :)
If that's the case, would it be worth manually testing providers for this and adding them to the list (understanding how tedious that is)? Or, is it something we need to monitor/update over time due to the potential for this status to change (I suppose, most likely, that someone starts to support it that previously didn't)?
In short, I think that the TLS support check, the way it's done right now, should go away after we finish the cleanup step (1-2 refreshes to get the updated TSV and 1-2 update DAG runs could be enough).
We do not test the URLs that have insecure http
scheme for TLS support. The main reason we were testing for TLS support was to add a best scheme to the URLs that don't have it (to convert urls like www.flickr.com/image/path
to https://www.flickr.com/image/path
). There are not so many such rows in the database, mainly the ones that were ingested before the ImageStore
improvements in the catalog, that were also not re-ingested. When we use the TSV from the cleanup run to update the catalog, all of the URLs will have a scheme, whether it is http
or https
, so the URL cleanup function will be parsing the URL and not running the TLS support checks because the scheme will not be ""
.
Do you think we should monitor for TLS support status? If so, I think this should be a separate issue. We could add a cleanup function to test domains with http
for TLS support and report all of them, and then test them and update the URLs if the support changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation and motivation for this feature.
Do you think we should monitor for TLS support status?
If we're hand-maintaining the list then yes, I think we should revisit it periodically or else the cleanup step here will apply the incorrect transformations.
I agree that that is out of scope of this issue, sort of, but if we're expanding the list of sites we're automatically applying the transformation to, then it does make the matter slightly more pressing as the area of effect is slightly wider. The added providers are small though, I think, so it's negligible. In any case, I agree it's a separate issue. I wanted to mention it in case it needs to be explicitly documented as such in a GitHub issue.
If it's something that will go away soon though, due to some other mechanism that will render this step unnecessary, then we can ignore it altogether and just document in the code that it's a temporary hold-over.
Oh, one requested change. I just noticed my git state was messy after reviewing this PR. Can we add the files to gitignore so they don't appear locally as changes? |
I'm not sure. The end goal for these changes is to remove the cleanup step. Or at least to remove the functions that we can remove. So, optimally, we would not have any data in the catalog that does not have a scheme in the URLs, and tags that are denylisted or badly-formed. Then, this sample data would be wrong. Does it make sense to add these rows to sample data until we update the catalog, and remove it after we're done? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perfect! Then, do you have a plan for the produced files here?
No, actually that's what I need help with. What's the best way of getting these TSV files from the Ingestion server to somewhere where the catalog can use them? I assume we should upload them to S3. Would it be more practical to somehow do it manually to avoid managing secrets here? @krysal ? @AetherUnbound ? |
We might be able to set the permissions of the EC2 boxes to allow them to upload to a specific S3 bucket without needing credentials (I think). https://aws.amazon.com/premiumsupport/knowledge-center/ec2-instance-access-s3-bucket/ |
46078ce
to
2e1dce9
Compare
2e1dce9
to
80798be
Compare
I am going to merge this PR as is, and we can download the files from the box and upload them to S3 manually. Hopefully, we will remove this process soon after we clean up the data, so it is an acceptable solution for that. |
@obulat Sounds good. The unit tests for ingestion server failed, I've re-run them to see if they pass. Is there an issue for tracking the following up work? |
97d57f3
to
ccdd6bb
Compare
Size Change: 0 B Total Size: 882 kB ℹ️ View Unchanged
|
This can certainly be merged as-is, but I think it would be pretty straightforward to upload the files to S3! As Sara mentions, I don't think it'd require any explicit permissions management, perhaps besides some IAM/role changes. We could even make a new |
edfed85
to
d00abf9
Compare
Co-authored-by: Krystle Salazar <[email protected]>
c434b80
to
bb277b8
Compare
* Add first pass a db snapshot rotation DAG * Add unit tests * Fix DAG documentation * Add db snapshots DAG to parsing test * Add missing attributes to DAG * Fix DAG_ID Co-authored-by: Madison Swain-Bowden <[email protected]> * Fix template variable * Remove redundant parameter * Update openverse_catalog/dags/maintenance/rotate_db_snapshots.py Co-authored-by: Madison Swain-Bowden <[email protected]> * Use Airflow template strings to get variables Co-authored-by: Madison Swain-Bowden <[email protected]> * Fix dag name * Sort describe snapshots return value (just to make sure) Also fixes the usage of `describe_db_snapshots` to retrieve the actual list of snapshots on the pagination object. * Lint generated DAG file Co-authored-by: Madison Swain-Bowden <[email protected]>
Fixes
Fixes #861 by @krysal
Fixes #654 by @obulat
Description
This PR adds more logging for the data refresh, but its main goal is to be a proof-of-concept of saving the data during weekly data refresh as a preparation step for data normalization.
Data refresh image cleanup steps:
This PR also adds a Wikimedia title cleanup step that removes File: prefix and file extension suffix from the image title. This step was added because in the Openverse Inserter PR it was specifically pointed out that those titles are bad for UX.The Wikimedia title cleanup step can be added after during the second run of this PR in prod.There is also a step that we need to add to the cleanup process for incorrect utf-8 tags, but I think we should add it in a later refresh (gist with the implementation) so as the cleanup step does not become much longer.
This PR saves one file per cleaned field in a tsv format. The files contain the image identifier and the cleaned data. I don't know where the best place to save them is - all suggestions welcome!
Testing Instructions
Replace
sample_data/sample_images.csv
with the file in this gist (https://gist.github.com/obulat/b31e43b131352b8f6cd66a2dd87061d8), and runjust recreate
(orjust start
->just init
, if you haven't run the API before). You should see the tsv files recreated, logging about the cleaned fields:Also, check the logs about updated fields/values and TLS_CACHE.
Checklist
Update index.md
).main
) ora parent feature branch.
errors.
Developer Certificate of Origin
Developer Certificate of Origin