Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CV2-5370] media text split #460

Merged
merged 14 commits into from
Oct 14, 2024
Merged

[CV2-5370] media text split #460

merged 14 commits into from
Oct 14, 2024

Conversation

DGaffney
Copy link
Contributor

@DGaffney DGaffney commented Oct 10, 2024

Description

Small tweaks to make full end-to-end tiplines work locally

  • Make all env_files consistent with mock s3 variables,
  • Move the sync endpoint for text to use the blocking requests instead of legacy alegre-based fingerprinting,
  • Making some minor refactors in elastic_crud.py so that we return the necessary data when running through pending fingerprinting tasks,
  • Bypassing fields that we should never check against when querying objects (since they are either purely metadata or globally unique),
  • Adding the paraphrase model,
  • Structuring sync requests so that they can be properly immediately searched on,
  • Clean up the response objects in return_sources to dis-embed the keys we need to send back to Check API,
  • Test changes to accommodate all the above

Reference: CV2-5370

How has this been tested?

Tested extensively locally and confirmed to work

Have you considered secure coding practices when writing this code?

None

@skyemeedan
Copy link
Contributor

If you can give some more detail in the description about what the changes are intended to do, it would help me with reviewing more quickly. Like it looks there is some simple naming refactoring, plus moving some calls to async blocking, adding paraphrase, and calling a separate endpoint for text based queries (I'm guessing this is the "media text split") in the title. is that right?

.env_file.test Show resolved Hide resolved
app/main/lib/elasticsearch.py Show resolved Hide resolved
app/main/lib/similarity.py Show resolved Hide resolved
app/main/lib/text_similarity.py Outdated Show resolved Hide resolved
app/main/lib/text_similarity.py Show resolved Hide resolved
@DGaffney
Copy link
Contributor Author

If you can give some more detail in the description about what the changes are intended to do, it would help me with reviewing more quickly. Like it looks there is some simple naming refactoring, plus moving some calls to async blocking, adding paraphrase, and calling a separate endpoint for text based queries (I'm guessing this is the "media text split") in the title. is that right?

That's basically right - I would summarize this as:

  • Make all env_files consistent with mock s3 variables,
  • Move the sync endpoint for text to use the blocking requests instead of legacy alegre-based fingerprinting,
  • Making some minor refactors in elastic_crud.py so that we return the necessary data when running through pending fingerprinting tasks,
  • Bypassing fields that we should never check against when querying objects (since they are either purely metadata or globally unique),
  • Adding the paraphrase model,
  • Structuring sync requests so that they can be properly immediately searched on,
  • Clean up the response objects in return_sources to dis-embed the keys we need to send back to Check API,
  • Test changes to accommodate all the above

Roughly in order of reading through the diff page.

@DGaffney DGaffney requested a review from skyemeedan October 12, 2024 15:26
@skyemeedan skyemeedan changed the title Cv2 5370 media text split [CV2-5370] media text split Oct 14, 2024
.env_file.test Show resolved Hide resolved
app/main/lib/elasticsearch.py Show resolved Hide resolved
@DGaffney DGaffney merged commit ce9b5c7 into develop Oct 14, 2024
4 checks passed
@DGaffney DGaffney deleted the cv2-5370-media-text-split branch October 14, 2024 15:08
Copy link

sentry-io bot commented Oct 18, 2024

Suspect Issues

This pull request was deployed and Sentry observed the following issues:

  • ‼️ ConnectionTimeout: ConnectionTimeout caused by - ReadTimeoutError(HTTPSConnectionPool(host='aos-0e790625b530-irilyx4... api.similarity_similarity_resource View Issue

Did you find this useful? React with a 👍 or 👎

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants