-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CV2-4953: Storing ClassyCat data in Alegre #105
base: master
Are you sure you want to change the base?
Conversation
# Conflicts: # lib/model/classycat_classify.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did this work when running locally with Alegre (i.e. a manual integration test)?
- I think it needs the key to specify which model will be used for vectorization
Comment:
I'm still not entirely sure why we are storing again in Alegre? I believe the text, the associated vector, the cluster, and the labels are already stored in alegre or calling systems by the time we call classycat. Have we ruled out appending an additional 'context' to existing items to store the labels? Or having a separate "label propagation" function in classycat that accepts an item, and a list of items with associated labels (assumed to be in a cluster) returns the propagated label or a new one if none available?
{'doc_id': str(uuid.uuid4()), # adding a unique id for each item to not rely on the input id for uniqueness | ||
'content': items[i]['text'], | ||
'context': { | ||
'input_id': items[i]['id'], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think would would be helpful to put a `'source':'ClassyCat' so if we later want to query this objects there is a key to do it
|
||
# call alegre endpoint to store the results: /text/bulk_similarity/ | ||
alegre_url = os.getenv('ALEGRE_URL') | ||
httpx.post(alegre_url + '/text/bulk_similarity/', json=final_results_to_be_stored_in_alegre) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ooh, does this insert endpoint work? maybe I can switch timpani to use it as well when we convert to batch mode!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We think so, but @DGaffney is currently moving all vectorization to Presto and will need to ensure this endpoint continues to work after that migration 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense Scott, I added a comment on Devin's ticket to make sure he sees it.
@skyemeedan as of this moment the endpoint works fine locally for me, feel free to test it out!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if we want to support this after we move to presto vectorization. This is a blocking, bulk query which breaks expectations in Presto-based vectorization doubly (not-blocking, and single query per item). This is, to my knowledge, the only location in Check (minus the re-index job in Check-API) that uses bulk_similarity
, and if we could get away from this pattern, that would be vastly preferable. Is there any way we can move this? Otherwise we'll be signing up for new complicated measures to support this long term.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- if the bulk_similarity is not going to be supported, then I think we can't build the classycat feature on top of it and will need to find another approach for this PR?
- My memory is that we decided that everything will be 'default bulk', so probably we need to adjust the design of new endpoints to support bulk?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it is reasonable to take on making async endpoints default bulk until after we have made text work on Presto. We've shunted in secondary features to this refactor before and the net effect is that they have greatly complicated our existing migration plan.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If supporting bulk requires some refactoring of presto payload structures, wouldn't it be easier to do that before we are supporting the full text processing in live?
'input_id': items[i]['id'], | ||
'labels': final_results[i]['labels'], | ||
'schema_id': schema_id, | ||
'model_name': self.llm_client.model_name}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this needs to be the model name for the text vectorization? i.e. "paraphrase-multilingual-mpnet-base-v2"
?
These are the parameters timpani used:
https://github.com/meedan/timpani/blob/6021d4fcb251d83ae48ed2c1566a16fad6971450/timpani/model_service/alegre_wrapper_service.py#L122
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Within context
, model_name
(or any parameter) can be anything. We will, however, need to specify the models
parameter for Alegre itself to know how to do vectorization. That parameter is currently missing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch Scott, I did not know that we should specify model_name, and on local the documents are being index only by Elasticsearch (full text search). I will update the code and redo the tests to make sure vectorization works on local.
'input_id': items[i]['id'], | ||
'labels': final_results[i]['labels'], | ||
'schema_id': schema_id, | ||
'model_name': self.llm_client.model_name}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Within context
, model_name
(or any parameter) can be anything. We will, however, need to specify the models
parameter for Alegre itself to know how to do vectorization. That parameter is currently missing.
# content is text, doc_id is the item's unique id, and context is input id, labels, schema_id, and model name | ||
final_results_to_be_stored_in_alegre = {'documents': [ | ||
{'doc_id': str(uuid.uuid4()), # adding a unique id for each item to not rely on the input id for uniqueness | ||
'content': items[i]['text'], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the /text/similarity
endpoint this parameter is text
and not content
. Let's double check this is the correct name for the bulk endpoint
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At least for /text/bulk_similarity/ we replace text
with content
:
# Rename "text" to "content" if present
if 'text' in params:
params['content'] = params.get('text')
del params["text"]
We can use both content
and text
for bulk but it ultimately gets renamed to content
in Alegre.
|
||
# call alegre endpoint to store the results: /text/bulk_similarity/ | ||
alegre_url = os.getenv('ALEGRE_URL') | ||
httpx.post(alegre_url + '/text/bulk_similarity/', json=final_results_to_be_stored_in_alegre) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We think so, but @DGaffney is currently moving all vectorization to Presto and will need to ensure this endpoint continues to work after that migration 😅
We plan to use Alegre for KNN similarity queries, since we already have Open/ElasticSearch set up for that service, makes a lot of sense. I agree that there are optimizations to be made here, but I rather not concern ClassyCat with those atm, we are still trying to achieve basic functionality in this ticket. Also to address the issue of already having the items stored in Alegre prior to calling classycat, I think it's a lot easier to make sure that Alegre updates existing records vs. adding duplicate ones, instead of adding additional endpoints and complexity to classycat. We can do that by allowing an optional "doc_id" for each record submitted to classycat. If that field exists in the call, I will make sure to pass that on to Alegre (as opposed to generate a new unique id) and ensure that Alegre does "upserts" rather than "inserts" into the DB. As mentioned before, I think this is work for a another time and should be a separate ticket. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets pretty pretty please not use bulk_similarity for this - I know we went through how to use it the other day but I didn't fully appreciate that this was part of new functionality that would be coming through a PR
Description
Store successful classification results of ClassyCat in Alegre
Reference: CV2-4953
How has this been tested?
Has been tested locally and in unit tests.
Are there any external dependencies?
New SSM parameters need to be added to Presto for the Alegre endpoint URL
Have you considered secure coding practices when writing this code?
None