Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactoring parse_message() #106

Merged
merged 21 commits into from
Aug 26, 2024
Merged

Refactoring parse_message() #106

merged 21 commits into from
Aug 26, 2024

Conversation

ashkankzme
Copy link
Contributor

@ashkankzme ashkankzme commented Aug 13, 2024

Description

The goal of this ticket is to refactor the generic presto parse_message() func into model level input validations and separate parse_input_message() and parse_output_message() functions instead of just one parse_message() for both input and output.

Reference: CV2-5001

How has this been tested?

Has it been tested locally? Are there automated tests?

Are there any external dependencies?

Are there changes required in sysops terraform for this feature or fix?

Have you considered secure coding practices when writing this code?

Please list any security concerns that may be relevant.

Copy link
Contributor

@skyemeedan skyemeedan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, I think the approach looks great and easy to implement for each model.

i don't think we should have code that checks responses have specific fields in their bodies (callback_url, id, etc) bc i think that would be useful as a unit test, but we don't want to catch those errors in production.

Maye a different design philosophy, but I would think we do want to validate precisely these fields. Especially in production. In my experience, in an async context, we want to "fail fast" before the message is submitted, while we still have the request open and so still have the ability to tell the caller that something is wrong. We want to avoid errors when things are pulled off the queue because calling system won't know and hard to debug. (I think this is why messaging systems like Kafka etc are so strict about schemas). And it seems like these are fields where the presto system itself won't work if they are missing.

For example:

  • if the id is missing, I think we are going to have an error downstream, likely in async code, but there will be no id, so really hard to know where the error is coming from
  • if the callback_url is missing, we won't know until the model has processed and message is pulled of the response queue and tries to do the callback. Presto will be logging errors, but the caller is just going to see it as silently failing to return data.


class Message(BaseModel):
body: GenericItem
model_name: str
retry_count: int = 0

def parse_message(message_data: Dict) -> Message:
def parse_input_message(message_data: Dict) -> Message:
body_data = message_data['body']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we are implicitly validating that there is a 'body' and 'model_name' as it will error here if missing ;-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct, those are the system level rules enforced at presto level. I will add validation that those two fields exist before passing it on to the model level validator.

lib/schemas.py Outdated
elif event_type == 'schema_lookup' or event_type == 'schema_create':
result_instance = ClassyCatSchemaResponse(**result_data)

modelClass = get_class('lib.model.', os.environ.get('MODEL_NAME'))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does this need to be different than the model_name associated with the message? do the names not quite align?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's a great question. I copied this from create() in lib.model.model.py

but I agree that it makes more sense to use the model_name mentioned in the input and will update the code.

this function has also some other not so great logic that needs to be refactored down the line, such as assuming every model_name input that is not video or yake will return a MediaResponse. I have outlined my ideas for this work in a separate ticket https://meedan.atlassian.net/browse/CV2-5093


if event_type == 'schema_lookup':
return ClassyCatSchemaResponse(**result_data)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like this needs to check that either the schema_id or schema_name are not empty? It looks like ClassyCatSchemaResponse considers schema_id as optional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the class specific validator checks that those fields exist, look at classycat_schema_lookup.py

however this function as implemented before only searches by name, not both name and id. maybe we can file a ticket for that if it's necessary?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought there was already a function for lookup by schema name and a separate one for id? (I just wasn't sure which this was)

@@ -0,0 +1,150 @@
# How to make a presto model
## Your go-to guide, one-stop shop for writing your model in presto
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really helpful! :-)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this rules

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very needed. Thank you!



class ClassyCatBatchClassificationResponse(ClassyCatResponse):
classification_results: Optional[List[dict]] = []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like should be non-optional? (must at least return an empty dict?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, I think I mostly agree with you. however these types are only enforced upon creation of these objects, when the response object is just an empty object waiting to be populated by the process() method. if we lift the optional option, then we would have to populate them on creation or have a default. I agree that down the line we want do better typing but for now this seems like it doesn't fit our message processing model without some major refactoring work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like the current default of [] would work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it will work for this specific case, I have updated and now they are in classycat_response.py. but I think overall this is a subtle problem with our specific use of pydantic in presto.

lib/model/model.py Show resolved Hide resolved
@skyemeedan
Copy link
Contributor

Also, I think it is super helpful to have an early draft PR like this to be able to discuss!!

@skyemeedan
Copy link
Contributor

oh wait, I'm realizing I missed a crucial word

i don't think we should have code that checks responses have specific fields in their bodies (callback_url, id, etc)

responses

I agree we don't need to validate these fields in responses (if these are missing, they are about to fail anyway) but I think we need to validate these on inputs

Copy link
Collaborator

@DGaffney DGaffney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok you've done a great job here but I have two things:

  1. I hate how much we're calling parse_input_message. Is there any way we can find a way to refactor so we don't call it a billion different places?
  2. I know you're stubbing out the validations per fingerprinter, but I think we could probably just do basic type checking of the sort we already do for unit tests for each fingerprinter and that would probably be sufficient as a first pass? We can talk more about that in the ML call next week if that's screwy or confusing.

@@ -27,3 +27,18 @@ def process(self, audio: schemas.Message) -> Dict[str, Union[str, List[int]]]:
finally:
os.remove(temp_file_name)
return {"hash_value": hash_value}

@classmethod
def validate_input(cls, data: Dict) -> None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we just be passing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think passing is equivalent to what we have right now for most of these models (with the exception of classycat), not counting the validations that happen inside schema.py

I do agree that we should start implementing them, and I have created a ticket for that work (CV2-5093), but the current design is backward compatible and there is no need to implement these right now? unless you think it's urgent we address it?



@classmethod
def parse_input_message(cls, data: Dict) -> Any:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we do nothing?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Edit: ah I see what the issue is after reading your Jira message - I think we should talk through how to not just stub all these on our next ML call, but in the meantime I think looking at what we typically unit-test for, for each model, and just doing type-checking based on the types of responses we're testing for would be appropriate.

@@ -0,0 +1,150 @@
# How to make a presto model
## Your go-to guide, one-stop shop for writing your model in presto
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this rules

@@ -32,34 +31,42 @@ class GenericItem(BaseModel):
text: Optional[str] = None
raw: Optional[Dict] = {}
parameters: Optional[Dict] = {}
result: Optional[Union[ErrorResponse, MediaResponse, VideoResponse, YakeKeywordsResponse, ClassyCatSchemaResponse, ClassyCatBatchClassificationResponse]] = None
result: Optional[Any] = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any way to not just do an Any without it being a big headache?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, I agree, any is not the best. but do we want to keep updating the schema.py file for every new model? bc the dependency on model names is a bit not great, we don't want to tangle lower tier presto infra (e.g. schema.py) to have upward dependencies, imo. is there a way to dynamically load these types without the massive headache of not having to manually paste these types in a config file for every model? hmmm not sure, maybe, but I'm not convinced it's worth our effort at this time, but happy to discuss more.

@ashkankzme
Copy link
Contributor Author

oh wait, I'm realizing I missed a crucial word

i don't think we should have code that checks responses have specific fields in their bodies (callback_url, id, etc)

responses

I agree we don't need to validate these fields in responses (if these are missing, they are about to fail anyway) but I think we need to validate these on inputs

yes, exactly. the quoted checks on inputs are already being enforced by pydantic in schemas.pyas I mention in my other comments too.

@ashkankzme
Copy link
Contributor Author

Ok you've done a great job here but I have two things:

  1. I hate how much we're calling parse_input_message. Is there any way we can find a way to refactor so we don't call it a billion different places?
  2. I know you're stubbing out the validations per fingerprinter, but I think we could probably just do basic type checking of the sort we already do for unit tests for each fingerprinter and that would probably be sufficient as a first pass? We can talk more about that in the ML call next week if that's screwy or confusing.

thanks Devin! I agree with the suggested solutions on both issues you raised. let's definitely discuss before making a decision on ml call or one on one, but my current hope is to keep the scope of this refactor limited, and have separate tickets for both reducing parse_input_message() usage and implementing model specific validations.

@ashkankzme ashkankzme marked this pull request as ready for review August 19, 2024 16:45
# Conflicts:
#	lib/model/generic_transformer.py
#	test/lib/model/test_generic.py
@ashkankzme
Copy link
Contributor Author

ashkankzme commented Aug 19, 2024

Ok you've done a great job here but I have two things:

  1. I hate how much we're calling parse_input_message. Is there any way we can find a way to refactor so we don't call it a billion different places?
  2. I know you're stubbing out the validations per fingerprinter, but I think we could probably just do basic type checking of the sort we already do for unit tests for each fingerprinter and that would probably be sufficient as a first pass? We can talk more about that in the ML call next week if that's screwy or confusing.

Per our convo today, I have created two tickets per your suggestion @DGaffney to address these issues in future work. Those tickets are CV2-5093 and CV2-5102.

@DGaffney DGaffney self-requested a review August 26, 2024 19:28
Copy link
Collaborator

@DGaffney DGaffney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@ashkankzme ashkankzme merged commit 0b8857c into master Aug 26, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants