-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactoring parse_message() #106
Conversation
…ion for classycat
…d into appropriate model file. untested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, I think the approach looks great and easy to implement for each model.
i don't think we should have code that checks responses have specific fields in their bodies (callback_url, id, etc) bc i think that would be useful as a unit test, but we don't want to catch those errors in production.
Maye a different design philosophy, but I would think we do want to validate precisely these fields. Especially in production. In my experience, in an async context, we want to "fail fast" before the message is submitted, while we still have the request open and so still have the ability to tell the caller that something is wrong. We want to avoid errors when things are pulled off the queue because calling system won't know and hard to debug. (I think this is why messaging systems like Kafka etc are so strict about schemas). And it seems like these are fields where the presto system itself won't work if they are missing.
For example:
- if the
id
is missing, I think we are going to have an error downstream, likely in async code, but there will be no id, so really hard to know where the error is coming from - if the
callback_url
is missing, we won't know until the model has processed and message is pulled of the response queue and tries to do the callback. Presto will be logging errors, but the caller is just going to see it as silently failing to return data.
|
||
class Message(BaseModel): | ||
body: GenericItem | ||
model_name: str | ||
retry_count: int = 0 | ||
|
||
def parse_message(message_data: Dict) -> Message: | ||
def parse_input_message(message_data: Dict) -> Message: | ||
body_data = message_data['body'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we are implicitly validating that there is a 'body' and 'model_name' as it will error here if missing ;-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
correct, those are the system level rules enforced at presto level. I will add validation that those two fields exist before passing it on to the model level validator.
lib/schemas.py
Outdated
elif event_type == 'schema_lookup' or event_type == 'schema_create': | ||
result_instance = ClassyCatSchemaResponse(**result_data) | ||
|
||
modelClass = get_class('lib.model.', os.environ.get('MODEL_NAME')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why does this need to be different than the model_name
associated with the message? do the names not quite align?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's a great question. I copied this from create()
in lib.model.model.py
but I agree that it makes more sense to use the model_name mentioned in the input and will update the code.
this function has also some other not so great logic that needs to be refactored down the line, such as assuming every model_name input that is not video or yake will return a MediaResponse. I have outlined my ideas for this work in a separate ticket https://meedan.atlassian.net/browse/CV2-5093
|
||
if event_type == 'schema_lookup': | ||
return ClassyCatSchemaResponse(**result_data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems like this needs to check that either the schema_id or schema_name are not empty? It looks like ClassyCatSchemaResponse considers schema_id as optional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the class specific validator checks that those fields exist, look at classycat_schema_lookup.py
however this function as implemented before only searches by name, not both name and id. maybe we can file a ticket for that if it's necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought there was already a function for lookup by schema name and a separate one for id? (I just wasn't sure which this was)
@@ -0,0 +1,150 @@ | |||
# How to make a presto model | |||
## Your go-to guide, one-stop shop for writing your model in presto |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really helpful! :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this rules
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very needed. Thank you!
lib/model/classycat.py
Outdated
|
||
|
||
class ClassyCatBatchClassificationResponse(ClassyCatResponse): | ||
classification_results: Optional[List[dict]] = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems like should be non-optional? (must at least return an empty dict?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well, I think I mostly agree with you. however these types are only enforced upon creation of these objects, when the response object is just an empty object waiting to be populated by the process()
method. if we lift the optional option, then we would have to populate them on creation or have a default. I agree that down the line we want do better typing but for now this seems like it doesn't fit our message processing model without some major refactoring work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems like the current default of []
would work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, it will work for this specific case, I have updated and now they are in classycat_response.py
. but I think overall this is a subtle problem with our specific use of pydantic in presto.
Also, I think it is super helpful to have an early draft PR like this to be able to discuss!! |
oh wait, I'm realizing I missed a crucial word
responses I agree we don't need to validate these fields in responses (if these are missing, they are about to fail anyway) but I think we need to validate these on inputs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok you've done a great job here but I have two things:
- I hate how much we're calling
parse_input_message
. Is there any way we can find a way to refactor so we don't call it a billion different places? - I know you're stubbing out the validations per fingerprinter, but I think we could probably just do basic type checking of the sort we already do for unit tests for each fingerprinter and that would probably be sufficient as a first pass? We can talk more about that in the ML call next week if that's screwy or confusing.
@@ -27,3 +27,18 @@ def process(self, audio: schemas.Message) -> Dict[str, Union[str, List[int]]]: | |||
finally: | |||
os.remove(temp_file_name) | |||
return {"hash_value": hash_value} | |||
|
|||
@classmethod | |||
def validate_input(cls, data: Dict) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we just be passing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think passing is equivalent to what we have right now for most of these models (with the exception of classycat), not counting the validations that happen inside schema.py
I do agree that we should start implementing them, and I have created a ticket for that work (CV2-5093), but the current design is backward compatible and there is no need to implement these right now? unless you think it's urgent we address it?
|
||
|
||
@classmethod | ||
def parse_input_message(cls, data: Dict) -> Any: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we do nothing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Edit: ah I see what the issue is after reading your Jira message - I think we should talk through how to not just stub all these on our next ML call, but in the meantime I think looking at what we typically unit-test for, for each model, and just doing type-checking based on the types of responses we're testing for would be appropriate.
@@ -0,0 +1,150 @@ | |||
# How to make a presto model | |||
## Your go-to guide, one-stop shop for writing your model in presto |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this rules
@@ -32,34 +31,42 @@ class GenericItem(BaseModel): | |||
text: Optional[str] = None | |||
raw: Optional[Dict] = {} | |||
parameters: Optional[Dict] = {} | |||
result: Optional[Union[ErrorResponse, MediaResponse, VideoResponse, YakeKeywordsResponse, ClassyCatSchemaResponse, ClassyCatBatchClassificationResponse]] = None | |||
result: Optional[Any] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any way to not just do an Any
without it being a big headache?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, I agree, any is not the best. but do we want to keep updating the schema.py file for every new model? bc the dependency on model names is a bit not great, we don't want to tangle lower tier presto infra (e.g. schema.py) to have upward dependencies, imo. is there a way to dynamically load these types without the massive headache of not having to manually paste these types in a config file for every model? hmmm not sure, maybe, but I'm not convinced it's worth our effort at this time, but happy to discuss more.
yes, exactly. the quoted checks on inputs are already being enforced by |
thanks Devin! I agree with the suggested solutions on both issues you raised. let's definitely discuss before making a decision on ml call or one on one, but my current hope is to keep the scope of this refactor limited, and have separate tickets for both reducing |
…y), still not green
…ention, now all tests pass :yay:
…tory field for classycatbatchresponse class.
# Conflicts: # lib/model/generic_transformer.py # test/lib/model/test_generic.py
Per our convo today, I have created two tickets per your suggestion @DGaffney to address these issues in future work. Those tickets are CV2-5093 and CV2-5102. |
# Conflicts: # lib/model/generic_transformer.py # test/lib/model/test_generic.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Conflicts: # .gitignore
Description
The goal of this ticket is to refactor the generic presto parse_message() func into model level input validations and separate parse_input_message() and parse_output_message() functions instead of just one parse_message() for both input and output.
Reference: CV2-5001
How has this been tested?
Has it been tested locally? Are there automated tests?
Are there any external dependencies?
Are there changes required in sysops terraform for this feature or fix?
Have you considered secure coding practices when writing this code?
Please list any security concerns that may be relevant.