-
Notifications
You must be signed in to change notification settings - Fork 88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
providers: add validate_record to validate record according to provider [+] #1140
providers: add validate_record to validate record according to provider [+] #1140
Conversation
# Create all PIDs specified on draft or which PIDs schemes which are | ||
# require | ||
# Determine schemes which are required, but not yet created. | ||
missing_required_schemes = required_schemes - record_schemes - draft_schemes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just for readability
"""Constructor for RecordService.""" | ||
self._providers = providers | ||
self._required_schemes = required_schemes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's legitimate for the PIDManager to know which pid schemes are required.
success, val_errors = provider.validate(record=record, **pid) | ||
if not success: | ||
errors.append({"field": f"pids.{scheme}", "messages": val_errors}) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just reordering for easier reading of the file as a developer.
|
||
if raise_errors and errors: | ||
raise ValidationError(message=errors) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's the crux of the feature. It's slightly different than the original idea of modifying the validate()
method mostly because it clarifies intent this way, removes the need to change the interface of that method (across many classes and tests) and removes the need to fiddle around to have providers know the scheme.
Note that errors
are a dict, so that ValidationError
end-up being processed by the error handler https://github.com/inveniosoftware/invenio-records-resources/blob/master/invenio_records_resources/resources/errors.py#L47 (really: https://github.com/inveniosoftware/invenio-records-resources/blob/a7595311b5e66e39e2c831786dbb07e6701f741c/invenio_records_resources/errors.py#L55) to return an array of {field: <field>, messages: [<msg1>, <msg2>, ...]}
as any regular draft validation error would.
Further note however, that the frontend doesn't associate the error with the field despite this. The frontend will display the top-level error at least:
and publication is prevented which are the important aspects.
The field specific error is not shown but this is the case as well for validate()
(even in local tests with this change of format - which IMHO is more correct).
The assignment of errors to fields is a wider frontend issue (e.g., inveniosoftware/invenio-app-rdm#1353 or inveniosoftware/react-invenio-deposit#403), so keeping it out-of-bounds of this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get the point of clear intention but now I think we have split the validation mechanism into 2 independent methods thus losing the self-contained way to validate and we moved the responsibility from the PID manager to the PID component. I think I would include the validation_record
as another step in the process of validate
and change slightly the interface of the method to
self.service.pids.pid_manager.validate(draft, record, raise_errors=True) |
and retrieve the draft_pids
from the draft itself. That said, isn't the self._validate_pids(pids, record, errors)
enough to be called with draft
instead of record
? That would simplify the amount of code needed I think...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is going to be long, grab some tea 😉 🍵
isn't the self._validate_pids(pids, record, errors) enough to be called with draft instead of record?
I don't think so. My understanding is that https://github.com/inveniosoftware/invenio-rdm-records/blob/master/invenio_rdm_records/services/pids/providers/base.py#L153 checks if your draft submitted for publication is trying to change the current pid under validation to point to another record than its associated published one. So the associated published record needs to be passed (not just the draft) to compare it against.
And it just so happens that for drafts with no associated published record (completely new drafts), the PID will not exist in the DB in the first place, so PIDDoesNotExistError
is raised and caught, and validate()
... well... validates. An astute reader will realize that one could pass an existing PID via the API when creating a draft... and that will probably cause a 500. But solving that problem is also out of bounds of this PR.
This is all pretty hard to follow. That's why I think creating a separate method with clear intent is much better here. It allows us to isolate ourselves from this.
And this is not just nice clarity, it also has to do with the interface change. I literally started by putting the change in validate()
and changing the interface (specifically the errors
one): a lot of things needed to change in turn (breaking change). So I backed away. Mostly those breakages were just about fixing the interfaces elsewhere, but what really made me back away was how providers would have to use their pid_type
in the error
dict which made me question why RDM_PERSISTENT_IDENTIFIERS
and RDM_PERSISTENT_IDENTIFIERS
are the way they are:
If a provider already knows their pid scheme (also called pid_type
in the code base... not sure why...) why are RDM_PERSISTENT_IDENTIFIERS
and RDM_PERSISTENT_IDENTIFIER_PROVIDERS separate and not just one mapping? You can't mix-and-match scheme to providers unless you dynamically change the provider's pid_type (which we don't). And all the other info (e.g., "validator": idutils.is_doi
) can just be tied to the provider (because in reality they are tied to the pid_type). Hopefully, it wasn't just a case of someone having DRYinitis 😉 .
That being said, in all seriousness, as usual, I probably don't understand the real intent of the code. That means we probably can't rely on the provider.pid_type
which means we can't generate the errors correctly which means we can't place that code in the same method.
So we are back to code intent and back to separate methods 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is hard for me to agree that a second method validate_record
, added to the provider
interface, adds clarity.
On the contrary, the current validate
method is a hook clear enough to allow you to implement the validations that you need. The interface that was created is passing the record
as param exactly for this, otherwise draft_pids
was probably enough (we added the record for deduplication/integrity check
).
I am also not 100% sure why you are passing the draft and not the record: the component's method publish
is called after the draft has been "converted" to a record, and what will be sent to the provider is the record. Why not doing the validation on the record then?
If you are seeking for clarity, then you would have to rename validate
to something like validate_pid
and remove the record
param, then add the second method validate_metadata
or similar.
validate(..., record, ...)
and validate_record(draft, ...)
is, for me, more confusing.
The only thing that I have noticed in your changes that might be a sign that we need a separate method is the different return error type, which changes from a list to a dict, one error per field. This makes sense.
To summarize:
- I would keep one
validate
method only. The proposed 2 methods are for me confusing. - I would change the return error type as you propose. The errors triggered by the HTTP call could be converted to an Exception instead.
- I agree with the other changes, your solution looks clean and well done :)
Let's try to involve another reviewer to have an extra opinion :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A lot to unpack here 😄 This is a good place to start:
The interface that was created is passing the record as param exactly for this, otherwise draft_pids was probably enough (we added the record for deduplication/integrity check). I am also not 100% sure why you are passing the draft and not the record: the component's method publish is called after the draft has been "converted" to a record, and what will be sent to the provider is the record. Why not doing the validation on the record then?
Validation on draft
was favored to sidestep the introduction of a subtle dependency on MetadataComponent
when it wasn't needed. To make sure we are on the same page: MetadataComponent
is the one that fills record.metadata
with draft.metadata
. Since draft
on its own is enough for validation purposes, by validating it we don't have to depend on MetadataComponent
having run before. This makes the design less fragile, because it loosens the constraints on the order of placement of PIDsComponent
in the components
array. It can also be argued that this plays well with the rest of the code in that component since the component looks at differences between draft
and record
(on the pids
field though) - but this argument is subjective versus loose coupling which is more concrete.
Back to the question of using a separate method validate_record
or placing the change in validate
. The impetus for a separate method is based on the open for addition, closed to modification principle: by adding a new separate method, there is no breaking of the previous interface and no modification of pre-existing code where I could introduce further problems. A better name for validate_record
would be welcome though if it helps clarify intent. The method itself doesn't care if a draft or a record is passed though, so any name is bound to be unclear unless we decide that a Draft (or a Record if the above paragraph wasn't convincing 😉 ) is what is passed to it OR we decide to use a name like recaft
or some such new word :laugh:.
That being said, I am happy to place the change in validate
😄 if:
- we are fine with the breaking change* (and the potential risk of me breaking other stuff 🐘 ) AND
- the uncertainty described in my previous comment can be addressed satisfyingly. To recap that uncertainty (I don't know that I can describe it differently than I did, but I will summarize it):
To generate correct errors, Providers need to know the pid scheme
(aka pid_type
) they are used for. I would like to use self.pid_type
BUT then why does RDM_PERSISTENT_IDENTIFIER_PROVIDERS
exist? Presumably to mix-and-match providers and scheme (although that probably wouldn't work currently anyway). In which case pid_type
won't be reliable in the future, then the separate method is probably a simpler, least-breaking, forward-looking change. Bringing someone else more familiar with that code will indeed be helpful here!
If 1) and 2) can be addressed, we can place the change in validate
and make it take draft
and record
to gain the advantage listed above (validate(draft, record, ...)
).
*: When I initially started work on this at the beginning of last week, I did introduce the change in validate
and I want to say 8 files or so needed additional changes due to the breaking interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Validation on draft was favored to sidestep the introduction of a subtle dependency on MetadataComponent when it wasn't needed....
While I understand what you say, having components as a list, where order counts, was vastly discussed when it was designed, concerns were raised and discussed. It was decided to go ahead with this architecture.
What you are not taking into account is that there could be another component, added in-between in the future, that could change the record metadata. You cannot predict that.
The provider should validate the final metadata that will be used to register the PID, not another version of the record (aka the draft).
The impetus for a separate method is based on the open for addition, closed to modification principle...
You need to put design patterns in context. With the change you want to make, you assume that you cannot change anything else. Unfortunately, you are adding more complexity and more confusion to the code, again IMO. We maintain the datacite
plugin, we can make this change and document very well what is the change if someone else already implemented the PIDProvider
interface. At the end of the day, this is what semantic versioning is here for.
Adding a new method won't break things, correct, but developers might have/want to implement the new method: finding now 2 methods for validation, they will be confused on what they have to validate and where.
To generate correct errors, Providers need to know the pid scheme (aka pid_type) they are used for.
Why? A concrete instance of a provider knows what is handling. In fact, you added the error Missing publisher field required for DOI registration.
.
By the way, the pid_type is in the constructor of the PIDProvider
class.
change in validate and make it take draft and record to gain the advantage listed above (validate(draft, record, ...)).
I would not do that for the reason explained above. I see no need to validate metadata on the draft.
*: When I initially started work on this at the beginning of last week, I did introduce the change in validate and I want to say 8 files or so needed additional changes due to the breaking interface.
To me, the only change you need is the error handling/return type. Nothing more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok after a lot of work: it turns out we can't change the errors to be a dict because we get a list of errors from schema.load
and all other components (and other code) rely on that interface. But that means we can avoid breaking interfaces.
I am deprecating this PR in favor of this new one: #1143 . It should address the points above and solve a couple of other problems. See you over there!
from flask import current_app | ||
from flask_babelex import lazy_gettext as _ | ||
from invenio_pidstore.errors import PIDAlreadyExists, PIDDoesNotExistError | ||
from invenio_pidstore.models import PersistentIdentifier, PIDStatus | ||
|
||
|
||
class PIDProvider: | ||
class PIDProvider(ABC): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was very handy to spot the changes needed and will immediately raise the need to implement that method for any third-party provider implementers.
"publication_date": "2020-06-01", | ||
# because DATACITE_ENABLED is True, this field is required | ||
"publisher": "Acme Inc", | ||
"resource_type": {"id": "image-photo"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding publisher and reordering alphabetically for readability.
success, errors = datacite_provider.validate( | ||
record=record, identifier="10.1000/valid.1234", provider="datacite" | ||
) | ||
assert success | ||
assert errors == [] | ||
assert [] == errors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(pet peeve: pytest shows the results of comparison tests like this with left-hand side as the expected result, so we should do the same to make reading test errors easier)
|
||
if raise_errors and errors: | ||
raise ValidationError(message=errors) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get the point of clear intention but now I think we have split the validation mechanism into 2 independent methods thus losing the self-contained way to validate and we moved the responsibility from the PID manager to the PID component. I think I would include the validation_record
as another step in the process of validate
and change slightly the interface of the method to
self.service.pids.pid_manager.validate(draft, record, raise_errors=True) |
and retrieve the draft_pids
from the draft itself. That said, isn't the self._validate_pids(pids, record, errors)
enough to be called with draft
instead of record
? That would simplify the amount of code needed I think...
a88df89
to
d9bff5d
Compare
…er [+] - in turn this allows validation of publisher by DataciteProvider - closes inveniosoftware#1137
d9bff5d
to
de414af
Compare
Deprecated in favor of #1143 . See you over there! |
Replaces: