Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

providers: add validate_record to validate record according to provider [+] #1140

Conversation

fenekku
Copy link
Contributor

@fenekku fenekku commented Dec 8, 2022

# Create all PIDs specified on draft or which PIDs schemes which are
# require
# Determine schemes which are required, but not yet created.
missing_required_schemes = required_schemes - record_schemes - draft_schemes
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just for readability

"""Constructor for RecordService."""
self._providers = providers
self._required_schemes = required_schemes
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's legitimate for the PIDManager to know which pid schemes are required.

success, val_errors = provider.validate(record=record, **pid)
if not success:
errors.append({"field": f"pids.{scheme}", "messages": val_errors})

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just reordering for easier reading of the file as a developer.


if raise_errors and errors:
raise ValidationError(message=errors)

Copy link
Contributor Author

@fenekku fenekku Dec 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the crux of the feature. It's slightly different than the original idea of modifying the validate() method mostly because it clarifies intent this way, removes the need to change the interface of that method (across many classes and tests) and removes the need to fiddle around to have providers know the scheme.

Note that errors are a dict, so that ValidationError end-up being processed by the error handler https://github.com/inveniosoftware/invenio-records-resources/blob/master/invenio_records_resources/resources/errors.py#L47 (really: https://github.com/inveniosoftware/invenio-records-resources/blob/a7595311b5e66e39e2c831786dbb07e6701f741c/invenio_records_resources/errors.py#L55) to return an array of {field: <field>, messages: [<msg1>, <msg2>, ...]} as any regular draft validation error would.

Further note however, that the frontend doesn't associate the error with the field despite this. The frontend will display the top-level error at least:

image

and publication is prevented which are the important aspects.
The field specific error is not shown but this is the case as well for validate() (even in local tests with this change of format - which IMHO is more correct).

The assignment of errors to fields is a wider frontend issue (e.g., inveniosoftware/invenio-app-rdm#1353 or inveniosoftware/react-invenio-deposit#403), so keeping it out-of-bounds of this PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get the point of clear intention but now I think we have split the validation mechanism into 2 independent methods thus losing the self-contained way to validate and we moved the responsibility from the PID manager to the PID component. I think I would include the validation_record as another step in the process of validate and change slightly the interface of the method to

Suggested change
self.service.pids.pid_manager.validate(draft, record, raise_errors=True)

and retrieve the draft_pids from the draft itself. That said, isn't the self._validate_pids(pids, record, errors) enough to be called with draft instead of record? That would simplify the amount of code needed I think...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to be long, grab some tea 😉 🍵

isn't the self._validate_pids(pids, record, errors) enough to be called with draft instead of record?

I don't think so. My understanding is that https://github.com/inveniosoftware/invenio-rdm-records/blob/master/invenio_rdm_records/services/pids/providers/base.py#L153 checks if your draft submitted for publication is trying to change the current pid under validation to point to another record than its associated published one. So the associated published record needs to be passed (not just the draft) to compare it against.

And it just so happens that for drafts with no associated published record (completely new drafts), the PID will not exist in the DB in the first place, so PIDDoesNotExistError is raised and caught, and validate()... well... validates. An astute reader will realize that one could pass an existing PID via the API when creating a draft... and that will probably cause a 500. But solving that problem is also out of bounds of this PR.

This is all pretty hard to follow. That's why I think creating a separate method with clear intent is much better here. It allows us to isolate ourselves from this.

And this is not just nice clarity, it also has to do with the interface change. I literally started by putting the change in validate() and changing the interface (specifically the errors one): a lot of things needed to change in turn (breaking change). So I backed away. Mostly those breakages were just about fixing the interfaces elsewhere, but what really made me back away was how providers would have to use their pid_type in the error dict which made me question why RDM_PERSISTENT_IDENTIFIERS and RDM_PERSISTENT_IDENTIFIERS are the way they are:

If a provider already knows their pid scheme (also called pid_type in the code base... not sure why...) why are RDM_PERSISTENT_IDENTIFIERS and RDM_PERSISTENT_IDENTIFIER_PROVIDERS separate and not just one mapping? You can't mix-and-match scheme to providers unless you dynamically change the provider's pid_type (which we don't). And all the other info (e.g., "validator": idutils.is_doi) can just be tied to the provider (because in reality they are tied to the pid_type). Hopefully, it wasn't just a case of someone having DRYinitis 😉 .

That being said, in all seriousness, as usual, I probably don't understand the real intent of the code. That means we probably can't rely on the provider.pid_type which means we can't generate the errors correctly which means we can't place that code in the same method.

So we are back to code intent and back to separate methods 😄

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is hard for me to agree that a second method validate_record, added to the provider interface, adds clarity.
On the contrary, the current validate method is a hook clear enough to allow you to implement the validations that you need. The interface that was created is passing the record as param exactly for this, otherwise draft_pids was probably enough (we added the record for deduplication/integrity check).
I am also not 100% sure why you are passing the draft and not the record: the component's method publish is called after the draft has been "converted" to a record, and what will be sent to the provider is the record. Why not doing the validation on the record then?

If you are seeking for clarity, then you would have to rename validate to something like validate_pid and remove the record param, then add the second method validate_metadata or similar.
validate(..., record, ...) and validate_record(draft, ...) is, for me, more confusing.

The only thing that I have noticed in your changes that might be a sign that we need a separate method is the different return error type, which changes from a list to a dict, one error per field. This makes sense.

To summarize:

  • I would keep one validate method only. The proposed 2 methods are for me confusing.
  • I would change the return error type as you propose. The errors triggered by the HTTP call could be converted to an Exception instead.
  • I agree with the other changes, your solution looks clean and well done :)

Let's try to involve another reviewer to have an extra opinion :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot to unpack here 😄 This is a good place to start:

The interface that was created is passing the record as param exactly for this, otherwise draft_pids was probably enough (we added the record for deduplication/integrity check). I am also not 100% sure why you are passing the draft and not the record: the component's method publish is called after the draft has been "converted" to a record, and what will be sent to the provider is the record. Why not doing the validation on the record then?

Validation on draft was favored to sidestep the introduction of a subtle dependency on MetadataComponent when it wasn't needed. To make sure we are on the same page: MetadataComponent is the one that fills record.metadata with draft.metadata. Since draft on its own is enough for validation purposes, by validating it we don't have to depend on MetadataComponent having run before. This makes the design less fragile, because it loosens the constraints on the order of placement of PIDsComponent in the components array. It can also be argued that this plays well with the rest of the code in that component since the component looks at differences between draft and record (on the pids field though) - but this argument is subjective versus loose coupling which is more concrete.

Back to the question of using a separate method validate_record or placing the change in validate. The impetus for a separate method is based on the open for addition, closed to modification principle: by adding a new separate method, there is no breaking of the previous interface and no modification of pre-existing code where I could introduce further problems. A better name for validate_record would be welcome though if it helps clarify intent. The method itself doesn't care if a draft or a record is passed though, so any name is bound to be unclear unless we decide that a Draft (or a Record if the above paragraph wasn't convincing 😉 ) is what is passed to it OR we decide to use a name like recaft or some such new word :laugh:.

That being said, I am happy to place the change in validate 😄 if:

  1. we are fine with the breaking change* (and the potential risk of me breaking other stuff 🐘 ) AND
  2. the uncertainty described in my previous comment can be addressed satisfyingly. To recap that uncertainty (I don't know that I can describe it differently than I did, but I will summarize it):

To generate correct errors, Providers need to know the pid scheme (aka pid_type) they are used for. I would like to use self.pid_type BUT then why does RDM_PERSISTENT_IDENTIFIER_PROVIDERS exist? Presumably to mix-and-match providers and scheme (although that probably wouldn't work currently anyway). In which case pid_type won't be reliable in the future, then the separate method is probably a simpler, least-breaking, forward-looking change. Bringing someone else more familiar with that code will indeed be helpful here!

If 1) and 2) can be addressed, we can place the change in validate and make it take draft and record to gain the advantage listed above (validate(draft, record, ...)).

*: When I initially started work on this at the beginning of last week, I did introduce the change in validate and I want to say 8 files or so needed additional changes due to the breaking interface.

Copy link
Contributor

@ntarocco ntarocco Dec 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Validation on draft was favored to sidestep the introduction of a subtle dependency on MetadataComponent when it wasn't needed....

While I understand what you say, having components as a list, where order counts, was vastly discussed when it was designed, concerns were raised and discussed. It was decided to go ahead with this architecture.
What you are not taking into account is that there could be another component, added in-between in the future, that could change the record metadata. You cannot predict that.

The provider should validate the final metadata that will be used to register the PID, not another version of the record (aka the draft).

The impetus for a separate method is based on the open for addition, closed to modification principle...

You need to put design patterns in context. With the change you want to make, you assume that you cannot change anything else. Unfortunately, you are adding more complexity and more confusion to the code, again IMO. We maintain the datacite plugin, we can make this change and document very well what is the change if someone else already implemented the PIDProvider interface. At the end of the day, this is what semantic versioning is here for.
Adding a new method won't break things, correct, but developers might have/want to implement the new method: finding now 2 methods for validation, they will be confused on what they have to validate and where.

To generate correct errors, Providers need to know the pid scheme (aka pid_type) they are used for.

Why? A concrete instance of a provider knows what is handling. In fact, you added the error Missing publisher field required for DOI registration..
By the way, the pid_type is in the constructor of the PIDProviderclass.

change in validate and make it take draft and record to gain the advantage listed above (validate(draft, record, ...)).

I would not do that for the reason explained above. I see no need to validate metadata on the draft.

*: When I initially started work on this at the beginning of last week, I did introduce the change in validate and I want to say 8 files or so needed additional changes due to the breaking interface.

To me, the only change you need is the error handling/return type. Nothing more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok after a lot of work: it turns out we can't change the errors to be a dict because we get a list of errors from schema.load and all other components (and other code) rely on that interface. But that means we can avoid breaking interfaces.

I am deprecating this PR in favor of this new one: #1143 . It should address the points above and solve a couple of other problems. See you over there!

from flask import current_app
from flask_babelex import lazy_gettext as _
from invenio_pidstore.errors import PIDAlreadyExists, PIDDoesNotExistError
from invenio_pidstore.models import PersistentIdentifier, PIDStatus


class PIDProvider:
class PIDProvider(ABC):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was very handy to spot the changes needed and will immediately raise the need to implement that method for any third-party provider implementers.

"publication_date": "2020-06-01",
# because DATACITE_ENABLED is True, this field is required
"publisher": "Acme Inc",
"resource_type": {"id": "image-photo"},
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding publisher and reordering alphabetically for readability.

success, errors = datacite_provider.validate(
record=record, identifier="10.1000/valid.1234", provider="datacite"
)
assert success
assert errors == []
assert [] == errors
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(pet peeve: pytest shows the results of comparison tests like this with left-hand side as the expected result, so we should do the same to make reading test errors easier)

@fenekku fenekku requested a review from lnielsen December 9, 2022 00:06
invenio_rdm_records/services/pids/providers/external.py Outdated Show resolved Hide resolved
invenio_rdm_records/services/pids/providers/datacite.py Outdated Show resolved Hide resolved

if raise_errors and errors:
raise ValidationError(message=errors)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get the point of clear intention but now I think we have split the validation mechanism into 2 independent methods thus losing the self-contained way to validate and we moved the responsibility from the PID manager to the PID component. I think I would include the validation_record as another step in the process of validate and change slightly the interface of the method to

Suggested change
self.service.pids.pid_manager.validate(draft, record, raise_errors=True)

and retrieve the draft_pids from the draft itself. That said, isn't the self._validate_pids(pids, record, errors) enough to be called with draft instead of record? That would simplify the amount of code needed I think...

@fenekku fenekku force-pushed the 1137_validate_publisher_via_datacite_provider branch 2 times, most recently from a88df89 to d9bff5d Compare December 12, 2022 13:42
@fenekku fenekku requested a review from zzacharo December 12, 2022 13:42
…er [+]

- in turn this allows validation of publisher by DataciteProvider
- closes inveniosoftware#1137
@fenekku fenekku force-pushed the 1137_validate_publisher_via_datacite_provider branch from d9bff5d to de414af Compare December 12, 2022 14:06
@zzacharo zzacharo assigned kpsherva and fenekku and unassigned fenekku and kpsherva Dec 13, 2022
@fenekku
Copy link
Contributor Author

fenekku commented Dec 16, 2022

Deprecated in favor of #1143 . See you over there!

@fenekku fenekku closed this Dec 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Validate Publisher via DataCiteProvider publisher required for DataCite DOI registration
4 participants