-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds sanitization to document creation #262
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can have so many unwanted consequences. Could even lead to ill formed xml breaking the full processing chain and also obscures the issue from the ingester. You either do this on the Ui facing part (where each element is being rendered as a form field) or before actually persisting on the final form submit. imho this is not the place
@DiegoPino Thanks for the feedback. I'll look into approaching it that way instead. Thanks! |
@DiegoPino I was able to break the edit metadata page with a control characters in the description of other content models like large image. If not in forms where should this sanitization be done? Use this in the text fields for description(contains breaking control characters): |
Hi when you say you managed to break it, could you explain the process/user
interaction involved?
Thanks!
El El mar, 16 de abr. de 2019 a las 09:49, Don Richards <
[email protected]> escribió:
@DiegoPino <https://github.com/DiegoPino> I was able to break the edit
metadata page with a control characters in the description of other content
models like large image. If not in forms where should this sanitization be
done?
[image: Screen Shot 2019-04-16 at 9 44 08 AM]
<https://user-images.githubusercontent.com/2738244/56214626-3b3a1300-602c-11e9-91e1-475c2e57ac61.png>
*Use this in the text fields for description(contains breaking control
characters):*
The Fourier-Galerkin method is utilized to solve the microscale unit cell
problem, while total Lagrangian nite element is used to solve the
macroscopic boundary value problems.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#262 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AGn85-ND35SWV4dBLOx7IYaly4dQgkgGks5vhdTYgaJpZM4cw0E3>
.
--
Diego Pino Navarro
Digital Repositories Developer
Metropolitan New York Library Council (METRO)
|
@DiegoPino Ingest an object (any content model) with this string in the description "The Fourier-Galerkin method is utilized to solve the microscale unit cell problem, while total Lagrangiannite element is used to solve the macroscopic boundary value problems. " This causes an error on the screen and will in some cases will cause the ingest to fail entirely. I'm using the term "crash" loosely. I'm describing a way to cause a failure during ingest or edit. |
@DonRichards, thanks for addressing my concerns. Look at this old pull i closed because of recommendations. As you can see this whole thing is not a new issue and in that time my incorrect approach (humble) was trying to dealing with the same fact that control characters permeate into lower layers and break XML (and well fedora because fedora objects are FOXML => XML) But the suggestion i got is correct, cleaning silently is a bad practice for Metadata. Better let the user know and deal with it than allow unexpected ingests to happen and only be noticeable once objects are are already inside. Troubleshooting after the fact can become a blind search. Basically the agreement there (and i feel it is still valid) is that we should do this type of checks (if the input is valid as an XML element value or property) at a higher level. Issue is of course if you allow ingest to happen through external APIs or harvesting from other sources. I would incline that we handle this in two different ways:
In both last cases, try to load/parse the XM, record errors and expose them to the user via watchlog and drupal messages, don't further process (XLST, etc) if not fully formed. Don't ingest neither. Simply break the chain and stop. That should happen in a lower layer but not too low, so we can intercept the issue before derivatives, cleanups, etc kick in. Any other @Islandora/7-x-1-x-committers wants to give this a review? |
@DiegoPino Although I agree check at the browser is ideal but I do have reservations with allowing potentially malicious code to be allow to pass through without validating. I'm calling it potentially malicious because this could be leveraged by a disgruntled user to pass values through the form that could interrupt and/or corrupt data going in. It seems to me that all user submitted data should be treated as dangerous until after sanitized and validated. Even an AJAX form validator will assume that browser didn't report falsely. |
@DonRichards i agree with the concept of " It seems to me that all user submitted data should be treated as dangerous until after sanitized and validated." The way how that should be managed is what i'm discussing. Doing it at that deep layer makes reporting, reacting and identifying the issue more complex. Sanitising v/s identifying and reacting is what the discussion i want to have because cleaning issues and then still ingesting (which means having an empty object?) does not solve the problem. I would really do sanitation as close to the final object update/Creation happen. But that is me. |
@DiegoPino Where do you suggest? This is trying to do it when the object is being created. And does the sanitize function return NULL when it isn't valid? |
@DonRichards you are right, the sanitize function, IMHO which is a bit expensive for an XML string via regex (other topic), only replaces things i guess and does not return null (can you test? i had no time sorry). But maybe i'm getting this wrong here: https://github.com/Islandora/islandora_xml_forms/pull/262/files#diff-3faa7b2aea945b6ca259133e26bf4e43R144 So you are running sanitize on XML document that is provided (means was stored) and if you follow the whole chain back this is what triggers it islandora_xml_forms/api/XMLFormDefinition.inc Line 388 in 0bc8de4
and the this
So is this assuming that the already stored XML could be not correct, valid XML and could have potentially issues/wrong? So is this dealing with stored XML? The one in api/XMLFormProcessor.inc makes sense, disallows direct input of strange characters. I declare myself not qualified to deal with this (probably the best), i'm probably missing something. Can unblock the pull if someone wants to go this a review @Islandora/7-x-1-x-committers |
@DonRichards So is there a way to check for bad characters in the form fields before saving the XML? Or is the Speaking practically, if the choice is between silently rewriting to make the characters valid or allowing bad input to mean we're storing bad XML, I'd rather have it silently rewritten. In an ideal world, we'd get warned of the bad character -- which field it's in and if possible what the character is or at least its immediate context -- and have the opportunity to correct it before submitting. But if alerting the user and making them fix before submitting is too complex/difficult/impossible, I'd be happy enough with the automatic sanitization. So much else that happens on the web is sanitized automatically anyway; why is this situation any different?
Sanitizing just cleans it up with regex... https://github.com/Islandora/islandora/blob/a46936cf7b5fae9002b34c81f180509ce5af8744/includes/utilities.inc#L878
I tested this pull with an ingest using invalid characters, and ended up with a valid document in which I couldn't see the difference... It's just working. |
Thanks @bondjimbond |
@DiegoPino So my tests return good results... Are you willing to remove your block on this? If so I can do a bit more of a thorough review and probably approve it. |
I don't believe the uses of the function here are safe, just sayin': the
... So... hook into the form definition stuff and where there's an |
@DiegoPino just pinging you on this. Would like to clear out as much as we can before the conference next month. |
@bondjimbond What are your thoughts on this PR? Can/should we move forwards with this? I'm not sure @DiegoPino is involved in Islandora anymore. I haven't heard anything about him or from him in a while, I assume that's why there is no response. Apologies in advance to Diego if I'm wrong. |
Hu. That is rude. Look around, i’m there but just on a single version (7).
But i’m not ok with this pull
El El jue, 24 de oct. de 2019 a la(s) 15:27, Don Richards <
[email protected]> escribió:
@bondjimbond <https://github.com/bondjimbond> What are your thoughts on
this PR? Can/should we move forwards with this? I'm not sure @DiegoPino
<https://github.com/DiegoPino> is involved in Islandora anymore. I
haven't heard anything about him or from him in a while, I assume that's
why there is no response. Apologies in advance to Diego if I'm wrong.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#262?email_source=notifications&email_token=ABU7ZZ6VTGMVRIG4QU2QZK3QQHZKDA5CNFSM4HGDIE32YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOECGFFBI#issuecomment-546067077>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABU7ZZ7GXT2MR7X6GFSUYMDQQHZKDANCNFSM4HGDIE3Q>
.
--
Diego Pino Navarro
Digital Repositories Developer
Metropolitan New York Library Council (METRO)
|
@DonRichards i will try this again. I don't think that cleaning XML on XML load and save, via a sanitization script changes the fact that people are inputting things we can not allow in XML. Not saying the sanitization itself is wrong, but this will clean things out without the user knowing they did something wrong? Also, XML forms if were your only input, well, could make sense, but there are SO many other ingests methods. So, badly formed XML is something we need to avoid true. Then maybe the cleanup needs to happen exposed to the user, or break the ingest method if there are issues with an alert, but not, in a black box fashion, remove things the user will never be aware of. True also, i can not test this for all use cases. Also right now this method is only used in two places i'm aware of (scholar and PDF sp). So, why not check for correct values, if XML forms is where you see things break, on input, as an element validation thing, where the user can react before getting their chars replaced by ''? That is my concern. And @adam-vessey things similarly too |
I don't think there's an argument to be made against addressing the problem that exists with other ingest methods -- but what's wrong with addressing the problem with ingest forms here and now, and later addressing the other (probably more difficult) problem as well? 7.x development has slowed down a lot; I don't think we can take the path of discarding small fixes in favour of big fixes anymore, because we just don't have the bodies invested to make the big fixes. As 7.x approaches end-of-life, we need to accept the improvements we get rather than waiting for something bigger/better to come along.
Is that a problem? For the most part, the kinds of users who enter data into a GUI ingest form are not the kind who will do a lot of investigating to figure out that the
I can't read this regex, but it appears to be taking invalid characters and replacing them with equivalent valid ones. Can anyone point out specific instances where the characters in question being silently rewritten would cause problems? IF there are serious reasons why we couldn't silently rewrite: then we could add another couple of lines that compare the sanitized input to the original, and if there are differences, interrupt the process, and show a warning to the user, pointing out exactly where the bad characters are, giving them an opportunity to edit before submission. (But this takes extra work, and I'd like to know that it's necessary before we demand it.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me. @DiegoPino we've discussed at the Committers' Call, willing to go ahead. Any concerns? Willing to remove your block in 24 hours?
I don't agree with this approach, so can't approve. But if committers are in agreement i'm dismissing my review.
I removed my review:
Same ones and i never got a real answer to them from the committer neither. But you all know better. Free for you to move forward |
@bondjimbond Would you be willing to look at this one more time? I've included the warning under the update/submit button. I added a div class name so that it can be styled by the system admin. |
@bondjimbond I attempted to find a solution to validate the form but from what I see it isn't possible. Mainly because these forms are dynamically created by form builder so I wouldn't be able to attach a validation trigger to it from here. I might be approaching this the wrong way but I can't afford to spend too much more time on this fix. This PR fixes the error and prevents it from triggering an error event even if the form is completely bypassed via batch ingest and contains invalid characters. |
Did we decide at the last call that this is OK? |
It was suggested to look into the form validation but it was overlooked that objects with existing bad characters or objects ingested via batch will continue to be an issue without this correction. |
As it's been weeks since approval and Diego removed his block, and since we have no better suggestions coming in, I'm going to merge this. If someone has a strong objection when it gets into the wild, we can talk about reverting, but having a fix that doesn't address all possible cases is better than not fixing at all. |
I mean... this is still an issue (with a reminder)... adding an undeclared dependency, since XML forms isn't dependent on |
@adam-vessey Right, we'd forgotten about that. Solution - new JIRA ticket to declare a dependency on Islandora and a quick PR; sound good? |
Hey @Islandora/7-x-1-x-committers, this PR is breaking XML forms. It's missing the file include for the new function call. Pinging @daniel-dgi @DonRichards @bondjimbond |
Address bug introduced in #262. Include the file containing the function
JIRA Ticket: https://jira.duraspace.org/browse/ISLANDORA-2412
What does this Pull Request do?
Sanitize the XML input through a form.
What's new?
Filters xml
How should this be tested?
Ingest a few items. Use control characters from the ticket to test.
Additional Notes:
Related to Islandora Scholar PR-319
This is an update to be done in conjunction with or prior to Scholar PR-319
Please test thoroughly, this is my first time modifying xml forms and I am not completely sure if I've made a breaking change.
Interested parties
@Islandora/7-x-1-x-committers