Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds sanitization to document creation #262

Merged
merged 3 commits into from
Jan 22, 2020
Merged

Conversation

DonRichards
Copy link
Member

JIRA Ticket: https://jira.duraspace.org/browse/ISLANDORA-2412

What does this Pull Request do?

Sanitize the XML input through a form.

What's new?

Filters xml

How should this be tested?

Ingest a few items. Use control characters from the ticket to test.

Additional Notes:

Related to Islandora Scholar PR-319
This is an update to be done in conjunction with or prior to Scholar PR-319

Please test thoroughly, this is my first time modifying xml forms and I am not completely sure if I've made a breaking change.

Interested parties

@Islandora/7-x-1-x-committers

Copy link
Contributor

@DiegoPino DiegoPino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can have so many unwanted consequences. Could even lead to ill formed xml breaking the full processing chain and also obscures the issue from the ingester. You either do this on the Ui facing part (where each element is being rendered as a form field) or before actually persisting on the final form submit. imho this is not the place

@DonRichards
Copy link
Member Author

@DiegoPino Thanks for the feedback. I'll look into approaching it that way instead. Thanks!

@DonRichards DonRichards deleted the DonRichards-patch-1 branch April 16, 2019 11:48
@DonRichards DonRichards restored the DonRichards-patch-1 branch April 16, 2019 13:43
@DonRichards
Copy link
Member Author

@DiegoPino I was able to break the edit metadata page with a control characters in the description of other content models like large image. If not in forms where should this sanitization be done?
Screen Shot 2019-04-16 at 9 44 08 AM

Use this in the text fields for description(contains breaking control characters):
The Fourier-Galerkin method is utilized to solve the microscale unit cell problem, while total Lagrangian nite element is used to solve the macroscopic boundary value problems.

@DonRichards DonRichards reopened this Apr 16, 2019
@DonRichards
Copy link
Member Author

Some interesting behaviors observed. Lg image have the breaks on edit and displays error during ingestion.
Screen Shot 2019-04-16 at 9 58 04 AM

Basic image break on ingest if the char is in the title and break on edit if in the description.

@DiegoPino
Copy link
Contributor

DiegoPino commented Apr 16, 2019 via email

@DonRichards
Copy link
Member Author

@DiegoPino Ingest an object (any content model) with this string in the description

"The Fourier-Galerkin method is utilized to solve the microscale unit cell problem, while total Lagrangian nite element is used to solve the macroscopic boundary value problems. "

This causes an error on the screen and will in some cases will cause the ingest to fail entirely. I'm using the term "crash" loosely. I'm describing a way to cause a failure during ingest or edit.

@DiegoPino
Copy link
Contributor

@DonRichards, thanks for addressing my concerns. Look at this old pull i closed because of recommendations. As you can see this whole thing is not a new issue and in that time my incorrect approach (humble) was trying to dealing with the same fact that control characters permeate into lower layers and break XML (and well fedora because fedora objects are FOXML => XML)

But the suggestion i got is correct, cleaning silently is a bad practice for Metadata. Better let the user know and deal with it than allow unexpected ingests to happen and only be noticeable once objects are are already inside. Troubleshooting after the fact can become a blind search.

Basically the agreement there (and i feel it is still valid) is that we should do this type of checks (if the input is valid as an XML element value or property) at a higher level. Issue is of course if you allow ingest to happen through external APIs or harvesting from other sources. I would incline that we handle this in two different ways:

  • Check on Form input: deal with a validation callback for every XML form element (seems to be the simplest way, you basically check before submission if the content is correct or not for XML use. This strategy borrowed from twig seems to be the safest, ping @adam-vessey, what do you think? I'm not saying sanitize, i'm saying "check" and inform the user. If you want to silently sanitize, then make "autoclean" an option people can select so they know what the consequences can be (like ending with an empty value...). Here are your elements, just add validation there. https://github.com/Islandora/islandora_xml_forms/blob/7.x/elements/xml_form_elements.module

  • Validate external sources (like when you harvest from some scholar external source) and

  • Validate the resulting XML before ingest.

In both last cases, try to load/parse the XM, record errors and expose them to the user via watchlog and drupal messages, don't further process (XLST, etc) if not fully formed. Don't ingest neither. Simply break the chain and stop. That should happen in a lower layer but not too low, so we can intercept the issue before derivatives, cleanups, etc kick in.

Any other @Islandora/7-x-1-x-committers wants to give this a review?

@DonRichards
Copy link
Member Author

@DiegoPino Although I agree check at the browser is ideal but I do have reservations with allowing potentially malicious code to be allow to pass through without validating. I'm calling it potentially malicious because this could be leveraged by a disgruntled user to pass values through the form that could interrupt and/or corrupt data going in. It seems to me that all user submitted data should be treated as dangerous until after sanitized and validated. Even an AJAX form validator will assume that browser didn't report falsely.

@DiegoPino
Copy link
Contributor

@DonRichards i agree with the concept of " It seems to me that all user submitted data should be treated as dangerous until after sanitized and validated." The way how that should be managed is what i'm discussing. Doing it at that deep layer makes reporting, reacting and identifying the issue more complex. Sanitising v/s identifying and reacting is what the discussion i want to have because cleaning issues and then still ingesting (which means having an empty object?) does not solve the problem. I would really do sanitation as close to the final object update/Creation happen. But that is me.

@DonRichards
Copy link
Member Author

@DiegoPino Where do you suggest? This is trying to do it when the object is being created. And does the sanitize function return NULL when it isn't valid?

@DiegoPino
Copy link
Contributor

@DonRichards you are right, the sanitize function, IMHO which is a bit expensive for an XML string via regex (other topic), only replaces things i guess and does not return null (can you test? i had no time sorry). But maybe i'm getting this wrong here: https://github.com/Islandora/islandora_xml_forms/pull/262/files#diff-3faa7b2aea945b6ca259133e26bf4e43R144

So you are running sanitize on XML document that is provided (means was stored) and if you follow the whole chain back this is what triggers it

public function createXMLDocument($xml = NULL) {

and the this

$document = $definition->createXMLDocument($xml);

So is this assuming that the already stored XML could be not correct, valid XML and could have potentially issues/wrong? So is this dealing with stored XML?
If so, why is sanitizing not happening on XML storage itself so it does not happen? I mean if stored XML has issues, transforms like MODS to DC, or cleanup XSLT would fail too right?

The one in api/XMLFormProcessor.inc makes sense, disallows direct input of strange characters.

I declare myself not qualified to deal with this (probably the best), i'm probably missing something. Can unblock the pull if someone wants to go this a review @Islandora/7-x-1-x-committers

@bondjimbond
Copy link
Contributor

bondjimbond commented Jul 8, 2019

@DonRichards So is there a way to check for bad characters in the form fields before saving the XML? Or is the islandora_sanitize_input_for_valid_xml() function the best we've got?

Speaking practically, if the choice is between silently rewriting to make the characters valid or allowing bad input to mean we're storing bad XML, I'd rather have it silently rewritten.

In an ideal world, we'd get warned of the bad character -- which field it's in and if possible what the character is or at least its immediate context -- and have the opportunity to correct it before submitting. But if alerting the user and making them fix before submitting is too complex/difficult/impossible, I'd be happy enough with the automatic sanitization. So much else that happens on the web is sanitized automatically anyway; why is this situation any different?

And does the sanitize function return NULL when it isn't valid?

Sanitizing just cleans it up with regex... https://github.com/Islandora/islandora/blob/a46936cf7b5fae9002b34c81f180509ce5af8744/includes/utilities.inc#L878

function islandora_sanitize_input_for_valid_xml($input, $replacement = '') {
  $input = preg_replace('/[^\x9\xA\xD\x20-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}]/u', $replacement, $input);
  return $input;
}

I tested this pull with an ingest using invalid characters, and ended up with a valid document in which I couldn't see the difference... It's just working.

@DonRichards
Copy link
Member Author

Thanks @bondjimbond

@bondjimbond
Copy link
Contributor

@DiegoPino So my tests return good results... Are you willing to remove your block on this? If so I can do a bit more of a thorough review and probably approve it.

@adam-vessey
Copy link
Contributor

I don't believe the uses of the function here are safe, just sayin': the xml_form_api module is not dependent on islandora, nor is it assured that the file containing it would've been loaded.

In an ideal world, we'd get warned of the bad character -- which field it's in and if possible what the character is [...]

... So... hook into the form definition stuff and where there's an update or create action, add in an #element_validate kind of thing?

@DonRichards
Copy link
Member Author

@DiegoPino just pinging you on this. Would like to clear out as much as we can before the conference next month.

@DonRichards
Copy link
Member Author

@bondjimbond What are your thoughts on this PR? Can/should we move forwards with this? I'm not sure @DiegoPino is involved in Islandora anymore. I haven't heard anything about him or from him in a while, I assume that's why there is no response. Apologies in advance to Diego if I'm wrong.

@DiegoPino
Copy link
Contributor

DiegoPino commented Oct 24, 2019 via email

@DiegoPino
Copy link
Contributor

@DonRichards i will try this again. I don't think that cleaning XML on XML load and save, via a sanitization script changes the fact that people are inputting things we can not allow in XML. Not saying the sanitization itself is wrong, but this will clean things out without the user knowing they did something wrong? Also, XML forms if were your only input, well, could make sense, but there are SO many other ingests methods. So, badly formed XML is something we need to avoid true. Then maybe the cleanup needs to happen exposed to the user, or break the ingest method if there are issues with an alert, but not, in a black box fashion, remove things the user will never be aware of. True also, i can not test this for all use cases. Also right now this method is only used in two places i'm aware of (scholar and PDF sp). So, why not check for correct values, if XML forms is where you see things break, on input, as an element validation thing, where the user can react before getting their chars replaced by ''? That is my concern. And @adam-vessey things similarly too

@bondjimbond
Copy link
Contributor

Also, XML forms if were your only input, well, could make sense, but there are SO many other ingests methods.

I don't think there's an argument to be made against addressing the problem that exists with other ingest methods -- but what's wrong with addressing the problem with ingest forms here and now, and later addressing the other (probably more difficult) problem as well?

7.x development has slowed down a lot; I don't think we can take the path of discarding small fixes in favour of big fixes anymore, because we just don't have the bodies invested to make the big fixes. As 7.x approaches end-of-life, we need to accept the improvements we get rather than waiting for something bigger/better to come along.

Not saying the sanitization itself is wrong, but this will clean things out without the user knowing they did something wrong?

Is that a problem? For the most part, the kinds of users who enter data into a GUI ingest form are not the kind who will do a lot of investigating to figure out that the that their OS automatically uses as a double quote has to be " in an ingest form.

$input = preg_replace('/[^\x9\xA\xD\x20-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}]/u', $replacement, $input);

I can't read this regex, but it appears to be taking invalid characters and replacing them with equivalent valid ones. Can anyone point out specific instances where the characters in question being silently rewritten would cause problems?

IF there are serious reasons why we couldn't silently rewrite: then we could add another couple of lines that compare the sanitized input to the original, and if there are differences, interrupt the process, and show a warning to the user, pointing out exactly where the bad characters are, giving them an opportunity to edit before submission. (But this takes extra work, and I'd like to know that it's necessary before we demand it.)

Copy link
Contributor

@bondjimbond bondjimbond left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. @DiegoPino we've discussed at the Committers' Call, willing to go ahead. Any concerns? Willing to remove your block in 24 hours?

@DiegoPino DiegoPino dismissed their stale review November 15, 2019 20:07

I don't agree with this approach, so can't approve. But if committers are in agreement i'm dismissing my review.

@DiegoPino
Copy link
Contributor

I removed my review:

Any concerns

Same ones and i never got a real answer to them from the committer neither. But you all know better. Free for you to move forward

@adam-vessey
Copy link
Contributor

This is still a thing, also...

@DonRichards
Copy link
Member Author

DonRichards commented Jan 17, 2020

@bondjimbond Would you be willing to look at this one more time? I've included the warning under the update/submit button. I added a div class name so that it can be styled by the system admin.
Warning Screenshot

@DonRichards
Copy link
Member Author

@bondjimbond I attempted to find a solution to validate the form but from what I see it isn't possible. Mainly because these forms are dynamically created by form builder so I wouldn't be able to attach a validation trigger to it from here. I might be approaching this the wrong way but I can't afford to spend too much more time on this fix. This PR fixes the error and prevents it from triggering an error event even if the form is completely bypassed via batch ingest and contains invalid characters.

@bondjimbond
Copy link
Contributor

Did we decide at the last call that this is OK?

@DonRichards
Copy link
Member Author

It was suggested to look into the form validation but it was overlooked that objects with existing bad characters or objects ingested via batch will continue to be an issue without this correction.

@bondjimbond bondjimbond merged commit 970401a into 7.x Jan 22, 2020
@bondjimbond
Copy link
Contributor

As it's been weeks since approval and Diego removed his block, and since we have no better suggestions coming in, I'm going to merge this. If someone has a strong objection when it gets into the wild, we can talk about reverting, but having a fix that doesn't address all possible cases is better than not fixing at all.

@bondjimbond bondjimbond deleted the DonRichards-patch-1 branch January 22, 2020 14:08
@adam-vessey
Copy link
Contributor

I mean... this is still an issue (with a reminder)... adding an undeclared dependency, since XML forms isn't dependent on islandora anywhere else... seems to have been completely ignored?

@bondjimbond
Copy link
Contributor

@adam-vessey Right, we'd forgotten about that. Solution - new JIRA ticket to declare a dependency on Islandora and a quick PR; sound good?

@nhart
Copy link
Contributor

nhart commented Feb 7, 2020

Hey @Islandora/7-x-1-x-committers, this PR is breaking XML forms. It's missing the file include for the new function call. Pinging @daniel-dgi @DonRichards @bondjimbond

nhart added a commit to nhart/islandora_xml_forms that referenced this pull request Feb 10, 2020
bondjimbond added a commit that referenced this pull request Feb 11, 2020
Address bug introduced in #262. Include the file containing the function
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants