Removes control characters in the input/output of content #319

DonRichards · 2019-04-03T12:01:13Z

JIRA Ticket: https://jira.duraspace.org/browse/ISLANDORA-2412

What does this Pull Request do?

Sanitize MODS data into Google Scholar submodule

This does not change the data being stored, it should only impact the data as it processes for Google Scholar meta tags.

What's new?

String replace for ASCII control characters

How should this be tested?

Please use the example in the JIRA ticket for a proof of concept and MODS file with a Control Character embedded into the abstract.

Try ingesting prior to testing this PR.

View the error during ingest and/or viewing the object

Pull this PR

Navigate to previous submission
- Error should be gone
Ingest a 2nd ETD to test the ingest process works without errors now.

Additional Notes:

N/A

Interested parties

@Islandora/7-x-1-x-committers

bryjbrown · 2019-04-03T12:28:33Z

@DonRichards First things first, theres someting about the JIRA issue you linked to that makes me unable to see it (says "You can't view this issue. It may have been deleted or you don't have permission to view it."), so I don't know the backstory to this.

Second, unless I'm mistaken these control characters are not allowed in XML to begin with, so any MODS record with control characters is not well-formed and shouldn't have made it into Fedora to begin with. If that line of thinking is true, then it seems like sanitizing the output for Google Scholar is slapping a band-aid on the problem. I'd argue that anything that reduces the pain of using ill-formed XML is encouraging bad practice.

manez · 2019-04-03T13:12:42Z

@bryjbrown are you logged in to JIRA? I think the ticket should be visible to you, but not to anon right now.

bondjimbond · 2019-04-03T13:37:00Z

@manez @DonRichards I'm logged in and also getting the "You can't view this" message.

jordandukart · 2019-04-03T13:59:12Z

@DonRichards First things first, theres someting about the JIRA issue you linked to that makes me unable to see it (says "You can't view this issue. It may have been deleted or you don't have permission to view it."), so I don't know the backstory to this.

Second, unless I'm mistaken these control characters are not allowed in XML to begin with, so any MODS record with control characters is not well-formed and shouldn't have made it into Fedora to begin with. If that line of thinking is true, then it seems like sanitizing the output for Google Scholar is slapping a band-aid on the problem. I'd argue that anything that reduces the pain of using ill-formed XML is encouraging bad practice.

If this is the case would this helper be more useful: https://github.com/Islandora/islandora/blob/7.x/includes/utilities.inc#L878

bondjimbond · 2019-04-03T14:24:06Z

Second, unless I'm mistaken these control characters are not allowed in XML to begin with, so any MODS record with control characters is not well-formed and shouldn't have made it into Fedora to begin with. If that line of thinking is true, then it seems like sanitizing the output for Google Scholar is slapping a band-aid on the problem. I'd argue that anything that reduces the pain of using ill-formed XML is encouraging bad practice.

So would these bad characters get into the MODS record in the first place if they were ingested via a form? (Can't see the ticket, so can't access the example to test.)

IF the form strips it, then the bad characters are perhaps getting in via batch ingest?

In general, it would be a good idea to stop bad XML by validating it at the start of the ingest process.

dannylamb · 2019-04-03T14:49:01Z

@bondjimbond If you can discern that the datastream you're ingesting in a batch is XML, it could conceivably get sanitized there. Still wouldn't stop someone who is manually jamminng in bad XML through Fedora's API, though.

DonRichards · 2019-04-03T15:18:52Z

@bryjbrown We allow students to submit directly through the GUI and this comes up about twice a semester for us. And you should be able to see the Jira ticket if you're logged in and are on the security response group (you should be).
@jordandukart Thanks for that, that might be a good long term solution to integrate.
@bondjimbond The form isn't stripping this out currently.

DonRichards · 2019-04-03T15:24:17Z

Assuming Github doesn't strip this stuff out of postings I'm going to paste a string that breaks it here.
The control character is between the two "nn". Paste this into the title or abstract.

total Lagrangiannite

To check if Github filtered go to https://www.textmagic.com/free-tools/unicode-detector and paste it in. It should say "Character not present in GSM charset, forces to use Unicode encoding"

bryjbrown · 2019-04-03T15:52:26Z

@DonRichards

And you should be able to see the Jira ticket if you're logged in and are on the security response group (you should be).

I am logged in, but I'm not part of the security response group. Is that what could be causing me not to be able to see it?

DonRichards · 2019-04-03T15:53:20Z

If you paste that directly into a input field this is the results.

DonRichards · 2019-04-03T15:55:28Z

@bryjbrown Yes, I set it to sensitive because it was a security response ticket. I've basically showed most of what was in the ticket in the comments here.

DonRichards · 2019-04-03T16:07:57Z

The "islandora_sanitize_input_for_valid_xml" @jordandukart suggested is currently firing for Full_text generation but I don't see it being fired prior to original file is ingested.

manez · 2019-04-03T18:24:28Z

Apologies for the confusion @bryjbrown . The message on the ticket implies that folks in the Developers group should be able to see, but I forgot that our permissions are customized, so that's not accurate under the hood. My bad 😞

fuel37 · 2019-04-05T12:24:42Z

@bryjbrown Yes, I set it to sensitive because it was a security response ticket. I've basically showed most of what was in the ticket in the comments here.

Out of curiosity, if this is a sensitive security issue being handled by the Islandora Security Group, are there not protocols in place to keep things private until a fix has been identified and released? I mean otherwise, having the exploit posted publically as it is here and on Jira puts the entire community at risk, no? I'm not privy to the protocols for handling security concerns but, if this is standard practice, I'd suggest the protocols be reviewed asap.

dannylamb · 2019-04-05T12:53:56Z

@fuel37 There are protocols in place. They just weren't followed. I'm trying to pull the conversation out of this and into private channels, although I see nothing wrong with taking the time to review how we handle these sorts of things.

DonRichards · 2019-04-05T17:43:42Z

@fuel37 Through a few conversations it became apparent to me that this may cause headaches for viewing, it isn't providing any obvious platform for a malicious user to leverage an exploit. This causes functionality to not render completely but because it originates from a user submission it was originally set as sensitive.

I worried when I submitted it that this might have a further reaching issue. It turns out that Google Scholar submodule is using the fields as strings to generate embedded tags and it doesn't handle them well. Drupal gives too many options for filtering and cleaning these strings.

DonRichards · 2019-04-08T13:43:23Z

ALthough we have a PR coming in to filter everything, this should be good to filter existing material.

adam-vessey · 2019-04-05T13:21:41Z

modules/islandora_google_scholar/islandora_google_scholar.module

@@ -24,6 +24,8 @@ function islandora_google_scholar_create_meta_tags($object) {
  else {
    $tags = array();
    $tags['citation_author'] = array();
+    // remove breaking control characters.
+    $object['MODS']->content = preg_replace('/[\x00-\x09\x0B\x0C\x0E-\x1F\x7F]/', '', $object['MODS']->content);


A couple of issues here:

creating a new version of the datastream every time this runs

~~restricting to ASCII... nothing of UTF-8 is allowed?~~ Actually... not sure about this one...

It's not creating a new MODS datastream, it's just re-assigning it when islandora_google_scholar_create_meta_tags is fired (every page load). This is what I got after ingesting and reloading the page several times. Are you getting something different?

The content property of datastreams is "magic", which is to say: Setting it should result in another version of the datastream being created.

Indeed, it does appear to... You appear to be running that on something that would not be touched by the code changed here... a large image? The given module only runs the given function for objects of the ir:citationCModel or ir:thesisCModel models.

Changing
$object['MODS']->content = preg_replace('/[\x00-\x09\x0B\x0C\x0E-\x1F\x7F]/', '', $object['MODS']->content);
to something like
$filtered_mods = preg_replace('/[\x00-\x09\x0B\x0C\x0E-\x1F\x7F]/', '', $object['MODS']->content);
should be enough to get around this issue, right @adam-vessey?

@bryjbrown: Yes, assuming of course that it's this hypothetical $filtered_mods variable that gets used instead of the content pulled from the MODS datastream on the next line.

Another alternative to that undertaken here which should additionally have very little chance of messing with character encodings would be to more gracefully handle the error when parsing the original content... constructing the new SimpleXMLElement will throw an exception if it fails to parse... we could catch and log it (watchdog(_exception), drupal_set_message, more?)... so that people would have a more direct way of discovering that their XML is invalid for whatever reason. Similarly, warning/error reporting could be taken over while constructing the object (libxml_use_internal_errors() and libxml_get_errors(), kind of thing), so we can control how/if they propagate... my concerns with the present implementation... I'm always leery of these kind of runtime filtering approaches, and how they essentially just mask underlying issues. Additionally, if the MODS does happen to be parseable, there's no need to filter it... less processing is usually better, especially in code for generating displays.

... as for what to do when handling an error... my first thought is just to return FALSE; (as we do elsewhere in this function where we have nothing to embed)...

@adam-vessey You first comment ir:citationCModel or ir:thesisCModel model is the only one going to be impacted because this is sanitizing MODS data going through the Google Scholar submodule for Google Scholar tag display. Not other cmodels should ever see this error. There was a security patch in Islandora for forms to filter this type of stuff. This PR should only be useful for those currently in the repository (new content will get filtered during ingest). I can push a quick adjustment from that feedback.
@bryjbrown Thanks for the suggestion.
What do you guys think about changing the $mods line to keep it simple?
$mods = preg_replace('/[\x00-\x09\x0B\x0C\x0E-\x1F\x7F]/', '', $object['MODS']->content);

@DonRichards I think you and @adam-vessey are in agreement. Only ir:citationCModels and ir:thesisCModels are affected, yet you posted a screenshot showing what appears to be a large image object. Since large images aren't affected by this function, they aren't a valid way to test whether the function creates new versions of the MODS datastream on page load or not. You'd only be able to tell by testing an ir:citationCModel and ir:thesisCModel object.

@bryjbrown Oh man, yep I did that and yes it's creating multiple versions. Changing it to $mods = preg_replace('/[\x00-\x09\x0B\x0C\x0E-\x1F\x7F]/', '', $object['MODS']->content); corrects that is does filter this appropriately. @adam-vessey Would you be good with this?

@DonRichards: It should address the issue of unintentionally creating versions; however, as mentioned, doesn't really facilitate the discovery of the invalid data such that it might eventually be fixed.

Looks like the same issue with parsing pops up, in the "Coins" display... and really, could pop up anywhere we attempt to parse the datastream? The issue could even go so far as Solr documents being incomplete due to exceptions in GSearch when parsing? Almost... seeming like a GIGO kind of thing... invalid data causing errors, so... really, whatever ingest mechanism allowing the invalid data should be fixed? Made to validate data, at least?... maybe a batch to fix things (though identifying what needs to be fixed could be a chore)...

In any case... threw together the exception-catching business, more as an example of what it might look like: 7.x...adam-vessey:7.x-alternative-error-handling

OK, I think I got it worked out. There is one place in addition I see needs attention in the islandora_xml_forms/api/XMLDocument.inc L144 $document->loadXML(islandora_sanitize_input_for_valid_xml($xml)) : for this to work.

DonRichards · 2019-04-16T13:43:04Z

I'm sorry but this isn't isolated to Scholar and should probably live further upstream. I just did a large image ingest with the same problem.

DonRichards · 2019-04-16T14:11:09Z

Maybe this shouldn't be closed. Not sure.

bondjimbond · 2020-01-22T14:19:48Z

@DonRichards Did you find a place "further upstream" to fix the display issue?

DonRichards · 2020-01-22T16:34:26Z

@bondjimbond No, I'm not sure where it would be.

DonRichards · 2020-04-14T15:30:29Z

I believe this this needs to be merged here and not upstream at this point.

adam-vessey reviewed Apr 8, 2019

View reviewed changes

DonRichards mentioned this pull request Apr 15, 2019

Adds sanitization to document creation Islandora/islandora_xml_forms#262

Merged

DonRichards closed this Apr 16, 2019

DonRichards reopened this Apr 18, 2019

DonRichards closed this Jan 28, 2020

DonRichards force-pushed the 7.x branch from c979e03 to cf5c326 Compare January 28, 2020 19:17

Add try catch with sanitization.

297f3ce

DonRichards reopened this Feb 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Removes control characters in the input/output of content #319

Removes control characters in the input/output of content #319

DonRichards commented Apr 3, 2019 •

edited

Loading

bryjbrown commented Apr 3, 2019

manez commented Apr 3, 2019

bondjimbond commented Apr 3, 2019

jordandukart commented Apr 3, 2019

bondjimbond commented Apr 3, 2019

dannylamb commented Apr 3, 2019 •

edited

Loading

DonRichards commented Apr 3, 2019

DonRichards commented Apr 3, 2019

bryjbrown commented Apr 3, 2019

DonRichards commented Apr 3, 2019

DonRichards commented Apr 3, 2019

DonRichards commented Apr 3, 2019

manez commented Apr 3, 2019

fuel37 commented Apr 5, 2019

dannylamb commented Apr 5, 2019

DonRichards commented Apr 5, 2019

DonRichards commented Apr 8, 2019

adam-vessey Apr 5, 2019 •

edited

Loading

DonRichards Apr 11, 2019

adam-vessey Apr 11, 2019

bryjbrown Apr 11, 2019

adam-vessey Apr 11, 2019

DonRichards Apr 12, 2019

bryjbrown Apr 12, 2019

DonRichards Apr 15, 2019

adam-vessey Apr 15, 2019

DonRichards Apr 15, 2019

DonRichards commented Apr 16, 2019

DonRichards commented Apr 16, 2019

bondjimbond commented Jan 22, 2020

DonRichards commented Jan 22, 2020

DonRichards commented Apr 14, 2020

Removes control characters in the input/output of content #319

Are you sure you want to change the base?

Removes control characters in the input/output of content #319

Conversation

DonRichards commented Apr 3, 2019 • edited Loading

What does this Pull Request do?

What's new?

How should this be tested?

Additional Notes:

Interested parties

bryjbrown commented Apr 3, 2019

manez commented Apr 3, 2019

bondjimbond commented Apr 3, 2019

jordandukart commented Apr 3, 2019

bondjimbond commented Apr 3, 2019

dannylamb commented Apr 3, 2019 • edited Loading

DonRichards commented Apr 3, 2019

DonRichards commented Apr 3, 2019

bryjbrown commented Apr 3, 2019

DonRichards commented Apr 3, 2019

DonRichards commented Apr 3, 2019

DonRichards commented Apr 3, 2019

manez commented Apr 3, 2019

fuel37 commented Apr 5, 2019

dannylamb commented Apr 5, 2019

DonRichards commented Apr 5, 2019

DonRichards commented Apr 8, 2019

adam-vessey Apr 5, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DonRichards commented Apr 16, 2019

DonRichards commented Apr 16, 2019

bondjimbond commented Jan 22, 2020

DonRichards commented Jan 22, 2020

DonRichards commented Apr 14, 2020

DonRichards commented Apr 3, 2019 •

edited

Loading

dannylamb commented Apr 3, 2019 •

edited

Loading

adam-vessey Apr 5, 2019 •

edited

Loading