-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Removes control characters in the input/output of content #319
base: 7.x
Are you sure you want to change the base?
Conversation
@DonRichards First things first, theres someting about the JIRA issue you linked to that makes me unable to see it (says "You can't view this issue. It may have been deleted or you don't have permission to view it."), so I don't know the backstory to this. Second, unless I'm mistaken these control characters are not allowed in XML to begin with, so any MODS record with control characters is not well-formed and shouldn't have made it into Fedora to begin with. If that line of thinking is true, then it seems like sanitizing the output for Google Scholar is slapping a band-aid on the problem. I'd argue that anything that reduces the pain of using ill-formed XML is encouraging bad practice. |
@bryjbrown are you logged in to JIRA? I think the ticket should be visible to you, but not to anon right now. |
@manez @DonRichards I'm logged in and also getting the "You can't view this" message. |
If this is the case would this helper be more useful: https://github.com/Islandora/islandora/blob/7.x/includes/utilities.inc#L878 |
So would these bad characters get into the MODS record in the first place if they were ingested via a form? (Can't see the ticket, so can't access the example to test.) IF the form strips it, then the bad characters are perhaps getting in via batch ingest? In general, it would be a good idea to stop bad XML by validating it at the start of the ingest process. |
@bondjimbond If you can discern that the datastream you're ingesting in a batch is XML, it could conceivably get sanitized there. Still wouldn't stop someone who is manually jamminng in bad XML through Fedora's API, though. |
@bryjbrown We allow students to submit directly through the GUI and this comes up about twice a semester for us. And you should be able to see the Jira ticket if you're logged in and are on the security response group (you should be). |
Assuming Github doesn't strip this stuff out of postings I'm going to paste a string that breaks it here. total Lagrangiannite To check if Github filtered go to https://www.textmagic.com/free-tools/unicode-detector and paste it in. It should say "Character not present in GSM charset, forces to use Unicode encoding" |
I am logged in, but I'm not part of the security response group. Is that what could be causing me not to be able to see it? |
@bryjbrown Yes, I set it to sensitive because it was a security response ticket. I've basically showed most of what was in the ticket in the comments here. |
The "islandora_sanitize_input_for_valid_xml" @jordandukart suggested is currently firing for Full_text generation but I don't see it being fired prior to original file is ingested. |
Apologies for the confusion @bryjbrown . The message on the ticket implies that folks in the Developers group should be able to see, but I forgot that our permissions are customized, so that's not accurate under the hood. My bad 😞 |
Out of curiosity, if this is a sensitive security issue being handled by the Islandora Security Group, are there not protocols in place to keep things private until a fix has been identified and released? I mean otherwise, having the exploit posted publically as it is here and on Jira puts the entire community at risk, no? I'm not privy to the protocols for handling security concerns but, if this is standard practice, I'd suggest the protocols be reviewed asap. |
@fuel37 There are protocols in place. They just weren't followed. I'm trying to pull the conversation out of this and into private channels, although I see nothing wrong with taking the time to review how we handle these sorts of things. |
@fuel37 Through a few conversations it became apparent to me that this may cause headaches for viewing, it isn't providing any obvious platform for a malicious user to leverage an exploit. This causes functionality to not render completely but because it originates from a user submission it was originally set as sensitive. I worried when I submitted it that this might have a further reaching issue. It turns out that Google Scholar submodule is using the fields as strings to generate embedded tags and it doesn't handle them well. Drupal gives too many options for filtering and cleaning these strings. |
ALthough we have a PR coming in to filter everything, this should be good to filter existing material. |
@@ -24,6 +24,8 @@ function islandora_google_scholar_create_meta_tags($object) { | |||
else { | |||
$tags = array(); | |||
$tags['citation_author'] = array(); | |||
// remove breaking control characters. | |||
$object['MODS']->content = preg_replace('/[\x00-\x09\x0B\x0C\x0E-\x1F\x7F]/', '', $object['MODS']->content); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of issues here:
- creating a new version of the datastream every time this runs
restricting to ASCII... nothing of UTF-8 is allowed?Actually... not sure about this one...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The content
property of datastreams is "magic", which is to say: Setting it should result in another version of the datastream being created.
Indeed, it does appear to... You appear to be running that on something that would not be touched by the code changed here... a large image? The given module only runs the given function for objects of the ir:citationCModel
or ir:thesisCModel
models.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing
$object['MODS']->content = preg_replace('/[\x00-\x09\x0B\x0C\x0E-\x1F\x7F]/', '', $object['MODS']->content);
to something like
$filtered_mods = preg_replace('/[\x00-\x09\x0B\x0C\x0E-\x1F\x7F]/', '', $object['MODS']->content);
should be enough to get around this issue, right @adam-vessey?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bryjbrown: Yes, assuming of course that it's this hypothetical $filtered_mods
variable that gets used instead of the content pulled from the MODS datastream on the next line.
Another alternative to that undertaken here which should additionally have very little chance of messing with character encodings would be to more gracefully handle the error when parsing the original content... constructing the new SimpleXMLElement
will throw an exception if it fails to parse... we could catch and log it (watchdog(_exception), drupal_set_message, more?)... so that people would have a more direct way of discovering that their XML is invalid for whatever reason. Similarly, warning/error reporting could be taken over while constructing the object (libxml_use_internal_errors()
and libxml_get_errors()
, kind of thing), so we can control how/if they propagate... my concerns with the present implementation... I'm always leery of these kind of runtime filtering approaches, and how they essentially just mask underlying issues. Additionally, if the MODS does happen to be parseable, there's no need to filter it... less processing is usually better, especially in code for generating displays.
... as for what to do when handling an error... my first thought is just to return FALSE;
(as we do elsewhere in this function where we have nothing to embed)...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@adam-vessey You first comment ir:citationCModel or ir:thesisCModel model is the only one going to be impacted because this is sanitizing MODS data going through the Google Scholar submodule for Google Scholar tag display. Not other cmodels should ever see this error. There was a security patch in Islandora for forms to filter this type of stuff. This PR should only be useful for those currently in the repository (new content will get filtered during ingest). I can push a quick adjustment from that feedback.
@bryjbrown Thanks for the suggestion.
What do you guys think about changing the $mods line to keep it simple?
$mods = preg_replace('/[\x00-\x09\x0B\x0C\x0E-\x1F\x7F]/', '', $object['MODS']->content);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@DonRichards I think you and @adam-vessey are in agreement. Only ir:citationCModels and ir:thesisCModels are affected, yet you posted a screenshot showing what appears to be a large image object. Since large images aren't affected by this function, they aren't a valid way to test whether the function creates new versions of the MODS datastream on page load or not. You'd only be able to tell by testing an ir:citationCModel and ir:thesisCModel object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bryjbrown Oh man, yep I did that and yes it's creating multiple versions. Changing it to $mods = preg_replace('/[\x00-\x09\x0B\x0C\x0E-\x1F\x7F]/', '', $object['MODS']->content);
corrects that is does filter this appropriately. @adam-vessey Would you be good with this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@DonRichards: It should address the issue of unintentionally creating versions; however, as mentioned, doesn't really facilitate the discovery of the invalid data such that it might eventually be fixed.
Looks like the same issue with parsing pops up, in the "Coins" display... and really, could pop up anywhere we attempt to parse the datastream? The issue could even go so far as Solr documents being incomplete due to exceptions in GSearch when parsing? Almost... seeming like a GIGO kind of thing... invalid data causing errors, so... really, whatever ingest mechanism allowing the invalid data should be fixed? Made to validate data, at least?... maybe a batch to fix things (though identifying what needs to be fixed could be a chore)...
In any case... threw together the exception-catching business, more as an example of what it might look like: 7.x...adam-vessey:7.x-alternative-error-handling
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I think I got it worked out. There is one place in addition I see needs attention in the islandora_xml_forms/api/XMLDocument.inc L144 $document->loadXML(islandora_sanitize_input_for_valid_xml($xml)) :
for this to work.
I'm sorry but this isn't isolated to Scholar and should probably live further upstream. I just did a large image ingest with the same problem. |
Maybe this shouldn't be closed. Not sure. |
@DonRichards Did you find a place "further upstream" to fix the display issue? |
@bondjimbond No, I'm not sure where it would be. |
I believe this this needs to be merged here and not upstream at this point. |
JIRA Ticket: https://jira.duraspace.org/browse/ISLANDORA-2412
What does this Pull Request do?
Sanitize MODS data into Google Scholar submodule
This does not change the data being stored, it should only impact the data as it processes for Google Scholar meta tags.
What's new?
String replace for ASCII control characters
How should this be tested?
Please use the example in the JIRA ticket for a proof of concept and MODS file with a Control Character embedded into the abstract.
Try ingesting prior to testing this PR.
Pull this PR
Additional Notes:
N/A
Interested parties
@Islandora/7-x-1-x-committers