Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

URL fragment identifiers containing colons are stripped even when relative URLs are allowed #35

Closed
chrisalley opened this issue May 16, 2011 · 3 comments
Labels

Comments

@chrisalley
Copy link

Using the Sanitize gem, I'm cleaning some HTML. In the href attribute of my anchor tags, I wish to parse the following:

<a href="#fn:1">1</a>

This is required for implementing footnotes using the Kramdown gem.

However, Sanitize doesn't appear to like the colon inside the href attribute. It simply outputs <a>1</a> instead, skipping the href attribute altogether.

My sanitize code looks like this:

# Setup whitelist of html elements, attributes, and protocols that are allowed.
allowed_elements = ['h2', 'a', 'img', 'p', 'ul', 'ol', 'li', 'strong', 'em', 'cite', 
  'blockquote', 'code', 'pre', 'dl', 'dt', 'dd', 'br', 'hr', 'sup', 'div']
allowed_attributes = {'a' => ['href', 'rel', 'rev'], 'img' => ['src', 'alt'], 
  'sup' => ['id'], 'div' => ['class'], 'li' => ['id']}
allowed_protocols = {'a' => {'href' => ['http', 'https', 'mailto', :relative]}}

# Clean text of any unwanted html tags.
html = Sanitize.clean(html, :elements => allowed_elements, :attributes => allowed_attributes, 
  :protocols => allowed_protocols)

Is there a way to get Sanitize to accept a colon in the href attribute?

This issue is a duplicate of this Stack Overflow question.

@rgrove
Copy link
Owner

rgrove commented May 16, 2011

Answered on Stack Overflow. Repeating here for posterity.

This is Sanitize doing the safest thing by default. It assumes that the portion of the URL before the : is a protocol (or a scheme in the terminology of RFC 1738), and since #fn isn't in the protocol whitelist, the entire href attribute is removed.

You can allow URLs like this by adding #fn to the protocol whitelist:

allowed_protocols = {'a' => {'href' => ['#fn', 'http', 'https', 'mailto', :relative]}}

@rgrove rgrove closed this as completed May 16, 2011
@trevordevore
Copy link

I'm found this while troubleshooting an issue where Sanitize strips out the : character in the href tag. I have a document with a bookmark that contains the : character and an href that points to it (e.g.href="#my:id). Seeing as : is a valid character for id in HTML5 would it be safe for Sanitize to leave the : in place for links that begin with a # character?

@rgrove rgrove reopened this Nov 1, 2017
@rgrove rgrove added bug and removed question labels Nov 1, 2017
@rgrove rgrove changed the title Sanitize doesn't like colon inside href attribute URL fragment identifiers containing colons are stripped even when relative URLs are allowed Nov 1, 2017
@rgrove
Copy link
Owner

rgrove commented Dec 30, 2024

Turns out this was fixed in #87 (released in v2.1.0) way back in 2013, but this issue wasn't mentioned in that PR so no link was established. When a new comment was added here in 2017, I must not have remembered that PR, and I reopened the issue. But as far as I can tell everything's working fine!

If anyone's still having problems with URL fragments that contain colons, please share some code that reproduces the problem (be sure to mention what version of Sanitize you're using), and I'll try to find time to investigate before another 7 years go by. 😄

@rgrove rgrove closed this as completed Dec 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants