Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

escape non-breaking space as \20, as required in RFC 7613, section 7.3 #21

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

nitram509
Copy link
Contributor

The method XmppStringUtils.escapeLocalpart(String localpart)
does not escape non-breaking spaces.
But RF7613 section 7.3 is explicitly mentioning, that all spaces should be escaped as \20.

This PR adds tests and incl. the fix.

Any feedback welcome.

@Flowdalic
Copy link
Member

Thanks for your contribution. :)

Unfortunately I am not sure if this is correct:
The method you modified deals with XEP-0106 escaping, not the PRECIS handling of the localpart. Applying PRECIS is done by the XmppStringprep implementation in jxmpp. Hence I don't think the change is correct.

@nitram509
Copy link
Contributor Author

Hi @Flowdalic thank you for this feedback.
It seems I was misled by the method name itself - good catch.
I will have another look later the day - if this PR is meaningless, I will close it.

@nitram509
Copy link
Contributor Author

I did dig deeper in the specs and found a few references...
In the XEP-0106 spec is written

Furthermore, since localparts use the UsernameCaseMapped profile (RFC 7613 [2]) of PRECIS any space character disallowed by category N (section 9.14) of the RFC 7564 [3] IdentifierClass is also forbidden.

To be honest, I find it very difficult to understand the various cross-references about characters classes, e.g. spaces. Also, I don't understand what is actually meant in https://tools.ietf.org/html/rfc7564#section-9.14 ... there is no reference, what is meant by {Zs}. :-/

What I found as well in the above-cited RFC 7564 https://tools.ietf.org/html/rfc7564#section-5.3 "A note about spaces" ... they explicitly mention that the identifier part should not contain spaces.

I tend to say, that when the previous implementation used "Character.isWhitespace()" they already included the general idea about excluding whitespaces from localpart. But a few specific ones were not yet included, but falling IMHO clearly into the spaces category.

(clarification question:)
What makes you thinking, the patch is not correct?
How would you interpret the specs?

@Flowdalic
Copy link
Member

there is no reference, what is meant by {Zs}

From RFC 7563: "…core properties defined by the Unicode Standard". See also https://unicode.org/reports/tr44/

But a few specific ones were not yet included, but falling IMHO clearly into the spaces category.

Right, but if you simply map them to \20, as your patch would do, then the mapping function is no longer injective.

Could you elaborate a bit about the motivation for this PR? What was the trigger?

What makes you thinking, the patch is not correct?

I am not sure if it is correct or not. It is underspecified in XEP-0106 if '<space>' means U+0020, or all of whitespace. Since it is escaped to \20, I'd assume the authors intention is/was to just escape U+0020.

How would you interpret the specs?

I'd just escape U+0020. Because even after thinking a little while about it, I could not come up with a reason to escape more whitespace characters to \20. It also means that the transformation is not reversible (at leaset, if you escape more code points to the same escape sequence).

@nitram509
Copy link
Contributor Author

From RFC 7563: "…core properties defined by the Unicode Standard". See also https://unicode.org/reports/tr44/

👍 thanks.

IMHO, this reference to Unicodes {Zs} "feels" a bit unsharp, considering that these categories are prone to change...

Occasionally, a character property value is changed to prevent incorrect generalizations about a character's use based on its nominal property values. For example, U+200B ZERO WIDTH SPACE was originally classified as a space character (General_Category=Zs), but it was reclassified as a Format character (General_Category=Cf) to clearly distinguish it from space characters in its function as a format control for line breaking. (from section 2.3.1, https://unicode.org/reports/tr44/)


Right, but if you simply map them to \20, as your patch would do, then the mapping function is no longer injective.

Very true. Information gets lost and reverse-transformation/decoding produces different results.

Could you elaborate a bit about the motivation for this PR? What was the trigger?

I'm participating 'Hacktoberfest' and stumbled upon jxmpp. Frankly said, my original intent was just to add some tests and increase code coverage. Then I found this issue.

I'd just escape U+0020. Because even after thinking a little while about it, I could not come up with a reason to escape more whitespace characters to \20. It also means that the transformation is not reversible (at least, if you escape more code points to the same escape sequence).

Fair point.
BUT then, the current implementation is wrong because of using Character.isWhitespace.
Should I adjust this PR towards this?

@nitram509
Copy link
Contributor Author

@Flowdalic may I ask, if you have any opinion on my previous comment?

@Flowdalic
Copy link
Member

BUT then, the current implementation is wrong because of using Character.isWhitespace.

Right

Should I adjust this PR towards this?

If "towards this" means replacing the isWhitespace check with == ' ', then yes, that should be ok.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants