Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bom, encoding and test EXPath-file-writeText3-002 #70

Open
benibela opened this issue Oct 25, 2016 · 7 comments
Open

bom, encoding and test EXPath-file-writeText3-002 #70

benibela opened this issue Oct 25, 2016 · 7 comments

Comments

@benibela
Copy link

The test EXPath-file-writeText3-002 assumes encoding utf-16 is written as big-endian with BOM.

It could just as well mean little-endian, each with or without BOM.

@benibela
Copy link
Author

Also EXPath-file-appendText3-002

@michaelhkay
Copy link
Member

There was email correspondence on this subject at the time, see for example

https://lists.w3.org/Archives/Public/public-expath/2012Jul/0005.html

This seemed to reach a level of consensus though I don't think this was well captured in the final spec.

You're free of course to interpret the spec any way you like but we will achieve better interoperability between implementations if implementors respect the test suite as defining a consensus interpretation.

While the relevant RFCs certainly make UTF-16 without a BOM legal, I think there is a strong presumption that the default serialization for UTF-16 should (a) be big-endian, and (b) have a BOM, and I would encourage you to follow these conventions.

Michael Kay
Saxonica

On 25 Oct 2016, at 21:45, Benito van der Zander [email protected] wrote:

The test EXPath-file-writeText3-002 assumes encoding utf-16 is written as big-endian with BOM.

It could just as well mean little-endian, each with or without BOM.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub #70, or mute the thread https://github.com/notifications/unsubscribe-auth/ACSIIgRnNfpu_E-fX76UzEmU-GIQGemDks5q3mpbgaJpZM4Kgdof.

@benibela
Copy link
Author

There you wrote

  • file:append-text#3 does not write a BOM

contrary to EXPath-file-appendText3-002

Despite an remark on it <modified by="Christian Grün" on="2013-11-20" change="Alternative without BOM added"/>

While the relevant RFCs certainly make UTF-16 without a BOM legal, I think there is a strong presumption that the default serialization for UTF-16 should (a) be big-endian, and (b) have a BOM, and I would encourage you to follow these conventions.

But I only wanted to deal with HTML.
There the WHATWG gave a clear definition, "UTF-16" means always little-endian: https://www.w3.org/TR/encoding/#utf-16le

@michaelhkay
Copy link
Member

Wikipedia article on UTF-16 says

If the BOM is missing, RFC 2781 https://tools.ietf.org/html/rfc2781 says that big-endian encoding should be assumed. (In practice, due to Windows using little-endian order by default, many applications similarly assume little-endian encoding by default.)

So it looks as if WhatWG are playing their usual game - ignore standards, just endorse the bugs in existing products.

But we're concerned here with writing of text, not reading. All the specs seem to agree that if you're writing, the most important thing is to include a BOM so that the reader knows what the endianness actually is.

Michael Kay
Saxonica

On 26 Oct 2016, at 10:47, Benito van der Zander [email protected] wrote:

There you wrote

file:append-text#3 does not write a BOM
contrary to EXPath-file-appendText3-002

Despite an remark on it

While the relevant RFCs certainly make UTF-16 without a BOM legal, I think there is a strong presumption that the default serialization for UTF-16 should (a) be big-endian, and (b) have a BOM, and I would encourage you to follow these conventions.

But I only wanted to deal with HTML.
There the WHATWG gave a clear definition, "UTF-16" means always little-endian: https://www.w3.org/TR/encoding/#utf-16le https://www.w3.org/TR/encoding/#utf-16le

You are receiving this because you commented.
Reply to this email directly, view it on GitHub #70 (comment), or mute the thread https://github.com/notifications/unsubscribe-auth/ACSIIuNIDXFF0JC4stKkqwxGa1arbMZEks5q3yHPgaJpZM4Kgdof.

@michaelhkay
Copy link
Member

On 26 Oct 2016, at 10:47, Benito van der Zander [email protected] wrote:

There you wrote

file:append-text#3 does not write a BOM
contrary to EXPath-file-appendText3-002

Well I'm sure my message wasn't the last word on the subject but it's hard to reconstruct the decisions at this distance.

Despite an remark on it

While the relevant RFCs certainly make UTF-16 without a BOM legal, I think there is a strong presumption that the default serialization for UTF-16 should (a) be big-endian, and (b) have a BOM, and I would encourage you to follow these conventions.

But I only wanted to deal with HTML.
There the WHATWG gave a clear definition, "UTF-16" means always little-endian: https://www.w3.org/TR/encoding/#utf-16le https://www.w3.org/TR/encoding/#utf-16le

Glory be, everything WhatWG does is weird.

Michael Kay
Saxonica

@benibela
Copy link
Author

All the specs seem to agree that if you're writing, the most important thing is to include a BOM so that the reader knows what the endianness actually is.

There is also JSON. There the BOM is forbidden: https://tools.ietf.org/html/rfc7159#section-8.1

Well I'm sure my message wasn't the last word on the subject but it's hard to reconstruct the decisions at this distance.

It has been a while.

It seems times are changing, and newer standards have a different opinion

@michaelhkay
Copy link
Member

There is also JSON. There the BOM is forbidden: https://tools.ietf.org/html/rfc7159#section-8.1 https://tools.ietf.org/html/rfc7159#section-8.1

Actually, not quite. It says that "implementations" shall not add a BOM. It doesn't say what an "implementation" is. With normal separation of concerns the JSON output will be written as characters, and the encoding to UTF-16 will be done by a library that has no idea that the text it is encoding is JSON, and therefore is under no obligation to conform to the JSON specification.

Michael Kay
Saxonica

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants