You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I was patching a fork of the roo gem to fix several xls parsing error and I ran into this weird state, where the characters are encoded in a different format than expected when loading files generated by this gem. I traced it down to write_string and found this nugget:
# Handle utf8 strings
if is_utf8?(str)
str_utf16le = utf8_to_16le(str)
return write_utf16le_string(row, col, str_utf16le, args[3])
end
This causes some cells to be arbitrarily encoded in UTF-16LE, which gets loaded as UTF-8 by the spreadsheet gem. I noticed the same conversion, but to UTF-16BE in chart axis saving. --don't mix the streams--
Then I went to read the write_utf16le_string method to see if it was doing something odd. I did not expect it to convert the string again, but here it calls utf16be_to_16le on an already encoded utf-16le string.
def write_utf16le_string(*args)
# Check for a cell reference in A1 notation and substitute row and column
args = row_col_notation(args)
return -1 if (args.size < 3) # Check the number of args
row, col, str, format = args
# Change from UTF16 big-endian to little endian
str = utf16be_to_16le(str)
write_utf16be_string(row, col, str, format)
end
Which in turn doesn't even use encoding but instead unpacks and repacks the string.
def utf16be_to_16le(utf16be)
utf16be.unpack('n*').pack('v*')
end
And then the write_utf16be_string at the end of write_utf16le_string itself calls utf16be_to_16le to convert it back to 16le. At this point I starting losing track of be's and le's in trying to figure out what the actual encoding becomes.
def utf16be_to_16le(...)
...
# Change from UTF16 big-endian to little endian
str = utf16be_to_16le(str)
WHY does it do this??? Is it a mistake, or something to appease an older version of excel? There's 3 levels of arbitrary re-encoding for utf-8 without clarifying comments and it clearly produces an output that other readers can't correctly interpret and which differs from excel 2010's xls outputs.
The text was updated successfully, but these errors were encountered:
I was patching a fork of the roo gem to fix several xls parsing error and I ran into this weird state, where the characters are encoded in a different format than expected when loading files generated by this gem. I traced it down to
write_string
and found this nugget:This causes some cells to be arbitrarily encoded in UTF-16LE, which gets loaded as UTF-8 by the spreadsheet gem. I noticed the same conversion, but to UTF-16BE in chart axis saving. --don't mix the streams--
Then I went to read the
write_utf16le_string
method to see if it was doing something odd. I did not expect it to convert the string again, but here it callsutf16be_to_16le
on an already encoded utf-16le string.Which in turn doesn't even use encoding but instead unpacks and repacks the string.
And then the
write_utf16be_string
at the end ofwrite_utf16le_string
itself callsutf16be_to_16le
to convert it back to 16le. At this point I starting losing track of be's and le's in trying to figure out what the actual encoding becomes.WHY does it do this??? Is it a mistake, or something to appease an older version of excel? There's 3 levels of arbitrary re-encoding for utf-8 without clarifying comments and it clearly produces an output that other readers can't correctly interpret and which differs from excel 2010's xls outputs.
The text was updated successfully, but these errors were encountered: