Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UTF-8 Arbitrarily converted to UTF-16LE or UTF-16BE #41

Open
MSeal opened this issue Dec 19, 2014 · 0 comments
Open

UTF-8 Arbitrarily converted to UTF-16LE or UTF-16BE #41

MSeal opened this issue Dec 19, 2014 · 0 comments

Comments

@MSeal
Copy link

MSeal commented Dec 19, 2014

I was patching a fork of the roo gem to fix several xls parsing error and I ran into this weird state, where the characters are encoded in a different format than expected when loading files generated by this gem. I traced it down to write_string and found this nugget:

    # Handle utf8 strings
    if is_utf8?(str)
      str_utf16le = utf8_to_16le(str)
      return write_utf16le_string(row, col, str_utf16le, args[3])
    end

This causes some cells to be arbitrarily encoded in UTF-16LE, which gets loaded as UTF-8 by the spreadsheet gem. I noticed the same conversion, but to UTF-16BE in chart axis saving. --don't mix the streams--

Then I went to read the write_utf16le_string method to see if it was doing something odd. I did not expect it to convert the string again, but here it calls utf16be_to_16le on an already encoded utf-16le string.

  def write_utf16le_string(*args)
    # Check for a cell reference in A1 notation and substitute row and column
    args = row_col_notation(args)

    return -1 if (args.size < 3)                  # Check the number of args

    row, col, str, format = args

    # Change from UTF16 big-endian to little endian
    str = utf16be_to_16le(str)

    write_utf16be_string(row, col, str, format)
  end

Which in turn doesn't even use encoding but instead unpacks and repacks the string.

  def utf16be_to_16le(utf16be)
    utf16be.unpack('n*').pack('v*')
  end

And then the write_utf16be_string at the end of write_utf16le_string itself calls utf16be_to_16le to convert it back to 16le. At this point I starting losing track of be's and le's in trying to figure out what the actual encoding becomes.

def utf16be_to_16le(...)
    ...
    # Change from UTF16 big-endian to little endian
    str = utf16be_to_16le(str)

WHY does it do this??? Is it a mistake, or something to appease an older version of excel? There's 3 levels of arbitrary re-encoding for utf-8 without clarifying comments and it clearly produces an output that other readers can't correctly interpret and which differs from excel 2010's xls outputs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant