Incorrect decoding (truncation) of `ByteString`, but not `Text` #91

NorfairKing · 2022-06-21T18:18:25Z

There seems to be a bug in tagsoup, that makes it so that characters that do not fit
into latin1, when utf8-encoded as HTML entities, are truncated when parsed.
Here is an example:

> HTML.parseTags ("&#128512;" :: ByteString)
[TagText "\NUL"]
> HTML.parseTags ("&#128512;" :: Text)
[TagText "\128512"]

The text was updated successfully, but these errors were encountered:

NorfairKing · 2022-06-21T18:24:19Z

Here is the bug:

tagsoup/src/Text/StringLike.hs

Line 64 in 44c32d6

fromChar = BS.singleton

Which uses

https://github.com/haskell/bytestring/blob/4e62154aca912e0154f3bdbaa14e9ff448c2d85e/Data/ByteString/Char8.hs#L295

and that function truncates because it uses:

-- | Unsafe conversion between 'Char' and 'Word8'. This is a no-op and
-- silently truncates to 8 bits Chars > '\255'. It is provided as
-- convenience for ByteString construction.
c2w :: Char -> Word8
c2w = fromIntegral . ord
{-# INLINE c2w #-}

https://github.com/haskell/bytestring/blob/4e62154aca912e0154f3bdbaa14e9ff448c2d85e/Data/ByteString/Internal.hs#L797-L802

ndmitchell · 2022-07-01T14:15:18Z

Thanks for the report - agreed that isn't ideal. What would you expect HTML.parseTags ("😀" :: ByteString) to do? There aren't a huge number of options if you want a text string, and have asked for ByteString as your pieces.

NorfairKing · 2022-07-01T15:15:05Z

What would you expect HTML.parseTags ("😀" :: ByteString) to do?

I've thought about this a lot.
I'm not sure, but I'd be ok with a UTF-8 encoding of "\128512"🤷 or for the instance to be removed...

EDIT: Lol, github displays that as an emoji.

ndmitchell · 2022-07-01T16:06:06Z

Shoving UTF8 into a bytestring seems like you are treating the type as a different type (having such a type widely used in Haskell would be great, and maybe with text it will be one day). Removing the instance breaks it for people who want to use that, knowing its limitations - I probably wouldn't add such a instance today, but removing it seems too far. Documenting the caveats seems a good idea regardless though.

NorfairKing · 2022-07-01T16:28:20Z

Documenting the caveats seems a good idea regardless though.

👍 I still mostly consume Data.ByteString.Lazy but have to decode it before calling parseTags...

ndmitchell added a commit that referenced this issue Sep 19, 2022

#91, note that StringLike on ByteString is not full fidelity

fc86451

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect decoding (truncation) of `ByteString`, but not `Text` #91

Incorrect decoding (truncation) of `ByteString`, but not `Text` #91

NorfairKing commented Jun 21, 2022

NorfairKing commented Jun 21, 2022

ndmitchell commented Jul 1, 2022 •

edited

Loading

NorfairKing commented Jul 1, 2022 •

edited

Loading

ndmitchell commented Jul 1, 2022

NorfairKing commented Jul 1, 2022 •

edited

Loading

Incorrect decoding (truncation) of ByteString, but not Text #91

Incorrect decoding (truncation) of ByteString, but not Text #91

Comments

NorfairKing commented Jun 21, 2022

NorfairKing commented Jun 21, 2022

ndmitchell commented Jul 1, 2022 • edited Loading

NorfairKing commented Jul 1, 2022 • edited Loading

ndmitchell commented Jul 1, 2022

NorfairKing commented Jul 1, 2022 • edited Loading

Incorrect decoding (truncation) of `ByteString`, but not `Text` #91

Incorrect decoding (truncation) of `ByteString`, but not `Text` #91

ndmitchell commented Jul 1, 2022 •

edited

Loading

NorfairKing commented Jul 1, 2022 •

edited

Loading

NorfairKing commented Jul 1, 2022 •

edited

Loading