Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect decoding (truncation) of ByteString, but not Text #91

Open
NorfairKing opened this issue Jun 21, 2022 · 5 comments
Open

Incorrect decoding (truncation) of ByteString, but not Text #91

NorfairKing opened this issue Jun 21, 2022 · 5 comments

Comments

@NorfairKing
Copy link

There seems to be a bug in tagsoup, that makes it so that characters that do not fit
into latin1, when utf8-encoded as HTML entities, are truncated when parsed.
Here is an example:

> HTML.parseTags ("😀" :: ByteString)
[TagText "\NUL"]
> HTML.parseTags ("😀" :: Text)
[TagText "\128512"]
@NorfairKing
Copy link
Author

Here is the bug:

fromChar = BS.singleton

Which uses

https://github.com/haskell/bytestring/blob/4e62154aca912e0154f3bdbaa14e9ff448c2d85e/Data/ByteString/Char8.hs#L295

and that function truncates because it uses:

-- | Unsafe conversion between 'Char' and 'Word8'. This is a no-op and
-- silently truncates to 8 bits Chars > '\255'. It is provided as
-- convenience for ByteString construction.
c2w :: Char -> Word8
c2w = fromIntegral . ord
{-# INLINE c2w #-}

https://github.com/haskell/bytestring/blob/4e62154aca912e0154f3bdbaa14e9ff448c2d85e/Data/ByteString/Internal.hs#L797-L802

@ndmitchell
Copy link
Owner

ndmitchell commented Jul 1, 2022

Thanks for the report - agreed that isn't ideal. What would you expect HTML.parseTags ("😀" :: ByteString) to do? There aren't a huge number of options if you want a text string, and have asked for ByteString as your pieces.

@NorfairKing
Copy link
Author

NorfairKing commented Jul 1, 2022

What would you expect HTML.parseTags ("😀" :: ByteString) to do?

I've thought about this a lot.
I'm not sure, but I'd be ok with a UTF-8 encoding of "\128512"🤷 or for the instance to be removed...

EDIT: Lol, github displays that as an emoji.

@ndmitchell
Copy link
Owner

Shoving UTF8 into a bytestring seems like you are treating the type as a different type (having such a type widely used in Haskell would be great, and maybe with text it will be one day). Removing the instance breaks it for people who want to use that, knowing its limitations - I probably wouldn't add such a instance today, but removing it seems too far. Documenting the caveats seems a good idea regardless though.

@NorfairKing
Copy link
Author

NorfairKing commented Jul 1, 2022

Documenting the caveats seems a good idea regardless though.

👍 I still mostly consume Data.ByteString.Lazy but have to decode it before calling parseTags...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants