Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SString::AppendUTF8 converts 'this' to UTF16 if it is non-ascii #76581

Open
jakobbotsch opened this issue Oct 4, 2022 · 10 comments
Open

SString::AppendUTF8 converts 'this' to UTF16 if it is non-ascii #76581

jakobbotsch opened this issue Oct 4, 2022 · 10 comments
Assignees
Milestone

Comments

@jakobbotsch
Copy link
Member

I was trying to use some of the UTF8 string handling over in #76505 and noticed that AppendUTF8 in turn calls the Append(SString&) overload. This overload calls End() that converts non-ascii UTF8 back to Unicode. This behavior seems quite odd, I would expect appending an UTF8 string to an UTF8 string to stay in UTF8.

@ghost ghost added the untriaged New issue has not been triaged by the area owner label Oct 4, 2022
@ghost
Copy link

ghost commented Oct 4, 2022

Tagging subscribers to this area: @dotnet/area-system-text-json, @gregsdennis
See info in area-owners.md if you want to be subscribed.

Issue Details

I was trying to use some of the UTF8 string handling over in #76505 and noticed that AppendUTF8 in turn calls the Append(SString&) overload. This overload calls End() that converts non-ascii UTF8 back to Unicode. This behavior seems quite odd, I would expect appending an UTF8 string to an UTF8 string to stay in UTF8.

Author: jakobbotsch
Assignees: -
Labels:

area-System.Text.Json, untriaged

Milestone: -

@jakobbotsch jakobbotsch added area-VM-coreclr and removed area-System.Text.Json untriaged New issue has not been triaged by the area owner labels Oct 4, 2022
@AaronRobinsonMSFT
Copy link
Member

AaronRobinsonMSFT commented Oct 4, 2022

This is by-design. The preferred encoding for SString is UTF-16 and it is sticky if I recall. Starting with UTF8 is what you likely want here. Also, at the end you can always convert the string fully to UTF8 by calling GetUTF8().

Can you provide an example on precisely what you want and how the SString was instantiated?

@AaronRobinsonMSFT AaronRobinsonMSFT added the question Answer questions and provide assistance, not an issue with source code or documentation. label Oct 4, 2022
@AaronRobinsonMSFT AaronRobinsonMSFT added this to the Future milestone Oct 4, 2022
@jkotas
Copy link
Member

jkotas commented Oct 4, 2022

The preferred UTF-16 encoding for SString should come into play only when there is encoding mismatch. For example, when one is trying to append UTF-16 string to UTF-8 string.

When I have UTF-8 string and trying to append another UTF-8 string to it, the result should stay as UTF-8. There is no reason for the result to get converted to UTF-16 (what is happening here).

@jakobbotsch
Copy link
Member Author

jakobbotsch commented Oct 4, 2022

Can you provide an example on precisely what you want and how the SString was instantiated?

I was trying to switch CEEInfo::appendClassName to UTF8, and my initial straightforward/naive implementations slowed down the JIT by an unexpected amount in checked builds (~25% more time taken to JIT SPC with PMI.dll on x86). When I stepped in I noticed this particular behavior for the UTF8-UTF8 string case, but I was not hitting this, I don't think there is an issue if the UTF8 strings stay within ASCII.

These are the versions I tried:

Version 1 just used TypeString and SString::GetUTF8(). The GetUTF8() here was very hot, 25% of the time in the profile I captured was spent inside that.

For version 2 I tried to avoid TypeString after discussing with @jkotas, since we thought it was related to the fact that it builds the string as UTF16 and thus incurs some unnecessary conversions. As you can see, I just instantiated it with StackSString ss;, maybe I should have used something else here?

For version 3 I avoided SString and used an append lambda function to copy directly into the buffer argument. This should avoid (at least) one copy, so it should be faster.

Calling these three versions of appendClassName in a loop from JIT 100000 times with the System.Runtime.CompilerServices.CastHelpers class as an argument gives the following results on x86:
Version 1: 2391 ms
Version 2: 1656 ms
Version 3: 172 ms

It is quite unexpected to me that the first two variants are so much more expensive. The contracts add some overhead, but it also looks like the AppendUTF8 path ends up creating quite a few temporary copies.

@AaronRobinsonMSFT
Copy link
Member

The preferred UTF-16 encoding for SString should come into play only when there is encoding mismatch. For example, when one is trying to append UTF-16 string to UTF-8 string.

I agree with this. If I recall the issue is that SString, by default, starts out as "Unicode". By this, I mean if the default constructor is used it will be in "Unicode" mode and when calling an AppendUTF8, it is appending to "Unicode" and therefore by design. If the constructor for specifying UTF8 is used, then the AppendUTF8 should behave as described. This is what I recall seeing. The pattern I was thinking of is SString str{SSttring::Utf8, ""};

It is quite unexpected to me that the first two variants are so much more expensive. The contracts add some overhead, but it also looks like the AppendUTF8 path ends up creating quite a few temporary copies.

On this we completely agree. I very much dislike the SString as is and it is one of my longer term goals to slowly improve the API so it is more performant and intuitive.

@jkotas
Copy link
Member

jkotas commented Oct 4, 2022

I agree with this. If I recall the issue is that SString, by default, starts out as "Unicode".

SString starts as REPRESENTATION_EMPTY without specific encoding.

@jakobbotsch
Copy link
Member Author

If the constructor for specifying UTF8 is used, then the AppendUTF8 should behave as described.

Well, this is what I was reporting -- it doesn't seem to. It calls Append(SString&) and that function calls End() which converts multi-byte 'this' into UTF16. For example:

SString str(SString::Utf8, "Planlægnings");
str.AppendUTF8("periode"); // 'str' is converted to UTF16 inside this call, before starting to append "periode"

@AaronRobinsonMSFT
Copy link
Member

@jakobbotsch I see the issue now. This is clearly unintuitive and I agree with your expectations here. Based on how this was implemented I get the why, but it shouldn't be this way. I'll add it to my list of things to fix up.

@AaronRobinsonMSFT AaronRobinsonMSFT removed the question Answer questions and provide assistance, not an issue with source code or documentation. label Oct 4, 2022
@AaronRobinsonMSFT AaronRobinsonMSFT modified the milestones: Future, 8.0.0 Oct 4, 2022
@mangod9
Copy link
Member

mangod9 commented Jul 3, 2023

@AaronRobinsonMSFT can this be closed based on #76571, or is there more work required here?

@AaronRobinsonMSFT
Copy link
Member

I need to get back to this. I don't believe it is fixed yet.

@AaronRobinsonMSFT AaronRobinsonMSFT modified the milestones: 8.0.0, Future Jul 3, 2023
@AaronRobinsonMSFT AaronRobinsonMSFT self-assigned this Jul 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants