-
Notifications
You must be signed in to change notification settings - Fork 483
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for Wide Strings in SOCI for Enhanced Unicode Handling #1133
Conversation
Converting from UTF-16 to UTF-8 is no problem when retrieving data, because the column data type is known. I'm thinking of adding another argument to "soci::use()" that lets the developer override the data type that's used for the underlying ODBC call. Another issue is the currently non-existing N'' enclosure for unicode strings for MSSQL in case of soci::use(). Another issue is the stream interface. Currently std::wstring isn't supported and as far as I understand, supporting it would require widening the query to UTF-16 before sending it to the DB. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! This globally looks good but there are globally 2 issues:
- The new functionality needs to be documented, notably it should be clearly stated that
wstring
andwchar_t
are only supported in the ODBC backend (and only when using SQL Server?). - The use of/checks for C++17 are confusing as it's not clear if it is required for wide char support or if it's just some kind of optimization (in the latter case I'd drop it, it's not worth the extra code complexity).
…hub.com/ORDIS-Co-Ltd/soci into wstring_support_with_unicode_conversion
This commit updates the Unicode conversion functions to handle UTF-16 on Windows and UTF-32 on other platforms. The changes include: 1. Updating the `utf8_to_wide` and `wide_to_utf8` functions to handle UTF-32 on Unix/Linux platforms. 2. Updating the `copy_from_string` function to handle UTF-16 on Windows and convert UTF-32 to UTF-16 on other platforms. 3. Updating the `bind_by_pos` function to handle UTF-16 on Windows and convert UTF-32 to UTF-16 on other platforms. 4. Adding a test case for wide strings in the ODBC MSSQL tests.
Thanks, I'm doing some final checks and I realize I have some more questions:
What do you think? |
Not sure if it's really needed. There was a char, so I added a wchar_t. 🤷🏻♂️
I would have to look into that. But I can't promise when I can get to that. I have a full plate at the moment. 😕
Good point. I will do that today, as it doesn't take long.
I fully agree. But at the time, I didn't want to make this a larger PR than it already is (and I expected it to be). It's quite a lot of fun to get it all right on all 3 platforms, as you know. |
Thanks, I can try to do (2) and do (3) myself, I just wanted to check if there was any reason not to. As for 4.1, I wanted to make it a.s.a.p. just because I feel like I've already delayed it way too much... |
Okay, then I will try to see if I can get 4 done using an LLM. I fed the relevant sources and the diff to 2 LLMs and one of the responses look very good for mysql. I will look into this later tonight. |
Concerning (2): after looking at it again, I think this We want using |
So we don't have to convert them before using them. Naturally, there are always pros and cons. |
Sorry, I still don't understand: under which circumstances would you have wide strings as part of the query instead of using them as parameters? Anyhow, I've realized there is a worse problem here: std::string str_out_utf8;
sql << "select wide_text from soci_test", into(str_out_utf8); doesn't work correctly, the existing "MS SQL wide string" test only passes because it uses ASCII strings. Changing them to this: std::wstring const str_in = L"Привет, SOCI!";
std::string const str_in_utf8 = "\xd0\x9f\xd1\x80\xd0\xb8\xd0\xb2\xd0\xb5\xd1\x82, SOCI!"; makes it fail, at least when using unixODBC SQL Server driver under Linux. |
But actually I don't see anything that would even attempt to make this work, so I think it's just the test which is wrong and needs to be removed. I.e. there is no provision for retrieving |
Also, while doing (3) I've realized that this code looks like something extracted from another project as it's not written in SOCI style. What is its copyright/licence and can we include it in SOCI at all? |
I've pushed my changes to #1179, but I still have important questions about the origin of the Unicode-related functions code. |
That is likely because of the influence that OpenAI's, Anthropic's and Alibaba's LLMs had on the code. 😉 |
I ran the code through copyleaks. |
Good catch. I think this was the first test I added and at the time it did not contain unicode strings yet, as I focussed only on wstring at first. So, I think this is an issue that should be fixed. I will look into this tomorrow (it's 1:38 here now). |
OK, thanks, I think we could merge #1179 then and enhance it further in other PRs because I really believe that the other problem is just due to support for interoperability between FYI, I did change/fix a couple of things in the Unicode functions (calling a void function |
Thank you for the fixes. |
Sorry, it's worse than I thought: it's broken under Windows when using native ODBC driver, see e.g. this build job. If even this doesn't work, we really can't merge it. |
Hmm... the tests did pass before. I am not sure what the cause is. |
They passed before because they used 7 bit ASCII strings. I've modified them to use non-ASCII ones and checked that this still worked under Linux, but not under Windows -- where they fail. |
I see. |
It's actually a limitation of the Microsoft compiler. A better explanation follows within the next 30 minutes. |
When using wide string literals (L"…") in C++ on Windows, particularly with non-ASCII characters like Cyrillic, it's important to understand how these are interpreted:
leads to:
leads to: This is also the reason I used this method in the unicode tests. Here are the updated versions of the test methods: TEST_CASE("MS SQL wide string", "[odbc][mssql][wstring]")
{
soci::session sql(backEnd, connectString);
wide_text_table_creator create_wide_text_table(sql);
// std::wstring const str_in = L"Привет, SOCI!";
std::wstring const str_in = L"\u041F\u0440\u0438\u0432\u0435\u0442, SOCI!";
sql << "insert into soci_test(wide_text) values(:str)", use(str_in);
std::wstring str_out;
sql << "select wide_text from soci_test", into(str_out);
CHECK(str_out == str_in);
}
TEST_CASE("MS SQL wide string vector", "[odbc][mssql][vector][wstring]")
{
soci::session sql(backEnd, connectString);
wide_text_table_creator create_wide_text_table(sql);
std::vector<std::wstring> const str_in = {
L"\u041F\u0440\u0438\u0432\u0435\u0442, SOCI!", // Привет, SOCI!
L"\u041F\u0440\u0438\u0432\u0435\u0442, World!", // Привет, World!
L"\u041F\u0440\u0438\u0432\u0435\u0442, Universe!", // Привет, Universe!
L"\u041F\u0440\u0438\u0432\u0435\u0442, Galaxy!"}; // Привет, Galaxy!
sql << "insert into soci_test(wide_text) values(:str)", use(str_in);
std::vector<std::wstring> str_out(4);
sql << "select wide_text from soci_test", into(str_out);
CHECK(str_out.size() == str_in.size());
for (std::size_t i = 0; i != str_in.size(); ++i)
{
CHECK(str_out[i] == str_in[i]);
}
} Result: |
Strange, I thought we already used But we can use |
I remember that I have had lots of trouble before on Windows because of this issue. |
We're using Generally, the following should work as well: std::u16string const utf16 = u"Привет, SOCI!";
std::wstring str_in(utf16.begin(), utf16.end()); but it doesn't with MSVC. I created another test (scroll right for comments): TEST_CASE("MS SQL utf8-to-utf16 string", "[odbc][mssql][wstring]")
{
soci::session sql(backEnd, connectString);
wide_text_table_creator create_wide_text_table(sql);
std::string const utf8 = "Привет, SOCI!"; // works
//std::string const utf8 = u8"Привет, SOCI!"; // doesn't work
// std::string const utf8 = "\xD0\x9F\xD1\x80\xD0\xB8\xD0\xB2\xD0\xB5\xD1\x82\x2C\x20\x53\x4F\x43\x49\x21"; // works
std::u16string utf16 = soci::details::utf8_to_utf16(utf8);
std::wstring str_in(utf16.begin(), utf16.end());
sql << "insert into soci_test(wide_text) values(:str)", use(str_in);
std::wstring str_out;
sql << "select wide_text from soci_test", into(str_out);
CHECK(str_out == str_in);
} |
OK, I figured it out: the tests in Also, the fact that scalar ODBC wstring test and the first 2 checks in the vector test pass is due to the fact that the column size is 40 and when interpreting the file as CP1252 and not UTF-8 (as MSVC did), the last 2 strings became longer than 40 characters, while the first 2 ones still fit. Finally, while debugging this, I've reread the code more carefully and I see many redundant comments and other things which don't make sense (why have |
It's really not great that the more I look at this code, the more problems I find in it :-( Dividing the column size by |
Great that you were able to figure this out. I wasn't able to back then. And due to posts on numerous websites (StackOverflow and others) suggesting an issue with MSVC while the builds with GCC and Clang worked, I ended up believing that it's really MSVC. I took a wrong turn when I should have dug deeper. Clearly a mistake by me. 🤷🏻♂️ The I'm much more aware of (and experienced with) the issues that come with LLM generated code than I was when most of this code was created. |
I've removed most of unnecessary LLM-generated stuff and made conversions more efficient to avoid copying potentially huge strings multiple times. There is still a lot of completely unused code remaining, notably conversions from UTF-8 and also one of UTF-16/32 conversions to it. I'm not sure if it should be removed or preserved to be used later for I'm also closing this PR and will merge #1179 if nobody sees any other problems with it. |
This pull request adds comprehensive support for wide strings (
wchar_t
,std::wstring
) to the SOCI database library, significantly improving its support for Unicode string types such as SQL Server'sNVARCHAR
andNTEXT
. This enhancement is crucial for applications that require robust handling of international and multi-language data.Key Changes:
Introduced
exchange_type_traits
andexchange_traits
Specializations:Updated ODBC Backend:
wchar_t
andstd::wstring
.Enhanced Buffer Management:
Improved Unicode Support:
Extended Test Coverage:
Notes:
This update significantly bolsters SOCI's capabilities in handling Unicode data, making it a more versatile and powerful tool for database interactions in multi-language applications.
Example usage
Here are a few examples showing how the new wide string features can be used with the ODBC backend.
Example 1: Handling
std::wstring
in SQL QueriesInserting and Selecting
std::wstring
DataExample 2: Working with
wchar_t
VectorsInserting and Selecting Wide Characters
Example 3: Using
std::wstring
with thesql
Stream OperatorInserting and Selecting
std::wstring
Data with Stream OperatorIn this example:
soci::session
object is created to connect to the database.NVARCHAR
column.std::wstring
is defined for insertion.sql
stream operator is used to insert thestd::wstring
into the database. Note the use ofN'
to indicate a Unicode string in SQL Server.std::wstring
is retrieved from the database using thesql
stream operator and thesoci::into
function.std::wcout
.These examples demonstrate how to insert and retrieve wide strings and wide characters using SOCI's newly added features for handling wide strings (
wchar_t
,std::wstring
).Limitation: The current implementation does not handle combining characters correctly. Combining characters, such as accents or diacritical marks, are treated separately instead of being combined with their base characters. This limitation may result in incorrect conversions for strings containing combining characters. A potential solution would be to incorporate Unicode normalization before the conversion process to ensure that combining characters are properly combined with their base characters.
Unicode defines several normalization forms (e.g., NFC, NFD, NFKC, NFKD), each with its own set of rules and behaviors. Choosing the appropriate normalization form is crucial, as different forms may produce different results.
To have full Unicode support, linking against a library like ICU or iconv is necessary. It can be made optional.
Disclaimer: This text is partially AI generated.