Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strange issue with strange characters on linux #86

Open
aersam opened this issue Apr 3, 2024 · 18 comments
Open

Strange issue with strange characters on linux #86

aersam opened this issue Apr 3, 2024 · 18 comments

Comments

@aersam
Copy link

aersam commented Apr 3, 2024

with this code:

from arrow_odbc import read_arrow_batches_from_odbc
         reader = read_arrow_batches_from_odbc(
            query=sql,
            connection_string=self.connection_string,
            max_binary_size=20000,
            max_text_size=20000,
        )
        print(sql)
        print(reader.schema)

I get this output:


SELECT [User - iD] AS [User_-_iD], [FirstName] AS [FirstName], [LastName] AS [LastName], [Age] AS [Age], [companyid] AS [companyid], CAST([time stämp] AS BIGINT) AS [time_stämp], CAST(GETUTCDATE() AS datetime2(6)) AS __timestamp, CAST(0 AS BIT) AS __is_deleted, CAST(1 AS BIT) AS __is_full_load FROM [dbo].[user]

User_-_iD: int64 not null
FirstName: string
LastName: string
Age: decimal128(15, 3)
companyid: string not null
time_stäm: int64
__timestamp: timestamp[us]
__is_deleted: bool
__is_full_load: bool

Please note the name of the [time_stämp] column in the schema

@aersam
Copy link
Author

aersam commented Apr 3, 2024

Only happens on Linux. Could also be a bug with MSFT's ODBC Driver, but I guess MSFT does not do bugs 😉

@pacman82
Copy link
Owner

pacman82 commented Apr 3, 2024

This does not happen on Windows because on windows arrow-odbc-py is compiled to use UTF-16 encoding to talk with the Database. On Linux and OS-X however the system local is usually UTF-8 and therefore arrow-odbc-py assumes the narrow encoding returned from the database is UTF-8. However, if your System local is configured not to be UTF-8 this assumption is wrong.

So my first piece of advice would be to check whether your System local is UTF-8. Other narrow ASCII encodings are currently not supported by arrow-odbc-py.

@aersam
Copy link
Author

aersam commented Apr 3, 2024

Ok, I'll try. For completness, here's the link to my failing gh action: https://github.com/bmsuisse/odbc2deltalake/actions/runs/8538422686

@pacman82
Copy link
Owner

pacman82 commented Apr 3, 2024

Looking at this it is more likely that somehow binary and character size is confused. I just do not know there. Yet, likely the system local is innocent. Would you kindly enable debug logging and execute again. In the beginning it should log the database column names and their type, even before conversion to arrow. I would be interested in seeing that.

@pacman82
Copy link
Owner

pacman82 commented Apr 3, 2024

@aersam
Copy link
Author

aersam commented Apr 3, 2024

Here we go:

DEBUG SQLAllocHandle allocated connection (Dbc) handle '0x55c451f30050'
WARN State: 01S00, Native error: 0, Message: [Microsoft][ODBC Driver 18 for SQL Server]Invalid connection string attribute
WARN State: 01000, Native error: 5701, Message: [Microsoft][ODBC Driver 18 for SQL Server][SQL Server]Changed database context to 'db_to_delta_test'.
WARN State: 01000, Native error: 5703, Message: [Microsoft][ODBC Driver 18 for SQL Server][SQL Server]Changed language setting to us_english.
DEBUG Database managment system name as reported by ODBC: Microsoft SQL Server
DEBUG ODBC driver reported for column 0. Relational type: BigInt; Nullability: NoNulls; Name: 'User_-_iD';
DEBUG ODBC driver reported for column 1. Relational type: Varchar { length: Some(100) }; Nullability: Nullable; Name: 'FirstName';
DEBUG ODBC driver reported for column 2. Relational type: WVarchar { length: None }; Nullability: Nullable; Name: 'LastName';
DEBUG ODBC driver reported for column 3. Relational type: Decimal { precision: 15, scale: 3 }; Nullability: Nullable; Name: 'Age';
DEBUG ODBC driver reported for column 4. Relational type: Varchar { length: Some(10) }; Nullability: NoNulls; Name: 'companyid';
DEBUG ODBC driver reported for column 5. Relational type: BigInt; Nullability: Nullable; Name: 'time_stäm';
DEBUG ODBC driver reported for column 6. Relational type: Timestamp { precision: 6 }; Nullability: Nullable; Name: '__timestamp';
DEBUG ODBC driver reported for column 7. Relational type: Bit; Nullability: Nullable; Name: '__is_deleted';
DEBUG ODBC driver reported for column 8. Relational type: Bit; Nullability: Nullable; Name: '__is_full_load';
DEBUG SQLColAttribute called with attribute 'ConciseType' for column '2' reported 12.
DEBUG SQLColAttribute called with attribute 'DisplaySize' for column '2' reported 100.
DEBUG Relational type of column 1: Varchar { length: Some(100) }
DEBUG SQLColAttribute called with attribute 'ConciseType' for column '3' reported -9.
DEBUG SQLColAttribute called with attribute 'DisplaySize' for column '3' reported 0.
DEBUG Relational type of column 2: WVarchar { length: None }
DEBUG SQLColAttribute called with attribute 'DisplaySize' for column '3' reported 0.
DEBUG SQLColAttribute called with attribute 'ConciseType' for column '5' reported 12.
DEBUG SQLColAttribute called with attribute 'DisplaySize' for column '5' reported 10.
DEBUG Relational type of column 4: Varchar { length: Some(10) }
INFO Column 'User_-_iD'
Bytes used per row: 8
INFO Column 'FirstName'
Bytes used per row: 409
INFO Column 'LastName'
Bytes used per row: 20009
INFO Column 'Age'
Bytes used per row: 26
INFO Column 'companyid'
Bytes used per row: 49
INFO Column 'time_stäm'
Bytes used per row: 16
INFO Column '__timestamp'
Bytes used per row: 24
INFO Column '__is_deleted'
Bytes used per row: 9
INFO Column '__is_full_load'
Bytes used per row: 9
INFO Total memory usage per row for single transit buffer: 20559
SELECT [User - iD] AS [User_-_iD], [FirstName] AS [FirstName], [LastName] AS [LastName], [Age] AS [Age], [companyid] AS [companyid], CAST([time stämp] AS BIGINT) AS [time_stämp], CAST(GETUTCDATE() AS datetime2(6)) AS __timestamp, CAST(0 AS BIT) AS __is_deleted, CAST(1 AS BIT) AS __is_full_load FROM [dbo].[user]
User_-_iD: int64 not null
FirstName: string
LastName: string
Age: decimal128(15, 3)
companyid: string not null
time_stäm: int64
__timestamp: timestamp[us]
__is_deleted: bool
__is_full_load: bool
DEBUG SQLFreeHandle dropped 0x55c451ef1b20 of type Stmt.
DEBUG SQLFreeHandle dropped 0x55c451f30050 of type Dbc.

@pacman82
Copy link
Owner

pacman82 commented Apr 4, 2024

I could reproduce the Bug.

Only happens on Linux. Could also be a bug with MSFT's ODBC Driver, but I guess MSFT does not do bugs 😉

Actually MSFT has one of the more solid ODBC drivers I would say. I also admire many of the things they achieved in terms of engineering. Yet this time, I think it is on them. It is not a configuration issue of the client local. The umlaut ä is rendered correctly. Yet the string is one letter short. My guess is this has to do with the fact that ä is two bytes in UTF-8, yet the text length is returned in characters not bytes. Anyhow this bug needs to be fixed by Microsoft.

Windows version of of arrow-odbc-py is not affected, because on Windows I choose to use the wide UTF-16 version of the ODBC interface. I am happy to help writing and detailing the issue (however I also suspect that a knowledgeable MS employee would be a ble to reproduce it by looking at this thread alone), but I can not figure out how to report a bug to MS.

Best, Markus

@pacman82
Copy link
Owner

pacman82 commented Apr 4, 2024

As a workaround I'll give it a try to supply larger buffers than requested by the drivers ...

@pacman82
Copy link
Owner

pacman82 commented Apr 4, 2024

To be fair to Microsoft, the function in question SQLDescribeCol is document to return the length in characters not bytes. https://learn.microsoft.com/en-us/sql/odbc/reference/syntax/sqldescribecol-function?view=sql-server-ver16. Yet it is clearly intended to determine buffer lengths (usually expressed in bytes). Driver may be "fine" and the ODBC standard is broken here. Well, anyhow, seems in the narrow case I'll start multiplying its values by 3 and for wide code by 2, this should put it all on the safe side even in worst case scenarios. I need to fix this upstream in odbc-api though.

@aersam
Copy link
Author

aersam commented Apr 4, 2024

Hm, interesting. Would you think I can reproduce this behavior if I write it in C#/.Net Core using MSFT-only libs?

@pacman82
Copy link
Owner

pacman82 commented Apr 4, 2024

Maybe not, depends if ADO relies on ODBC. Also relies on how wasteful authors of these libs are then using memory. I could avoid a lot of trouble if I would just allocate 4kb of buffer for each column name you wanna now. In that case you the name would need to be very long and contain special characters.

A minimal reproducing example in plain C, maybe based on an example for SQLDescribeCol would be feasable though. Just watch the string length output argument for a column name whose name differ in byte and character length.

@pacman82
Copy link
Owner

pacman82 commented Apr 4, 2024

NameLengthPtr
[Output] Pointer to a buffer in which to return the total number of characters (excluding the null termination) available to return in *ColumnName. If the number of characters available to return is greater than or equal to BufferLength, the column name in *ColumnName is truncated to BufferLength minus the length of a null-termination character.

This formulation alone in the documents is problematic

@pacman82
Copy link
Owner

pacman82 commented Apr 4, 2024

It confuses binary buffer length and character length. Very likely related to the bug in question.

@pacman82
Copy link
Owner

pacman82 commented Apr 4, 2024

Definitly this. If you want to see it fail under windows too, you can use a character in the column name which is 4 Bytes in UTF-16 instead of two. E.g. 𐐏. Also MariaDB has the same bug; PostgreSQL is fine though.

@pacman82
Copy link
Owner

pacman82 commented Apr 4, 2024

Sadly allocating larger buffers is not a suitable workaround. The drivers only copy the bytes in the length of character data in the application provided buffers. The "good" news is that this makes the bug more relevant to fix.

Sorry, I cannot help here. I really tried.

@aersam
Copy link
Author

aersam commented Apr 5, 2024

Thank you for all your efforts! I'll try creating a repo in C# and getting this to the correct guys at Redmond - it's an interesting bug, after all 🙂

@pacman82
Copy link
Owner

pacman82 commented Apr 5, 2024

Probably not even needed, just stating that SQLDescribeCol truncates column names those binary length exceeds the character length, is a precise description, which should enable them to reproduce.

@pacman82 pacman82 closed this as completed Apr 5, 2024
@pacman82 pacman82 reopened this Apr 5, 2024
@pacman82
Copy link
Owner

pacman82 commented Apr 5, 2024

Remark: Using SQLColAttribute instead of SQLDescribeCol might work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants