Strange issue with strange characters on linux #86

aersam · 2024-04-03T12:39:55Z

with this code:

from arrow_odbc import read_arrow_batches_from_odbc
         reader = read_arrow_batches_from_odbc(
            query=sql,
            connection_string=self.connection_string,
            max_binary_size=20000,
            max_text_size=20000,
        )
        print(sql)
        print(reader.schema)

I get this output:


SELECT [User - iD] AS [User_-_iD], [FirstName] AS [FirstName], [LastName] AS [LastName], [Age] AS [Age], [companyid] AS [companyid], CAST([time stämp] AS BIGINT) AS [time_stämp], CAST(GETUTCDATE() AS datetime2(6)) AS __timestamp, CAST(0 AS BIT) AS __is_deleted, CAST(1 AS BIT) AS __is_full_load FROM [dbo].[user]

User_-_iD: int64 not null
FirstName: string
LastName: string
Age: decimal128(15, 3)
companyid: string not null
time_stäm: int64
__timestamp: timestamp[us]
__is_deleted: bool
__is_full_load: bool

Please note the name of the [time_stämp] column in the schema

The text was updated successfully, but these errors were encountered:

aersam · 2024-04-03T12:43:11Z

Only happens on Linux. Could also be a bug with MSFT's ODBC Driver, but I guess MSFT does not do bugs 😉

pacman82 · 2024-04-03T12:57:38Z

This does not happen on Windows because on windows arrow-odbc-py is compiled to use UTF-16 encoding to talk with the Database. On Linux and OS-X however the system local is usually UTF-8 and therefore arrow-odbc-py assumes the narrow encoding returned from the database is UTF-8. However, if your System local is configured not to be UTF-8 this assumption is wrong.

So my first piece of advice would be to check whether your System local is UTF-8. Other narrow ASCII encodings are currently not supported by arrow-odbc-py.

aersam · 2024-04-03T13:04:58Z

Ok, I'll try. For completness, here's the link to my failing gh action: https://github.com/bmsuisse/odbc2deltalake/actions/runs/8538422686

pacman82 · 2024-04-03T13:10:41Z

Looking at this it is more likely that somehow binary and character size is confused. I just do not know there. Yet, likely the system local is innocent. Would you kindly enable debug logging and execute again. In the beginning it should log the database column names and their type, even before conversion to arrow. I would be interested in seeing that.

pacman82 · 2024-04-03T13:11:47Z

You can log to standard error using: https://arrow-odbc.readthedocs.io/en/latest/arrow_odbc.html#arrow_odbc.log_to_stderr

aersam · 2024-04-03T14:20:49Z

Here we go:

DEBUG SQLAllocHandle allocated connection (Dbc) handle '0x55c451f30050'
WARN State: 01S00, Native error: 0, Message: [Microsoft][ODBC Driver 18 for SQL Server]Invalid connection string attribute
WARN State: 01000, Native error: 5701, Message: [Microsoft][ODBC Driver 18 for SQL Server][SQL Server]Changed database context to 'db_to_delta_test'.
WARN State: 01000, Native error: 5703, Message: [Microsoft][ODBC Driver 18 for SQL Server][SQL Server]Changed language setting to us_english.
DEBUG Database managment system name as reported by ODBC: Microsoft SQL Server
DEBUG ODBC driver reported for column 0. Relational type: BigInt; Nullability: NoNulls; Name: 'User_-_iD';
DEBUG ODBC driver reported for column 1. Relational type: Varchar { length: Some(100) }; Nullability: Nullable; Name: 'FirstName';
DEBUG ODBC driver reported for column 2. Relational type: WVarchar { length: None }; Nullability: Nullable; Name: 'LastName';
DEBUG ODBC driver reported for column 3. Relational type: Decimal { precision: 15, scale: 3 }; Nullability: Nullable; Name: 'Age';
DEBUG ODBC driver reported for column 4. Relational type: Varchar { length: Some(10) }; Nullability: NoNulls; Name: 'companyid';
DEBUG ODBC driver reported for column 5. Relational type: BigInt; Nullability: Nullable; Name: 'time_stäm';
DEBUG ODBC driver reported for column 6. Relational type: Timestamp { precision: 6 }; Nullability: Nullable; Name: '__timestamp';
DEBUG ODBC driver reported for column 7. Relational type: Bit; Nullability: Nullable; Name: '__is_deleted';
DEBUG ODBC driver reported for column 8. Relational type: Bit; Nullability: Nullable; Name: '__is_full_load';
DEBUG SQLColAttribute called with attribute 'ConciseType' for column '2' reported 12.
DEBUG SQLColAttribute called with attribute 'DisplaySize' for column '2' reported 100.
DEBUG Relational type of column 1: Varchar { length: Some(100) }
DEBUG SQLColAttribute called with attribute 'ConciseType' for column '3' reported -9.
DEBUG SQLColAttribute called with attribute 'DisplaySize' for column '3' reported 0.
DEBUG Relational type of column 2: WVarchar { length: None }
DEBUG SQLColAttribute called with attribute 'DisplaySize' for column '3' reported 0.
DEBUG SQLColAttribute called with attribute 'ConciseType' for column '5' reported 12.
DEBUG SQLColAttribute called with attribute 'DisplaySize' for column '5' reported 10.
DEBUG Relational type of column 4: Varchar { length: Some(10) }
INFO Column 'User_-_iD'
Bytes used per row: 8
INFO Column 'FirstName'
Bytes used per row: 409
INFO Column 'LastName'
Bytes used per row: 20009
INFO Column 'Age'
Bytes used per row: 26
INFO Column 'companyid'
Bytes used per row: 49
INFO Column 'time_stäm'
Bytes used per row: 16
INFO Column '__timestamp'
Bytes used per row: 24
INFO Column '__is_deleted'
Bytes used per row: 9
INFO Column '__is_full_load'
Bytes used per row: 9
INFO Total memory usage per row for single transit buffer: 20559
SELECT [User - iD] AS [User_-_iD], [FirstName] AS [FirstName], [LastName] AS [LastName], [Age] AS [Age], [companyid] AS [companyid], CAST([time stämp] AS BIGINT) AS [time_stämp], CAST(GETUTCDATE() AS datetime2(6)) AS __timestamp, CAST(0 AS BIT) AS __is_deleted, CAST(1 AS BIT) AS __is_full_load FROM [dbo].[user]
User_-_iD: int64 not null
FirstName: string
LastName: string
Age: decimal128(15, 3)
companyid: string not null
time_stäm: int64
__timestamp: timestamp[us]
__is_deleted: bool
__is_full_load: bool
DEBUG SQLFreeHandle dropped 0x55c451ef1b20 of type Stmt.
DEBUG SQLFreeHandle dropped 0x55c451f30050 of type Dbc.

pacman82 · 2024-04-04T15:58:41Z

I could reproduce the Bug.

Only happens on Linux. Could also be a bug with MSFT's ODBC Driver, but I guess MSFT does not do bugs 😉

Actually MSFT has one of the more solid ODBC drivers I would say. I also admire many of the things they achieved in terms of engineering. Yet this time, I think it is on them. It is not a configuration issue of the client local. The umlaut ä is rendered correctly. Yet the string is one letter short. My guess is this has to do with the fact that ä is two bytes in UTF-8, yet the text length is returned in characters not bytes. Anyhow this bug needs to be fixed by Microsoft.

Windows version of of arrow-odbc-py is not affected, because on Windows I choose to use the wide UTF-16 version of the ODBC interface. I am happy to help writing and detailing the issue (however I also suspect that a knowledgeable MS employee would be a ble to reproduce it by looking at this thread alone), but I can not figure out how to report a bug to MS.

Best, Markus

pacman82 · 2024-04-04T16:04:20Z

As a workaround I'll give it a try to supply larger buffers than requested by the drivers ...

pacman82 · 2024-04-04T16:11:23Z

To be fair to Microsoft, the function in question SQLDescribeCol is document to return the length in characters not bytes. https://learn.microsoft.com/en-us/sql/odbc/reference/syntax/sqldescribecol-function?view=sql-server-ver16. Yet it is clearly intended to determine buffer lengths (usually expressed in bytes). Driver may be "fine" and the ODBC standard is broken here. Well, anyhow, seems in the narrow case I'll start multiplying its values by 3 and for wide code by 2, this should put it all on the safe side even in worst case scenarios. I need to fix this upstream in odbc-api though.

aersam · 2024-04-04T18:16:29Z

Hm, interesting. Would you think I can reproduce this behavior if I write it in C#/.Net Core using MSFT-only libs?

pacman82 · 2024-04-04T19:27:11Z

Maybe not, depends if ADO relies on ODBC. Also relies on how wasteful authors of these libs are then using memory. I could avoid a lot of trouble if I would just allocate 4kb of buffer for each column name you wanna now. In that case you the name would need to be very long and contain special characters.

A minimal reproducing example in plain C, maybe based on an example for SQLDescribeCol would be feasable though. Just watch the string length output argument for a column name whose name differ in byte and character length.

pacman82 · 2024-04-04T19:28:36Z

NameLengthPtr
[Output] Pointer to a buffer in which to return the total number of characters (excluding the null termination) available to return in *ColumnName. If the number of characters available to return is greater than or equal to BufferLength, the column name in *ColumnName is truncated to BufferLength minus the length of a null-termination character.

This formulation alone in the documents is problematic

pacman82 · 2024-04-04T19:29:35Z

It confuses binary buffer length and character length. Very likely related to the bug in question.

pacman82 · 2024-04-04T19:46:44Z

Definitly this. If you want to see it fail under windows too, you can use a character in the column name which is 4 Bytes in UTF-16 instead of two. E.g. 𐐏. Also MariaDB has the same bug; PostgreSQL is fine though.

pacman82 · 2024-04-04T19:58:15Z

Sadly allocating larger buffers is not a suitable workaround. The drivers only copy the bytes in the length of character data in the application provided buffers. The "good" news is that this makes the bug more relevant to fix.

Sorry, I cannot help here. I really tried.

aersam · 2024-04-05T06:33:33Z

Thank you for all your efforts! I'll try creating a repo in C# and getting this to the correct guys at Redmond - it's an interesting bug, after all 🙂

pacman82 · 2024-04-05T08:05:00Z

Probably not even needed, just stating that SQLDescribeCol truncates column names those binary length exceeds the character length, is a precise description, which should enable them to reproduce.

pacman82 · 2024-04-05T09:31:30Z

Remark: Using SQLColAttribute instead of SQLDescribeCol might work.

pacman82 closed this as completed Apr 5, 2024

pacman82 reopened this Apr 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Strange issue with strange characters on linux #86

Strange issue with strange characters on linux #86

aersam commented Apr 3, 2024 •

edited

Loading

aersam commented Apr 3, 2024

pacman82 commented Apr 3, 2024

aersam commented Apr 3, 2024

pacman82 commented Apr 3, 2024

pacman82 commented Apr 3, 2024

aersam commented Apr 3, 2024

pacman82 commented Apr 4, 2024

pacman82 commented Apr 4, 2024

pacman82 commented Apr 4, 2024

aersam commented Apr 4, 2024

pacman82 commented Apr 4, 2024

pacman82 commented Apr 4, 2024

pacman82 commented Apr 4, 2024

pacman82 commented Apr 4, 2024

pacman82 commented Apr 4, 2024

aersam commented Apr 5, 2024

pacman82 commented Apr 5, 2024 •

edited

Loading

pacman82 commented Apr 5, 2024

Strange issue with strange characters on linux #86

Strange issue with strange characters on linux #86

Comments

aersam commented Apr 3, 2024 • edited Loading

aersam commented Apr 3, 2024

pacman82 commented Apr 3, 2024

aersam commented Apr 3, 2024

pacman82 commented Apr 3, 2024

pacman82 commented Apr 3, 2024

aersam commented Apr 3, 2024

pacman82 commented Apr 4, 2024

pacman82 commented Apr 4, 2024

pacman82 commented Apr 4, 2024

aersam commented Apr 4, 2024

pacman82 commented Apr 4, 2024

pacman82 commented Apr 4, 2024

pacman82 commented Apr 4, 2024

pacman82 commented Apr 4, 2024

pacman82 commented Apr 4, 2024

aersam commented Apr 5, 2024

pacman82 commented Apr 5, 2024 • edited Loading

pacman82 commented Apr 5, 2024

aersam commented Apr 3, 2024 •

edited

Loading

pacman82 commented Apr 5, 2024 •

edited

Loading