-
Notifications
You must be signed in to change notification settings - Fork 174
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[BUG] Fix
.str.length()
on Unicode strings (#2579)
Previously, the `.str.length()` method would count the number of bytes in the UTF-8 string. This is inconsistent with Python's `len()` and pandas' `str.len()` which count Unicode codepoints. For instance, on the string "😉test", the number of bytes is 8, whereas the number of codepoints is 5. This PR makes Daft consistent with that behavior. There doesn't seem to be a way now to reproduce the original behavior; maybe we should add a `.byte_length()` method for that.
- Loading branch information
Showing
3 changed files
with
9 additions
and
3 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters