Skip to content

Commit

Permalink
[BUG] Fix .str.length() on Unicode strings (#2579)
Browse files Browse the repository at this point in the history
Previously, the `.str.length()` method would count the number of bytes
in the UTF-8 string. This is inconsistent with Python's `len()` and
pandas' `str.len()` which count Unicode codepoints. For instance, on the
string "😉test", the number of bytes is 8, whereas the number of
codepoints is 5. This PR makes Daft consistent with that behavior.

There doesn't seem to be a way now to reproduce the original behavior;
maybe we should add a `.byte_length()` method for that.
  • Loading branch information
Vince7778 authored Jul 29, 2024
1 parent 4c3d1b5 commit 8fce5b5
Show file tree
Hide file tree
Showing 3 changed files with 9 additions and 3 deletions.
2 changes: 1 addition & 1 deletion src/daft-core/src/array/ops/utf8.rs
Original file line number Diff line number Diff line change
Expand Up @@ -610,7 +610,7 @@ impl Utf8Array {
.iter()
.map(|val| {
let v = val?;
Some(v.len() as u64)
Some(v.chars().count() as u64)
})
.collect::<arrow2::array::UInt64Array>()
.with_validity(self_arrow.validity().cloned());
Expand Down
6 changes: 6 additions & 0 deletions tests/series/test_utf8_ops.py
Original file line number Diff line number Diff line change
Expand Up @@ -265,6 +265,12 @@ def test_series_utf8_length_all_null() -> None:
assert result.to_pylist() == [None, None, None]


def test_series_utf8_length_unicode() -> None:
s = Series.from_arrow(pa.array(["😉test", "hey̆"]))
result = s.str.length()
assert result.to_pylist() == [5, 4]


@pytest.mark.parametrize(
["data", "expected"],
[
Expand Down
4 changes: 2 additions & 2 deletions tests/table/utf8/test_length.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,6 @@


def test_utf8_length():
table = MicroPartition.from_pydict({"col": ["foo", None, "barbaz", "quux"]})
table = MicroPartition.from_pydict({"col": ["foo", None, "barbaz", "quux", "😉test", ""]})
result = table.eval_expression_list([col("col").str.length()])
assert result.to_pydict() == {"col": [3, None, 6, 4]}
assert result.to_pydict() == {"col": [3, None, 6, 4, 5, 0]}

0 comments on commit 8fce5b5

Please sign in to comment.