Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add str.normalize() #20483

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

etiennebacher
Copy link
Contributor

Contributing to the Rust part for the first time so there are probably some quirks here and there. I used the suggestion in #11455 to use the unicode_normalization crate and mostly followed #12878. I don't know if you want to add this function or to implement it that way but it was good training for me anyway.

Note that I'm not very familiar with this method so double-checking the output and maybe adding more corner cases to the test suite would be nice.

Quick performance check after make build-release:

import polars as pl
import time
import pandas as pd

N = 20_000_000
txt = ["01²3", "株式会社", "ሎ", "KADOKAWA Future"]

ser = pd.Series(txt * N)
start = time.time()
ser.str.normalize('NFKC')
print("Pandas:", time.time() - start)

ser = pl.Series(txt* N)
start = time.time()
ser.str.normalize('NFKC')
print("Polars:", time.time() - start)
Pandas: 11.836752653121948
Polars: 11.922921657562256

A bit disappointed with the performance, maybe I missed something obvious. There are also a couple of issues on performance in the Rust crate used: https://github.com/unicode-rs/unicode-normalization/issues?q=sort%3Aupdated-desc+is%3Aissue+is%3Aopen+performance

Fixes #5799
Fixes #11455

@github-actions github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars labels Dec 27, 2024
Copy link

codecov bot commented Dec 27, 2024

Codecov Report

Attention: Patch coverage is 84.05797% with 11 lines in your changes missing coverage. Please review.

Project coverage is 79.01%. Comparing base (3aaf4c2) to head (cb023f8).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
.../polars-python/src/lazyframe/visitor/expr_nodes.rs 0.00% 8 Missing ⚠️
.../polars-ops/src/chunked_array/strings/normalize.rs 91.30% 2 Missing ⚠️
...rates/polars-plan/src/dsl/function_expr/strings.rs 85.71% 1 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main   #20483   +/-   ##
=======================================
  Coverage   79.01%   79.01%           
=======================================
  Files        1563     1564    +1     
  Lines      220596   220665   +69     
  Branches     2492     2492           
=======================================
+ Hits       174306   174367   +61     
- Misses      45717    45725    +8     
  Partials      573      573           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@orlp
Copy link
Collaborator

orlp commented Dec 27, 2024

This kernel should not be written by collecting to a temporary String for each string. It should instead be something like this to re-use the allocation:

pub fn normalize_with<F: Fn(&str, &mut String)>(ca: &StringChunked, normalizer: F) -> StringChunked {
    let mut buffer = String::new();
    let mut builder = StringChunkedBuilder::new(ca.name().clone(), ca.len());
    for opt_s in ca.iter() {
        if let Some(s) = opt_s {
            buffer.clear();
            normalizer(s, &mut buffer);
            builder.append_value(&buffer);
        } else {
            builder.append_null();
        }
    }
    builder.finish()
}

pub fn normalize(ca: &StringChunked, form: UnicodeForm) -> StringChunked {
    match form {
        UnicodeForm::NFC => normalize_with(ca, |s, b| b.extend(s.nfc())),
        UnicodeForm::NFKC => normalize_with(ca, |s, b| b.extend(s.nfkc())),
        UnicodeForm::NFD => normalize_with(ca, |s, b| b.extend(s.nfd())),
        UnicodeForm::NFKD => normalize_with(ca, |s, b| b.extend(s.nfkd())),
    }
}

@etiennebacher
Copy link
Contributor Author

Thanks @orlp, I naively followed unicode_normalization's example but should have given more thought to this.

Updated benchmark:

Pandas: 20.463711977005005
Polars: 16.712544441223145

(Can't really explain the change in magnitude compared to the first one but the gap between polars and pandas now is consistently there)

@etiennebacher etiennebacher marked this pull request as draft December 28, 2024 08:19
@etiennebacher

This comment was marked as outdated.

@etiennebacher etiennebacher marked this pull request as ready for review December 28, 2024 09:12
@ritchie46
Copy link
Member

ritchie46 commented Dec 29, 2024

Thanks for your first contributions @etiennebacher. Before implementing features, we should first decide if we want them. (This is shown by the accepted tag).

For one, I am not entirely sure that we do want this in the main library. It seems quite a large dependency (with all the unicode tables), which might be better suited for a plugin.

Let me get back to this, I want to see how much this dependency adds and how important of a feature this is.

@etiennebacher
Copy link
Contributor Author

Sure, no problem with letting this be a plugin functionality.

I don't mind this being closed, but no matter the outcome the two issues mentioned in the original post should be updated.

@drumtorben
Copy link

This functionality is already in the polars-ds extension:
https://polars-ds-extension.readthedocs.io/en/latest/string.html#polars_ds.string.normalize_string

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implementation of .str.normalize Method for String (Unicode) Normalization Unicode Normalize with Python
4 participants