Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEAT] Add .str.split() API for splitting string columns. #1409

Merged
merged 3 commits into from
Sep 28, 2023

Conversation

clarkzinzow
Copy link
Contributor

This PR adds an Expression.str.split() API for splitting strings in string columns on a pattern.

Closes #1388

@github-actions github-actions bot added the enhancement New feature or request label Sep 22, 2023
@codecov
Copy link

codecov bot commented Sep 22, 2023

Codecov Report

Merging #1409 (cacafe9) into main (558b31e) will increase coverage by 0.03%.
Report is 1 commits behind head on main.
The diff coverage is 100.00%.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1409      +/-   ##
==========================================
+ Coverage   74.61%   74.64%   +0.03%     
==========================================
  Files          60       60              
  Lines        6034     6042       +8     
==========================================
+ Hits         4502     4510       +8     
  Misses       1532     1532              
Files Coverage Δ
daft/expressions/expressions.py 91.35% <100.00%> (+0.07%) ⬆️
daft/series.py 92.24% <100.00%> (+0.10%) ⬆️

@@ -129,7 +129,8 @@ impl FullNull for ListArray {
Self::new(
Field::new(name, dtype.clone()),
empty_flat_child,
OffsetsBuffer::try_from(repeat(0).take(length).collect::<Vec<_>>()).unwrap(),
OffsetsBuffer::try_from(repeat(0).take(length + 1).collect::<Vec<_>>())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jaychia FYI I made this fix as a driveby, where the offset buffer should be one element longer than the data array and validity mask.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm odd, this should have been caught be an assert in the constructor. Maybe we need to add one?

Copy link
Contributor

@jaychia jaychia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly lgtm, added a few comments. Nice tests!

@@ -129,7 +129,8 @@ impl FullNull for ListArray {
Self::new(
Field::new(name, dtype.clone()),
empty_flat_child,
OffsetsBuffer::try_from(repeat(0).take(length).collect::<Vec<_>>()).unwrap(),
OffsetsBuffer::try_from(repeat(0).take(length + 1).collect::<Vec<_>>())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm odd, this should have been caught be an assert in the constructor. Maybe we need to add one?

src/daft-core/src/array/ops/utf8.rs Outdated Show resolved Hide resolved
src/daft-core/src/array/ops/utf8.rs Outdated Show resolved Hide resolved
@clarkzinzow clarkzinzow merged commit 069432d into main Sep 28, 2023
24 checks passed
@clarkzinzow clarkzinzow deleted the clark/expr-str-split branch September 28, 2023 20:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add .str.split() expression
2 participants