Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEAT] add list.value_counts() #2902

Merged
merged 25 commits into from
Oct 7, 2024
Merged

[FEAT] add list.value_counts() #2902

merged 25 commits into from
Oct 7, 2024

Conversation

andrewgazelka
Copy link
Member

@andrewgazelka andrewgazelka commented Sep 24, 2024

This PR implements the list.value_counts() function and refactors Map types to use a more explicit key-value structure. It also includes various improvements and bug fixes across the codebase.

Key changes

  1. Implemented list.value_counts() function

  2. Refactored Map type representation

    • Updated DataType enum in src/daft-schema/src/dtype.rs to use explicit key and value types
  3. Improved error handling and type checking

    • Enhanced type checking in various parts of the codebase
    • Improved error messages for better debugging
  4. Performance optimizations

    • Refactored some operations to be more efficient, especially in list and map operations
  5. Code cleanup and minor improvements

    • Removed unnecessary clones and improved code readability
    • Updated comments and documentation

daft/expressions/expressions.py Outdated Show resolved Hide resolved
src/daft-core/src/array/ops/list.rs Outdated Show resolved Hide resolved
Copy link

codspeed-hq bot commented Sep 24, 2024

CodSpeed Performance Report

Merging #2902 will not alter performance

Comparing andrew/list-value_counts (35d65d3) with main (f5cf5af)

Summary

✅ 17 untouched benchmarks

Copy link

codecov bot commented Sep 24, 2024

Codecov Report

Attention: Patch coverage is 77.24771% with 124 lines in your changes missing coverage. Please review.

Project coverage is 78.36%. Comparing base (46c5a7e) to head (35d65d3).
Report is 22 commits behind head on main.

Files with missing lines Patch % Lines
src/common/tracing/src/lib.rs 15.38% 22 Missing ⚠️
src/daft-core/src/array/fixed_size_list_array.rs 0.00% 20 Missing ⚠️
src/daft-core/src/array/mod.rs 39.28% 17 Missing ⚠️
src/daft-functions/src/list/value_counts.rs 61.90% 16 Missing ⚠️
src/daft-table/src/lib.rs 44.44% 10 Missing ⚠️
src/daft-dsl/src/functions/map/get.rs 52.94% 8 Missing ⚠️
src/daft-core/src/array/ops/from_arrow.rs 76.92% 6 Missing ⚠️
src/daft-core/src/series/ops/list.rs 57.14% 6 Missing ⚠️
src/daft-schema/src/schema.rs 54.54% 5 Missing ⚠️
src/daft-core/src/array/ops/list.rs 96.00% 4 Missing ⚠️
... and 5 more
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #2902      +/-   ##
==========================================
+ Coverage   76.34%   78.36%   +2.01%     
==========================================
  Files         597      603       +6     
  Lines       71388    71464      +76     
==========================================
+ Hits        54504    56002    +1498     
+ Misses      16884    15462    -1422     
Flag Coverage Δ
?

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
daft/expressions/expressions.py 93.75% <100.00%> (+0.37%) ⬆️
src/daft-core/src/array/list_array.rs 84.11% <100.00%> (+0.58%) ⬆️
src/daft-core/src/array/ops/arrow2/comparison.rs 98.70% <100.00%> (ø)
src/daft-core/src/array/ops/cast.rs 87.50% <100.00%> (+0.02%) ⬆️
src/daft-core/src/array/ops/groups.rs 98.18% <100.00%> (ø)
src/daft-core/src/array/ops/map.rs 87.50% <100.00%> (+14.77%) ⬆️
src/daft-core/src/array/ops/sparse_tensor.rs 100.00% <100.00%> (ø)
src/daft-core/src/array/serdes.rs 87.74% <100.00%> (+0.32%) ⬆️
src/daft-core/src/array/struct_array.rs 79.71% <ø> (ø)
src/daft-core/src/datatypes/logical.rs 74.07% <ø> (ø)
... and 28 more

... and 81 files with indirect coverage changes

@andrewgazelka andrewgazelka force-pushed the andrew/list-value_counts branch 4 times, most recently from b983578 to ba181af Compare September 25, 2024 19:10
@andrewgazelka andrewgazelka changed the title value counts [FEAT] add list.value_counts() Sep 25, 2024
@github-actions github-actions bot added the enhancement New feature or request label Sep 25, 2024
@andrewgazelka andrewgazelka marked this pull request as ready for review September 25, 2024 21:13
daft/expressions/expressions.py Outdated Show resolved Hide resolved
daft/expressions/expressions.py Show resolved Hide resolved
src/arrow2/src/offset.rs Outdated Show resolved Hide resolved
src/daft-core/src/array/list_array.rs Outdated Show resolved Hide resolved
src/daft-core/src/array/ops/list.rs Show resolved Hide resolved
src/daft-core/src/array/ops/list.rs Show resolved Hide resolved
src/daft-core/src/array/ops/list.rs Show resolved Hide resolved
src/daft-core/src/array/ops/list.rs Outdated Show resolved Hide resolved
@andrewgazelka andrewgazelka force-pushed the andrew/list-value_counts branch from a4e1c7b to e063b04 Compare September 27, 2024 00:02
@andrewgazelka andrewgazelka force-pushed the andrew/list-value_counts branch from 9aee66d to 4a6850a Compare October 1, 2024 20:16
Copy link
Contributor

@desmondcheongzx desmondcheongzx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flushing what I have right now, seems you wanted to iterate a bit more on it.

Could we also update the PR description so we get a proper commit message?

daft/expressions/expressions.py Outdated Show resolved Hide resolved
src/arrow2/src/array/map/mod.rs Show resolved Hide resolved
src/arrow2/src/array/map/mod.rs Outdated Show resolved Hide resolved
src/arrow2/src/array/mod.rs Outdated Show resolved Hide resolved
src/arrow2/src/array/mod.rs Outdated Show resolved Hide resolved
src/arrow2/src/array/mod.rs Outdated Show resolved Hide resolved
tests/integration/iceberg/test_table_load.py Outdated Show resolved Hide resolved
tests/table/map/test_map_get.py Outdated Show resolved Hide resolved
src/lib.rs Outdated Show resolved Hide resolved
tests/expressions/test_expressions.py Outdated Show resolved Hide resolved
@andrewgazelka andrewgazelka force-pushed the andrew/list-value_counts branch from 78fa527 to dcc0a36 Compare October 3, 2024 23:54
Copy link
Contributor

@desmondcheongzx desmondcheongzx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussing offline---what happens with lists of nested values?

Also, could we add this new expression to docs/source/api_docs/expressions.rst?

src/daft-micropartition/src/micropartition.rs Show resolved Hide resolved
src/daft-core/src/array/mod.rs Outdated Show resolved Hide resolved
src/daft-core/src/array/mod.rs Outdated Show resolved Hide resolved
src/daft-core/src/series/ops/concat.rs Show resolved Hide resolved
src/daft-micropartition/src/ops/eval_expressions.rs Outdated Show resolved Hide resolved
src/daft-schema/src/dtype.rs Outdated Show resolved Hide resolved
src/daft-schema/src/dtype.rs Show resolved Hide resolved
src/daft-core/src/array/ops/list.rs Outdated Show resolved Hide resolved
src/daft-core/src/array/ops/list.rs Outdated Show resolved Hide resolved
src/daft-core/src/array/ops/list.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@desmondcheongzx desmondcheongzx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks for all the changes!

@andrewgazelka andrewgazelka merged commit 98dbadb into main Oct 7, 2024
40 checks passed
@andrewgazelka andrewgazelka deleted the andrew/list-value_counts branch October 7, 2024 18:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants