Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEAT] Add a list.contains API #2773

Closed
wants to merge 7 commits into from
Closed

[FEAT] Add a list.contains API #2773

wants to merge 7 commits into from

Conversation

raunakab
Copy link
Contributor

Overview

Added a list.contains API to check whether or not a column of type List or FixedSizeList contains a given value.

@github-actions github-actions bot added the enhancement New feature or request label Aug 31, 2024
@raunakab raunakab marked this pull request as draft August 31, 2024 01:13
@raunakab raunakab linked an issue Aug 31, 2024 that may be closed by this pull request
@raunakab raunakab requested review from jaychia and Vince7778 August 31, 2024 01:13
Copy link

codspeed-hq bot commented Aug 31, 2024

CodSpeed Performance Report

Merging #2773 will degrade performances by 27.25%

Comparing ronnie/list-contains (d63ce68) with main (d9d916b)

Summary

❌ 1 regressions
✅ 16 untouched benchmarks

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Benchmarks breakdown

Benchmark main ronnie/list-contains Change
test_iter_rows_first_row[100 Small Files] 184.3 ms 253.3 ms -27.25%

@raunakab raunakab marked this pull request as ready for review September 9, 2024 22:38
@jaychia jaychia requested review from kevinzwang and removed request for jaychia and Vince7778 September 12, 2024 23:53
Copy link
Member

@kevinzwang kevinzwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An alternative way to write the kernel is to check equality between the flattened list and the values and then for each list range, determine if there are any true values in that equality array. The potential advantage of this is that you can use the vectorized arrow2 equality op. The potential disadvantage is that you can't early terminate if you find an equal value in a list.

Also, please add tests!

daft/expressions/expressions.py Outdated Show resolved Hide resolved
src/daft-core/src/array/ops/list.rs Outdated Show resolved Hide resolved
Returns:
Expression: a List expression which are the lists which all contain at least one instance of the given value.
"""
return Expression._from_pyexpr(native.list_contains(self._expr, value._expr))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should convert value into an Expression using Expression._to_expression first before using it. That way it works with literals too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about self then?

Copy link
Contributor Author

@raunakab raunakab Nov 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like:

self_ = Expression(self)._to_expression()
value = Expression(value)._to_expression()
return Expression._from_pyexpr(native.list_contains(self_, value))

?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Self is already an expression so there's no need. It should look something like

value_expr = Expression.to_expression(value)
return Expression._from_pyexpr(native.list_contains(self._expr, value_expr._expr))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, I don't understand.

value is already of type Expression. self is of type ExpressionListNamespace. When I grab self._expr and value._expr, I am grabbing their internal PyExpr type, no? That is what the native.list_contains function expects.

Copy link
Contributor Author

@raunakab raunakab Nov 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why should we perform the Expression._to_expression conversion first? Looking at the implementation, it seems to take in an object and return it untouched if it is of type Expression (and if not, return lit(object)).


pub fn list_contains(&self, values: Series) -> DaftResult<BooleanArray> {
assert_eq!(self.len(), values.len(), "Expected two lists with the same length, instead got self.len() = {} and values.len() = {}", self.len(), values.len());
assert_eq!(self.child_data_type(), values.data_type(), "Expected values to be a column of type <T> and self to be column of type List<T>, but instead got values: {} and self: {}", values.data_type(), self.child_data_type());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should consider casting values to the child dtype when possible so that it doesn't have to be exactly the same dtype.

@@ -475,6 +476,31 @@ impl ListArray {
self.validity().cloned(),
))
}

pub fn list_contains(&self, values: Series) -> DaftResult<BooleanArray> {
assert_eq!(self.len(), values.len(), "Expected two lists with the same length, instead got self.len() = {} and values.len() = {}", self.len(), values.len());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

values should be able to be either the same cardinality as self or a cardinality of 1 (for a literal). See the other list functions for some examples.

assert_eq!(self.len(), values.len(), "Expected two lists with the same length, instead got self.len() = {} and values.len() = {}", self.len(), values.len());
assert_eq!(self.child_data_type(), values.data_type(), "Expected values to be a column of type <T> and self to be column of type List<T>, but instead got values: {} and self: {}", values.data_type(), self.child_data_type());
let founds = with_match_iterable_daft_types!(values.data_type(), |$T| {
let mut founds = vec![];
Copy link
Member

@kevinzwang kevinzwang Sep 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using arrow2::array::BooleanArray::from_trusted_len_values_iter here instead if you can. That way you can construct an array directly from an iterator instead of creating a vec, which actually uses 8x more memory since boolean values are bytes in rust vectors but bits in the boolean array.

@@ -701,6 +727,31 @@ impl FixedSizeListArray {
self.validity().cloned(),
))
}

pub fn list_contains(&self, values: Series) -> DaftResult<BooleanArray> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comments as ListArray::list_contains

let value_field = value.to_field(schema)?;
match &data_field.dtype {
DataType::List(inner_type) => {
if inner_type.as_ref() == &value_field.dtype {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment about value type. if we can cast it to the inner type it should be okay

@@ -151,6 +151,32 @@ macro_rules! with_match_comparable_daft_types {(
}
})}

#[macro_export]
macro_rules! with_match_iterable_daft_types {(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about the purpose of this macro. Shouldn't list contains work for any comparable type? You can use with_match_comparable_daft_types if so.

let data_field = data.to_field(schema)?;
let value_field = value.to_field(schema)?;
match &data_field.dtype {
DataType::List(inner_type) => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another check you should do with inner_type is to check if it is a type that you can check equality with the values.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the dtype can be a FixedSizeList

@samster25
Copy link
Member

@raunakab @kevinzwang status on this PR?

@jaychia
Copy link
Contributor

jaychia commented Nov 20, 2024

Hey folks, status? @raunakab

Copy link

codecov bot commented Nov 21, 2024

Codecov Report

Attention: Patch coverage is 2.46914% with 79 lines in your changes missing coverage. Please review.

Project coverage is 77.35%. Comparing base (5fee192) to head (d63ce68).
Report is 59 commits behind head on main.

Files with missing lines Patch % Lines
src/daft-functions/src/list/contains.rs 0.00% 48 Missing ⚠️
src/daft-core/src/array/ops/list.rs 0.00% 16 Missing ⚠️
src/daft-core/src/series/ops/list.rs 0.00% 14 Missing ⚠️
daft/expressions/expressions.py 50.00% 1 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #2773      +/-   ##
==========================================
- Coverage   77.41%   77.35%   -0.07%     
==========================================
  Files         678      679       +1     
  Lines       83680    83332     -348     
==========================================
- Hits        64783    64460     -323     
+ Misses      18897    18872      -25     
Files with missing lines Coverage Δ
src/daft-functions/src/list/mod.rs 100.00% <100.00%> (ø)
daft/expressions/expressions.py 93.26% <50.00%> (-0.12%) ⬇️
src/daft-core/src/series/ops/list.rs 58.55% <0.00%> (-5.95%) ⬇️
src/daft-core/src/array/ops/list.rs 92.39% <0.00%> (-2.14%) ⬇️
src/daft-functions/src/list/contains.rs 0.00% <0.00%> (ø)

... and 10 files with indirect coverage changes

@raunakab
Copy link
Contributor Author

raunakab commented Dec 3, 2024

This is a stale PR. Closing in favour of #3479.

@raunakab raunakab closed this Dec 3, 2024
@raunakab raunakab deleted the ronnie/list-contains branch December 3, 2024 22:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add a .list.contains() expression
4 participants