-
Notifications
You must be signed in to change notification settings - Fork 175
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEAT] Add a list.contains
API
#2773
Conversation
CodSpeed Performance ReportMerging #2773 will degrade performances by 27.25%Comparing Summary
Benchmarks breakdown
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An alternative way to write the kernel is to check equality between the flattened list and the values and then for each list range, determine if there are any true values in that equality array. The potential advantage of this is that you can use the vectorized arrow2 equality op. The potential disadvantage is that you can't early terminate if you find an equal value in a list.
Also, please add tests!
Returns: | ||
Expression: a List expression which are the lists which all contain at least one instance of the given value. | ||
""" | ||
return Expression._from_pyexpr(native.list_contains(self._expr, value._expr)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should convert value
into an Expression
using Expression._to_expression
first before using it. That way it works with literals too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about self
then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something like:
self_ = Expression(self)._to_expression()
value = Expression(value)._to_expression()
return Expression._from_pyexpr(native.list_contains(self_, value))
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Self is already an expression so there's no need. It should look something like
value_expr = Expression.to_expression(value)
return Expression._from_pyexpr(native.list_contains(self._expr, value_expr._expr))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, I don't understand.
value
is already of type Expression
. self
is of type ExpressionListNamespace
. When I grab self._expr
and value._expr
, I am grabbing their internal PyExpr
type, no? That is what the native.list_contains
function expects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why should we perform the Expression._to_expression
conversion first? Looking at the implementation, it seems to take in an object
and return it untouched if it is of type Expression
(and if not, return lit(object)
).
|
||
pub fn list_contains(&self, values: Series) -> DaftResult<BooleanArray> { | ||
assert_eq!(self.len(), values.len(), "Expected two lists with the same length, instead got self.len() = {} and values.len() = {}", self.len(), values.len()); | ||
assert_eq!(self.child_data_type(), values.data_type(), "Expected values to be a column of type <T> and self to be column of type List<T>, but instead got values: {} and self: {}", values.data_type(), self.child_data_type()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should consider casting values to the child dtype when possible so that it doesn't have to be exactly the same dtype.
src/daft-core/src/array/ops/list.rs
Outdated
@@ -475,6 +476,31 @@ impl ListArray { | |||
self.validity().cloned(), | |||
)) | |||
} | |||
|
|||
pub fn list_contains(&self, values: Series) -> DaftResult<BooleanArray> { | |||
assert_eq!(self.len(), values.len(), "Expected two lists with the same length, instead got self.len() = {} and values.len() = {}", self.len(), values.len()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
values
should be able to be either the same cardinality as self or a cardinality of 1 (for a literal). See the other list functions for some examples.
assert_eq!(self.len(), values.len(), "Expected two lists with the same length, instead got self.len() = {} and values.len() = {}", self.len(), values.len()); | ||
assert_eq!(self.child_data_type(), values.data_type(), "Expected values to be a column of type <T> and self to be column of type List<T>, but instead got values: {} and self: {}", values.data_type(), self.child_data_type()); | ||
let founds = with_match_iterable_daft_types!(values.data_type(), |$T| { | ||
let mut founds = vec![]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider using arrow2::array::BooleanArray::from_trusted_len_values_iter
here instead if you can. That way you can construct an array directly from an iterator instead of creating a vec, which actually uses 8x more memory since boolean values are bytes in rust vectors but bits in the boolean array.
@@ -701,6 +727,31 @@ impl FixedSizeListArray { | |||
self.validity().cloned(), | |||
)) | |||
} | |||
|
|||
pub fn list_contains(&self, values: Series) -> DaftResult<BooleanArray> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comments as ListArray::list_contains
let value_field = value.to_field(schema)?; | ||
match &data_field.dtype { | ||
DataType::List(inner_type) => { | ||
if inner_type.as_ref() == &value_field.dtype { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment about value type. if we can cast it to the inner type it should be okay
@@ -151,6 +151,32 @@ macro_rules! with_match_comparable_daft_types {( | |||
} | |||
})} | |||
|
|||
#[macro_export] | |||
macro_rules! with_match_iterable_daft_types {( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure about the purpose of this macro. Shouldn't list contains work for any comparable type? You can use with_match_comparable_daft_types
if so.
let data_field = data.to_field(schema)?; | ||
let value_field = value.to_field(schema)?; | ||
match &data_field.dtype { | ||
DataType::List(inner_type) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another check you should do with inner_type
is to check if it is a type that you can check equality with the values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, the dtype can be a FixedSizeList
@raunakab @kevinzwang status on this PR? |
Hey folks, status? @raunakab |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2773 +/- ##
==========================================
- Coverage 77.41% 77.35% -0.07%
==========================================
Files 678 679 +1
Lines 83680 83332 -348
==========================================
- Hits 64783 64460 -323
+ Misses 18897 18872 -25
|
This is a stale PR. Closing in favour of #3479. |
Overview
Added a
list.contains
API to check whether or not a column of typeList
orFixedSizeList
contains a given value.