-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
implement where api #298
base: master
Are you sure you want to change the base?
implement where api #298
Conversation
Codecov Report
@@ Coverage Diff @@
## master #298 +/- ##
==========================================
+ Coverage 83.24% 83.26% +0.01%
==========================================
Files 22 22
Lines 6149 6287 +138
Branches 1247 1273 +26
==========================================
+ Hits 5119 5235 +116
- Misses 734 749 +15
- Partials 296 303 +7
Continue to review full report at Codecov.
|
Here is the matrix that represent the field type result of
|
The first part of the TODO was to accept list, tuple, and any sort of ndarray, not just bool arrays. Can we make that change? |
exetera/core/fields.py
Outdated
raise NotImplementedError("Where does not support condition on indexed string fields at present") | ||
cond = cond.data[:] | ||
elif callable(cond): | ||
raise NotImplementedError("module method `fields.where` doesn't support callable cond, please use instance mehthod `where` for callable cond.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: mehthod -> method
exetera/core/fields.py
Outdated
a = a.data[:] | ||
if isinstance(b, Field): | ||
b = b.data[:] | ||
return np.where(cond, a, b) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still returning a numpy array rather than a field
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still returning a numpy array rather than a field
The logic of module-level where
API will be almost same as instance-level where
API. Think we can focus on one first, e.g. instance-level where
API.
exetera/core/fields.py
Outdated
@@ -143,6 +161,41 @@ def _ensure_valid(self): | |||
if not self._valid_reference: | |||
raise ValueError("This field no longer refers to a valid underlying field object") | |||
|
|||
def where(self, cond:Union[list, tuple, np.ndarray, Field], b, inplace=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add the callable signature to cond's type information
exetera/core/fields.py
Outdated
result_mem_field.data.write(result_ndarray) | ||
|
||
elif isinstance(self, (IndexedStringField, FixedStringField)) or isinstance(b, (IndexedStringField, FixedStringField)): | ||
result_mem_field = IndexedStringMemField(self._session) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't seem right. Why are we causing an operation with fixed string field to output an indexed string field?
It doesn't make the logic much more complicated. Also, I would make that a separate method probably, because I can imagine us needing it elsewhere in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For FixedStringField, you can refer to the matrix I listed above. Only when two FixedStringField will generate FixedStringField, otherwise it will be IndexedStringField.
|
Sorry, accidental close |
Fixed string type promotion should not result in indexed strings, I think. Here is a revised version of the table below: |
|
exetera/core/fields.py
Outdated
|
||
result_mem_field = None | ||
|
||
if isinstance(self, IndexedStringField) and isinstance(b, IndexedStringField): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When doing the type checking need to check that it's one of two types: isinstance(self, (IndexedStringField, IndexedStringMemField))
.
exetera/core/fields.py
Outdated
cond = cond(self.data[:]) | ||
else: | ||
raise TypeError("'cond' parameter needs to be either callable lambda function, or array like, or NumericMemField") | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we could just do return where(cond, self, b)
and then the rest of the body of this method can be put into the global where
function.
exetera/core/fields.py
Outdated
other_field_row_count = len(other_field.data[:]) | ||
data_converted_to_str = np.where([True]*other_field_row_count, other_field.data[:], [""]*other_field_row_count) | ||
maxLength = 0 | ||
re_match = re.findall(r"<U(\d+)|S(\d+)", str(data_converted_to_str.dtype)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
U can be <U
or >U
tests/test_fields.py
Outdated
@@ -2169,6 +2169,167 @@ def test_indexed_string_isin(self, data, isin_data, expected): | |||
np.testing.assert_array_equal(expected, result) | |||
|
|||
|
|||
WHERE_BOOLEAN_COND = RAND_STATE.randint(0, 2, 20).tolist() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we missing tests for when cond
is a field?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, currently unittest for cond is a field is missing. I'm trying to add one.
So for the indexedstringfield, we will throw out the exception.
How should we deal with the FixedStringField? As we can't use string as boolean value directly, so which case should be considered True for fixedstringfield, and which case is False?
WHERE_FIXED_STRING_TESTS = [ | ||
(lambda f: f > 5, "create_numeric", {"nformat": "int8"}, WHERE_NUMERIC_FIELD_DATA, "create_fixed_string", {"length": 3}, WHERE_FIXED_STRING_FIELD_DATA), | ||
(lambda f: f > 2, "create_categorical", {"nformat": "int32", "key": {"a": 1, "b": 2, "c": 3}}, WHERE_CATEGORICAL_FIELD_DATA, "create_fixed_string", {"length": 3}, WHERE_FIXED_STRING_FIELD_DATA), | ||
(WHERE_BOOLEAN_COND, "create_fixed_string", {"length": 3}, WHERE_FIXED_STRING_FIELD_DATA, "create_categorical", {"nformat": "int32", "key": {"a": 1, "b": 2, "c": 3}}, WHERE_CATEGORICAL_FIELD_DATA), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2300: can we also do this for float32?
tests/test_fields.py
Outdated
np.testing.assert_array_equal(result.data[:], expected_result) | ||
|
||
# reload to test FixedStringMemField | ||
a_mem_field, b_mem_field = a_field, b_field |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move this to before the first subtest
tests/test_fields.py
Outdated
|
||
expected_result = where_oracle(cond, a_field_data, b_field_data) | ||
|
||
with self.subTest(f"Test instance where method: a is {type(a_field)}, b is {type(b_field)}"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move this to after the mem fields are created
tests/test_fields.py
Outdated
|
||
# reload to test FixedStringMemField | ||
a_mem_field, b_mem_field = a_field, b_field | ||
if isinstance(a_field, fields.FixedStringField): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
condition can be removed
tests/test_fields.py
Outdated
a_mem_field = fields.FixedStringMemField(self.s, a_kwarg["length"]) | ||
a_mem_field.data.write(np.array(a_field_data)) | ||
|
||
if isinstance(b_field, fields.FixedStringField): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
condition can be removed
tests/test_fields.py
Outdated
b_mem_field = fields.FixedStringMemField(self.s, b_kwarg["length"]) | ||
b_mem_field.data.write(np.array(b_field_data)) | ||
|
||
with self.subTest(f"Test instance where method: a is {type(a_mem_field)}, b is {type(b_mem_field)}"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do all four combinations:
a_field, b_field
a_field, b_mem_field
a_mem_field, b_field
a_mem_field, b_mem_field
|
||
|
||
@parameterized.expand(WHERE_INDEXED_STRING_TESTS) | ||
def test_instance_field_where_return_indexed_string_mem_field(self, cond, a_creator, a_kwarg, a_field_data, b_creator, b_kwarg, b_field_data): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here with combinations of hdf5 and mem fields
exetera/core/fields.py
Outdated
if isinstance(cond, (list, tuple, np.ndarray)): | ||
cond = cond | ||
elif isinstance(cond, Field): | ||
if isinstance(cond, (NumericField, CategoricalField)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still not checking for both hdf5 and mem field types
exetera/core/fields.py
Outdated
if isinstance(cond, (list, tuple, np.ndarray)): | ||
cond = cond | ||
elif isinstance(cond, Field): | ||
if isinstance(cond, (NumericField, CategoricalField)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still not checking both hdf5 and mem field types
exetera/core/fields.py
Outdated
if isinstance(cond, (list, tuple, np.ndarray)): | ||
cond = cond | ||
elif isinstance(cond, Field): | ||
if isinstance(cond, (NumericField, CategoricalField)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still not checking hdf5 and mem field types
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If these exception handling messages are fixed, I think we are good to go
exetera/core/fields.py
Outdated
else: | ||
raise NotImplementedError("Where only support condition on numeric field and categorical field at present.") | ||
elif callable(cond): | ||
raise NotImplementedError("module method `fields.where` doesn't support callable cond, please use instance mehthod `where` for callable cond.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo, please replace with:
"module method fields.where
doesn't support callable cond
parameter, please use the instance method where
if you need to use a callable cond
parameter"
exetera/core/fields.py
Outdated
if isinstance(cond, (NumericField, NumericMemField, CategoricalField, CategoricalMemField)): | ||
cond = cond.data[:] | ||
else: | ||
raise NotImplementedError("Where only support condition on numeric field and categorical field at present.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo, please replace with:
"where
only supports python sequences, numpy ndarrays, and numeric field and categorical field types for the cond
parameter at present."
exetera/core/fields.py
Outdated
if l: | ||
maxLength = int(l) | ||
else: | ||
raise ValueError("The return dtype of instance method `where` doesn't match '<U(\d+)' or 'S(\d+)' when one of the field is FixedStringField") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo, please replace with:
"The return dtype of instance method where
doesn't match '<U(\d+)' or 'S(\d+)' when one of the fields is a fixed string field"
exetera/core/fields.py
Outdated
if isinstance(cond, (NumericField, NumericMemField, CategoricalField, CategoricalMemField)): | ||
cond = cond.data[:] | ||
else: | ||
raise NotImplementedError("Where only support condition on numeric field and categorical field at present.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo, please replace with:
"where
only supports python sequences, numpy ndarrays, and numeric field and categorical field types for the cond parameter at present."
exetera/core/fields.py
Outdated
if isinstance(cond, (NumericField, NumericMemField, CategoricalField, CategoricalMemField)): | ||
cond = cond.data[:] | ||
else: | ||
raise NotImplementedError("Where only support condition on numeric field and categorical field at present.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo, please replace with:
"where
only supports callables, python sequences, numpy ndarrays, and numeric field and categorical field types for the cond parameter at present."
exetera/core/fields.py
Outdated
elif callable(cond): | ||
cond = cond(self.data[:]) | ||
else: | ||
raise TypeError("'cond' parameter needs to be either callable lambda function, or array like, or NumericMemField.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo, please replace with:
"where
only supports callables, python sequences, numpy ndarrays, and numeric field and categorical field types for the cond parameter at present."
exetera/core/fields.py
Outdated
if isinstance(cond, (NumericField, NumericMemField, CategoricalField, CategoricalMemField)): | ||
cond = cond.data[:] | ||
else: | ||
raise NotImplementedError("Where only support condition on numeric field and categorical field at present.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo, please replace with:
"where only supports callables, python sequences, numpy ndarrays, and numeric field and categorical field types for the cond parameter at present."
exetera/core/fields.py
Outdated
elif callable(cond): | ||
cond = cond(self.data[:]) | ||
else: | ||
raise TypeError("'cond' parameter needs to be either callable lambda function, or array like, or NumericMemField.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo, please replace with:
"where only supports callables, python sequences, numpy ndarrays, and numeric field and categorical field types for the cond parameter at present."
fix #224
TODO:
(1) Accept types Field, np.ndarray, list, tuple, for where function and methods
(2) where always returns a MemField, in case of inplace return self after changing internals to match type if necessary, otherwise returning fresh field
(3) categorical is typed down to integer, treat timestamp as float64 fields
For checking for multiple types: isinstance(cond, (list, tuple, np.ndarray))