implement where api #298

Liyuan-Chen-1024 · 2022-05-17T12:32:05Z

TODO:
(1) Accept types Field, np.ndarray, list, tuple, for where function and methods
(2) where always returns a MemField, in case of inplace return self after changing internals to match type if necessary, otherwise returning fresh field
(3) categorical is typed down to integer, treat timestamp as float64 fields
For checking for multiple types: isinstance(cond, (list, tuple, np.ndarray))

codecov-commenter · 2022-05-17T12:40:51Z

Codecov Report

Merging #298 (6da082d) into master (1267885) will increase coverage by 0.01%.
The diff coverage is 75.00%.

@@            Coverage Diff             @@
##           master     #298      +/-   ##
==========================================
+ Coverage   83.24%   83.26%   +0.01%     
==========================================
  Files          22       22              
  Lines        6149     6287     +138     
  Branches     1247     1273      +26     
==========================================
+ Hits         5119     5235     +116     
- Misses        734      749      +15     
- Partials      296      303       +7

Impacted Files	Coverage Δ
exetera/core/abstract_types.py	`63.35% <50.00%> (-0.10%)`	⬇️
exetera/core/fields.py	`90.08% <73.56%> (-0.57%)`	⬇️
exetera/core/operations.py	`87.00% <100.00%> (+0.05%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1267885...6da082d. Read the comment docs.

Liyuan-Chen-1024 · 2022-05-17T12:50:32Z

Here is the matrix that represent the field type result of where with different pairs of input fields. (col, row) pair reprensent (a, b) in where(cond, a, b)

	NumericField	CategoricalField	IndexedStringField	FixedStringField
NumericField	NumericMemF	NumericMemF	IndexedStringMemF	IndexedStringMemF
CategoricalField	NumericMemF	NumericMemF	IndexedStringMemF	IndexedStringMemF
IndexedStringField	IndexedStringMemF	IndexedStringMemF	IndexedStringMemF	IndexedStringMemF
FixedStringField	IndexedStringMemF	IndexedStringMemF	IndexedStringMemF	FixedStringMemF

ericspod · 2022-05-18T17:47:21Z

The first part of the TODO was to accept list, tuple, and any sort of ndarray, not just bool arrays. Can we make that change?

atbenmurray · 2022-05-30T13:25:44Z

exetera/core/fields.py

+            raise NotImplementedError("Where does not support condition on indexed string fields at present")
+        cond = cond.data[:]
+    elif callable(cond):
+        raise NotImplementedError("module method `fields.where` doesn't support callable cond, please use instance mehthod `where` for callable cond.")


typo: mehthod -> method

atbenmurray · 2022-05-30T13:26:17Z

exetera/core/fields.py

+        a = a.data[:]
+    if isinstance(b, Field):
+        b = b.data[:]
+    return np.where(cond, a, b)


This is still returning a numpy array rather than a field

This is still returning a numpy array rather than a field

The logic of module-level where API will be almost same as instance-level where API. Think we can focus on one first, e.g. instance-level where API.

atbenmurray · 2022-05-30T13:27:36Z

exetera/core/fields.py

@@ -143,6 +161,41 @@ def _ensure_valid(self):
        if not self._valid_reference:
            raise ValueError("This field no longer refers to a valid underlying field object")

+    def where(self, cond:Union[list, tuple, np.ndarray, Field], b, inplace=False):


Please add the callable signature to cond's type information

atbenmurray · 2022-05-30T13:30:22Z

exetera/core/fields.py

+            result_mem_field.data.write(result_ndarray)
+
+        elif isinstance(self, (IndexedStringField, FixedStringField)) or isinstance(b, (IndexedStringField, FixedStringField)):
+            result_mem_field = IndexedStringMemField(self._session)


This doesn't seem right. Why are we causing an operation with fixed string field to output an indexed string field?
It doesn't make the logic much more complicated. Also, I would make that a separate method probably, because I can imagine us needing it elsewhere in the future.

For FixedStringField, you can refer to the matrix I listed above. Only when two FixedStringField will generate FixedStringField, otherwise it will be IndexedStringField.

atbenmurray · 2022-05-30T13:35:11Z

atbenmurray · 2022-05-30T13:35:42Z

Sorry, accidental close

atbenmurray · 2022-05-30T13:37:00Z

Fixed string type promotion should not result in indexed strings, I think. Here is a revised version of the table below:

atbenmurray · 2022-05-30T13:47:01Z

a	b	result	notes
numeric	numeric	numeric
categorical	numeric	numeric
categorical	categorical	numeric	we could support categorical of the same dictionary if the categorcial types are identical
fixed string	numeric	fixed string	type is 'S' where `max` is max of longest numeric representation and fixed string length
fixed string	categorical	fixed string	see above (and treat categorical like numeric)
fixed string	fixed string	fixed string	longest fixed string
indexed string	numeric	indexed string
indexed string	categorical	indexed string
indexed string	fixed string	indexed string
indexed string	indexed string	indexed string

… is not

ericspod · 2022-06-13T14:23:40Z

exetera/core/fields.py

+
+        result_mem_field = None
+
+        if isinstance(self, IndexedStringField) and isinstance(b, IndexedStringField):


When doing the type checking need to check that it's one of two types: isinstance(self, (IndexedStringField, IndexedStringMemField)).

ericspod · 2022-06-13T14:38:49Z

exetera/core/fields.py

+            cond = cond(self.data[:])
+        else:
+            raise TypeError("'cond' parameter needs to be either callable lambda function, or array like, or NumericMemField")
+


Here we could just do return where(cond, self, b) and then the rest of the body of this method can be put into the global where function.

…tringfield'

atbenmurray · 2022-06-16T10:20:19Z

exetera/core/fields.py

+        other_field_row_count = len(other_field.data[:])
+        data_converted_to_str = np.where([True]*other_field_row_count, other_field.data[:], [""]*other_field_row_count)
+        maxLength = 0
+        re_match = re.findall(r"<U(\d+)|S(\d+)", str(data_converted_to_str.dtype))


U can be <U or >U

atbenmurray · 2022-06-16T10:20:35Z

tests/test_fields.py

@@ -2169,6 +2169,167 @@ def test_indexed_string_isin(self, data, isin_data, expected):
                np.testing.assert_array_equal(expected, result)


+WHERE_BOOLEAN_COND = RAND_STATE.randint(0, 2, 20).tolist()


Are we missing tests for when cond is a field?

Yes, currently unittest for cond is a field is missing. I'm trying to add one.
So for the indexedstringfield, we will throw out the exception.
How should we deal with the FixedStringField? As we can't use string as boolean value directly, so which case should be considered True for fixedstringfield, and which case is False?

…_mixed_fields

atbenmurray · 2022-06-27T14:53:04Z

tests/test_fields.py

+WHERE_FIXED_STRING_TESTS = [
+    (lambda f: f > 5, "create_numeric", {"nformat": "int8"}, WHERE_NUMERIC_FIELD_DATA, "create_fixed_string", {"length": 3}, WHERE_FIXED_STRING_FIELD_DATA),
+    (lambda f: f > 2, "create_categorical", {"nformat": "int32", "key": {"a": 1, "b": 2, "c": 3}}, WHERE_CATEGORICAL_FIELD_DATA, "create_fixed_string", {"length": 3}, WHERE_FIXED_STRING_FIELD_DATA),
+    (WHERE_BOOLEAN_COND,  "create_fixed_string", {"length": 3}, WHERE_FIXED_STRING_FIELD_DATA, "create_categorical", {"nformat": "int32", "key": {"a": 1, "b": 2, "c": 3}}, WHERE_CATEGORICAL_FIELD_DATA),


2300: can we also do this for float32?

atbenmurray · 2022-06-27T14:59:43Z

tests/test_fields.py

+            np.testing.assert_array_equal(result.data[:], expected_result)
+
+        # reload to test FixedStringMemField
+        a_mem_field, b_mem_field = a_field, b_field


Move this to before the first subtest

atbenmurray · 2022-06-27T15:00:01Z

tests/test_fields.py

+
+        expected_result = where_oracle(cond, a_field_data, b_field_data)
+
+        with self.subTest(f"Test instance where method: a is {type(a_field)}, b is {type(b_field)}"):


Move this to after the mem fields are created

atbenmurray · 2022-06-27T15:00:14Z

tests/test_fields.py

+
+        # reload to test FixedStringMemField
+        a_mem_field, b_mem_field = a_field, b_field
+        if isinstance(a_field, fields.FixedStringField):


condition can be removed

atbenmurray · 2022-06-27T15:00:28Z

tests/test_fields.py

+            a_mem_field = fields.FixedStringMemField(self.s, a_kwarg["length"])
+            a_mem_field.data.write(np.array(a_field_data))
+
+        if isinstance(b_field, fields.FixedStringField):


condition can be removed

atbenmurray · 2022-06-27T15:01:11Z

tests/test_fields.py

+            b_mem_field = fields.FixedStringMemField(self.s, b_kwarg["length"])
+            b_mem_field.data.write(np.array(b_field_data))
+
+        with self.subTest(f"Test instance where method: a is {type(a_mem_field)}, b is {type(b_mem_field)}"):


Do all four combinations:
a_field, b_field
a_field, b_mem_field
a_mem_field, b_field
a_mem_field, b_mem_field

atbenmurray · 2022-06-27T15:01:35Z

tests/test_fields.py

+
+
+    @parameterized.expand(WHERE_INDEXED_STRING_TESTS)
+    def test_instance_field_where_return_indexed_string_mem_field(self, cond, a_creator, a_kwarg, a_field_data, b_creator, b_kwarg, b_field_data):


Same here with combinations of hdf5 and mem fields

atbenmurray · 2022-06-27T15:03:38Z

exetera/core/fields.py

+    if isinstance(cond, (list, tuple, np.ndarray)):
+        cond = cond
+    elif isinstance(cond, Field):
+        if isinstance(cond, (NumericField, CategoricalField)):


still not checking for both hdf5 and mem field types

atbenmurray · 2022-06-27T15:04:09Z

exetera/core/fields.py

+        if isinstance(cond, (list, tuple, np.ndarray)):
+            cond = cond
+        elif isinstance(cond, Field):
+            if isinstance(cond, (NumericField, CategoricalField)):


still not checking both hdf5 and mem field types

atbenmurray · 2022-06-27T15:04:36Z

exetera/core/fields.py

+        if isinstance(cond, (list, tuple, np.ndarray)):
+            cond = cond
+        elif isinstance(cond, Field):
+            if isinstance(cond, (NumericField, CategoricalField)):


still not checking hdf5 and mem field types

atbenmurray

If these exception handling messages are fixed, I think we are good to go

atbenmurray · 2022-07-25T13:31:32Z

exetera/core/fields.py

+        else:
+            raise NotImplementedError("Where only support condition on numeric field and categorical field at present.")
+    elif callable(cond):
+        raise NotImplementedError("module method `fields.where` doesn't support callable cond, please use instance mehthod `where` for callable cond.")


Typo, please replace with:

"module method fields.where doesn't support callable cond parameter, please use the instance method where if you need to use a callable cond parameter"

atbenmurray · 2022-07-25T13:33:58Z

exetera/core/fields.py

+        if isinstance(cond, (NumericField, NumericMemField, CategoricalField, CategoricalMemField)):
+            cond = cond.data[:]
+        else:
+            raise NotImplementedError("Where only support condition on numeric field and categorical field at present.")


Typo, please replace with:

"where only supports python sequences, numpy ndarrays, and numeric field and categorical field types for the cond parameter at present."

atbenmurray · 2022-07-25T13:35:38Z

exetera/core/fields.py

+                    if l:
+                        maxLength = int(l)
+            else:
+                raise ValueError("The return dtype of instance method `where` doesn't match '<U(\d+)' or 'S(\d+)' when one of the field is FixedStringField")


Typo, please replace with:

"The return dtype of instance method where doesn't match '<U(\d+)' or 'S(\d+)' when one of the fields is a fixed string field"

atbenmurray · 2022-07-25T13:36:58Z

exetera/core/fields.py

+            if isinstance(cond, (NumericField, NumericMemField, CategoricalField, CategoricalMemField)):
+                cond = cond.data[:]
+            else:
+                raise NotImplementedError("Where only support condition on numeric field and categorical field at present.")


Typo, please replace with:

"where only supports python sequences, numpy ndarrays, and numeric field and categorical field types for the cond parameter at present."

atbenmurray · 2022-07-25T13:38:04Z

exetera/core/fields.py

+            if isinstance(cond, (NumericField, NumericMemField, CategoricalField, CategoricalMemField)):
+                cond = cond.data[:]
+            else:
+                raise NotImplementedError("Where only support condition on numeric field and categorical field at present.")


Typo, please replace with:

"where only supports callables, python sequences, numpy ndarrays, and numeric field and categorical field types for the cond parameter at present."

atbenmurray · 2022-07-25T13:38:10Z

exetera/core/fields.py

+        elif callable(cond):
+            cond = cond(self.data[:])
+        else:
+            raise TypeError("'cond' parameter needs to be either callable lambda function, or array like, or NumericMemField.")


Typo, please replace with:

"where only supports callables, python sequences, numpy ndarrays, and numeric field and categorical field types for the cond parameter at present."

atbenmurray · 2022-07-25T13:39:22Z

exetera/core/fields.py

+            if isinstance(cond, (NumericField, NumericMemField, CategoricalField, CategoricalMemField)):
+                cond = cond.data[:]
+            else:
+                raise NotImplementedError("Where only support condition on numeric field and categorical field at present.")


Typo, please replace with:

"where only supports callables, python sequences, numpy ndarrays, and numeric field and categorical field types for the cond parameter at present."

atbenmurray · 2022-07-25T13:39:29Z

exetera/core/fields.py

+        elif callable(cond):
+            cond = cond(self.data[:])
+        else:
+            raise TypeError("'cond' parameter needs to be either callable lambda function, or array like, or NumericMemField.")


Typo, please replace with:

"where only supports callables, python sequences, numpy ndarrays, and numeric field and categorical field types for the cond parameter at present."

implement where api

aa7301a

Liyuan-Chen-1024 requested review from ericspod and atbenmurray May 17, 2022 13:04

implement where return memfield

834b5a9

atbenmurray reviewed May 30, 2022

View reviewed changes

atbenmurray requested changes May 30, 2022

View reviewed changes

atbenmurray closed this May 30, 2022

atbenmurray reopened this May 30, 2022

Liyuan-Chen-1024 added 3 commits June 9, 2022 19:14

implement fixed string field and add parameterized unittest

fd79955

implement where for two indexed string fields

a564305

implement instance where when one field indexedstringfield, the other…

fa62e3b

… is not

ericspod reviewed Jun 13, 2022

View reviewed changes

Liyuan-Chen-1024 added 2 commits June 14, 2022 15:25

add IndexStringMemField check; move where logic to global

dea5b9c

combine the logic of 'a&b is indexedstringfield' and 'a|b is indexeds…

bd51519

…tringfield'

atbenmurray reviewed Jun 16, 2022

View reviewed changes

Liyuan-Chen-1024 added 6 commits June 17, 2022 11:55

add unittest when cond is field; add >U

6da082d

field indexing

0f83acd

Merge branch 'master' of github.com:KCL-BMEIS/ExeTera into where_with…

213ddac

…_mixed_fields

add mem field test

127511d

move code change to another branch

c0746fd

add combination of field and its memfield

78c7b95

atbenmurray reviewed Jul 5, 2022

View reviewed changes

add mem field check; add float dtype check

86fdeed

atbenmurray reviewed Jul 25, 2022

View reviewed changes

fix exception handling messages

0d51c11


		result_mem_field = None

		if isinstance(self, IndexedStringField) and isinstance(b, IndexedStringField):

		@@ -2169,6 +2169,167 @@ def test_indexed_string_isin(self, data, isin_data, expected):
		np.testing.assert_array_equal(expected, result)


		WHERE_BOOLEAN_COND = RAND_STATE.randint(0, 2, 20).tolist()


		expected_result = where_oracle(cond, a_field_data, b_field_data)

		with self.subTest(f"Test instance where method: a is {type(a_field)}, b is {type(b_field)}"):



		@parameterized.expand(WHERE_INDEXED_STRING_TESTS)
		def test_instance_field_where_return_indexed_string_mem_field(self, cond, a_creator, a_kwarg, a_field_data, b_creator, b_kwarg, b_field_data):

implement where api #298

Are you sure you want to change the base?

implement where api #298

Conversation

Liyuan-Chen-1024 commented May 17, 2022

codecov-commenter commented May 17, 2022 • edited Loading

Codecov Report

Liyuan-Chen-1024 commented May 17, 2022 • edited Loading

ericspod commented May 18, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

atbenmurray commented May 30, 2022

atbenmurray commented May 30, 2022

atbenmurray commented May 30, 2022 • edited Loading

atbenmurray commented May 30, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

atbenmurray left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-commenter commented May 17, 2022 •

edited

Loading

Liyuan-Chen-1024 commented May 17, 2022 •

edited

Loading

atbenmurray commented May 30, 2022 •

edited

Loading