-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dfview #297
base: master
Are you sure you want to change the base?
Dfview #297
Conversation
add get_spans() in Field class, similar to get_spans() in Session class
to add CSVDataset file as the import required in module init
…d string, indexed string, etc.
…ction in core.operations. Provide get_spans methods in fields using data attribute.
Modify the get_spans functions in Session to call field method and operation method.
exetera/core/fields.py
Outdated
if self._filter_wrapper is None: # poential returns: raise error or return a full-index array | ||
return None | ||
else: | ||
return self._filter_wrapper |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to return a read-only field array for this.
exetera/core/fields.py
Outdated
""" | ||
Return if the dataframe's name matches the field h5group path; if not, means this field is a view. | ||
""" | ||
if self._field.name[1:1+len(self.dataframe.name)] != self.dataframe.name: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if self._field.name[1:1+len(self.dataframe.name)] != self.dataframe.name: | |
return self._field.name[1:1+len(self.dataframe.name)] != self.dataframe.name |
exetera/core/fields.py
Outdated
@@ -243,7 +264,11 @@ def dtype(self): | |||
return self._dataset.dtype | |||
|
|||
def __getitem__(self, item): | |||
return self._dataset[item] | |||
if self._field_instance.filter is not None: | |||
mask = self._field_instance.filter[:] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apply the item to select indices from the mask first then use these to get values from self._dataset:
idx = mask[item]
exetera/core/fields.py
Outdated
return self._dataset[item] | ||
if self._field_instance.filter is not None and not isinstance(self._field_instance, IndexedStringField): | ||
mask = self._field_instance.filter[:] | ||
data = self._dataset[:][mask] # as HDF5 does not support unordered mask |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same idea here, want to get indices from mask rather than resolve all mask indices then select from result.
@@ -483,16 +494,17 @@ def complete(self): | |||
|
|||
|
|||
class ReadOnlyIndexedFieldArray: | |||
def __init__(self, field, indices, values): | |||
def __init__(self, chunksize, indices, values, field): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: check this is right?
exetera/core/fields.py
Outdated
index[ir + 1] - np.int64(startindex)].tobytes().decode() | ||
return results | ||
else: | ||
mask = self._field_instance.filter[:] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, slice the mask not the data.
exetera/core/fields.py
Outdated
value = self._values[start:stop].tobytes().decode() | ||
return value | ||
else: | ||
mask = self._field_instance.filter[:] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Samesies.
exetera/core/dataframe.py
Outdated
# add filter | ||
if filter is not None: | ||
nformat = 'int32' | ||
if len(filter) > 0 and np.max(filter) >= 2**31 - 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
utils.INT64_INDEX_LENGTH
for name, field in self._columns.items(): | ||
newfld = field.create_like(ddf, name) | ||
field.apply_filter(filter_to_apply_, target=newfld) | ||
ddf._add_view(field, filter_to_apply_) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check if the same dataset
exetera/core/dataframe.py
Outdated
@@ -478,13 +524,12 @@ def apply_filter(self, filter_to_apply, ddf=None): | |||
:returns: a dataframe contains all the fields filterd, self if ddf is not set | |||
""" | |||
filter_to_apply_ = val.validate_filter(filter_to_apply) | |||
|
|||
if ddf is not None: | |||
if ddf is not None and ddf is not self: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ddf = self if ddf is None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if ddf not in (None, self)
exetera/core/dataframe.py
Outdated
if ddf is not None: | ||
if not isinstance(ddf, DataFrame): | ||
raise TypeError("The destination object must be an instance of DataFrame.") | ||
ddf._write_filter(np.where(filter_to_apply_ == True)[0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest that you return the filter reference here so you can directly assign it during add_view (line 608)
exetera/core/fields.py
Outdated
return None | ||
else: | ||
return self._filter_wrapper[:] | ||
return self._filter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dead code
exetera/core/fields.py
Outdated
|
||
@property | ||
def filter(self): | ||
if self._filter_wrapper is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return filter field rather than dereferencing
exetera/core/fields.py
Outdated
""" | ||
self._references.remove(field) | ||
|
||
def concreate_all_fields(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: concrete_all_fields
exetera/core/fields.py
Outdated
@@ -310,6 +333,8 @@ def clear(self): | |||
Replaces current dataset with empty dataset. | |||
:return: None | |||
""" | |||
if len(self._references) > 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can do this check inside the notification method
view.update(self, msg) | ||
|
||
def update(self, subject, msg=None): | ||
if isinstance(subject, (WriteableFieldArray, WriteableIndexedFieldArray)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# This field is being notified by its own field array
# It needs to notify other fields that it is about to change before the change goes ahead
self.notify(msg) | ||
self.detach() | ||
|
||
if isinstance(subject, HDF5Field): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# This field is being notified by the field that owns the data that it has a view of
# At present, the behavior is that it copies the data and then detaches from the view that notified it, as it
# no longer has an observation relationship with that field
exetera/core/fields.py
Outdated
def update(self, subject, msg=None): | ||
if isinstance(subject, (WriteableFieldArray, WriteableIndexedFieldArray)): | ||
self.notify(msg) | ||
self.detach() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Detach should be the responsibility of the observer, not the subject, as the subject could instead do something clever that maintains the relationship
def attach(self, view): | ||
self._view_refs.append(view) | ||
|
||
def detach(self, view=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This detach is actually notify_deletion
. This is a standard part of subject_observer, but so is detach
, which is for the observer to detach from a given subject
def attach(self, view): | ||
self._view_refs.append(view) | ||
|
||
def detach(self, view=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also deletion function (subject del observer)
if utils.is_sorted(mask): | ||
return self._dataset[mask] | ||
else: | ||
return self._dataset[np.sort(mask)][mask] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add an issue on benchmarking: filtering/item -> hdf5 or hdf5/numpy -> filter.
exetera/core/fields.py
Outdated
bytestr[index[ir] - np.int64(startindex): | ||
index[ir + 1] - np.int64(startindex)].tobytes().decode() | ||
return results | ||
if self._field_instance.filter is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# This field is not a view so no filtered_index to deal with
else: | ||
mask = self._field_instance.filter[item] | ||
if utils.is_sorted(mask): | ||
index_s = self._indices[mask] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# the filtered indices represent a filter operation
We need to evaluate whether this saves anything. H5py is horrifically slow and the risk is that all of this work is lost when loading slices through its api. I would suggest just loading all of the data and applying the filtered index rather than trying to gain time here, unless we do a series of detailed benchmarks that can give us a heuristic to decide whether to do this or not
1) has hdf5 group but no dataset 2) can be re-recognized in a new session
add document
Codecov Report
@@ Coverage Diff @@
## master #297 +/- ##
==========================================
- Coverage 83.39% 82.63% -0.77%
==========================================
Files 22 22
Lines 6191 6478 +287
Branches 1247 1324 +77
==========================================
+ Hits 5163 5353 +190
- Misses 733 802 +69
- Partials 295 323 +28
Continue to review full report at Codecov.
|
df.view() to create a view from the df, contains all fields as a reference
df.apply_filter() to set filter