Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dfview #297

Open
wants to merge 239 commits into
base: master
Choose a base branch
from
Open

Dfview #297

wants to merge 239 commits into from

Conversation

deng113jie
Copy link
Contributor

df.view() to create a view from the df, contains all fields as a reference
df.apply_filter() to set filter

jie and others added 30 commits March 11, 2021 16:17
add get_spans() in Field class, similar to get_spans() in Session class
to add CSVDataset file as the import required in module init
…ction in core.operations.

Provide get_spans methods in fields using data attribute.
Modify the get_spans functions in Session to call field method and operation method.
if self._filter_wrapper is None: # poential returns: raise error or return a full-index array
return None
else:
return self._filter_wrapper
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to return a read-only field array for this.

"""
Return if the dataframe's name matches the field h5group path; if not, means this field is a view.
"""
if self._field.name[1:1+len(self.dataframe.name)] != self.dataframe.name:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if self._field.name[1:1+len(self.dataframe.name)] != self.dataframe.name:
return self._field.name[1:1+len(self.dataframe.name)] != self.dataframe.name

@@ -243,7 +264,11 @@ def dtype(self):
return self._dataset.dtype

def __getitem__(self, item):
return self._dataset[item]
if self._field_instance.filter is not None:
mask = self._field_instance.filter[:]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apply the item to select indices from the mask first then use these to get values from self._dataset:

idx = mask[item]

return self._dataset[item]
if self._field_instance.filter is not None and not isinstance(self._field_instance, IndexedStringField):
mask = self._field_instance.filter[:]
data = self._dataset[:][mask] # as HDF5 does not support unordered mask
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same idea here, want to get indices from mask rather than resolve all mask indices then select from result.

@@ -483,16 +494,17 @@ def complete(self):


class ReadOnlyIndexedFieldArray:
def __init__(self, field, indices, values):
def __init__(self, chunksize, indices, values, field):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: check this is right?

index[ir + 1] - np.int64(startindex)].tobytes().decode()
return results
else:
mask = self._field_instance.filter[:]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, slice the mask not the data.

value = self._values[start:stop].tobytes().decode()
return value
else:
mask = self._field_instance.filter[:]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Samesies.

@deng113jie deng113jie marked this pull request as ready for review May 24, 2022 16:51
@KCL-BMEIS KCL-BMEIS deleted a comment from codecov-commenter May 24, 2022
# add filter
if filter is not None:
nformat = 'int32'
if len(filter) > 0 and np.max(filter) >= 2**31 - 1:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

utils.INT64_INDEX_LENGTH

for name, field in self._columns.items():
newfld = field.create_like(ddf, name)
field.apply_filter(filter_to_apply_, target=newfld)
ddf._add_view(field, filter_to_apply_)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check if the same dataset

@@ -478,13 +524,12 @@ def apply_filter(self, filter_to_apply, ddf=None):
:returns: a dataframe contains all the fields filterd, self if ddf is not set
"""
filter_to_apply_ = val.validate_filter(filter_to_apply)

if ddf is not None:
if ddf is not None and ddf is not self:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ddf = self if ddf is None

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if ddf not in (None, self)

if ddf is not None:
if not isinstance(ddf, DataFrame):
raise TypeError("The destination object must be an instance of DataFrame.")
ddf._write_filter(np.where(filter_to_apply_ == True)[0])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest that you return the filter reference here so you can directly assign it during add_view (line 608)

return None
else:
return self._filter_wrapper[:]
return self._filter
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dead code


@property
def filter(self):
if self._filter_wrapper is None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return filter field rather than dereferencing

"""
self._references.remove(field)

def concreate_all_fields(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: concrete_all_fields

@@ -310,6 +333,8 @@ def clear(self):
Replaces current dataset with empty dataset.
:return: None
"""
if len(self._references) > 0:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can do this check inside the notification method

view.update(self, msg)

def update(self, subject, msg=None):
if isinstance(subject, (WriteableFieldArray, WriteableIndexedFieldArray)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

# This field is being notified by its own field array
# It needs to notify other fields that it is about to change before the change goes ahead

self.notify(msg)
self.detach()

if isinstance(subject, HDF5Field):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

# This field is being notified by the field that owns the data that it has a view of
# At present, the behavior is that it copies the data and then detaches from the view that notified it, as it
# no longer has an observation relationship with that field

def update(self, subject, msg=None):
if isinstance(subject, (WriteableFieldArray, WriteableIndexedFieldArray)):
self.notify(msg)
self.detach()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Detach should be the responsibility of the observer, not the subject, as the subject could instead do something clever that maintains the relationship

def attach(self, view):
self._view_refs.append(view)

def detach(self, view=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This detach is actually notify_deletion. This is a standard part of subject_observer, but so is detach, which is for the observer to detach from a given subject

def attach(self, view):
self._view_refs.append(view)

def detach(self, view=None):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also deletion function (subject del observer)

if utils.is_sorted(mask):
return self._dataset[mask]
else:
return self._dataset[np.sort(mask)][mask]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add an issue on benchmarking: filtering/item -> hdf5 or hdf5/numpy -> filter.

bytestr[index[ir] - np.int64(startindex):
index[ir + 1] - np.int64(startindex)].tobytes().decode()
return results
if self._field_instance.filter is None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

# This field is not a view so no filtered_index to deal with

else:
mask = self._field_instance.filter[item]
if utils.is_sorted(mask):
index_s = self._indices[mask]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

# the filtered indices represent a filter operation
We need to evaluate whether this saves anything. H5py is horrifically slow and the risk is that all of this work is lost when loading slices through its api. I would suggest just loading all of the data and applying the filtered index rather than trying to gain time here, unless we do a series of detailed benchmarks that can give us a heuristic to decide whether to do this or not

1) has hdf5 group but no dataset
2) can be re-recognized in a new session
@codecov-commenter
Copy link

codecov-commenter commented May 26, 2022

Codecov Report

Merging #297 (3153f2b) into master (bc66b84) will decrease coverage by 0.76%.
The diff coverage is 73.04%.

@@            Coverage Diff             @@
##           master     #297      +/-   ##
==========================================
- Coverage   83.39%   82.63%   -0.77%     
==========================================
  Files          22       22              
  Lines        6191     6478     +287     
  Branches     1247     1324      +77     
==========================================
+ Hits         5163     5353     +190     
- Misses        733      802      +69     
- Partials      295      323      +28     
Impacted Files Coverage Δ
exetera/core/abstract_types.py 63.15% <57.14%> (-0.30%) ⬇️
exetera/core/fields.py 87.82% <69.59%> (-3.22%) ⬇️
exetera/core/dataframe.py 85.81% <84.93%> (-1.55%) ⬇️
exetera/core/dataset.py 94.91% <100.00%> (+0.32%) ⬆️
exetera/core/utils.py 78.32% <100.00%> (+0.62%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bc66b84...3153f2b. Read the comment docs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants