KCL-BMEIS · deng113jie · Mar 11, 2021 · Mar 12, 2021 · Mar 12, 2021 · Mar 15, 2021
diff --git a/docs/index.rst b/docs/index.rst
@@ -36,6 +36,7 @@ Getting started
     dataset.md
     dataframe.md
     field.md
+    view.md
 
 .. toctree::
     :maxdepth: 1

diff --git a/docs/view.md b/docs/view.md
@@ -0,0 +1,48 @@
+# View
+## What is a view
+A view is a special field that has a ‘source_field’ in it’s hdf5 attributes. In this case, the view will initialize the dataset from the destination specified by the ‘source_field’ rather than in it’s own hdf5 storage.
+
+
+The benefit of using a view is to reduce disk IO during apply_index and apply_filter operations. The index or filter will be stored in the dataframe, the view can read from the source field and combing the index or filter to achieve the filtering operation without writing an extra copy of data.
+
+Specifically, the dataframe._filters_grp is the hdf5 group where the index or filters are stored. The boolean filter will be transferred as an integer index during apply_filter function. The index filter is stored as a NumericField in the dataframe._add_view function.
+
+![View Structure](view_arch.png)
+
+## How to generate a view
+
+The user can either 1) call apply_filter or apply_index to generate views with index filters, 2) or simply call dataframe.view() to generate views without index filters. The internal function is dataframe._add_view() which take care the construction. Please note the view can only be created from the field that co-exist in the same dataset/file, as associate view with field from a different file will bring lots of uncertainty.
+
+The view is only a special instance of field, so that the construction is similar: 1) call the field.base_view_constructor to setup the h5group and, 2) call the specific field type constructor to initialize the field instance. However, there is also a special action, that is 3) attach the view to the source field so that the source field can notify the view if the underly data is changed. These three actions can be seen from in the upper part from dataframe._add_view().
+
+You can tell if a field is a view by field.is_view(), this method will check the ‘source_field’ attributes in the field’s hdf5 group. If this attribute is present, then this field is a view.
+
+## Fetch data from a view
+As a view is just like a field to users, you can still use field.data[:] to fetch the data from the view. In the field implementation, the member  ‘data’ is a FieldArray. Hence, the difference of a view and field is during the initialization of the FieldArray. Normally, FieldArray will load hdf5 dataset of the current hdf5 group (where field is stored); however in case of a view, the FieldArray will load data from the hdf5 group specified by the ‘source_field’ attribute.
+
+Also in the case of where there is a filter/index for this view (field.filter is not None), the FieldArray will fetch the filter/index first and mask the underlying data first. These can be found on FieldArray.__getitem__ or IndexFieldArray.__getitem__ for indexed string.
+
+## Life-cycle of a view
+### A view from a field
+Step0: you have a field in the dataframe, and called dataframe.apply_filter(), apply_index() or view()
+
+Step1: the view will be created and attach to the source field. The attach method is in the field, but call in dataframe._add_view.
+
+Step2: When the view.data is called, the view will initialize a FieldArray that point to the soure field rather than it’s own dataset.
+
+Step3: When the field.data.write or field.data.clear is called, means the data will be modified, the data.write or data.clear will call field.update() to notify the field of the action. And then the field will pass the notification to the views in field.notify(). Once received the notification, the view.update() can perform certain actions.
+
+Step4: At the moment, the view.update() will copy the original data to it’s own dataset, , re-initilize the data interface and delete the ‘source_field’ attribute (so that it’s not a view anymore).
+
+![View Structure](view_life.png)
+
+### An existing view
+As the view is stored in the hdf5, the view relationship can be presistenced over sessions. Upon loading a dataset (in dataset.__init__), the dataset will check if there is a view and call dataframe._bind_view() to attach the view to the field during initialization of dataset/dataframe/fields. This is why the view can only be created from a field that co-exist in the same dataset (hdf5 file).
+
+
+## Future works
+### Data fetching performance
+Different ways of getting data out from the HDF5 can vary the performance a lot. For example, it's generally better to get the data out of HDF5 by chunk rather than indexes. In the current implementation (fieldarray.__getitem__), we mask the index filter with item first, then fetch the data out from hdf5. As hdf5 doesn't support un-ordered data access, we sort the mask and convert them back when return the data. Further work can be done on how to arrange the order of index filter and item (specified by the user through __getitem__). For example, with large volume of data and small set of index filter, it might make sense to mask the filter first. However in the case of large filters, it will be faster to load the data into memory first. Where is the boundary worth investigating.
+
+### Dependency between views
+In the current implementation, the views all dependent on the source field. In the case of changing the data in the field, all the attached views will copy the data over and write it's own copy. This is not efficient with a number of views attached. One better way could be only one of the view to copy the data over and become the source field of the rest views. This needs a detailed design and implementation in fields.update().
diff --git a/docs/view_arch.png b/docs/view_arch.png
diff --git a/docs/view_life.png b/docs/view_life.png
diff --git a/exetera/core/abstract_types.py b/exetera/core/abstract_types.py
@@ -65,6 +65,11 @@ def indexed(self):
     def data(self):
         raise NotImplementedError()
 
+    @property
+    @abstractmethod
+    def filter(self):
+        raise NotImplementedError()
+
     @abstractmethod
     def __bool__(self):
         raise NotImplementedError()
@@ -491,3 +496,39 @@ def ordered_merge_right(self, right_on, left_on,
                             left_field_sources=tuple(), right_field_sinks=None,
                             right_to_left_map=None, right_unique=False, left_unique=False):
         raise NotImplementedError()
+
+
+class SubjectObserver(ABC):
+    def attach(self, observer):
+        """
+        Attach the observer (view) to the subject (field).
+        """
+        raise NotImplementedError()
+
+    def detach(self, observer):
+        """
+        Detach the observer (view) from the subject (field), this is to remove the association between observer with subject.
+        This method id called by the observer.
+        """
+        raise NotImplementedError()
+
+    def notify_deletion(self, observer=None):
+        """
+        Delete the observer from the subject, but called from the subject side.
+        """
+        raise NotImplementedError()
+
+    def notify(self, msg=None):
+        """
+        Called by the Subject to notify the observer on something.
+        """
+        raise NotImplementedError()
+
+    def update(self, subject, msg=None):
+        """
+        Called inside the observer, to perform actions based on subject and message type.
+        """
+        raise NotImplementedError()
+
+
+
diff --git a/exetera/core/dataframe.py b/exetera/core/dataframe.py
@@ -17,6 +17,7 @@
 from exetera.core import fields as fld
 from exetera.core import operations as ops
 from exetera.core import validation as val
+from exetera.core.utils import INT64_INDEX_LENGTH
 import h5py
 import csv as csvlib
 
@@ -58,10 +59,16 @@ def __init__(self,
         self.name = name
         self._columns = OrderedDict()
         self._dataset = dataset
-        self._h5group = h5group
+        self._h5group = h5group  # the HDF5 group to store all fields
 
         for subg in h5group.keys():
-            self._columns[subg] = dataset.session.get(h5group[subg])
+            if subg[0] != '_':  # stores metadata, for example filters
+                self._columns[subg] = dataset.session.get(h5group[subg])
+
+        if '_filters' not in h5group.keys():
+            self._filters_grp = self._h5group.create_group('_filters')
+        else:
+            self._filters_grp = h5group['_filters']
 
     @property
     def columns(self):
@@ -101,15 +108,67 @@ def add(self,
             nfield.data.write(field.data[:])
         self._columns[dname] = nfield
 
+    def _add_view(self, field: fld.Field, filter: np.ndarray = None):
+        """
+        Internal function called by apply_filter to add a field view into the dataframe.
+
+        :param field: The field to apply filter to.
+        :param filter: The filter to apply.
+        :return: The field view.
+
+        """
+        # add view
+        h5group = fld.base_view_contructor(field._session, self, field)
+        view = type(field)(field._session, h5group, self, write_enabled=True)
+        field.attach(view)
+        self._columns[view.name] = view
+
+        # add filter
+        if filter is not None:
+            nformat = 'int32'
+            if len(filter) > 0 and np.max(filter) >= INT64_INDEX_LENGTH:
+                nformat = 'int64'
+            filter_name = view.name
+            if filter_name not in self._filters_grp.keys():
+                fld.numeric_field_constructor(self._dataset.session, self._filters_grp, filter_name, nformat)
+                filter_field = fld.NumericField(self._dataset.session, self._filters_grp[filter_name], self,
+                                                write_enabled=True)
+                filter_field.data.write(filter)
+            else:
+                filter_field = fld.NumericField(self._dataset.session, self._filters_grp[filter_name], self,
+                                                write_enabled=True)
+                if nformat not in filter_field._fieldtype:
+                    filter_field = filter_field.astype(nformat)
+                filter_field.data.clear()
+                filter_field.data.write(filter)
+
+            view._filter_index_wrapper = fld.ReadOnlyFieldArray(filter_field, 'values')  # read-only
+
+        return self._columns[view.name]
+
+    def _bind_view(self, view: fld.Field, source_field: fld.Field):
+        """
+        Binding view is when the view (reference field) is already set, but has not attach to the original field yet, for
+        instance during the initializing of an existing dataset/dataframe.
+        :param view: The view field.
+        :param source_field: The original field.
+        """
+        source_field.attach(view)
+        if view.name in self._filters_grp.keys():
+            filter_field = fld.NumericField(self._dataset.session, self._filters_grp[view.name], self,
+                                            write_enabled=True)
+            view._filter_index_wrapper = fld.ReadOnlyFieldArray(filter_field, 'values')  # read-only
+
     def drop(self,
              name: str):
         """
         Drop a field from this dataframe as well as the HDF5 Group
 
         :param name: name of field to be dropped
         """
-        del self._columns[name]
-        del self._h5group[name]
+        del self._columns[name]  # should always be
+        if name in self._h5group.keys():  # in case of reference only
+            del self._h5group[name]
 
     def create_group(self,
                      name: str):
@@ -317,8 +376,10 @@ def __delitem__(self, name):
         if not self.__contains__(name=name):
             raise ValueError("There is no field named '{}' in this dataframe".format(name))
         else:
-            del self._h5group[name]
-            del self._columns[name]
+            del self._columns[name]  # should always be
+            if name in self._h5group.keys():  # in case of reference only
+                del self._h5group[name]
+
 
     def delete_field(self, field):
         """
@@ -478,18 +539,23 @@ def apply_filter(self, filter_to_apply, ddf=None):
         :returns: a dataframe contains all the fields filterd, self if ddf is not set
         """
         filter_to_apply_ = val.validate_filter(filter_to_apply)
-
-        if ddf is not None:
-            if not isinstance(ddf, DataFrame):
-                raise TypeError("The destination object must be an instance of DataFrame.")
+        ddf = self if ddf is None else ddf
+        if not isinstance(ddf, DataFrame):
+            raise TypeError("The destination object must be an instance of DataFrame.")
+        if ddf == self:
+            for field in self._columns.values():
+                field.apply_filter(filter_to_apply_, in_place=True)
+        elif ddf.dataset == self.dataset:  # another df in the same ds, create view
+            filter_to_apply_ = filter_to_apply_.nonzero()[0]
+            for name, field in self._columns.items():
+                if name in ddf:
+                    del ddf[name]
+                ddf._add_view(field, filter_to_apply_)
+        else:  # another df in different ds, do hard copy
             for name, field in self._columns.items():
                 newfld = field.create_like(ddf, name)
                 field.apply_filter(filter_to_apply_, target=newfld)
-            return ddf
-        else:
-            for field in self._columns.values():
-                field.apply_filter(filter_to_apply_, in_place=True)
-            return self
+        return ddf
 
     def apply_index(self, index_to_apply, ddf=None):
         """
@@ -514,20 +580,23 @@ def apply_index(self, index_to_apply, ddf=None):
         :param ddf: optional- the destination data frame
         :returns: a dataframe contains all the fields re-indexed, self if ddf is not set
         """
-        if ddf is not None:
-            if not isinstance(ddf, DataFrame):
-                raise TypeError("The destination object must be an instance of DataFrame.")
+        ddf = self if ddf is None else ddf
+        if not isinstance(ddf, DataFrame):
+            raise TypeError("The destination object must be an instance of DataFrame.")
+        if ddf == self:  # in_place
+            val.validate_all_field_length_in_df(self)
+            for field in self._columns.values():
+                field.apply_index(index_to_apply, in_place=True)
+        elif ddf.dataset == self.dataset:  # view
+            for name, field in self._columns.items():
+                if name in ddf:
+                    del ddf[name]
+                ddf._add_view(field, index_to_apply)
+        else:  # hard copy
             for name, field in self._columns.items():
                 newfld = field.create_like(ddf, name)
                 field.apply_index(index_to_apply, target=newfld)
-            return ddf
-        else:
-            val.validate_all_field_length_in_df(self) 
-
-            for field in self._columns.values():
-                field.apply_index(index_to_apply, in_place=True)
-            return self
-
+        return ddf
 
     def sort_values(self, by: Union[str, List[str]], ddf: DataFrame = None, axis=0, ascending=True, kind='stable'):
         """
@@ -981,6 +1050,17 @@ def describe(self, include=None, exclude=None, output='terminal'):
                 print('\n')
         return result
 
+    def view(self):
+        """
+        Create a view of this dataframe.
+        """
+        view_name = '_' + self.name + '_view'
+        if view_name in self.dataset:
+            self.dataset.drop(view_name)
+        dfv = self.dataset.create_dataframe(view_name)
+        for f in self.columns.values():
+            dfv._add_view(f)
+        return dfv
 
 
 class HDF5DataFrameGroupBy(DataFrameGroupBy):
@@ -1656,4 +1736,4 @@ def _ordered_merge(left: DataFrame,
         if right[k].indexed:
             ops.ordered_map_valid_indexed_stream(right[k], right_map, dest_f, invalid)
         else:
-            ops.ordered_map_valid_stream(right[k], right_map, dest_f, invalid)
+            ops.ordered_map_valid_stream(right[k], right_map, dest_f, invalid)
diff --git a/exetera/core/dataset.py b/exetera/core/dataset.py
@@ -48,11 +48,20 @@ def __init__(self, session, dataset_path, mode, name):
         self._file = h5py.File(dataset_path, mode)
         self._dataframes = dict()
 
+        #initilize the dataframe and fields
         for group in self._file.keys():
             if group not in ('trash',):
                 h5group = self._file[group]
                 dataframe = edf.HDF5DataFrame(self, group, h5group=h5group)
                 self._dataframes[group] = dataframe
+        # bind the views
+        for df in self._dataframes.values():
+            for field in df.columns.values():
+                if field.is_view():
+                    source_name = field._field.attrs['source_field']
+                    idx = source_name.rfind('/')
+                    source_field = self._dataframes[source_name[1:idx]][source_name[idx+1:]]
+                    df._bind_view(field, source_field)
 
     @property
     def session(self):