Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dfview #297

Open
wants to merge 239 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
239 commits
Select commit Hold shift + click to select a range
a2d7008
fixing issue #86 from upstream:
Mar 11, 2021
62925bb
add unit test for Field get_spans() function
Mar 12, 2021
0e313dc
remove unuseful line comments
Mar 12, 2021
e211371
add dataset, datafreame class
deng113jie Mar 15, 2021
39e4535
Merge remote-tracking branch 'upstream/master'
deng113jie Mar 15, 2021
329a7cc
closing issue 92, reset the dataset when call field.data.clear
deng113jie Mar 15, 2021
d9d8b02
closing issue 92, reset the dataset when call field.data.clear
deng113jie Mar 15, 2021
f7ba342
Merge branch 'master' into patch92
deng113jie Mar 15, 2021
21f0fa9
add unittest for field.data.clear function
deng113jie Mar 15, 2021
c9363ef
recover the dataset file to avoid merge error when fixing issue 92
deng113jie Mar 15, 2021
14fc1f3
fix end_of_file char in dataset.py
deng113jie Mar 15, 2021
2d13342
add get_span for index string field
deng113jie Mar 16, 2021
666073e
unittest for get_span functions on different types of field, eg. fixe…
deng113jie Mar 17, 2021
73aa50e
Merge remote-tracking branch 'upstream/master'
deng113jie Mar 18, 2021
689cc3f
Merge remote-tracking branch 'upstream/master' into dataframe
deng113jie Mar 18, 2021
8ba818f
dataframe basic methods and unittest
deng113jie Mar 19, 2021
abb3337
more dataframe operations
deng113jie Mar 22, 2021
3180cbd
fix upstream merge conflict
deng113jie Mar 24, 2021
9b9c420
minor fixing
deng113jie Mar 24, 2021
55989d6
update get_span to field subclass
deng113jie Mar 24, 2021
cd69d04
solve conflict
deng113jie Mar 24, 2021
f2136d5
intermedia commit due to test pr 118
deng113jie Mar 24, 2021
30953e3
Merge branch 'master' of https://github.com/KCL-BMEIS/ExeTera
deng113jie Mar 24, 2021
0dccc6e
Merge remote-tracking branch 'upstream/master' into dataframe
deng113jie Mar 24, 2021
000463d
Implementate get_spans(ndarray) and get_spans(ndarray1, ndarray2) fun…
deng113jie Mar 24, 2021
37972b5
Merge branch 'dataframe'
deng113jie Mar 24, 2021
74c1dad
Move the get_spans functions from persistence to operations.
deng113jie Mar 25, 2021
bf210c4
Merge branch 'dataframe'
deng113jie Mar 25, 2021
95c1645
minor edits for pull request
deng113jie Mar 25, 2021
5db42d2
Merge branch 'master' of https://github.com/KCL-BMEIS/ExeTera
deng113jie Mar 25, 2021
664e255
remove dataframe for pull request
deng113jie Mar 25, 2021
02265fe
remove dataframe test for pr
deng113jie Mar 25, 2021
f536652
add dataframe
deng113jie Mar 25, 2021
bafe9cf
Merge remote-tracking branch 'upstream/master' into dataframe
deng113jie Mar 25, 2021
223dbe9
fix get_spans_for_2_fields_by_spans, fix the unittest
deng113jie Mar 25, 2021
cc48016
Merge branch 'master' into dataframe
deng113jie Mar 25, 2021
948ce1a
Initial commit for is_sorted method on Field
atbenmurray Mar 25, 2021
37b8ac2
minor edits for the pr
deng113jie Mar 26, 2021
0369c92
fix minor edit error for pr
deng113jie Mar 26, 2021
2096828
Merge branch 'master' into dataframe
deng113jie Mar 26, 2021
f213240
add apply_index and apply_filter methods on fields
deng113jie Mar 26, 2021
b050d74
Merging from recent PRs
atbenmurray Mar 26, 2021
76b5ff1
Merge branch 'master' of https://github.com/KCL-BMEIS/ExeTera into da…
deng113jie Mar 26, 2021
fe36b94
Adding in missing tests for all field types for is_sorted
atbenmurray Mar 26, 2021
daa6012
update the apply filter and apply index on Fields
deng113jie Mar 26, 2021
5c43f38
minor updates to line up w/ upstream
deng113jie Mar 26, 2021
459b91c
update apply filter & apply index methods in fields that differ if de…
deng113jie Mar 26, 2021
c0ac960
updated the apply_index and apply_filter methods in fields. Use oldda…
deng113jie Mar 29, 2021
dd0867d
add dataframe basic functions and operations; working on dataset to e…
deng113jie Mar 30, 2021
e52d825
add functions in dataframe
deng113jie Apr 1, 2021
463ea70
integrates the dataset, dataframe into the session
deng113jie Apr 6, 2021
76d1952
update the fieldsimporter and field.create_like methods to call dataf…
deng113jie Apr 7, 2021
7cfeceb
add license info to a few files
deng113jie Apr 7, 2021
b1cb082
Merge branch 'master' of https://github.com/KCL-BMEIS/ExeTera
deng113jie Apr 8, 2021
eaac2b6
csv_reader_with_njit
Liyuan-Chen-1024 Apr 8, 2021
a9ce1fb
change output_excel from string to int
Liyuan-Chen-1024 Apr 9, 2021
113a83f
Merge branch 'master' of github.com:KCL-BMEIS/ExeTera into importer_c…
Liyuan-Chen-1024 Apr 9, 2021
375982c
solve merge conflict
Liyuan-Chen-1024 Apr 9, 2021
e9d1053
initialize column_idx matrix outside of the njit function
Liyuan-Chen-1024 Apr 9, 2021
e1ed80d
use np.fromfile to load the file into byte array
Liyuan-Chen-1024 Apr 9, 2021
f4fe394
Merge branch 'master' into field_is_sorted_method
atbenmurray Apr 11, 2021
a057677
Refactoring and reformatting of some of the dataset / dataframe code;…
atbenmurray Apr 11, 2021
0845a63
Merge branch 'issort' into dataframe
deng113jie Apr 12, 2021
4d2886a
Merge branch 'master' of https://github.com/KCL-BMEIS/ExeTera into da…
deng113jie Apr 12, 2021
db3ec9f
Work on fast csv reading
atbenmurray Apr 12, 2021
f2efedc
Address issue #138 on minor tweaks
deng113jie Apr 12, 2021
4926330
remove draft group.py from repo
deng113jie Apr 12, 2021
56bb190
Improved performance from the fast csv reader through avoiding ndarra…
atbenmurray Apr 12, 2021
04d810b
Merge branch 'master' of https://github.com/KCL-BMEIS/ExeTera
deng113jie Apr 13, 2021
f0b7e37
fix dataframe api
deng113jie Apr 13, 2021
18d49a6
Merge branch 'master' of https://github.com/KCL-BMEIS/ExeTera
deng113jie Apr 13, 2021
737eeed
fixing #13 and #14, add dest parameter to get_spans(), tidy up the fi…
deng113jie Apr 13, 2021
732762d
minor fix remove dataframe and file property from dataset, as not use…
deng113jie Apr 13, 2021
ab6508c
minor fix on unittest
deng113jie Apr 13, 2021
39027f7
Merge branch 'master' of https://github.com/KCL-BMEIS/ExeTera
deng113jie Apr 14, 2021
358d82b
add docstring for dataset
deng113jie Apr 14, 2021
98a4d7f
copy/move for dataframe; docstrings
deng113jie Apr 15, 2021
e6b1a57
Merge branch 'master' of https://github.com/KCL-BMEIS/ExeTera
deng113jie Apr 15, 2021
a0e0167
categorical field: convert from byte int to value int within njit fun…
Liyuan-Chen-1024 Apr 15, 2021
204bd39
merge
Liyuan-Chen-1024 Apr 15, 2021
c788b96
Adding in of pseudocode version of fast categorical lookup
atbenmurray Apr 15, 2021
60f2ba9
clean up the comments
Liyuan-Chen-1024 Apr 15, 2021
bba4829
Merge branch 'importer_csv_reader' of github.com:KCL-BMEIS/ExeTera in…
Liyuan-Chen-1024 Apr 15, 2021
c341eb2
docstrings for dataframe
deng113jie Apr 16, 2021
b23f1d8
Major reworking of apply_filter / apply_index for fields; they should…
atbenmurray Apr 16, 2021
63bd5a0
add unittest for various fields in dataframe
deng113jie Apr 16, 2021
650014e
Merge branch 'master' of https://github.com/KCL-BMEIS/ExeTera
deng113jie Apr 16, 2021
cb9f2a2
add unittest for Dataframe.add/drop/move
deng113jie Apr 16, 2021
013f401
minor change on name to make sure name in consistent over dataframe, …
deng113jie Apr 16, 2021
18ce7ce
minor fixed of adding prefix b to string in test_session and test_dat…
deng113jie Apr 16, 2021
8657081
minor fixed of adding prefix b to string in test_session and test_dat…
deng113jie Apr 16, 2021
51e2fec
Completed initial pass of memory fields for all types
atbenmurray Apr 16, 2021
955aede
categloric field.keys will return byte key as string, thus minor chan…
deng113jie Apr 16, 2021
039d8ee
solved the byte to string issue, problem is dof python 3.7 and 3.8
deng113jie Apr 16, 2021
547bb88
Miscellaneous field fixes; fixed issues with dataframe apply_filter /…
atbenmurray Apr 16, 2021
700635f
Moving most binary op logic out into a static method in FieldDataOps
atbenmurray Apr 16, 2021
dec92ca
Resolved conflicts in dataframe.py
atbenmurray Apr 16, 2021
b631932
Dataframe copy, move and drop operations have been moved out of the D…
atbenmurray Apr 17, 2021
4804417
Fixing accidental introduction of CRLF to abstract_types
atbenmurray Apr 17, 2021
f16cb09
Fixed bug where apply_filter and apply_index weren't returning a fiel…
atbenmurray Apr 17, 2021
37dac08
Fixed issue in timestamp_field_create_like when group is set and is a…
atbenmurray Apr 17, 2021
8c62e0a
persistence.filter_duplicate_fields now supports fields as well as nd…
atbenmurray Apr 17, 2021
cfcb69b
sort_on message now shows in verbose mode under all circumstances
atbenmurray Apr 17, 2021
22504ef
Fixed bug in apply filter when a destination dataset is applied
atbenmurray Apr 17, 2021
23c373d
Added a test to catch dataframe.apply_filter bug
atbenmurray Apr 17, 2021
98624e6
Bug fix: categorical_field_constructor in fields.py was returning num…
atbenmurray Apr 17, 2021
76d8717
Copying data before filtering, as filtering in h5py is very slow
atbenmurray Apr 17, 2021
44a9c3d
Adding apply_spans functions to fields
atbenmurray Apr 18, 2021
210f847
Fixed TestFieldApplySpansCount.test_timestamp_apply_spans that had be…
atbenmurray Apr 18, 2021
f8829ae
Merge commit 'refs/pull/149/head' of https://github.com/KCL-BMEIS/Exe…
deng113jie Apr 19, 2021
a7d6673
Issues found with indexed strings and merging; fixes found for apply_…
atbenmurray Apr 19, 2021
3d322c2
Updated merge functions to consistently return memory fields if not p…
atbenmurray Apr 19, 2021
294ec3a
Merge branch 'master' of https://github.com/KCL-BMEIS/ExeTera
deng113jie Apr 20, 2021
e8edd9d
concate cat keys instead of padding
Liyuan-Chen-1024 Apr 20, 2021
c2ba9ff
some docstring for fields
deng113jie Apr 20, 2021
1a19815
dataframe copy/move/drop and unittest
deng113jie Apr 20, 2021
1fb0362
Fixing issue with dataframe move/copy being static
atbenmurray Apr 20, 2021
937368e
Updating HDF5Field writeable methods to account for prior changes
atbenmurray Apr 20, 2021
cddcf66
Adding merge functionality for dataframes
atbenmurray Apr 20, 2021
534cbd4
dataset.drop is a member method of Dataset as it did not make sense f…
atbenmurray Apr 20, 2021
e5dc536
Added missing methods / properties to DataFrame ABC
atbenmurray Apr 20, 2021
9b1a4a9
minor update on dataframe static function
deng113jie Apr 20, 2021
1967685
minor update
deng113jie Apr 20, 2021
6c3270a
Merge commit 'refs/pull/157/head' of https://github.com/KCL-BMEIS/Exe…
deng113jie Apr 20, 2021
6bdb08e
minor update session
deng113jie Apr 21, 2021
3680436
Merge branch 'master' of https://github.com/KCL-BMEIS/ExeTera
deng113jie Apr 21, 2021
cf5f5a6
minor comments update
deng113jie Apr 21, 2021
23ad71a
minor comments update
deng113jie Apr 21, 2021
75eefc0
add unittest for csv_reader_speedup.py
Liyuan-Chen-1024 Apr 21, 2021
3ddc916
Merge branch 'master' of https://github.com/KCL-BMEIS/ExeTera
deng113jie Apr 21, 2021
3a6dc51
Merge commit 'refs/pull/137/head' of https://github.com/KCL-BMEIS/Exe…
deng113jie Apr 22, 2021
c02fe32
count operation; logical not for numeric fields
deng113jie Apr 26, 2021
58159d0
remove csv speed up work from commit
deng113jie Apr 27, 2021
a7b477d
minor update
deng113jie Apr 27, 2021
903f3b4
Merge branch 'master' of https://github.com/KCL-BMEIS/ExeTera
deng113jie Apr 27, 2021
29f736d
unit test for logical not in numeric field
deng113jie Apr 27, 2021
7fd9bdc
patch for get_spans for datastore
deng113jie Apr 28, 2021
04df757
tests for two fields
deng113jie Apr 28, 2021
e47e15c
add as type to numeric field
deng113jie Apr 29, 2021
a4b14fb
Merge branch 'master' of https://github.com/deng113jie/ExeTera
deng113jie Apr 29, 2021
5492b94
seperate the unittest of get_spans by datastore reader
deng113jie Apr 29, 2021
25320bd
unittest for astype
deng113jie Apr 29, 2021
e289c6b
Merge branch 'dspatch'
deng113jie May 4, 2021
a59c13a
Merge branch 'master' of https://github.com/KCL-BMEIS/ExeTera
deng113jie May 4, 2021
87df0bc
update astype for fields, update logical_not for numeric fields
deng113jie May 10, 2021
0875149
remove dataframe view commits
deng113jie May 10, 2021
c335831
remove kwargs in get_spans in session, add fields back for backward c…
deng113jie May 11, 2021
bdf783a
Merge branch 'master' of https://github.com/KCL-BMEIS/ExeTera
deng113jie May 11, 2021
ea20c60
remove filter view tests
deng113jie May 11, 2021
778d56c
Merge branch 'master' of https://github.com/KCL-BMEIS/ExeTera
deng113jie May 27, 2021
611601a
partial commit on viewer
deng113jie Jun 10, 2021
66867b7
Merge branch 'master' of https://github.com/KCL-BMEIS/ExeTera
deng113jie Sep 20, 2021
fbe396f
remote view from git
deng113jie Sep 20, 2021
c2c7185
add df.describe unittest
deng113jie Sep 22, 2021
78cc222
sync with upstream
deng113jie Sep 23, 2021
001134c
Delete python-publish.yml
deng113jie Sep 23, 2021
eb0bb76
Update python-app.yml
deng113jie Sep 23, 2021
d646ac2
Update python-app.yml
deng113jie Sep 23, 2021
b55775b
dataframe describe function
deng113jie Sep 23, 2021
0d23098
Merge branch 'master' of https://github.com/deng113jie/ExeTera
deng113jie Sep 23, 2021
7774c6f
sync with upstream
deng113jie Sep 23, 2021
ae1d621
Update python-app.yml
deng113jie Sep 30, 2021
3d5738e
alternative get_timestamp notebook for discussion
deng113jie Oct 5, 2021
4685c6b
update the notebook output of linux and mac
deng113jie Oct 5, 2021
dc38d28
update format
deng113jie Oct 5, 2021
0df34bc
update the to_timestamp and to_timestamp function in utils
deng113jie Oct 11, 2021
87353e3
add unittest for utils to_timestamp and to_datetimie
deng113jie Oct 11, 2021
87abe47
fix for pr
deng113jie Oct 11, 2021
a3719ef
setup github action specific for windows for cython
deng113jie Oct 12, 2021
ed42f70
minor workflow fix
deng113jie Oct 12, 2021
2157da2
add example pyx file
deng113jie Oct 12, 2021
1abeaa7
fix package upload command on win; as the git action
deng113jie Oct 12, 2021
e77562e
add twine as tools
deng113jie Oct 12, 2021
03208aa
add linux action file
deng113jie Oct 12, 2021
35430f2
update the linux build command
deng113jie Oct 12, 2021
de3e7e5
build workflow for macos
deng113jie Oct 12, 2021
a8af750
minor update the macos workflow
deng113jie Oct 12, 2021
d41a24b
fixed timestamp issue on windows by add timezone info to datetime
deng113jie Oct 14, 2021
c98b87c
finanlize workflow file, compile react to publish action only
deng113jie Oct 14, 2021
a57c413
avoid the bytearray vs string error in windows by converting result to
deng113jie Oct 14, 2021
764650b
fixing string vs bytesarray issue
deng113jie Oct 14, 2021
4676901
update categorical field key property, change the key, value to bytes if
deng113jie Oct 15, 2021
e5d74c6
solved index must be np.int64 error
deng113jie Oct 15, 2021
030d587
all unittest error on windoes removed
deng113jie Oct 15, 2021
55e62eb
Merge branch 'master' into win_actions
deng113jie Oct 15, 2021
7cf7bae
minor update on workflow file
deng113jie Oct 15, 2021
521142e
minor update workflow file
deng113jie Oct 15, 2021
9373fd2
minor fix: use pip install -r ; remove unused import in utils.py
deng113jie Oct 15, 2021
6f67ac4
update action file
deng113jie Oct 15, 2021
703a19a
remove change on test_presistence on uint32 to int32
deng113jie Oct 18, 2021
613532a
Merge branch 'master' of https://github.com/KCL-BMEIS/ExeTera
deng113jie Nov 22, 2021
b981bb9
Merge branch 'master' of https://github.com/KCL-BMEIS/ExeTera
deng113jie Jan 17, 2022
e35c1c4
Merge branch 'KCL-BMEIS:master' into master
deng113jie Jan 25, 2022
a7ee946
Merge branch 'master' of https://github.com/deng113jie/ExeTera
deng113jie Jan 25, 2022
0f319d1
Merge branch 'master' of https://github.com/KCL-BMEIS/ExeTera
deng113jie Feb 1, 2022
a5ab148
add output argument for describe function in dataframe, so that the r…
deng113jie Feb 1, 2022
f50ab1f
comment all print function in unittest
deng113jie Feb 1, 2022
94a074a
modify the remap function for categorical field and categorical mem f…
deng113jie Feb 1, 2022
ef563e9
Added check to ensure ExeTera entry point actually works after pip in…
ericspod Feb 8, 2022
e0caad0
Attempted Fix
ericspod Feb 9, 2022
ba06c9f
find_packages
ericspod Feb 9, 2022
25a3cc3
Tweak
ericspod Feb 9, 2022
2e03557
Merge commit 'refs/pull/253/head' of https://github.com/KCL-BMEIS/Exe…
deng113jie Feb 9, 2022
5f83d7e
Merge branch 'master' of https://github.com/KCL-BMEIS/ExeTera
deng113jie Feb 9, 2022
d932731
Merge branch 'master' of https://github.com/KCL-BMEIS/ExeTera
deng113jie Feb 14, 2022
fff9a19
fixing issue 214
deng113jie Feb 14, 2022
ffbc4f0
fixing bug on dataset set item
deng113jie Feb 17, 2022
7f82d58
fixing apply_span_src in fields.py
deng113jie Feb 17, 2022
474425a
revert change on field
deng113jie Feb 17, 2022
1c59265
add unittest for dataset setitem bug
deng113jie Feb 17, 2022
5002c65
examples using dataset generated by randomdataset
deng113jie Feb 18, 2022
efcde7c
Merge branch 'master' of https://github.com/KCL-BMEIS/ExeTera
deng113jie Feb 18, 2022
825aaf1
update on examples
deng113jie Feb 23, 2022
47e2e7e
update example: added two csv files and one json files and one import…
deng113jie Feb 23, 2022
071be03
minor update on readme
deng113jie Feb 23, 2022
c908397
remove output from notebooks
deng113jie Feb 23, 2022
72cca74
minor update on example notebooks
deng113jie Feb 25, 2022
bd9a85f
Merge branch 'master' of https://github.com/KCL-BMEIS/ExeTera
deng113jie Mar 3, 2022
62d0d69
update examples
deng113jie Mar 7, 2022
40e8770
update the example notebooks
deng113jie Mar 8, 2022
81bbbcd
Merge branch 'master' of https://github.com/KCL-BMEIS/ExeTera
deng113jie Mar 24, 2022
3507595
Merge branch 'master' of https://github.com/KCL-BMEIS/ExeTera
deng113jie Apr 21, 2022
099c8f6
df view init commit
deng113jie Apr 27, 2022
3d6966e
dataframe view init commit:
deng113jie May 5, 2022
735b7a5
dataframe view updates 3:
deng113jie May 6, 2022
85169e6
change filtered data presentation from data array to field __getitem__
deng113jie May 9, 2022
9667dd7
minor update
deng113jie May 9, 2022
c333f6d
updated view functions,
deng113jie May 11, 2022
bb764bd
modify the association between view fields with field array, by assig…
deng113jie May 12, 2022
7803d76
update the view:
deng113jie May 18, 2022
dfb36ab
fixed the data[:] for indexed string fields
deng113jie May 18, 2022
5c93b43
Merge branch 'master' of https://github.com/KCL-BMEIS/ExeTera into df…
deng113jie May 18, 2022
fe9cee8
update Eric's comments
deng113jie May 23, 2022
c1ad9ba
minor update
deng113jie May 23, 2022
80c0339
update unittests for dataframe view
deng113jie May 24, 2022
6cb1d3e
minor update
deng113jie May 24, 2022
e8cf7f2
add persistence over view so that view
deng113jie May 25, 2022
3153f2b
add unittest for view presistence
deng113jie May 26, 2022
135260e
documents on future work
deng113jie May 26, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@ Getting started
dataset.md
dataframe.md
field.md
view.md

.. toctree::
:maxdepth: 1
Expand Down
48 changes: 48 additions & 0 deletions docs/view.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
# View
## What is a view
A view is a special field that has a ‘source_field’ in it’s hdf5 attributes. In this case, the view will initialize the dataset from the destination specified by the ‘source_field’ rather than in it’s own hdf5 storage.


The benefit of using a view is to reduce disk IO during apply_index and apply_filter operations. The index or filter will be stored in the dataframe, the view can read from the source field and combing the index or filter to achieve the filtering operation without writing an extra copy of data.

Specifically, the dataframe._filters_grp is the hdf5 group where the index or filters are stored. The boolean filter will be transferred as an integer index during apply_filter function. The index filter is stored as a NumericField in the dataframe._add_view function.

![View Structure](view_arch.png)

## How to generate a view

The user can either 1) call apply_filter or apply_index to generate views with index filters, 2) or simply call dataframe.view() to generate views without index filters. The internal function is dataframe._add_view() which take care the construction. Please note the view can only be created from the field that co-exist in the same dataset/file, as associate view with field from a different file will bring lots of uncertainty.

The view is only a special instance of field, so that the construction is similar: 1) call the field.base_view_constructor to setup the h5group and, 2) call the specific field type constructor to initialize the field instance. However, there is also a special action, that is 3) attach the view to the source field so that the source field can notify the view if the underly data is changed. These three actions can be seen from in the upper part from dataframe._add_view().

You can tell if a field is a view by field.is_view(), this method will check the ‘source_field’ attributes in the field’s hdf5 group. If this attribute is present, then this field is a view.

## Fetch data from a view
As a view is just like a field to users, you can still use field.data[:] to fetch the data from the view. In the field implementation, the member ‘data’ is a FieldArray. Hence, the difference of a view and field is during the initialization of the FieldArray. Normally, FieldArray will load hdf5 dataset of the current hdf5 group (where field is stored); however in case of a view, the FieldArray will load data from the hdf5 group specified by the ‘source_field’ attribute.

Also in the case of where there is a filter/index for this view (field.filter is not None), the FieldArray will fetch the filter/index first and mask the underlying data first. These can be found on FieldArray.__getitem__ or IndexFieldArray.__getitem__ for indexed string.

## Life-cycle of a view
### A view from a field
Step0: you have a field in the dataframe, and called dataframe.apply_filter(), apply_index() or view()

Step1: the view will be created and attach to the source field. The attach method is in the field, but call in dataframe._add_view.

Step2: When the view.data is called, the view will initialize a FieldArray that point to the soure field rather than it’s own dataset.

Step3: When the field.data.write or field.data.clear is called, means the data will be modified, the data.write or data.clear will call field.update() to notify the field of the action. And then the field will pass the notification to the views in field.notify(). Once received the notification, the view.update() can perform certain actions.

Step4: At the moment, the view.update() will copy the original data to it’s own dataset, , re-initilize the data interface and delete the ‘source_field’ attribute (so that it’s not a view anymore).

![View Structure](view_life.png)

### An existing view
As the view is stored in the hdf5, the view relationship can be presistenced over sessions. Upon loading a dataset (in dataset.__init__), the dataset will check if there is a view and call dataframe._bind_view() to attach the view to the field during initialization of dataset/dataframe/fields. This is why the view can only be created from a field that co-exist in the same dataset (hdf5 file).


## Future works
### Data fetching performance
Different ways of getting data out from the HDF5 can vary the performance a lot. For example, it's generally better to get the data out of HDF5 by chunk rather than indexes. In the current implementation (fieldarray.__getitem__), we mask the index filter with item first, then fetch the data out from hdf5. As hdf5 doesn't support un-ordered data access, we sort the mask and convert them back when return the data. Further work can be done on how to arrange the order of index filter and item (specified by the user through __getitem__). For example, with large volume of data and small set of index filter, it might make sense to mask the filter first. However in the case of large filters, it will be faster to load the data into memory first. Where is the boundary worth investigating.

### Dependency between views
In the current implementation, the views all dependent on the source field. In the case of changing the data in the field, all the attached views will copy the data over and write it's own copy. This is not efficient with a number of views attached. One better way could be only one of the view to copy the data over and become the source field of the rest views. This needs a detailed design and implementation in fields.update().
Binary file added docs/view_arch.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/view_life.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
41 changes: 41 additions & 0 deletions exetera/core/abstract_types.py
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,11 @@ def indexed(self):
def data(self):
raise NotImplementedError()

@property
@abstractmethod
def filter(self):
raise NotImplementedError()

@abstractmethod
def __bool__(self):
raise NotImplementedError()
Expand Down Expand Up @@ -491,3 +496,39 @@ def ordered_merge_right(self, right_on, left_on,
left_field_sources=tuple(), right_field_sinks=None,
right_to_left_map=None, right_unique=False, left_unique=False):
raise NotImplementedError()


class SubjectObserver(ABC):
def attach(self, observer):
"""
Attach the observer (view) to the subject (field).
"""
raise NotImplementedError()

def detach(self, observer):
"""
Detach the observer (view) from the subject (field), this is to remove the association between observer with subject.
This method id called by the observer.
"""
raise NotImplementedError()

def notify_deletion(self, observer=None):
"""
Delete the observer from the subject, but called from the subject side.
"""
raise NotImplementedError()

def notify(self, msg=None):
"""
Called by the Subject to notify the observer on something.
"""
raise NotImplementedError()

def update(self, subject, msg=None):
"""
Called inside the observer, to perform actions based on subject and message type.
"""
raise NotImplementedError()



134 changes: 107 additions & 27 deletions exetera/core/dataframe.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@
from exetera.core import fields as fld
from exetera.core import operations as ops
from exetera.core import validation as val
from exetera.core.utils import INT64_INDEX_LENGTH
import h5py
import csv as csvlib

Expand Down Expand Up @@ -58,10 +59,16 @@ def __init__(self,
self.name = name
self._columns = OrderedDict()
self._dataset = dataset
self._h5group = h5group
self._h5group = h5group # the HDF5 group to store all fields

for subg in h5group.keys():
self._columns[subg] = dataset.session.get(h5group[subg])
if subg[0] != '_': # stores metadata, for example filters
self._columns[subg] = dataset.session.get(h5group[subg])

if '_filters' not in h5group.keys():
self._filters_grp = self._h5group.create_group('_filters')
else:
self._filters_grp = h5group['_filters']

@property
def columns(self):
Expand Down Expand Up @@ -101,15 +108,67 @@ def add(self,
nfield.data.write(field.data[:])
self._columns[dname] = nfield

def _add_view(self, field: fld.Field, filter: np.ndarray = None):
"""
Internal function called by apply_filter to add a field view into the dataframe.

:param field: The field to apply filter to.
:param filter: The filter to apply.
:return: The field view.

"""
# add view
h5group = fld.base_view_contructor(field._session, self, field)
view = type(field)(field._session, h5group, self, write_enabled=True)
field.attach(view)
self._columns[view.name] = view

# add filter
if filter is not None:
nformat = 'int32'
if len(filter) > 0 and np.max(filter) >= INT64_INDEX_LENGTH:
nformat = 'int64'
filter_name = view.name
if filter_name not in self._filters_grp.keys():
fld.numeric_field_constructor(self._dataset.session, self._filters_grp, filter_name, nformat)
filter_field = fld.NumericField(self._dataset.session, self._filters_grp[filter_name], self,
write_enabled=True)
filter_field.data.write(filter)
else:
filter_field = fld.NumericField(self._dataset.session, self._filters_grp[filter_name], self,
write_enabled=True)
if nformat not in filter_field._fieldtype:
filter_field = filter_field.astype(nformat)
filter_field.data.clear()
filter_field.data.write(filter)

view._filter_index_wrapper = fld.ReadOnlyFieldArray(filter_field, 'values') # read-only

return self._columns[view.name]

def _bind_view(self, view: fld.Field, source_field: fld.Field):
"""
Binding view is when the view (reference field) is already set, but has not attach to the original field yet, for
instance during the initializing of an existing dataset/dataframe.
:param view: The view field.
:param source_field: The original field.
"""
source_field.attach(view)
if view.name in self._filters_grp.keys():
filter_field = fld.NumericField(self._dataset.session, self._filters_grp[view.name], self,
write_enabled=True)
view._filter_index_wrapper = fld.ReadOnlyFieldArray(filter_field, 'values') # read-only

def drop(self,
name: str):
"""
Drop a field from this dataframe as well as the HDF5 Group

:param name: name of field to be dropped
"""
del self._columns[name]
del self._h5group[name]
del self._columns[name] # should always be
if name in self._h5group.keys(): # in case of reference only
del self._h5group[name]

def create_group(self,
name: str):
Expand Down Expand Up @@ -317,8 +376,10 @@ def __delitem__(self, name):
if not self.__contains__(name=name):
raise ValueError("There is no field named '{}' in this dataframe".format(name))
else:
del self._h5group[name]
del self._columns[name]
del self._columns[name] # should always be
if name in self._h5group.keys(): # in case of reference only
del self._h5group[name]


def delete_field(self, field):
"""
Expand Down Expand Up @@ -478,18 +539,23 @@ def apply_filter(self, filter_to_apply, ddf=None):
:returns: a dataframe contains all the fields filterd, self if ddf is not set
"""
filter_to_apply_ = val.validate_filter(filter_to_apply)

if ddf is not None:
if not isinstance(ddf, DataFrame):
raise TypeError("The destination object must be an instance of DataFrame.")
ddf = self if ddf is None else ddf
if not isinstance(ddf, DataFrame):
raise TypeError("The destination object must be an instance of DataFrame.")
if ddf == self:
for field in self._columns.values():
field.apply_filter(filter_to_apply_, in_place=True)
elif ddf.dataset == self.dataset: # another df in the same ds, create view
filter_to_apply_ = filter_to_apply_.nonzero()[0]
for name, field in self._columns.items():
if name in ddf:
del ddf[name]
ddf._add_view(field, filter_to_apply_)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check if the same dataset

else: # another df in different ds, do hard copy
for name, field in self._columns.items():
newfld = field.create_like(ddf, name)
field.apply_filter(filter_to_apply_, target=newfld)
return ddf
else:
for field in self._columns.values():
field.apply_filter(filter_to_apply_, in_place=True)
return self
return ddf

def apply_index(self, index_to_apply, ddf=None):
"""
Expand All @@ -514,20 +580,23 @@ def apply_index(self, index_to_apply, ddf=None):
:param ddf: optional- the destination data frame
:returns: a dataframe contains all the fields re-indexed, self if ddf is not set
"""
if ddf is not None:
if not isinstance(ddf, DataFrame):
raise TypeError("The destination object must be an instance of DataFrame.")
ddf = self if ddf is None else ddf
if not isinstance(ddf, DataFrame):
raise TypeError("The destination object must be an instance of DataFrame.")
if ddf == self: # in_place
val.validate_all_field_length_in_df(self)
for field in self._columns.values():
field.apply_index(index_to_apply, in_place=True)
elif ddf.dataset == self.dataset: # view
for name, field in self._columns.items():
if name in ddf:
del ddf[name]
ddf._add_view(field, index_to_apply)
else: # hard copy
for name, field in self._columns.items():
newfld = field.create_like(ddf, name)
field.apply_index(index_to_apply, target=newfld)
return ddf
else:
val.validate_all_field_length_in_df(self)

for field in self._columns.values():
field.apply_index(index_to_apply, in_place=True)
return self

return ddf

def sort_values(self, by: Union[str, List[str]], ddf: DataFrame = None, axis=0, ascending=True, kind='stable'):
"""
Expand Down Expand Up @@ -981,6 +1050,17 @@ def describe(self, include=None, exclude=None, output='terminal'):
print('\n')
return result

def view(self):
"""
Create a view of this dataframe.
"""
view_name = '_' + self.name + '_view'
if view_name in self.dataset:
self.dataset.drop(view_name)
dfv = self.dataset.create_dataframe(view_name)
for f in self.columns.values():
dfv._add_view(f)
return dfv


class HDF5DataFrameGroupBy(DataFrameGroupBy):
Expand Down Expand Up @@ -1656,4 +1736,4 @@ def _ordered_merge(left: DataFrame,
if right[k].indexed:
ops.ordered_map_valid_indexed_stream(right[k], right_map, dest_f, invalid)
else:
ops.ordered_map_valid_stream(right[k], right_map, dest_f, invalid)
ops.ordered_map_valid_stream(right[k], right_map, dest_f, invalid)
9 changes: 9 additions & 0 deletions exetera/core/dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -48,11 +48,20 @@ def __init__(self, session, dataset_path, mode, name):
self._file = h5py.File(dataset_path, mode)
self._dataframes = dict()

#initilize the dataframe and fields
for group in self._file.keys():
if group not in ('trash',):
h5group = self._file[group]
dataframe = edf.HDF5DataFrame(self, group, h5group=h5group)
self._dataframes[group] = dataframe
# bind the views
for df in self._dataframes.values():
for field in df.columns.values():
if field.is_view():
source_name = field._field.attrs['source_field']
idx = source_name.rfind('/')
source_field = self._dataframes[source_name[1:idx]][source_name[idx+1:]]
df._bind_view(field, source_field)

@property
def session(self):
Expand Down
Loading