-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dfview #297
Open
deng113jie
wants to merge
239
commits into
master
Choose a base branch
from
dfview
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Dfview #297
Changes from all commits
Commits
Show all changes
239 commits
Select commit
Hold shift + click to select a range
a2d7008
fixing issue #86 from upstream:
62925bb
add unit test for Field get_spans() function
0e313dc
remove unuseful line comments
e211371
add dataset, datafreame class
deng113jie 39e4535
Merge remote-tracking branch 'upstream/master'
deng113jie 329a7cc
closing issue 92, reset the dataset when call field.data.clear
deng113jie d9d8b02
closing issue 92, reset the dataset when call field.data.clear
deng113jie f7ba342
Merge branch 'master' into patch92
deng113jie 21f0fa9
add unittest for field.data.clear function
deng113jie c9363ef
recover the dataset file to avoid merge error when fixing issue 92
deng113jie 14fc1f3
fix end_of_file char in dataset.py
deng113jie 2d13342
add get_span for index string field
deng113jie 666073e
unittest for get_span functions on different types of field, eg. fixe…
deng113jie 73aa50e
Merge remote-tracking branch 'upstream/master'
deng113jie 689cc3f
Merge remote-tracking branch 'upstream/master' into dataframe
deng113jie 8ba818f
dataframe basic methods and unittest
deng113jie abb3337
more dataframe operations
deng113jie 3180cbd
fix upstream merge conflict
deng113jie 9b9c420
minor fixing
deng113jie 55989d6
update get_span to field subclass
deng113jie cd69d04
solve conflict
deng113jie f2136d5
intermedia commit due to test pr 118
deng113jie 30953e3
Merge branch 'master' of https://github.com/KCL-BMEIS/ExeTera
deng113jie 0dccc6e
Merge remote-tracking branch 'upstream/master' into dataframe
deng113jie 000463d
Implementate get_spans(ndarray) and get_spans(ndarray1, ndarray2) fun…
deng113jie 37972b5
Merge branch 'dataframe'
deng113jie 74c1dad
Move the get_spans functions from persistence to operations.
deng113jie bf210c4
Merge branch 'dataframe'
deng113jie 95c1645
minor edits for pull request
deng113jie 5db42d2
Merge branch 'master' of https://github.com/KCL-BMEIS/ExeTera
deng113jie 664e255
remove dataframe for pull request
deng113jie 02265fe
remove dataframe test for pr
deng113jie f536652
add dataframe
deng113jie bafe9cf
Merge remote-tracking branch 'upstream/master' into dataframe
deng113jie 223dbe9
fix get_spans_for_2_fields_by_spans, fix the unittest
deng113jie cc48016
Merge branch 'master' into dataframe
deng113jie 948ce1a
Initial commit for is_sorted method on Field
atbenmurray 37b8ac2
minor edits for the pr
deng113jie 0369c92
fix minor edit error for pr
deng113jie 2096828
Merge branch 'master' into dataframe
deng113jie f213240
add apply_index and apply_filter methods on fields
deng113jie b050d74
Merging from recent PRs
atbenmurray 76b5ff1
Merge branch 'master' of https://github.com/KCL-BMEIS/ExeTera into da…
deng113jie fe36b94
Adding in missing tests for all field types for is_sorted
atbenmurray daa6012
update the apply filter and apply index on Fields
deng113jie 5c43f38
minor updates to line up w/ upstream
deng113jie 459b91c
update apply filter & apply index methods in fields that differ if de…
deng113jie c0ac960
updated the apply_index and apply_filter methods in fields. Use oldda…
deng113jie dd0867d
add dataframe basic functions and operations; working on dataset to e…
deng113jie e52d825
add functions in dataframe
deng113jie 463ea70
integrates the dataset, dataframe into the session
deng113jie 76d1952
update the fieldsimporter and field.create_like methods to call dataf…
deng113jie 7cfeceb
add license info to a few files
deng113jie b1cb082
Merge branch 'master' of https://github.com/KCL-BMEIS/ExeTera
deng113jie eaac2b6
csv_reader_with_njit
Liyuan-Chen-1024 a9ce1fb
change output_excel from string to int
Liyuan-Chen-1024 113a83f
Merge branch 'master' of github.com:KCL-BMEIS/ExeTera into importer_c…
Liyuan-Chen-1024 375982c
solve merge conflict
Liyuan-Chen-1024 e9d1053
initialize column_idx matrix outside of the njit function
Liyuan-Chen-1024 e1ed80d
use np.fromfile to load the file into byte array
Liyuan-Chen-1024 f4fe394
Merge branch 'master' into field_is_sorted_method
atbenmurray a057677
Refactoring and reformatting of some of the dataset / dataframe code;…
atbenmurray 0845a63
Merge branch 'issort' into dataframe
deng113jie 4d2886a
Merge branch 'master' of https://github.com/KCL-BMEIS/ExeTera into da…
deng113jie db3ec9f
Work on fast csv reading
atbenmurray f2efedc
Address issue #138 on minor tweaks
deng113jie 4926330
remove draft group.py from repo
deng113jie 56bb190
Improved performance from the fast csv reader through avoiding ndarra…
atbenmurray 04d810b
Merge branch 'master' of https://github.com/KCL-BMEIS/ExeTera
deng113jie f0b7e37
fix dataframe api
deng113jie 18d49a6
Merge branch 'master' of https://github.com/KCL-BMEIS/ExeTera
deng113jie 737eeed
fixing #13 and #14, add dest parameter to get_spans(), tidy up the fi…
deng113jie 732762d
minor fix remove dataframe and file property from dataset, as not use…
deng113jie ab6508c
minor fix on unittest
deng113jie 39027f7
Merge branch 'master' of https://github.com/KCL-BMEIS/ExeTera
deng113jie 358d82b
add docstring for dataset
deng113jie 98a4d7f
copy/move for dataframe; docstrings
deng113jie e6b1a57
Merge branch 'master' of https://github.com/KCL-BMEIS/ExeTera
deng113jie a0e0167
categorical field: convert from byte int to value int within njit fun…
Liyuan-Chen-1024 204bd39
merge
Liyuan-Chen-1024 c788b96
Adding in of pseudocode version of fast categorical lookup
atbenmurray 60f2ba9
clean up the comments
Liyuan-Chen-1024 bba4829
Merge branch 'importer_csv_reader' of github.com:KCL-BMEIS/ExeTera in…
Liyuan-Chen-1024 c341eb2
docstrings for dataframe
deng113jie b23f1d8
Major reworking of apply_filter / apply_index for fields; they should…
atbenmurray 63bd5a0
add unittest for various fields in dataframe
deng113jie 650014e
Merge branch 'master' of https://github.com/KCL-BMEIS/ExeTera
deng113jie cb9f2a2
add unittest for Dataframe.add/drop/move
deng113jie 013f401
minor change on name to make sure name in consistent over dataframe, …
deng113jie 18ce7ce
minor fixed of adding prefix b to string in test_session and test_dat…
deng113jie 8657081
minor fixed of adding prefix b to string in test_session and test_dat…
deng113jie 51e2fec
Completed initial pass of memory fields for all types
atbenmurray 955aede
categloric field.keys will return byte key as string, thus minor chan…
deng113jie 039d8ee
solved the byte to string issue, problem is dof python 3.7 and 3.8
deng113jie 547bb88
Miscellaneous field fixes; fixed issues with dataframe apply_filter /…
atbenmurray 700635f
Moving most binary op logic out into a static method in FieldDataOps
atbenmurray dec92ca
Resolved conflicts in dataframe.py
atbenmurray b631932
Dataframe copy, move and drop operations have been moved out of the D…
atbenmurray 4804417
Fixing accidental introduction of CRLF to abstract_types
atbenmurray f16cb09
Fixed bug where apply_filter and apply_index weren't returning a fiel…
atbenmurray 37dac08
Fixed issue in timestamp_field_create_like when group is set and is a…
atbenmurray 8c62e0a
persistence.filter_duplicate_fields now supports fields as well as nd…
atbenmurray cfcb69b
sort_on message now shows in verbose mode under all circumstances
atbenmurray 22504ef
Fixed bug in apply filter when a destination dataset is applied
atbenmurray 23c373d
Added a test to catch dataframe.apply_filter bug
atbenmurray 98624e6
Bug fix: categorical_field_constructor in fields.py was returning num…
atbenmurray 76d8717
Copying data before filtering, as filtering in h5py is very slow
atbenmurray 44a9c3d
Adding apply_spans functions to fields
atbenmurray 210f847
Fixed TestFieldApplySpansCount.test_timestamp_apply_spans that had be…
atbenmurray f8829ae
Merge commit 'refs/pull/149/head' of https://github.com/KCL-BMEIS/Exe…
deng113jie a7d6673
Issues found with indexed strings and merging; fixes found for apply_…
atbenmurray 3d322c2
Updated merge functions to consistently return memory fields if not p…
atbenmurray 294ec3a
Merge branch 'master' of https://github.com/KCL-BMEIS/ExeTera
deng113jie e8edd9d
concate cat keys instead of padding
Liyuan-Chen-1024 c2ba9ff
some docstring for fields
deng113jie 1a19815
dataframe copy/move/drop and unittest
deng113jie 1fb0362
Fixing issue with dataframe move/copy being static
atbenmurray 937368e
Updating HDF5Field writeable methods to account for prior changes
atbenmurray cddcf66
Adding merge functionality for dataframes
atbenmurray 534cbd4
dataset.drop is a member method of Dataset as it did not make sense f…
atbenmurray e5dc536
Added missing methods / properties to DataFrame ABC
atbenmurray 9b1a4a9
minor update on dataframe static function
deng113jie 1967685
minor update
deng113jie 6c3270a
Merge commit 'refs/pull/157/head' of https://github.com/KCL-BMEIS/Exe…
deng113jie 6bdb08e
minor update session
deng113jie 3680436
Merge branch 'master' of https://github.com/KCL-BMEIS/ExeTera
deng113jie cf5f5a6
minor comments update
deng113jie 23ad71a
minor comments update
deng113jie 75eefc0
add unittest for csv_reader_speedup.py
Liyuan-Chen-1024 3ddc916
Merge branch 'master' of https://github.com/KCL-BMEIS/ExeTera
deng113jie 3a6dc51
Merge commit 'refs/pull/137/head' of https://github.com/KCL-BMEIS/Exe…
deng113jie c02fe32
count operation; logical not for numeric fields
deng113jie 58159d0
remove csv speed up work from commit
deng113jie a7b477d
minor update
deng113jie 903f3b4
Merge branch 'master' of https://github.com/KCL-BMEIS/ExeTera
deng113jie 29f736d
unit test for logical not in numeric field
deng113jie 7fd9bdc
patch for get_spans for datastore
deng113jie 04df757
tests for two fields
deng113jie e47e15c
add as type to numeric field
deng113jie a4b14fb
Merge branch 'master' of https://github.com/deng113jie/ExeTera
deng113jie 5492b94
seperate the unittest of get_spans by datastore reader
deng113jie 25320bd
unittest for astype
deng113jie e289c6b
Merge branch 'dspatch'
deng113jie a59c13a
Merge branch 'master' of https://github.com/KCL-BMEIS/ExeTera
deng113jie 87df0bc
update astype for fields, update logical_not for numeric fields
deng113jie 0875149
remove dataframe view commits
deng113jie c335831
remove kwargs in get_spans in session, add fields back for backward c…
deng113jie bdf783a
Merge branch 'master' of https://github.com/KCL-BMEIS/ExeTera
deng113jie ea20c60
remove filter view tests
deng113jie 778d56c
Merge branch 'master' of https://github.com/KCL-BMEIS/ExeTera
deng113jie 611601a
partial commit on viewer
deng113jie 66867b7
Merge branch 'master' of https://github.com/KCL-BMEIS/ExeTera
deng113jie fbe396f
remote view from git
deng113jie c2c7185
add df.describe unittest
deng113jie 78cc222
sync with upstream
deng113jie 001134c
Delete python-publish.yml
deng113jie eb0bb76
Update python-app.yml
deng113jie d646ac2
Update python-app.yml
deng113jie b55775b
dataframe describe function
deng113jie 0d23098
Merge branch 'master' of https://github.com/deng113jie/ExeTera
deng113jie 7774c6f
sync with upstream
deng113jie ae1d621
Update python-app.yml
deng113jie 3d5738e
alternative get_timestamp notebook for discussion
deng113jie 4685c6b
update the notebook output of linux and mac
deng113jie dc38d28
update format
deng113jie 0df34bc
update the to_timestamp and to_timestamp function in utils
deng113jie 87353e3
add unittest for utils to_timestamp and to_datetimie
deng113jie 87abe47
fix for pr
deng113jie a3719ef
setup github action specific for windows for cython
deng113jie ed42f70
minor workflow fix
deng113jie 2157da2
add example pyx file
deng113jie 1abeaa7
fix package upload command on win; as the git action
deng113jie e77562e
add twine as tools
deng113jie 03208aa
add linux action file
deng113jie 35430f2
update the linux build command
deng113jie de3e7e5
build workflow for macos
deng113jie a8af750
minor update the macos workflow
deng113jie d41a24b
fixed timestamp issue on windows by add timezone info to datetime
deng113jie c98b87c
finanlize workflow file, compile react to publish action only
deng113jie a57c413
avoid the bytearray vs string error in windows by converting result to
deng113jie 764650b
fixing string vs bytesarray issue
deng113jie 4676901
update categorical field key property, change the key, value to bytes if
deng113jie e5d74c6
solved index must be np.int64 error
deng113jie 030d587
all unittest error on windoes removed
deng113jie 55e62eb
Merge branch 'master' into win_actions
deng113jie 7cf7bae
minor update on workflow file
deng113jie 521142e
minor update workflow file
deng113jie 9373fd2
minor fix: use pip install -r ; remove unused import in utils.py
deng113jie 6f67ac4
update action file
deng113jie 703a19a
remove change on test_presistence on uint32 to int32
deng113jie 613532a
Merge branch 'master' of https://github.com/KCL-BMEIS/ExeTera
deng113jie b981bb9
Merge branch 'master' of https://github.com/KCL-BMEIS/ExeTera
deng113jie e35c1c4
Merge branch 'KCL-BMEIS:master' into master
deng113jie a7ee946
Merge branch 'master' of https://github.com/deng113jie/ExeTera
deng113jie 0f319d1
Merge branch 'master' of https://github.com/KCL-BMEIS/ExeTera
deng113jie a5ab148
add output argument for describe function in dataframe, so that the r…
deng113jie f50ab1f
comment all print function in unittest
deng113jie 94a074a
modify the remap function for categorical field and categorical mem f…
deng113jie ef563e9
Added check to ensure ExeTera entry point actually works after pip in…
ericspod e0caad0
Attempted Fix
ericspod ba06c9f
find_packages
ericspod 25a3cc3
Tweak
ericspod 2e03557
Merge commit 'refs/pull/253/head' of https://github.com/KCL-BMEIS/Exe…
deng113jie 5f83d7e
Merge branch 'master' of https://github.com/KCL-BMEIS/ExeTera
deng113jie d932731
Merge branch 'master' of https://github.com/KCL-BMEIS/ExeTera
deng113jie fff9a19
fixing issue 214
deng113jie ffbc4f0
fixing bug on dataset set item
deng113jie 7f82d58
fixing apply_span_src in fields.py
deng113jie 474425a
revert change on field
deng113jie 1c59265
add unittest for dataset setitem bug
deng113jie 5002c65
examples using dataset generated by randomdataset
deng113jie efcde7c
Merge branch 'master' of https://github.com/KCL-BMEIS/ExeTera
deng113jie 825aaf1
update on examples
deng113jie 47e2e7e
update example: added two csv files and one json files and one import…
deng113jie 071be03
minor update on readme
deng113jie c908397
remove output from notebooks
deng113jie 72cca74
minor update on example notebooks
deng113jie bd9a85f
Merge branch 'master' of https://github.com/KCL-BMEIS/ExeTera
deng113jie 62d0d69
update examples
deng113jie 40e8770
update the example notebooks
deng113jie 81bbbcd
Merge branch 'master' of https://github.com/KCL-BMEIS/ExeTera
deng113jie 3507595
Merge branch 'master' of https://github.com/KCL-BMEIS/ExeTera
deng113jie 099c8f6
df view init commit
deng113jie 3d6966e
dataframe view init commit:
deng113jie 735b7a5
dataframe view updates 3:
deng113jie 85169e6
change filtered data presentation from data array to field __getitem__
deng113jie 9667dd7
minor update
deng113jie c333f6d
updated view functions,
deng113jie bb764bd
modify the association between view fields with field array, by assig…
deng113jie 7803d76
update the view:
deng113jie dfb36ab
fixed the data[:] for indexed string fields
deng113jie 5c93b43
Merge branch 'master' of https://github.com/KCL-BMEIS/ExeTera into df…
deng113jie fe9cee8
update Eric's comments
deng113jie c1ad9ba
minor update
deng113jie 80c0339
update unittests for dataframe view
deng113jie 6cb1d3e
minor update
deng113jie e8cf7f2
add persistence over view so that view
deng113jie 3153f2b
add unittest for view presistence
deng113jie 135260e
documents on future work
deng113jie File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -36,6 +36,7 @@ Getting started | |
dataset.md | ||
dataframe.md | ||
field.md | ||
view.md | ||
|
||
.. toctree:: | ||
:maxdepth: 1 | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,48 @@ | ||
# View | ||
## What is a view | ||
A view is a special field that has a ‘source_field’ in it’s hdf5 attributes. In this case, the view will initialize the dataset from the destination specified by the ‘source_field’ rather than in it’s own hdf5 storage. | ||
|
||
|
||
The benefit of using a view is to reduce disk IO during apply_index and apply_filter operations. The index or filter will be stored in the dataframe, the view can read from the source field and combing the index or filter to achieve the filtering operation without writing an extra copy of data. | ||
|
||
Specifically, the dataframe._filters_grp is the hdf5 group where the index or filters are stored. The boolean filter will be transferred as an integer index during apply_filter function. The index filter is stored as a NumericField in the dataframe._add_view function. | ||
|
||
![View Structure](view_arch.png) | ||
|
||
## How to generate a view | ||
|
||
The user can either 1) call apply_filter or apply_index to generate views with index filters, 2) or simply call dataframe.view() to generate views without index filters. The internal function is dataframe._add_view() which take care the construction. Please note the view can only be created from the field that co-exist in the same dataset/file, as associate view with field from a different file will bring lots of uncertainty. | ||
|
||
The view is only a special instance of field, so that the construction is similar: 1) call the field.base_view_constructor to setup the h5group and, 2) call the specific field type constructor to initialize the field instance. However, there is also a special action, that is 3) attach the view to the source field so that the source field can notify the view if the underly data is changed. These three actions can be seen from in the upper part from dataframe._add_view(). | ||
|
||
You can tell if a field is a view by field.is_view(), this method will check the ‘source_field’ attributes in the field’s hdf5 group. If this attribute is present, then this field is a view. | ||
|
||
## Fetch data from a view | ||
As a view is just like a field to users, you can still use field.data[:] to fetch the data from the view. In the field implementation, the member ‘data’ is a FieldArray. Hence, the difference of a view and field is during the initialization of the FieldArray. Normally, FieldArray will load hdf5 dataset of the current hdf5 group (where field is stored); however in case of a view, the FieldArray will load data from the hdf5 group specified by the ‘source_field’ attribute. | ||
|
||
Also in the case of where there is a filter/index for this view (field.filter is not None), the FieldArray will fetch the filter/index first and mask the underlying data first. These can be found on FieldArray.__getitem__ or IndexFieldArray.__getitem__ for indexed string. | ||
|
||
## Life-cycle of a view | ||
### A view from a field | ||
Step0: you have a field in the dataframe, and called dataframe.apply_filter(), apply_index() or view() | ||
|
||
Step1: the view will be created and attach to the source field. The attach method is in the field, but call in dataframe._add_view. | ||
|
||
Step2: When the view.data is called, the view will initialize a FieldArray that point to the soure field rather than it’s own dataset. | ||
|
||
Step3: When the field.data.write or field.data.clear is called, means the data will be modified, the data.write or data.clear will call field.update() to notify the field of the action. And then the field will pass the notification to the views in field.notify(). Once received the notification, the view.update() can perform certain actions. | ||
|
||
Step4: At the moment, the view.update() will copy the original data to it’s own dataset, , re-initilize the data interface and delete the ‘source_field’ attribute (so that it’s not a view anymore). | ||
|
||
![View Structure](view_life.png) | ||
|
||
### An existing view | ||
As the view is stored in the hdf5, the view relationship can be presistenced over sessions. Upon loading a dataset (in dataset.__init__), the dataset will check if there is a view and call dataframe._bind_view() to attach the view to the field during initialization of dataset/dataframe/fields. This is why the view can only be created from a field that co-exist in the same dataset (hdf5 file). | ||
|
||
|
||
## Future works | ||
### Data fetching performance | ||
Different ways of getting data out from the HDF5 can vary the performance a lot. For example, it's generally better to get the data out of HDF5 by chunk rather than indexes. In the current implementation (fieldarray.__getitem__), we mask the index filter with item first, then fetch the data out from hdf5. As hdf5 doesn't support un-ordered data access, we sort the mask and convert them back when return the data. Further work can be done on how to arrange the order of index filter and item (specified by the user through __getitem__). For example, with large volume of data and small set of index filter, it might make sense to mask the filter first. However in the case of large filters, it will be faster to load the data into memory first. Where is the boundary worth investigating. | ||
|
||
### Dependency between views | ||
In the current implementation, the views all dependent on the source field. In the case of changing the data in the field, all the attached views will copy the data over and write it's own copy. This is not efficient with a number of views attached. One better way could be only one of the view to copy the data over and become the source field of the rest views. This needs a detailed design and implementation in fields.update(). |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check if the same dataset