Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fixing unittest errors on windows #222

Merged
merged 190 commits into from
Oct 18, 2021
Merged
Show file tree
Hide file tree
Changes from 189 commits
Commits
Show all changes
190 commits
Select commit Hold shift + click to select a range
a2d7008
fixing issue #86 from upstream:
Mar 11, 2021
62925bb
add unit test for Field get_spans() function
Mar 12, 2021
0e313dc
remove unuseful line comments
Mar 12, 2021
e211371
add dataset, datafreame class
deng113jie Mar 15, 2021
39e4535
Merge remote-tracking branch 'upstream/master'
deng113jie Mar 15, 2021
329a7cc
closing issue 92, reset the dataset when call field.data.clear
deng113jie Mar 15, 2021
d9d8b02
closing issue 92, reset the dataset when call field.data.clear
deng113jie Mar 15, 2021
f7ba342
Merge branch 'master' into patch92
deng113jie Mar 15, 2021
21f0fa9
add unittest for field.data.clear function
deng113jie Mar 15, 2021
c9363ef
recover the dataset file to avoid merge error when fixing issue 92
deng113jie Mar 15, 2021
14fc1f3
fix end_of_file char in dataset.py
deng113jie Mar 15, 2021
2d13342
add get_span for index string field
deng113jie Mar 16, 2021
666073e
unittest for get_span functions on different types of field, eg. fixe…
deng113jie Mar 17, 2021
73aa50e
Merge remote-tracking branch 'upstream/master'
deng113jie Mar 18, 2021
689cc3f
Merge remote-tracking branch 'upstream/master' into dataframe
deng113jie Mar 18, 2021
8ba818f
dataframe basic methods and unittest
deng113jie Mar 19, 2021
abb3337
more dataframe operations
deng113jie Mar 22, 2021
3180cbd
fix upstream merge conflict
deng113jie Mar 24, 2021
9b9c420
minor fixing
deng113jie Mar 24, 2021
55989d6
update get_span to field subclass
deng113jie Mar 24, 2021
cd69d04
solve conflict
deng113jie Mar 24, 2021
f2136d5
intermedia commit due to test pr 118
deng113jie Mar 24, 2021
30953e3
Merge branch 'master' of https://github.com/KCL-BMEIS/ExeTera
deng113jie Mar 24, 2021
0dccc6e
Merge remote-tracking branch 'upstream/master' into dataframe
deng113jie Mar 24, 2021
000463d
Implementate get_spans(ndarray) and get_spans(ndarray1, ndarray2) fun…
deng113jie Mar 24, 2021
37972b5
Merge branch 'dataframe'
deng113jie Mar 24, 2021
74c1dad
Move the get_spans functions from persistence to operations.
deng113jie Mar 25, 2021
bf210c4
Merge branch 'dataframe'
deng113jie Mar 25, 2021
95c1645
minor edits for pull request
deng113jie Mar 25, 2021
5db42d2
Merge branch 'master' of https://github.com/KCL-BMEIS/ExeTera
deng113jie Mar 25, 2021
664e255
remove dataframe for pull request
deng113jie Mar 25, 2021
02265fe
remove dataframe test for pr
deng113jie Mar 25, 2021
f536652
add dataframe
deng113jie Mar 25, 2021
bafe9cf
Merge remote-tracking branch 'upstream/master' into dataframe
deng113jie Mar 25, 2021
223dbe9
fix get_spans_for_2_fields_by_spans, fix the unittest
deng113jie Mar 25, 2021
cc48016
Merge branch 'master' into dataframe
deng113jie Mar 25, 2021
948ce1a
Initial commit for is_sorted method on Field
atbenmurray Mar 25, 2021
37b8ac2
minor edits for the pr
deng113jie Mar 26, 2021
0369c92
fix minor edit error for pr
deng113jie Mar 26, 2021
2096828
Merge branch 'master' into dataframe
deng113jie Mar 26, 2021
f213240
add apply_index and apply_filter methods on fields
deng113jie Mar 26, 2021
b050d74
Merging from recent PRs
atbenmurray Mar 26, 2021
76b5ff1
Merge branch 'master' of https://github.com/KCL-BMEIS/ExeTera into da…
deng113jie Mar 26, 2021
fe36b94
Adding in missing tests for all field types for is_sorted
atbenmurray Mar 26, 2021
daa6012
update the apply filter and apply index on Fields
deng113jie Mar 26, 2021
5c43f38
minor updates to line up w/ upstream
deng113jie Mar 26, 2021
459b91c
update apply filter & apply index methods in fields that differ if de…
deng113jie Mar 26, 2021
c0ac960
updated the apply_index and apply_filter methods in fields. Use oldda…
deng113jie Mar 29, 2021
dd0867d
add dataframe basic functions and operations; working on dataset to e…
deng113jie Mar 30, 2021
e52d825
add functions in dataframe
deng113jie Apr 1, 2021
463ea70
integrates the dataset, dataframe into the session
deng113jie Apr 6, 2021
76d1952
update the fieldsimporter and field.create_like methods to call dataf…
deng113jie Apr 7, 2021
7cfeceb
add license info to a few files
deng113jie Apr 7, 2021
b1cb082
Merge branch 'master' of https://github.com/KCL-BMEIS/ExeTera
deng113jie Apr 8, 2021
eaac2b6
csv_reader_with_njit
Liyuan-Chen-1024 Apr 8, 2021
a9ce1fb
change output_excel from string to int
Liyuan-Chen-1024 Apr 9, 2021
113a83f
Merge branch 'master' of github.com:KCL-BMEIS/ExeTera into importer_c…
Liyuan-Chen-1024 Apr 9, 2021
375982c
solve merge conflict
Liyuan-Chen-1024 Apr 9, 2021
e9d1053
initialize column_idx matrix outside of the njit function
Liyuan-Chen-1024 Apr 9, 2021
e1ed80d
use np.fromfile to load the file into byte array
Liyuan-Chen-1024 Apr 9, 2021
f4fe394
Merge branch 'master' into field_is_sorted_method
atbenmurray Apr 11, 2021
a057677
Refactoring and reformatting of some of the dataset / dataframe code;…
atbenmurray Apr 11, 2021
0845a63
Merge branch 'issort' into dataframe
deng113jie Apr 12, 2021
4d2886a
Merge branch 'master' of https://github.com/KCL-BMEIS/ExeTera into da…
deng113jie Apr 12, 2021
db3ec9f
Work on fast csv reading
atbenmurray Apr 12, 2021
f2efedc
Address issue #138 on minor tweaks
deng113jie Apr 12, 2021
4926330
remove draft group.py from repo
deng113jie Apr 12, 2021
56bb190
Improved performance from the fast csv reader through avoiding ndarra…
atbenmurray Apr 12, 2021
04d810b
Merge branch 'master' of https://github.com/KCL-BMEIS/ExeTera
deng113jie Apr 13, 2021
f0b7e37
fix dataframe api
deng113jie Apr 13, 2021
18d49a6
Merge branch 'master' of https://github.com/KCL-BMEIS/ExeTera
deng113jie Apr 13, 2021
737eeed
fixing #13 and #14, add dest parameter to get_spans(), tidy up the fi…
deng113jie Apr 13, 2021
732762d
minor fix remove dataframe and file property from dataset, as not use…
deng113jie Apr 13, 2021
ab6508c
minor fix on unittest
deng113jie Apr 13, 2021
39027f7
Merge branch 'master' of https://github.com/KCL-BMEIS/ExeTera
deng113jie Apr 14, 2021
358d82b
add docstring for dataset
deng113jie Apr 14, 2021
98a4d7f
copy/move for dataframe; docstrings
deng113jie Apr 15, 2021
e6b1a57
Merge branch 'master' of https://github.com/KCL-BMEIS/ExeTera
deng113jie Apr 15, 2021
a0e0167
categorical field: convert from byte int to value int within njit fun…
Liyuan-Chen-1024 Apr 15, 2021
204bd39
merge
Liyuan-Chen-1024 Apr 15, 2021
c788b96
Adding in of pseudocode version of fast categorical lookup
atbenmurray Apr 15, 2021
60f2ba9
clean up the comments
Liyuan-Chen-1024 Apr 15, 2021
bba4829
Merge branch 'importer_csv_reader' of github.com:KCL-BMEIS/ExeTera in…
Liyuan-Chen-1024 Apr 15, 2021
c341eb2
docstrings for dataframe
deng113jie Apr 16, 2021
b23f1d8
Major reworking of apply_filter / apply_index for fields; they should…
atbenmurray Apr 16, 2021
63bd5a0
add unittest for various fields in dataframe
deng113jie Apr 16, 2021
650014e
Merge branch 'master' of https://github.com/KCL-BMEIS/ExeTera
deng113jie Apr 16, 2021
cb9f2a2
add unittest for Dataframe.add/drop/move
deng113jie Apr 16, 2021
013f401
minor change on name to make sure name in consistent over dataframe, …
deng113jie Apr 16, 2021
18ce7ce
minor fixed of adding prefix b to string in test_session and test_dat…
deng113jie Apr 16, 2021
8657081
minor fixed of adding prefix b to string in test_session and test_dat…
deng113jie Apr 16, 2021
51e2fec
Completed initial pass of memory fields for all types
atbenmurray Apr 16, 2021
955aede
categloric field.keys will return byte key as string, thus minor chan…
deng113jie Apr 16, 2021
039d8ee
solved the byte to string issue, problem is dof python 3.7 and 3.8
deng113jie Apr 16, 2021
547bb88
Miscellaneous field fixes; fixed issues with dataframe apply_filter /…
atbenmurray Apr 16, 2021
700635f
Moving most binary op logic out into a static method in FieldDataOps
atbenmurray Apr 16, 2021
dec92ca
Resolved conflicts in dataframe.py
atbenmurray Apr 16, 2021
b631932
Dataframe copy, move and drop operations have been moved out of the D…
atbenmurray Apr 17, 2021
4804417
Fixing accidental introduction of CRLF to abstract_types
atbenmurray Apr 17, 2021
f16cb09
Fixed bug where apply_filter and apply_index weren't returning a fiel…
atbenmurray Apr 17, 2021
37dac08
Fixed issue in timestamp_field_create_like when group is set and is a…
atbenmurray Apr 17, 2021
8c62e0a
persistence.filter_duplicate_fields now supports fields as well as nd…
atbenmurray Apr 17, 2021
cfcb69b
sort_on message now shows in verbose mode under all circumstances
atbenmurray Apr 17, 2021
22504ef
Fixed bug in apply filter when a destination dataset is applied
atbenmurray Apr 17, 2021
23c373d
Added a test to catch dataframe.apply_filter bug
atbenmurray Apr 17, 2021
98624e6
Bug fix: categorical_field_constructor in fields.py was returning num…
atbenmurray Apr 17, 2021
76d8717
Copying data before filtering, as filtering in h5py is very slow
atbenmurray Apr 17, 2021
44a9c3d
Adding apply_spans functions to fields
atbenmurray Apr 18, 2021
210f847
Fixed TestFieldApplySpansCount.test_timestamp_apply_spans that had be…
atbenmurray Apr 18, 2021
f8829ae
Merge commit 'refs/pull/149/head' of https://github.com/KCL-BMEIS/Exe…
deng113jie Apr 19, 2021
a7d6673
Issues found with indexed strings and merging; fixes found for apply_…
atbenmurray Apr 19, 2021
3d322c2
Updated merge functions to consistently return memory fields if not p…
atbenmurray Apr 19, 2021
294ec3a
Merge branch 'master' of https://github.com/KCL-BMEIS/ExeTera
deng113jie Apr 20, 2021
e8edd9d
concate cat keys instead of padding
Liyuan-Chen-1024 Apr 20, 2021
c2ba9ff
some docstring for fields
deng113jie Apr 20, 2021
1a19815
dataframe copy/move/drop and unittest
deng113jie Apr 20, 2021
1fb0362
Fixing issue with dataframe move/copy being static
atbenmurray Apr 20, 2021
937368e
Updating HDF5Field writeable methods to account for prior changes
atbenmurray Apr 20, 2021
cddcf66
Adding merge functionality for dataframes
atbenmurray Apr 20, 2021
534cbd4
dataset.drop is a member method of Dataset as it did not make sense f…
atbenmurray Apr 20, 2021
e5dc536
Added missing methods / properties to DataFrame ABC
atbenmurray Apr 20, 2021
9b1a4a9
minor update on dataframe static function
deng113jie Apr 20, 2021
1967685
minor update
deng113jie Apr 20, 2021
6c3270a
Merge commit 'refs/pull/157/head' of https://github.com/KCL-BMEIS/Exe…
deng113jie Apr 20, 2021
6bdb08e
minor update session
deng113jie Apr 21, 2021
3680436
Merge branch 'master' of https://github.com/KCL-BMEIS/ExeTera
deng113jie Apr 21, 2021
cf5f5a6
minor comments update
deng113jie Apr 21, 2021
23ad71a
minor comments update
deng113jie Apr 21, 2021
75eefc0
add unittest for csv_reader_speedup.py
Liyuan-Chen-1024 Apr 21, 2021
3ddc916
Merge branch 'master' of https://github.com/KCL-BMEIS/ExeTera
deng113jie Apr 21, 2021
3a6dc51
Merge commit 'refs/pull/137/head' of https://github.com/KCL-BMEIS/Exe…
deng113jie Apr 22, 2021
c02fe32
count operation; logical not for numeric fields
deng113jie Apr 26, 2021
58159d0
remove csv speed up work from commit
deng113jie Apr 27, 2021
a7b477d
minor update
deng113jie Apr 27, 2021
903f3b4
Merge branch 'master' of https://github.com/KCL-BMEIS/ExeTera
deng113jie Apr 27, 2021
29f736d
unit test for logical not in numeric field
deng113jie Apr 27, 2021
7fd9bdc
patch for get_spans for datastore
deng113jie Apr 28, 2021
04df757
tests for two fields
deng113jie Apr 28, 2021
e47e15c
add as type to numeric field
deng113jie Apr 29, 2021
a4b14fb
Merge branch 'master' of https://github.com/deng113jie/ExeTera
deng113jie Apr 29, 2021
5492b94
seperate the unittest of get_spans by datastore reader
deng113jie Apr 29, 2021
25320bd
unittest for astype
deng113jie Apr 29, 2021
e289c6b
Merge branch 'dspatch'
deng113jie May 4, 2021
a59c13a
Merge branch 'master' of https://github.com/KCL-BMEIS/ExeTera
deng113jie May 4, 2021
87df0bc
update astype for fields, update logical_not for numeric fields
deng113jie May 10, 2021
0875149
remove dataframe view commits
deng113jie May 10, 2021
c335831
remove kwargs in get_spans in session, add fields back for backward c…
deng113jie May 11, 2021
bdf783a
Merge branch 'master' of https://github.com/KCL-BMEIS/ExeTera
deng113jie May 11, 2021
ea20c60
remove filter view tests
deng113jie May 11, 2021
778d56c
Merge branch 'master' of https://github.com/KCL-BMEIS/ExeTera
deng113jie May 27, 2021
611601a
partial commit on viewer
deng113jie Jun 10, 2021
66867b7
Merge branch 'master' of https://github.com/KCL-BMEIS/ExeTera
deng113jie Sep 20, 2021
fbe396f
remote view from git
deng113jie Sep 20, 2021
c2c7185
add df.describe unittest
deng113jie Sep 22, 2021
78cc222
sync with upstream
deng113jie Sep 23, 2021
001134c
Delete python-publish.yml
deng113jie Sep 23, 2021
eb0bb76
Update python-app.yml
deng113jie Sep 23, 2021
d646ac2
Update python-app.yml
deng113jie Sep 23, 2021
b55775b
dataframe describe function
deng113jie Sep 23, 2021
0d23098
Merge branch 'master' of https://github.com/deng113jie/ExeTera
deng113jie Sep 23, 2021
7774c6f
sync with upstream
deng113jie Sep 23, 2021
ae1d621
Update python-app.yml
deng113jie Sep 30, 2021
3d5738e
alternative get_timestamp notebook for discussion
deng113jie Oct 5, 2021
4685c6b
update the notebook output of linux and mac
deng113jie Oct 5, 2021
dc38d28
update format
deng113jie Oct 5, 2021
0df34bc
update the to_timestamp and to_timestamp function in utils
deng113jie Oct 11, 2021
87353e3
add unittest for utils to_timestamp and to_datetimie
deng113jie Oct 11, 2021
87abe47
fix for pr
deng113jie Oct 11, 2021
a3719ef
setup github action specific for windows for cython
deng113jie Oct 12, 2021
ed42f70
minor workflow fix
deng113jie Oct 12, 2021
2157da2
add example pyx file
deng113jie Oct 12, 2021
1abeaa7
fix package upload command on win; as the git action
deng113jie Oct 12, 2021
e77562e
add twine as tools
deng113jie Oct 12, 2021
03208aa
add linux action file
deng113jie Oct 12, 2021
35430f2
update the linux build command
deng113jie Oct 12, 2021
de3e7e5
build workflow for macos
deng113jie Oct 12, 2021
a8af750
minor update the macos workflow
deng113jie Oct 12, 2021
d41a24b
fixed timestamp issue on windows by add timezone info to datetime
deng113jie Oct 14, 2021
c98b87c
finanlize workflow file, compile react to publish action only
deng113jie Oct 14, 2021
a57c413
avoid the bytearray vs string error in windows by converting result to
deng113jie Oct 14, 2021
764650b
fixing string vs bytesarray issue
deng113jie Oct 14, 2021
4676901
update categorical field key property, change the key, value to bytes if
deng113jie Oct 15, 2021
e5d74c6
solved index must be np.int64 error
deng113jie Oct 15, 2021
030d587
all unittest error on windoes removed
deng113jie Oct 15, 2021
55e62eb
Merge branch 'master' into win_actions
deng113jie Oct 15, 2021
7cf7bae
minor update on workflow file
deng113jie Oct 15, 2021
521142e
minor update workflow file
deng113jie Oct 15, 2021
9373fd2
minor fix: use pip install -r ; remove unused import in utils.py
deng113jie Oct 15, 2021
6f67ac4
update action file
deng113jie Oct 15, 2021
703a19a
remove change on test_presistence on uint32 to int32
deng113jie Oct 18, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 8 additions & 5 deletions .github/workflows/python-app.yml
Original file line number Diff line number Diff line change
Expand Up @@ -12,8 +12,11 @@ on:
jobs:
build:

runs-on: ubuntu-latest

runs-on: ${{ matrix.os }}
strategy:
matrix:
os: [macos-latest, windows-latest, ubuntu-latest]

steps:
- uses: actions/checkout@v2
- name: Set up Python 3.7
Expand All @@ -23,8 +26,8 @@ jobs:
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install flake8 numpy numba pandas h5py
if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
pip install flake8
pip install -r requirements.txt
- name: Lint with flake8
run: |
# stop the build if there are Python syntax errors or undefined names
Expand All @@ -33,4 +36,4 @@ jobs:
flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics
- name: Test with unittest
run: |
python -m unittest tests/*
python -m unittest
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
# separate terms of service, privacy policy, and support
# documentation.

name: Upload Python Package
name: Build & upload package on Linux

on:
release:
Expand All @@ -26,9 +26,15 @@ jobs:
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install build
pip install flake8
pip install -r requirements.txt
- name: Set up GCC
uses: egor-tensin/setup-gcc@v1
with:
version: latest
platform: x64
- name: Build package
run: python -m build
run: python setup.py bdist_wheel
- name: Publish package
uses: pypa/gh-action-pypi-publish@27b31702a0e7fc50959f5ad993c78deac1bdfc29
with:
Expand Down
38 changes: 38 additions & 0 deletions .github/workflows/python-publish-macos.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
# This workflow will upload a Python Package using Twine when a release is created
# For more information see: https://help.github.com/en/actions/language-and-framework-guides/using-python-with-github-actions#publishing-to-package-registries

# This workflow uses actions that are not certified by GitHub.
# They are provided by a third-party and are governed by
# separate terms of service, privacy policy, and support
# documentation.

name: Build & upload package on MacOS

on:
release:
types: [published]

jobs:
deploy:

runs-on: macos-latest

steps:
- uses: actions/checkout@v2
- name: Set up Python
uses: actions/setup-python@v2
with:
python-version: '3.x'
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install flake8
pip install -r requirements.txt
- name: Build package
run: python setup.py bdist_wheel
- name: Publish package
run: |
python3 -m twine upload dist/*
env:
TWINE_USERNAME: __token__
TWINE_PASSWORD: ${{ secrets.PYPI_API_TOKEN }}
42 changes: 42 additions & 0 deletions .github/workflows/python-publish-win.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
# This workflow will upload a Python Package using Twine when a release is created
# For more information see: https://help.github.com/en/actions/language-and-framework-guides/using-python-with-github-actions#publishing-to-package-registries

# This workflow uses actions that are not certified by GitHub.
# They are provided by a third-party and are governed by
# separate terms of service, privacy policy, and support
# documentation.

name: Build & upload package on Windows

on:
release:
types: [published]

jobs:
deploy:

runs-on: windows-latest

steps:
- uses: actions/checkout@v2
- name: Set up Python
uses: actions/setup-python@v2
with:
python-version: '3.x'
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install flake8
pip install -r requirements.txt
- name: Set up MinGW
uses: egor-tensin/setup-mingw@v2
with:
platform: x64
- name: Build package
run: python setup.py bdist_wheel
- name: Publish package
run: |
python3 -m twine upload dist/*
env:
TWINE_USERNAME: __token__
TWINE_PASSWORD: ${{ secrets.PYPI_API_TOKEN }}
8 changes: 8 additions & 0 deletions exetera/_libs/ops.pyx
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
def fib(n):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Example code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, so that the package can be compile. Should be removed later once we have real functions in.

"""Print the Fibonacci series up to n."""
a, b = 0, 1
while b < n:
print(b)
a, b = b, a + b

print()
162 changes: 162 additions & 0 deletions exetera/core/dataframe.py
Original file line number Diff line number Diff line change
Expand Up @@ -565,6 +565,168 @@ def groupby(self, by: Union[str, List[str]], hint_keys_is_sorted=False):

return HDF5DataFrameGroupBy(self._columns, by, sorted_index, spans)

def describe(self, include=None, exclude=None):
"""
Show the basic statistics of the data in each field.

:param include: The field name or data type or simply 'all' to indicate the fields included in the calculation.
:param exclude: The filed name or data type to exclude in the calculation.
:return: A dataframe contains the statistic results.

"""
# check include and exclude conflicts
if include is not None and exclude is not None:
if isinstance(include, str):
raise ValueError('Please do not use exclude parameter when include is set as a single field.')
elif isinstance(include, type):
if isinstance(exclude, type) or (isinstance(exclude, list) and isinstance(exclude[0], type)):
raise ValueError(
'Please do not use set exclude as a type when include is set as a single data type.')
elif isinstance(include, list):
if isinstance(include[0], str) and isinstance(exclude, str):
raise ValueError('Please do not use exclude as the same type as the include parameter.')
elif isinstance(include[0], str) and isinstance(exclude, list) and isinstance(exclude[0], str):
raise ValueError('Please do not use exclude as the same type as the include parameter.')
elif isinstance(include[0], type) and isinstance(exclude, type):
raise ValueError('Please do not use exclude as the same type as the include parameter.')
elif isinstance(include[0], type) and isinstance(exclude, list) and isinstance(exclude[0], type):
raise ValueError('Please do not use exclude as the same type as the include parameter.')

fields_to_calculate = []
if include is not None:
if isinstance(include, str): # a single str
if include == 'all':
fields_to_calculate = list(self.columns.keys())
elif include in self.columns.keys():
fields_to_calculate = [include]
else:
raise ValueError('The field to include in not in the dataframe.')
elif isinstance(include, type): # a single type
for f in self.columns:
if not self[f].indexed and np.issubdtype(self[f].data.dtype, include):
fields_to_calculate.append(f)
if len(fields_to_calculate) == 0:
raise ValueError('No such type appeared in the dataframe.')
elif isinstance(include, list) and isinstance(include[0], str): # a list of str
for f in include:
if f in self.columns.keys():
fields_to_calculate.append(f)
if len(fields_to_calculate) == 0:
raise ValueError('The fields to include in not in the dataframe.')

elif isinstance(include, list) and isinstance(include[0], type): # a list of type
for t in include:
for f in self.columns:
if not self[f].indexed and np.issubdtype(self[f].data.dtype, t):
fields_to_calculate.append(f)
if len(fields_to_calculate) == 0:
raise ValueError('No such type appeared in the dataframe.')

else:
raise ValueError('The include parameter can only be str, dtype, or list of either.')

else: # include is None, numeric & timestamp fields only (no indexed strings) TODO confirm the type
for f in self.columns:
if isinstance(self[f], fld.NumericField) or isinstance(self[f], fld.TimestampField):
fields_to_calculate.append(f)

if len(fields_to_calculate) == 0:
raise ValueError('No fields included to describe.')

if exclude is not None:
if isinstance(exclude, str):
if exclude in fields_to_calculate: # exclude
fields_to_calculate.remove(exclude) # remove from list
elif isinstance(exclude, type): # a type
for f in fields_to_calculate:
if np.issubdtype(self[f].data.dtype, exclude):
fields_to_calculate.remove(f)
elif isinstance(exclude, list) and isinstance(exclude[0], str): # a list of str
for f in exclude:
fields_to_calculate.remove(f)

elif isinstance(exclude, list) and isinstance(exclude[0], type): # a list of type
for t in exclude:
for f in fields_to_calculate:
if np.issubdtype(self[f].data.dtype, t):
fields_to_calculate.remove(f) # remove will raise valueerror if dtype not presented

else:
raise ValueError('The exclude parameter can only be str, dtype, or list of either.')

if len(fields_to_calculate) == 0:
raise ValueError('All fields are excluded, no field left to describe.')
# if flexible (str) fields
des_idxstr = False
for f in fields_to_calculate:
if isinstance(self[f], fld.CategoricalField) or isinstance(self[f], fld.FixedStringField) or isinstance(
self[f], fld.IndexedStringField):
des_idxstr = True
# calculation
result = {'fields': [], 'count': [], 'mean': [], 'std': [], 'min': [], '25%': [], '50%': [], '75%': [],
'max': []}

# count
if des_idxstr:
result['unique'], result['top'], result['freq'] = [], [], []

for f in fields_to_calculate:
result['fields'].append(f)
result['count'].append(len(self[f].data))

if des_idxstr and (isinstance(self[f], fld.NumericField) or isinstance(self[f],
fld.TimestampField)): # numberic, timestamp
result['unique'].append('NaN')
result['top'].append('NaN')
result['freq'].append('NaN')

result['mean'].append("{:.2f}".format(np.mean(self[f].data[:])))
result['std'].append("{:.2f}".format(np.std(self[f].data[:])))
result['min'].append("{:.2f}".format(np.min(self[f].data[:])))
result['25%'].append("{:.2f}".format(np.percentile(self[f].data[:], 0.25)))
result['50%'].append("{:.2f}".format(np.percentile(self[f].data[:], 0.5)))
result['75%'].append("{:.2f}".format(np.percentile(self[f].data[:], 0.75)))
result['max'].append("{:.2f}".format(np.max(self[f].data[:])))

elif des_idxstr and (isinstance(self[f], fld.CategoricalField) or isinstance(self[f],
fld.IndexedStringField) or isinstance(
self[f], fld.FixedStringField)): # categorical & indexed string & fixed string
a, b = np.unique(self[f].data[:], return_counts=True)
result['unique'].append(len(a))
result['top'].append(a[np.argmax(b)])
result['freq'].append(b[np.argmax(b)])

result['mean'].append('NaN')
result['std'].append('NaN')
result['min'].append('NaN')
result['25%'].append('NaN')
result['50%'].append('NaN')
result['75%'].append('NaN')
result['max'].append('NaN')

elif not des_idxstr:
result['mean'].append("{:.2f}".format(np.mean(self[f].data[:])))
result['std'].append("{:.2f}".format(np.std(self[f].data[:])))
result['min'].append("{:.2f}".format(np.min(self[f].data[:])))
result['25%'].append("{:.2f}".format(np.percentile(self[f].data[:], 0.25)))
result['50%'].append("{:.2f}".format(np.percentile(self[f].data[:], 0.5)))
result['75%'].append("{:.2f}".format(np.percentile(self[f].data[:], 0.75)))
result['max'].append("{:.2f}".format(np.max(self[f].data[:])))

# display
columns_to_show = ['fields', 'count', 'unique', 'top', 'freq', 'mean', 'std', 'min', '25%', '50%', '75%', 'max']
# 5 fields each time for display
for col in range(0, len(result['fields']), 5): # 5 column each time
for i in columns_to_show:
if i in result:
print(i, end='\t')
for f in result[i][col:col + 5 if col + 5 < len(result[i]) - 1 else len(result[i])]:
print('{:>15}'.format(f), end='\t')
print('')
print('\n')

return result



class HDF5DataFrameGroupBy(DataFrameGroupBy):
Expand Down
10 changes: 6 additions & 4 deletions exetera/core/field_importers.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,8 @@
from exetera.core import operations as ops
from exetera.core.data_writer import DataWriter
from exetera.core import utils
from datetime import datetime, date
from datetime import datetime, date, timezone
import pytz

INDEXED_STRING_FIELD_SIZE = 10 # guessing

Expand Down Expand Up @@ -307,14 +308,14 @@ def write_part(self, values):
# ts = datetime.strptime(value.decode(), '%Y-%m-%d %H:%M:%S.%f%z')
v_datetime = datetime(int(value[0:4]), int(value[5:7]), int(value[8:10]),
int(value[11:13]), int(value[14:16]), int(value[17:19]),
int(value[20:26]))
int(value[20:26]), tzinfo=timezone.utc)
elif v_len == 25:
# ts = datetime.strptime(value.decode(), '%Y-%m-%d %H:%M:%S%z')
v_datetime = datetime(int(value[0:4]), int(value[5:7]), int(value[8:10]),
int(value[11:13]), int(value[14:16]), int(value[17:19]))
int(value[11:13]), int(value[14:16]), int(value[17:19]), tzinfo=timezone.utc)
elif v_len == 19:
v_datetime = datetime(int(value[0:4]), int(value[5:7]), int(value[8:10]),
int(value[11:13]), int(value[14:16]), int(value[17:19]))
int(value[11:13]), int(value[14:16]), int(value[17:19]), tzinfo=timezone.utc)
else:
raise ValueError(f"Date field '{self.field}' has unexpected format '{value}'")
datetime_ts[i] = v_datetime.timestamp()
Expand Down Expand Up @@ -362,6 +363,7 @@ def write_part(self, values):
flags[i] = False
else:
ts = datetime.strptime(value.decode(), '%Y-%m-%d')
ts = ts.replace(tzinfo=timezone.utc)
date_ts[i] = ts.timestamp()

self.field.data.write_part(date_ts)
Expand Down
10 changes: 8 additions & 2 deletions exetera/core/fields.py
Original file line number Diff line number Diff line change
Expand Up @@ -1557,8 +1557,14 @@ def nformat(self):
@property
def keys(self):
self._ensure_valid()
kv = self._field['key_values']
kn = self._field['key_names']
if isinstance(self._field['key_values'][0], str): # convert into bytearray to keep up with linux
kv = [bytes(i, 'utf-8') for i in self._field['key_values']]
else:
kv = self._field['key_values']
if isinstance(self._field['key_names'][0], str):
kn = [bytes(i, 'utf-8') for i in self._field['key_names']]
else:
kn = self._field['key_names']
keys = dict(zip(kv, kn))
return keys

Expand Down
4 changes: 2 additions & 2 deletions exetera/core/persistence.py
Original file line number Diff line number Diff line change
Expand Up @@ -169,7 +169,7 @@ def _apply_sort_to_array(index, values):
@njit
def _apply_sort_to_index_values(index, indices, values):

s_indices = np.zeros_like(indices)
s_indices = np.zeros_like(indices, dtype=np.int64)
s_values = np.zeros_like(values)
accumulated = np.int64(0)
s_indices[0] = 0
Expand Down Expand Up @@ -1029,7 +1029,7 @@ def apply_spans_concat(self, spans, reader, writer):

src_index = reader.field['index'][:]
src_values = reader.field['values'][:]
dest_index = np.zeros(reader.chunksize, src_index.dtype)
dest_index = np.zeros(reader.chunksize, np.int64)
dest_values = np.zeros(reader.chunksize * 16, src_values.dtype)

max_index_i = reader.chunksize
Expand Down
2 changes: 1 addition & 1 deletion exetera/core/readerwriter.py
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,7 @@ def dtype(self):
return self.field['index'].dtype, self.field['values'].dtype

def sort(self, index, writer):
field_index = self.field['index'][:]
field_index = np.array(self.field['index'][:], dtype=np.int64)
field_values = self.field['values'][:]
r_field_index, r_field_values =\
pers._apply_sort_to_index_values(index, field_index, field_values)
Expand Down
Loading