Skip to content

Commit

Permalink
fixing unittest errors on windows (#222)
Browse files Browse the repository at this point in the history
* fixing issue #86 from upstream:
add get_spans() in Field class, similar to get_spans() in Session class

* add unit test for Field get_spans() function

* remove unuseful line comments

* add dataset, datafreame class

* closing issue 92, reset the dataset when call field.data.clear

* closing issue 92, reset the dataset when call field.data.clear

* add unittest for field.data.clear function

* recover the dataset file to avoid merge error when fixing issue 92

* fix end_of_file char in dataset.py

* add get_span for index string field

* unittest for get_span functions on different types of field, eg. fixed string, indexed string, etc.

* dataframe basic methods and unittest

* more dataframe operations

* minor fixing

* update get_span to field subclass

* intermedia commit due to test pr 118

* Implementate get_spans(ndarray) and get_spans(ndarray1, ndarray2) function in core.operations.

Provide get_spans methods in fields using data attribute.

* Move the get_spans functions from persistence to operations.
Modify the get_spans functions in Session to call field method and operation method.

* minor edits for pull request

* remove dataframe for pull request

* remove dataframe test for pr

* add dataframe

* fix get_spans_for_2_fields_by_spans, fix the unittest

* Initial commit for is_sorted method on Field

* minor edits for the pr

* fix minor edit error for pr

* add apply_index and apply_filter methods on fields

* Adding in missing tests for all field types for is_sorted

* update the apply filter and apply index on Fields

* minor updates to line up w/ upstream

* update apply filter & apply index methods in fields that differ if destination field is set: if set, use dstfld.write because new field usually empty; if not set, write to self using  fld.data[:]

* updated the apply_index and apply_filter methods in fields. Use olddata[:]=newdata if length of old dataset is equals to new dataset; clear() and write() data if not.

* add dataframe basic functions and operations; working on dataset to enable dataframe to create fields.

* add functions in dataframe
add dataset class
add functions in dataset
move dataset module to csvdataset

* integrates the dataset, dataframe into the session

* update the fieldsimporter and field.create_like methods to call dataframe.create
update the unittests to follow s.open_dataset and dataset.create_dataframe flow

* add license info to a few files

* csv_reader_with_njit

* change output_excel from string to int

* initialize column_idx matrix outside of the njit function

* use np.fromfile to load the file into byte array

* Refactoring and reformatting of some of the dataset / dataframe code; moving Session and Dataset to abstract types; fixing of is_sorted tests that were broken with the merge of the new functionality

* Work on fast csv reading

* Address issue #138 on minor tweaks
Fix bug: create dataframe in dataset construction method to mapping existing datasets
Full syn between dataset with h5file when add dataframe (group), remove dataframe, set dataframe.

* remove draft group.py from repo

* Improved performance from the fast csv reader through avoiding ndarray slicing

* fix dataframe api

* fixing #13 and #14, add dest parameter to get_spans(), tidy up the field/fields parameters

* minor fix remove dataframe and file property from dataset, as not used so far.

* minor fix on unittest

* add docstring for dataset

* copy/move for dataframe; docstrings

* categorical field: convert from byte int to value int within njit function

* Adding in of pseudocode version of fast categorical lookup

* clean up the comments

* docstrings for dataframe

* Major reworking of apply_filter / apply_index for fields; they shouldn't destructively change self by default. Also addition of further mem versions of fields and factoring out of common functionality. Fix to field when indices / values are cleared but this leaves data pointing to the old field

* add unittest for various fields in dataframe
add dataframe.add/drop/move
add docstrings

* add unittest for Dataframe.add/drop/move

* minor change on name to make sure name in consistent over dataframe, dataset.key and h5group

* minor fixed of adding prefix b to string in test_session and test_dataset

* minor fixed of adding prefix b to string in test_session and test_dataset

* Completed initial pass of memory fields for all types

* categloric field.keys will return byte key as string, thus minor change on the unittest

* solved the byte to string issue, problem is dof python 3.7 and 3.8

* Miscellaneous field fixes; fixed issues with dataframe apply_filter / apply_index

* Moving most binary op logic out into a static method in FieldDataOps

* Dataframe copy, move and drop operations have been moved out of the DataFrame static methods as python doesn't support static and instance method name overloading (my bad)

* Fixing accidental introduction of CRLF to abstract_types

* Fixed bug where apply_filter and apply_index weren't returning a field on all code paths; beefed up tests to cover this

* Fixed issue in timestamp_field_create_like when group is set and is a dataframe

* persistence.filter_duplicate_fields now supports fields as well as ndarrays

* sort_on message now shows in verbose mode under all circumstances

* Fixed bug in apply filter when a destination dataset is applied

* Added a test to catch dataframe.apply_filter bug

* Bug fix: categorical_field_constructor in fields.py was returning numeric field when pass a h5py group as a destination for the field

* Copying data before filtering, as filtering in h5py is very slow

* Adding apply_spans functions to fields

* Fixed TestFieldApplySpansCount.test_timestamp_apply_spans that had been written but not run

* Issues found with indexed strings and merging; fixes found for apply_filter and apply_index when being passed a field rather than an ndarray; both with augmented testing

* Updated merge functions to consistently return memory fields if not provided with outputs but provided with fields

* concate cat keys instead of padding

* some docstring for fields

* dataframe copy/move/drop and unittest

* Fixing issue with dataframe move/copy being static

* Updating HDF5Field writeable methods to account for prior changes

* Adding merge functionality for dataframes

* dataset.drop is a member method of Dataset as it did not make sense for it to be static or outside of the class

* Added missing methods / properties to DataFrame ABC

* minor update on dataframe static function

* minor update

* minor update session

* minor comments update

* minor comments update

* add unittest for csv_reader_speedup.py

* count operation; logical not for numeric fields

* remove csv speed up work from commit

* minor update

* unit test for logical not in numeric field

* patch for get_spans for datastore

* tests for two fields

* add as type to numeric field

* seperate the unittest of get_spans by datastore reader

* unittest for astype

* update astype for fields, update logical_not for numeric fields

* remove dataframe view commits

* remove kwargs in get_spans in session, add fields back for backward compatibility

* remove filter view tests

* partial commit on viewer

* remote view from git

* add df.describe unittest

* sync with upstream

* Delete python-publish.yml

* Update python-app.yml

* Update python-app.yml

* dataframe describe function

* sync with upstream

* Update python-app.yml

* alternative get_timestamp notebook for discussion

* update the notebook output of linux and mac

* update format

* update the to_timestamp and to_timestamp function in utils
fix the current datetime.timestamp() error in test_fields and
test_sessions

* add unittest for utils to_timestamp and to_datetimie

* fix for pr

* setup github action specific for windows for cython

* minor workflow fix

* add example pyx file

* fix package upload command on win; as the git action
gh-action-pypi-publish works only on linux

* add twine as tools

* add linux action file

* update the linux build command

* build workflow for macos

* minor update the macos workflow

* fixed timestamp issue on windows by add timezone info to datetime

* finanlize workflow file, compile react to publish action only

* avoid the bytearray vs string error in windows by converting result to
bytearray

* fixing string vs bytesarray issue

* update categorical field key property, change the key, value to bytes if
it is a str

* solved index must be np.int64 error

* all unittest error on windoes removed

* minor update on workflow file

* minor update workflow file

* minor fix: use pip install -r ; remove unused import in utils.py

* update action file

* remove change on test_presistence on uint32 to int32

Co-authored-by: jie <[email protected]>
Co-authored-by: Ben Murray <[email protected]>
Co-authored-by: clyyuanzi-london <[email protected]>
  • Loading branch information
4 people authored Oct 18, 2021
1 parent 2149a38 commit b6864c1
Show file tree
Hide file tree
Showing 21 changed files with 558 additions and 48 deletions.
13 changes: 8 additions & 5 deletions .github/workflows/python-app.yml
Original file line number Diff line number Diff line change
Expand Up @@ -12,8 +12,11 @@ on:
jobs:
build:

runs-on: ubuntu-latest

runs-on: ${{ matrix.os }}
strategy:
matrix:
os: [macos-latest, windows-latest, ubuntu-latest]

steps:
- uses: actions/checkout@v2
- name: Set up Python 3.7
Expand All @@ -23,8 +26,8 @@ jobs:
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install flake8 numpy numba pandas h5py
if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
pip install flake8
pip install -r requirements.txt
- name: Lint with flake8
run: |
# stop the build if there are Python syntax errors or undefined names
Expand All @@ -33,4 +36,4 @@ jobs:
flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics
- name: Test with unittest
run: |
python -m unittest tests/*
python -m unittest
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
# separate terms of service, privacy policy, and support
# documentation.

name: Upload Python Package
name: Build & upload package on Linux

on:
release:
Expand All @@ -26,9 +26,15 @@ jobs:
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install build
pip install flake8
pip install -r requirements.txt
- name: Set up GCC
uses: egor-tensin/setup-gcc@v1
with:
version: latest
platform: x64
- name: Build package
run: python -m build
run: python setup.py bdist_wheel
- name: Publish package
uses: pypa/gh-action-pypi-publish@27b31702a0e7fc50959f5ad993c78deac1bdfc29
with:
Expand Down
38 changes: 38 additions & 0 deletions .github/workflows/python-publish-macos.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
# This workflow will upload a Python Package using Twine when a release is created
# For more information see: https://help.github.com/en/actions/language-and-framework-guides/using-python-with-github-actions#publishing-to-package-registries

# This workflow uses actions that are not certified by GitHub.
# They are provided by a third-party and are governed by
# separate terms of service, privacy policy, and support
# documentation.

name: Build & upload package on MacOS

on:
release:
types: [published]

jobs:
deploy:

runs-on: macos-latest

steps:
- uses: actions/checkout@v2
- name: Set up Python
uses: actions/setup-python@v2
with:
python-version: '3.x'
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install flake8
pip install -r requirements.txt
- name: Build package
run: python setup.py bdist_wheel
- name: Publish package
run: |
python3 -m twine upload dist/*
env:
TWINE_USERNAME: __token__
TWINE_PASSWORD: ${{ secrets.PYPI_API_TOKEN }}
42 changes: 42 additions & 0 deletions .github/workflows/python-publish-win.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
# This workflow will upload a Python Package using Twine when a release is created
# For more information see: https://help.github.com/en/actions/language-and-framework-guides/using-python-with-github-actions#publishing-to-package-registries

# This workflow uses actions that are not certified by GitHub.
# They are provided by a third-party and are governed by
# separate terms of service, privacy policy, and support
# documentation.

name: Build & upload package on Windows

on:
release:
types: [published]

jobs:
deploy:

runs-on: windows-latest

steps:
- uses: actions/checkout@v2
- name: Set up Python
uses: actions/setup-python@v2
with:
python-version: '3.x'
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install flake8
pip install -r requirements.txt
- name: Set up MinGW
uses: egor-tensin/setup-mingw@v2
with:
platform: x64
- name: Build package
run: python setup.py bdist_wheel
- name: Publish package
run: |
python3 -m twine upload dist/*
env:
TWINE_USERNAME: __token__
TWINE_PASSWORD: ${{ secrets.PYPI_API_TOKEN }}
8 changes: 8 additions & 0 deletions exetera/_libs/ops.pyx
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
def fib(n):
"""Print the Fibonacci series up to n."""
a, b = 0, 1
while b < n:
print(b)
a, b = b, a + b

print()
162 changes: 162 additions & 0 deletions exetera/core/dataframe.py
Original file line number Diff line number Diff line change
Expand Up @@ -565,6 +565,168 @@ def groupby(self, by: Union[str, List[str]], hint_keys_is_sorted=False):

return HDF5DataFrameGroupBy(self._columns, by, sorted_index, spans)

def describe(self, include=None, exclude=None):
"""
Show the basic statistics of the data in each field.
:param include: The field name or data type or simply 'all' to indicate the fields included in the calculation.
:param exclude: The filed name or data type to exclude in the calculation.
:return: A dataframe contains the statistic results.
"""
# check include and exclude conflicts
if include is not None and exclude is not None:
if isinstance(include, str):
raise ValueError('Please do not use exclude parameter when include is set as a single field.')
elif isinstance(include, type):
if isinstance(exclude, type) or (isinstance(exclude, list) and isinstance(exclude[0], type)):
raise ValueError(
'Please do not use set exclude as a type when include is set as a single data type.')
elif isinstance(include, list):
if isinstance(include[0], str) and isinstance(exclude, str):
raise ValueError('Please do not use exclude as the same type as the include parameter.')
elif isinstance(include[0], str) and isinstance(exclude, list) and isinstance(exclude[0], str):
raise ValueError('Please do not use exclude as the same type as the include parameter.')
elif isinstance(include[0], type) and isinstance(exclude, type):
raise ValueError('Please do not use exclude as the same type as the include parameter.')
elif isinstance(include[0], type) and isinstance(exclude, list) and isinstance(exclude[0], type):
raise ValueError('Please do not use exclude as the same type as the include parameter.')

fields_to_calculate = []
if include is not None:
if isinstance(include, str): # a single str
if include == 'all':
fields_to_calculate = list(self.columns.keys())
elif include in self.columns.keys():
fields_to_calculate = [include]
else:
raise ValueError('The field to include in not in the dataframe.')
elif isinstance(include, type): # a single type
for f in self.columns:
if not self[f].indexed and np.issubdtype(self[f].data.dtype, include):
fields_to_calculate.append(f)
if len(fields_to_calculate) == 0:
raise ValueError('No such type appeared in the dataframe.')
elif isinstance(include, list) and isinstance(include[0], str): # a list of str
for f in include:
if f in self.columns.keys():
fields_to_calculate.append(f)
if len(fields_to_calculate) == 0:
raise ValueError('The fields to include in not in the dataframe.')

elif isinstance(include, list) and isinstance(include[0], type): # a list of type
for t in include:
for f in self.columns:
if not self[f].indexed and np.issubdtype(self[f].data.dtype, t):
fields_to_calculate.append(f)
if len(fields_to_calculate) == 0:
raise ValueError('No such type appeared in the dataframe.')

else:
raise ValueError('The include parameter can only be str, dtype, or list of either.')

else: # include is None, numeric & timestamp fields only (no indexed strings) TODO confirm the type
for f in self.columns:
if isinstance(self[f], fld.NumericField) or isinstance(self[f], fld.TimestampField):
fields_to_calculate.append(f)

if len(fields_to_calculate) == 0:
raise ValueError('No fields included to describe.')

if exclude is not None:
if isinstance(exclude, str):
if exclude in fields_to_calculate: # exclude
fields_to_calculate.remove(exclude) # remove from list
elif isinstance(exclude, type): # a type
for f in fields_to_calculate:
if np.issubdtype(self[f].data.dtype, exclude):
fields_to_calculate.remove(f)
elif isinstance(exclude, list) and isinstance(exclude[0], str): # a list of str
for f in exclude:
fields_to_calculate.remove(f)

elif isinstance(exclude, list) and isinstance(exclude[0], type): # a list of type
for t in exclude:
for f in fields_to_calculate:
if np.issubdtype(self[f].data.dtype, t):
fields_to_calculate.remove(f) # remove will raise valueerror if dtype not presented

else:
raise ValueError('The exclude parameter can only be str, dtype, or list of either.')

if len(fields_to_calculate) == 0:
raise ValueError('All fields are excluded, no field left to describe.')
# if flexible (str) fields
des_idxstr = False
for f in fields_to_calculate:
if isinstance(self[f], fld.CategoricalField) or isinstance(self[f], fld.FixedStringField) or isinstance(
self[f], fld.IndexedStringField):
des_idxstr = True
# calculation
result = {'fields': [], 'count': [], 'mean': [], 'std': [], 'min': [], '25%': [], '50%': [], '75%': [],
'max': []}

# count
if des_idxstr:
result['unique'], result['top'], result['freq'] = [], [], []

for f in fields_to_calculate:
result['fields'].append(f)
result['count'].append(len(self[f].data))

if des_idxstr and (isinstance(self[f], fld.NumericField) or isinstance(self[f],
fld.TimestampField)): # numberic, timestamp
result['unique'].append('NaN')
result['top'].append('NaN')
result['freq'].append('NaN')

result['mean'].append("{:.2f}".format(np.mean(self[f].data[:])))
result['std'].append("{:.2f}".format(np.std(self[f].data[:])))
result['min'].append("{:.2f}".format(np.min(self[f].data[:])))
result['25%'].append("{:.2f}".format(np.percentile(self[f].data[:], 0.25)))
result['50%'].append("{:.2f}".format(np.percentile(self[f].data[:], 0.5)))
result['75%'].append("{:.2f}".format(np.percentile(self[f].data[:], 0.75)))
result['max'].append("{:.2f}".format(np.max(self[f].data[:])))

elif des_idxstr and (isinstance(self[f], fld.CategoricalField) or isinstance(self[f],
fld.IndexedStringField) or isinstance(
self[f], fld.FixedStringField)): # categorical & indexed string & fixed string
a, b = np.unique(self[f].data[:], return_counts=True)
result['unique'].append(len(a))
result['top'].append(a[np.argmax(b)])
result['freq'].append(b[np.argmax(b)])

result['mean'].append('NaN')
result['std'].append('NaN')
result['min'].append('NaN')
result['25%'].append('NaN')
result['50%'].append('NaN')
result['75%'].append('NaN')
result['max'].append('NaN')

elif not des_idxstr:
result['mean'].append("{:.2f}".format(np.mean(self[f].data[:])))
result['std'].append("{:.2f}".format(np.std(self[f].data[:])))
result['min'].append("{:.2f}".format(np.min(self[f].data[:])))
result['25%'].append("{:.2f}".format(np.percentile(self[f].data[:], 0.25)))
result['50%'].append("{:.2f}".format(np.percentile(self[f].data[:], 0.5)))
result['75%'].append("{:.2f}".format(np.percentile(self[f].data[:], 0.75)))
result['max'].append("{:.2f}".format(np.max(self[f].data[:])))

# display
columns_to_show = ['fields', 'count', 'unique', 'top', 'freq', 'mean', 'std', 'min', '25%', '50%', '75%', 'max']
# 5 fields each time for display
for col in range(0, len(result['fields']), 5): # 5 column each time
for i in columns_to_show:
if i in result:
print(i, end='\t')
for f in result[i][col:col + 5 if col + 5 < len(result[i]) - 1 else len(result[i])]:
print('{:>15}'.format(f), end='\t')
print('')
print('\n')

return result



class HDF5DataFrameGroupBy(DataFrameGroupBy):
Expand Down
10 changes: 6 additions & 4 deletions exetera/core/field_importers.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,8 @@
from exetera.core import operations as ops
from exetera.core.data_writer import DataWriter
from exetera.core import utils
from datetime import datetime, date
from datetime import datetime, date, timezone
import pytz

INDEXED_STRING_FIELD_SIZE = 10 # guessing

Expand Down Expand Up @@ -307,14 +308,14 @@ def write_part(self, values):
# ts = datetime.strptime(value.decode(), '%Y-%m-%d %H:%M:%S.%f%z')
v_datetime = datetime(int(value[0:4]), int(value[5:7]), int(value[8:10]),
int(value[11:13]), int(value[14:16]), int(value[17:19]),
int(value[20:26]))
int(value[20:26]), tzinfo=timezone.utc)
elif v_len == 25:
# ts = datetime.strptime(value.decode(), '%Y-%m-%d %H:%M:%S%z')
v_datetime = datetime(int(value[0:4]), int(value[5:7]), int(value[8:10]),
int(value[11:13]), int(value[14:16]), int(value[17:19]))
int(value[11:13]), int(value[14:16]), int(value[17:19]), tzinfo=timezone.utc)
elif v_len == 19:
v_datetime = datetime(int(value[0:4]), int(value[5:7]), int(value[8:10]),
int(value[11:13]), int(value[14:16]), int(value[17:19]))
int(value[11:13]), int(value[14:16]), int(value[17:19]), tzinfo=timezone.utc)
else:
raise ValueError(f"Date field '{self.field}' has unexpected format '{value}'")
datetime_ts[i] = v_datetime.timestamp()
Expand Down Expand Up @@ -362,6 +363,7 @@ def write_part(self, values):
flags[i] = False
else:
ts = datetime.strptime(value.decode(), '%Y-%m-%d')
ts = ts.replace(tzinfo=timezone.utc)
date_ts[i] = ts.timestamp()

self.field.data.write_part(date_ts)
Expand Down
10 changes: 8 additions & 2 deletions exetera/core/fields.py
Original file line number Diff line number Diff line change
Expand Up @@ -1557,8 +1557,14 @@ def nformat(self):
@property
def keys(self):
self._ensure_valid()
kv = self._field['key_values']
kn = self._field['key_names']
if isinstance(self._field['key_values'][0], str): # convert into bytearray to keep up with linux
kv = [bytes(i, 'utf-8') for i in self._field['key_values']]
else:
kv = self._field['key_values']
if isinstance(self._field['key_names'][0], str):
kn = [bytes(i, 'utf-8') for i in self._field['key_names']]
else:
kn = self._field['key_names']
keys = dict(zip(kv, kn))
return keys

Expand Down
4 changes: 2 additions & 2 deletions exetera/core/persistence.py
Original file line number Diff line number Diff line change
Expand Up @@ -169,7 +169,7 @@ def _apply_sort_to_array(index, values):
@njit
def _apply_sort_to_index_values(index, indices, values):

s_indices = np.zeros_like(indices)
s_indices = np.zeros_like(indices, dtype=np.int64)
s_values = np.zeros_like(values)
accumulated = np.int64(0)
s_indices[0] = 0
Expand Down Expand Up @@ -1029,7 +1029,7 @@ def apply_spans_concat(self, spans, reader, writer):

src_index = reader.field['index'][:]
src_values = reader.field['values'][:]
dest_index = np.zeros(reader.chunksize, src_index.dtype)
dest_index = np.zeros(reader.chunksize, np.int64)
dest_values = np.zeros(reader.chunksize * 16, src_values.dtype)

max_index_i = reader.chunksize
Expand Down
2 changes: 1 addition & 1 deletion exetera/core/readerwriter.py
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,7 @@ def dtype(self):
return self.field['index'].dtype, self.field['values'].dtype

def sort(self, index, writer):
field_index = self.field['index'][:]
field_index = np.array(self.field['index'][:], dtype=np.int64)
field_values = self.field['values'][:]
r_field_index, r_field_values =\
pers._apply_sort_to_index_values(index, field_index, field_values)
Expand Down
Loading

0 comments on commit b6864c1

Please sign in to comment.