fixing unittest errors on windows (#222)

* fixing issue #86 from upstream: add get_spans() in Field class, similar to get_spans() in Session class * add unit test for Field get_spans() function * remove unuseful line comments * add dataset, datafreame class * closing issue 92, reset the dataset when call field.data.clear * closing issue 92, reset the dataset when call field.data.clear * add unittest for field.data.clear function * recover the dataset file to avoid merge error when fixing issue 92 * fix end_of_file char in dataset.py * add get_span for index string field * unittest for get_span functions on different types of field, eg. fixed string, indexed string, etc. * dataframe basic methods and unittest * more dataframe operations * minor fixing * update get_span to field subclass * intermedia commit due to test pr 118 * Implementate get_spans(ndarray) and get_spans(ndarray1, ndarray2) function in core.operations. Provide get_spans methods in fields using data attribute. * Move the get_spans functions from persistence to operations. Modify the get_spans functions in Session to call field method and operation method. * minor edits for pull request * remove dataframe for pull request * remove dataframe test for pr * add dataframe * fix get_spans_for_2_fields_by_spans, fix the unittest * Initial commit for is_sorted method on Field * minor edits for the pr * fix minor edit error for pr * add apply_index and apply_filter methods on fields * Adding in missing tests for all field types for is_sorted * update the apply filter and apply index on Fields * minor updates to line up w/ upstream * update apply filter & apply index methods in fields that differ if destination field is set: if set, use dstfld.write because new field usually empty; if not set, write to self using fld.data[:] * updated the apply_index and apply_filter methods in fields. Use olddata[:]=newdata if length of old dataset is equals to new dataset; clear() and write() data if not. * add dataframe basic functions and operations; working on dataset to enable dataframe to create fields. * add functions in dataframe add dataset class add functions in dataset move dataset module to csvdataset * integrates the dataset, dataframe into the session * update the fieldsimporter and field.create_like methods to call dataframe.create update the unittests to follow s.open_dataset and dataset.create_dataframe flow * add license info to a few files * csv_reader_with_njit * change output_excel from string to int * initialize column_idx matrix outside of the njit function * use np.fromfile to load the file into byte array * Refactoring and reformatting of some of the dataset / dataframe code; moving Session and Dataset to abstract types; fixing of is_sorted tests that were broken with the merge of the new functionality * Work on fast csv reading * Address issue #138 on minor tweaks Fix bug: create dataframe in dataset construction method to mapping existing datasets Full syn between dataset with h5file when add dataframe (group), remove dataframe, set dataframe. * remove draft group.py from repo * Improved performance from the fast csv reader through avoiding ndarray slicing * fix dataframe api * fixing #13 and #14, add dest parameter to get_spans(), tidy up the field/fields parameters * minor fix remove dataframe and file property from dataset, as not used so far. * minor fix on unittest * add docstring for dataset * copy/move for dataframe; docstrings * categorical field: convert from byte int to value int within njit function * Adding in of pseudocode version of fast categorical lookup * clean up the comments * docstrings for dataframe * Major reworking of apply_filter / apply_index for fields; they shouldn't destructively change self by default. Also addition of further mem versions of fields and factoring out of common functionality. Fix to field when indices / values are cleared but this leaves data pointing to the old field * add unittest for various fields in dataframe add dataframe.add/drop/move add docstrings * add unittest for Dataframe.add/drop/move * minor change on name to make sure name in consistent over dataframe, dataset.key and h5group * minor fixed of adding prefix b to string in test_session and test_dataset * minor fixed of adding prefix b to string in test_session and test_dataset * Completed initial pass of memory fields for all types * categloric field.keys will return byte key as string, thus minor change on the unittest * solved the byte to string issue, problem is dof python 3.7 and 3.8 * Miscellaneous field fixes; fixed issues with dataframe apply_filter / apply_index * Moving most binary op logic out into a static method in FieldDataOps * Dataframe copy, move and drop operations have been moved out of the DataFrame static methods as python doesn't support static and instance method name overloading (my bad) * Fixing accidental introduction of CRLF to abstract_types * Fixed bug where apply_filter and apply_index weren't returning a field on all code paths; beefed up tests to cover this * Fixed issue in timestamp_field_create_like when group is set and is a dataframe * persistence.filter_duplicate_fields now supports fields as well as ndarrays * sort_on message now shows in verbose mode under all circumstances * Fixed bug in apply filter when a destination dataset is applied * Added a test to catch dataframe.apply_filter bug * Bug fix: categorical_field_constructor in fields.py was returning numeric field when pass a h5py group as a destination for the field * Copying data before filtering, as filtering in h5py is very slow * Adding apply_spans functions to fields * Fixed TestFieldApplySpansCount.test_timestamp_apply_spans that had been written but not run * Issues found with indexed strings and merging; fixes found for apply_filter and apply_index when being passed a field rather than an ndarray; both with augmented testing * Updated merge functions to consistently return memory fields if not provided with outputs but provided with fields * concate cat keys instead of padding * some docstring for fields * dataframe copy/move/drop and unittest * Fixing issue with dataframe move/copy being static * Updating HDF5Field writeable methods to account for prior changes * Adding merge functionality for dataframes * dataset.drop is a member method of Dataset as it did not make sense for it to be static or outside of the class * Added missing methods / properties to DataFrame ABC * minor update on dataframe static function * minor update * minor update session * minor comments update * minor comments update * add unittest for csv_reader_speedup.py * count operation; logical not for numeric fields * remove csv speed up work from commit * minor update * unit test for logical not in numeric field * patch for get_spans for datastore * tests for two fields * add as type to numeric field * seperate the unittest of get_spans by datastore reader * unittest for astype * update astype for fields, update logical_not for numeric fields * remove dataframe view commits * remove kwargs in get_spans in session, add fields back for backward compatibility * remove filter view tests * partial commit on viewer * remote view from git * add df.describe unittest * sync with upstream * Delete python-publish.yml * Update python-app.yml * Update python-app.yml * dataframe describe function * sync with upstream * Update python-app.yml * alternative get_timestamp notebook for discussion * update the notebook output of linux and mac * update format * update the to_timestamp and to_timestamp function in utils fix the current datetime.timestamp() error in test_fields and test_sessions * add unittest for utils to_timestamp and to_datetimie * fix for pr * setup github action specific for windows for cython * minor workflow fix * add example pyx file * fix package upload command on win; as the git action gh-action-pypi-publish works only on linux * add twine as tools * add linux action file * update the linux build command * build workflow for macos * minor update the macos workflow * fixed timestamp issue on windows by add timezone info to datetime * finanlize workflow file, compile react to publish action only * avoid the bytearray vs string error in windows by converting result to bytearray * fixing string vs bytesarray issue * update categorical field key property, change the key, value to bytes if it is a str * solved index must be np.int64 error * all unittest error on windoes removed * minor update on workflow file * minor update workflow file * minor fix: use pip install -r ; remove unused import in utils.py * update action file * remove change on test_presistence on uint32 to int32 Co-authored-by: jie <[email protected]> Co-authored-by: Ben Murray <[email protected]> Co-authored-by: clyyuanzi-london <[email protected]>
KCL-BMEIS · Oct 18, 2021 · b6864c1 · b6864c1
1 parent 2149a38
commit b6864c1
Show file tree

Hide file tree

Showing 21 changed files with 558 additions and 48 deletions.
diff --git a/.github/workflows/python-app.yml b/.github/workflows/python-app.yml
@@ -12,8 +12,11 @@ on:
 jobs:
   build:
 
-    runs-on: ubuntu-latest
-
+    runs-on: ${{ matrix.os }}
+    strategy:
+      matrix:
+        os: [macos-latest, windows-latest, ubuntu-latest] 
+
     steps:
     - uses: actions/checkout@v2
     - name: Set up Python 3.7
@@ -23,8 +26,8 @@ jobs:
     - name: Install dependencies
       run: |
         python -m pip install --upgrade pip
-        pip install flake8 numpy numba pandas h5py
-        if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
+        pip install flake8
+        pip install -r requirements.txt
     - name: Lint with flake8
       run: |
         # stop the build if there are Python syntax errors or undefined names
@@ -33,4 +36,4 @@ jobs:
         flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics
     - name: Test with unittest
       run: |
-        python -m unittest tests/*
+        python -m unittest
diff --git a/.github/workflows/python-publish.yml → .github/workflows/python-publish-linux.yml b/.github/workflows/python-publish.yml → .github/workflows/python-publish-linux.yml
@@ -6,7 +6,7 @@
 # separate terms of service, privacy policy, and support
 # documentation.
 
-name: Upload Python Package
+name: Build & upload package on Linux
 
 on:
   release:
@@ -26,9 +26,15 @@ jobs:
     - name: Install dependencies
       run: |
         python -m pip install --upgrade pip
-        pip install build
+        pip install flake8
+        pip install -r requirements.txt
+    - name: Set up GCC
+      uses: egor-tensin/setup-gcc@v1
+      with:
+        version: latest
+        platform: x64
     - name: Build package
-      run: python -m build
+      run: python setup.py bdist_wheel
     - name: Publish package
       uses: pypa/gh-action-pypi-publish@27b31702a0e7fc50959f5ad993c78deac1bdfc29
       with:

diff --git a/.github/workflows/python-publish-macos.yml b/.github/workflows/python-publish-macos.yml
@@ -0,0 +1,38 @@
+# This workflow will upload a Python Package using Twine when a release is created
+# For more information see: https://help.github.com/en/actions/language-and-framework-guides/using-python-with-github-actions#publishing-to-package-registries
+
+# This workflow uses actions that are not certified by GitHub.
+# They are provided by a third-party and are governed by
+# separate terms of service, privacy policy, and support
+# documentation.
+
+name: Build & upload package on MacOS
+
+on:
+  release:
+    types: [published]
+
+jobs:
+  deploy:
+
+    runs-on: macos-latest
+
+    steps:
+    - uses: actions/checkout@v2
+    - name: Set up Python
+      uses: actions/setup-python@v2
+      with:
+        python-version: '3.x'
+    - name: Install dependencies
+      run: |
+        python -m pip install --upgrade pip
+        pip install flake8
+        pip install -r requirements.txt
+    - name: Build package
+      run: python setup.py bdist_wheel
+    - name: Publish package
+      run: |
+        python3 -m twine upload dist/*
+      env:
+        TWINE_USERNAME: __token__
+        TWINE_PASSWORD: ${{ secrets.PYPI_API_TOKEN }}
diff --git a/.github/workflows/python-publish-win.yml b/.github/workflows/python-publish-win.yml
@@ -0,0 +1,42 @@
+# This workflow will upload a Python Package using Twine when a release is created
+# For more information see: https://help.github.com/en/actions/language-and-framework-guides/using-python-with-github-actions#publishing-to-package-registries
+
+# This workflow uses actions that are not certified by GitHub.
+# They are provided by a third-party and are governed by
+# separate terms of service, privacy policy, and support
+# documentation.
+
+name: Build & upload package on Windows
+
+on:
+  release:
+    types: [published]
+
+jobs:
+  deploy:
+
+    runs-on: windows-latest
+
+    steps:
+    - uses: actions/checkout@v2
+    - name: Set up Python
+      uses: actions/setup-python@v2
+      with:
+        python-version: '3.x'
+    - name: Install dependencies
+      run: |
+        python -m pip install --upgrade pip
+        pip install flake8
+        pip install -r requirements.txt
+    - name: Set up MinGW
+      uses: egor-tensin/setup-mingw@v2
+      with:
+       platform: x64
+    - name: Build package
+      run: python setup.py bdist_wheel
+    - name: Publish package
+      run: |
+        python3 -m twine upload dist/*
+      env:
+        TWINE_USERNAME: __token__
+        TWINE_PASSWORD: ${{ secrets.PYPI_API_TOKEN }}
diff --git a/exetera/_libs/ops.pyx b/exetera/_libs/ops.pyx
@@ -0,0 +1,8 @@
+def fib(n):
+    """Print the Fibonacci series up to n."""
+    a, b = 0, 1
+    while b < n:
+        print(b)
+        a, b = b, a + b
+
+    print()
diff --git a/exetera/core/dataframe.py b/exetera/core/dataframe.py
@@ -565,6 +565,168 @@ def groupby(self, by: Union[str, List[str]], hint_keys_is_sorted=False):
 
         return HDF5DataFrameGroupBy(self._columns, by, sorted_index, spans)
 
+    def describe(self, include=None, exclude=None):
+        """
+        Show the basic statistics of the data in each field.
+
+        :param include: The field name or data type or simply 'all' to indicate the fields included in the calculation.
+        :param exclude: The filed name or data type to exclude in the calculation.
+        :return: A dataframe contains the statistic results.
+
+        """
+        # check include and exclude conflicts
+        if include is not None and exclude is not None:
+            if isinstance(include, str):
+                raise ValueError('Please do not use exclude parameter when include is set as a single field.')
+            elif isinstance(include, type):
+                if isinstance(exclude, type) or (isinstance(exclude, list) and isinstance(exclude[0], type)):
+                    raise ValueError(
+                        'Please do not use set exclude as a type when include is set as a single data type.')
+            elif isinstance(include, list):
+                if isinstance(include[0], str) and isinstance(exclude, str):
+                    raise ValueError('Please do not use exclude as the same type as the include parameter.')
+                elif isinstance(include[0], str) and isinstance(exclude, list) and isinstance(exclude[0], str):
+                    raise ValueError('Please do not use exclude as the same type as the include parameter.')
+                elif isinstance(include[0], type) and isinstance(exclude, type):
+                    raise ValueError('Please do not use exclude as the same type as the include parameter.')
+                elif isinstance(include[0], type) and isinstance(exclude, list) and isinstance(exclude[0], type):
+                    raise ValueError('Please do not use exclude as the same type as the include parameter.')
+
+        fields_to_calculate = []
+        if include is not None:
+            if isinstance(include, str):  # a single str
+                if include == 'all':
+                    fields_to_calculate = list(self.columns.keys())
+                elif include in self.columns.keys():
+                    fields_to_calculate = [include]
+                else:
+                    raise ValueError('The field to include in not in the dataframe.')
+            elif isinstance(include, type):  # a single type
+                for f in self.columns:
+                    if not self[f].indexed and np.issubdtype(self[f].data.dtype, include):
+                        fields_to_calculate.append(f)
+                if len(fields_to_calculate) == 0:
+                    raise ValueError('No such type appeared in the dataframe.')
+            elif isinstance(include, list) and isinstance(include[0], str):  # a list of str
+                for f in include:
+                    if f in self.columns.keys():
+                        fields_to_calculate.append(f)
+                if len(fields_to_calculate) == 0:
+                    raise ValueError('The fields to include in not in the dataframe.')
+
+            elif isinstance(include, list) and isinstance(include[0], type):  # a list of type
+                for t in include:
+                    for f in self.columns:
+                        if not self[f].indexed and np.issubdtype(self[f].data.dtype, t):
+                            fields_to_calculate.append(f)
+                if len(fields_to_calculate) == 0:
+                    raise ValueError('No such type appeared in the dataframe.')
+
+            else:
+                raise ValueError('The include parameter can only be str, dtype, or list of either.')
+
+        else:  # include is None, numeric & timestamp fields only (no indexed strings) TODO confirm the type
+            for f in self.columns:
+                if isinstance(self[f], fld.NumericField) or isinstance(self[f], fld.TimestampField):
+                    fields_to_calculate.append(f)
+
+        if len(fields_to_calculate) == 0:
+            raise ValueError('No fields included to describe.')
+
+        if exclude is not None:
+            if isinstance(exclude, str):
+                if exclude in fields_to_calculate:  # exclude
+                    fields_to_calculate.remove(exclude)  # remove from list
+            elif isinstance(exclude, type):  # a type
+                for f in fields_to_calculate:
+                    if np.issubdtype(self[f].data.dtype, exclude):
+                        fields_to_calculate.remove(f)
+            elif isinstance(exclude, list) and isinstance(exclude[0], str):  # a list of str
+                for f in exclude:
+                    fields_to_calculate.remove(f)
+
+            elif isinstance(exclude, list) and isinstance(exclude[0], type):  # a list of type
+                for t in exclude:
+                    for f in fields_to_calculate:
+                        if np.issubdtype(self[f].data.dtype, t):
+                            fields_to_calculate.remove(f)  # remove will raise valueerror if dtype not presented
+
+            else:
+                raise ValueError('The exclude parameter can only be str, dtype, or list of either.')
+
+        if len(fields_to_calculate) == 0:
+            raise ValueError('All fields are excluded, no field left to describe.')
+        # if flexible (str) fields
+        des_idxstr = False
+        for f in fields_to_calculate:
+            if isinstance(self[f], fld.CategoricalField) or isinstance(self[f], fld.FixedStringField) or isinstance(
+                    self[f], fld.IndexedStringField):
+                des_idxstr = True
+        # calculation
+        result = {'fields': [], 'count': [], 'mean': [], 'std': [], 'min': [], '25%': [], '50%': [], '75%': [],
+                  'max': []}
+
+        # count
+        if des_idxstr:
+            result['unique'], result['top'], result['freq'] = [], [], []
+
+        for f in fields_to_calculate:
+            result['fields'].append(f)
+            result['count'].append(len(self[f].data))
+
+            if des_idxstr and (isinstance(self[f], fld.NumericField) or isinstance(self[f],
+                                                                                   fld.TimestampField)):  # numberic, timestamp
+                result['unique'].append('NaN')
+                result['top'].append('NaN')
+                result['freq'].append('NaN')
+
+                result['mean'].append("{:.2f}".format(np.mean(self[f].data[:])))
+                result['std'].append("{:.2f}".format(np.std(self[f].data[:])))
+                result['min'].append("{:.2f}".format(np.min(self[f].data[:])))
+                result['25%'].append("{:.2f}".format(np.percentile(self[f].data[:], 0.25)))
+                result['50%'].append("{:.2f}".format(np.percentile(self[f].data[:], 0.5)))
+                result['75%'].append("{:.2f}".format(np.percentile(self[f].data[:], 0.75)))
+                result['max'].append("{:.2f}".format(np.max(self[f].data[:])))
+
+            elif des_idxstr and (isinstance(self[f], fld.CategoricalField) or isinstance(self[f],
+                                                                                         fld.IndexedStringField) or isinstance(
+                self[f], fld.FixedStringField)):  # categorical & indexed string & fixed string
+                a, b = np.unique(self[f].data[:], return_counts=True)
+                result['unique'].append(len(a))
+                result['top'].append(a[np.argmax(b)])
+                result['freq'].append(b[np.argmax(b)])
+
+                result['mean'].append('NaN')
+                result['std'].append('NaN')
+                result['min'].append('NaN')
+                result['25%'].append('NaN')
+                result['50%'].append('NaN')
+                result['75%'].append('NaN')
+                result['max'].append('NaN')
+
+            elif not des_idxstr:
+                result['mean'].append("{:.2f}".format(np.mean(self[f].data[:])))
+                result['std'].append("{:.2f}".format(np.std(self[f].data[:])))
+                result['min'].append("{:.2f}".format(np.min(self[f].data[:])))
+                result['25%'].append("{:.2f}".format(np.percentile(self[f].data[:], 0.25)))
+                result['50%'].append("{:.2f}".format(np.percentile(self[f].data[:], 0.5)))
+                result['75%'].append("{:.2f}".format(np.percentile(self[f].data[:], 0.75)))
+                result['max'].append("{:.2f}".format(np.max(self[f].data[:])))
+
+        # display
+        columns_to_show = ['fields', 'count', 'unique', 'top', 'freq', 'mean', 'std', 'min', '25%', '50%', '75%', 'max']
+        # 5 fields each time for display
+        for col in range(0, len(result['fields']), 5):  # 5 column each time
+            for i in columns_to_show:
+                if i in result:
+                    print(i, end='\t')
+                    for f in result[i][col:col + 5 if col + 5 < len(result[i]) - 1 else len(result[i])]:
+                        print('{:>15}'.format(f), end='\t')
+                    print('')
+            print('\n')
+
+        return result
+
 
 
 class HDF5DataFrameGroupBy(DataFrameGroupBy):

diff --git a/exetera/core/field_importers.py b/exetera/core/field_importers.py
@@ -5,7 +5,8 @@
 from exetera.core import operations as ops
 from exetera.core.data_writer import DataWriter
 from exetera.core import utils
-from datetime import datetime, date
+from datetime import datetime, date, timezone
+import pytz
 
 INDEXED_STRING_FIELD_SIZE = 10 # guessing
 
@@ -307,14 +308,14 @@ def write_part(self, values):
                     # ts = datetime.strptime(value.decode(), '%Y-%m-%d %H:%M:%S.%f%z')
                     v_datetime = datetime(int(value[0:4]), int(value[5:7]), int(value[8:10]),
                                           int(value[11:13]), int(value[14:16]), int(value[17:19]),
-                                          int(value[20:26]))
+                                          int(value[20:26]), tzinfo=timezone.utc)
                 elif v_len == 25:
                     # ts = datetime.strptime(value.decode(), '%Y-%m-%d %H:%M:%S%z')
                     v_datetime = datetime(int(value[0:4]), int(value[5:7]), int(value[8:10]),
-                                          int(value[11:13]), int(value[14:16]), int(value[17:19]))
+                                          int(value[11:13]), int(value[14:16]), int(value[17:19]), tzinfo=timezone.utc)
                 elif v_len == 19:
                     v_datetime = datetime(int(value[0:4]), int(value[5:7]), int(value[8:10]),
-                                          int(value[11:13]), int(value[14:16]), int(value[17:19]))
+                                          int(value[11:13]), int(value[14:16]), int(value[17:19]), tzinfo=timezone.utc)
                 else:
                     raise ValueError(f"Date field '{self.field}' has unexpected format '{value}'")
                 datetime_ts[i] = v_datetime.timestamp()
@@ -362,6 +363,7 @@ def write_part(self, values):
                 flags[i] = False
             else:
                 ts = datetime.strptime(value.decode(), '%Y-%m-%d')
+                ts = ts.replace(tzinfo=timezone.utc)
                 date_ts[i] = ts.timestamp()
 
         self.field.data.write_part(date_ts)

diff --git a/exetera/core/fields.py b/exetera/core/fields.py
@@ -1557,8 +1557,14 @@ def nformat(self):
     @property
     def keys(self):
         self._ensure_valid()
-        kv = self._field['key_values']
-        kn = self._field['key_names']
+        if isinstance(self._field['key_values'][0], str):  # convert into bytearray to keep up with linux
+            kv = [bytes(i, 'utf-8') for i in self._field['key_values']]
+        else:
+            kv = self._field['key_values']
+        if isinstance(self._field['key_names'][0], str):
+            kn = [bytes(i, 'utf-8') for i in self._field['key_names']]
+        else:
+            kn = self._field['key_names']
         keys = dict(zip(kv, kn))
         return keys
 

diff --git a/exetera/core/persistence.py b/exetera/core/persistence.py
@@ -169,7 +169,7 @@ def _apply_sort_to_array(index, values):
 @njit
 def _apply_sort_to_index_values(index, indices, values):
 
-    s_indices = np.zeros_like(indices)
+    s_indices = np.zeros_like(indices, dtype=np.int64)
     s_values = np.zeros_like(values)
     accumulated = np.int64(0)
     s_indices[0] = 0
@@ -1029,7 +1029,7 @@ def apply_spans_concat(self, spans, reader, writer):
 
         src_index = reader.field['index'][:]
         src_values = reader.field['values'][:]
-        dest_index = np.zeros(reader.chunksize, src_index.dtype)
+        dest_index = np.zeros(reader.chunksize, np.int64)
         dest_values = np.zeros(reader.chunksize * 16, src_values.dtype)
 
         max_index_i = reader.chunksize

diff --git a/exetera/core/readerwriter.py b/exetera/core/readerwriter.py
@@ -60,7 +60,7 @@ def dtype(self):
         return self.field['index'].dtype, self.field['values'].dtype
 
     def sort(self, index, writer):
-        field_index = self.field['index'][:]
+        field_index = np.array(self.field['index'][:], dtype=np.int64)
         field_values = self.field['values'][:]
         r_field_index, r_field_values =\
             pers._apply_sort_to_index_values(index, field_index, field_values)