From b6864c13c91678c6f8fcb35a9dc986427d8ab870 Mon Sep 17 00:00:00 2001
From: deng113jie <deng113jie@gmail.com>
Date: Mon, 18 Oct 2021 19:20:38 +0100
Subject: [PATCH] fixing unittest errors on windows (#222)

* fixing issue #86 from upstream:
add get_spans() in Field class, similar to get_spans() in Session class

* add unit test for Field get_spans() function

* remove unuseful line comments

* add dataset, datafreame class

* closing issue 92, reset the dataset when call field.data.clear

* closing issue 92, reset the dataset when call field.data.clear

* add unittest for field.data.clear function

* recover the dataset file to avoid merge error when fixing issue 92

* fix end_of_file char in dataset.py

* add get_span for index string field

* unittest for get_span functions on different types of field, eg. fixed string, indexed string, etc.

* dataframe basic methods and unittest

* more dataframe operations

* minor fixing

* update get_span to field subclass

* intermedia commit due to test pr 118

* Implementate get_spans(ndarray) and get_spans(ndarray1, ndarray2) function in core.operations.

Provide get_spans methods in fields using data attribute.

* Move the get_spans functions from persistence to operations.
Modify the get_spans functions in Session to call field method and operation method.

* minor edits for pull request

* remove dataframe for pull request

* remove dataframe test for pr

* add dataframe

* fix get_spans_for_2_fields_by_spans, fix the unittest

* Initial commit for is_sorted method on Field

* minor edits for the pr

* fix minor edit error for pr

* add apply_index and apply_filter methods on fields

* Adding in missing tests for all field types for is_sorted

* update the apply filter and apply index on Fields

* minor updates to line up w/ upstream

* update apply filter & apply index methods in fields that differ if destination field is set: if set, use dstfld.write because new field usually empty; if not set, write to self using  fld.data[:]

* updated the apply_index and apply_filter methods in fields. Use olddata[:]=newdata if length of old dataset is equals to new dataset; clear() and write() data if not.

* add dataframe basic functions and operations; working on dataset to enable dataframe to create fields.

* add functions in dataframe
add dataset class
add functions in dataset
move dataset module to csvdataset

* integrates the dataset, dataframe into the session

* update the fieldsimporter and field.create_like methods to call dataframe.create
update the unittests to follow s.open_dataset and dataset.create_dataframe flow

* add license info to a few files

* csv_reader_with_njit

* change output_excel from string to int

* initialize column_idx matrix outside of the njit function

* use np.fromfile to load the file into byte array

* Refactoring and reformatting of some of the dataset / dataframe code; moving Session and Dataset to abstract types; fixing of is_sorted tests that were broken with the merge of the new functionality

* Work on fast csv reading

* Address issue #138 on minor tweaks
Fix bug: create dataframe in dataset construction method to mapping existing datasets
Full syn between dataset with h5file when add dataframe (group), remove dataframe, set dataframe.

* remove draft group.py from repo

* Improved performance from the fast csv reader through avoiding ndarray slicing

* fix dataframe api

* fixing #13 and #14, add dest parameter to get_spans(), tidy up the field/fields parameters

* minor fix remove dataframe and file property from dataset, as not used so far.

* minor fix on unittest

* add docstring for dataset

* copy/move for dataframe; docstrings

* categorical field: convert from byte int to value int within njit function

* Adding in of pseudocode version of fast categorical lookup

* clean up the comments

* docstrings for dataframe

* Major reworking of apply_filter / apply_index for fields; they shouldn't destructively change self by default. Also addition of further mem versions of fields and factoring out of common functionality. Fix to field when indices / values are cleared but this leaves data pointing to the old field

* add unittest for various fields in dataframe
add dataframe.add/drop/move
add docstrings

* add unittest for Dataframe.add/drop/move

* minor change on name to make sure name in consistent over dataframe, dataset.key and h5group

* minor fixed of adding prefix b to string in test_session and test_dataset

* minor fixed of adding prefix b to string in test_session and test_dataset

* Completed initial pass of memory fields for all types

* categloric field.keys will return byte key as string, thus minor change on the unittest

* solved the byte to string issue, problem is dof python 3.7 and 3.8

* Miscellaneous field fixes; fixed issues with dataframe apply_filter / apply_index

* Moving most binary op logic out into a static method in FieldDataOps

* Dataframe copy, move and drop operations have been moved out of the DataFrame static methods as python doesn't support static and instance method name overloading (my bad)

* Fixing accidental introduction of CRLF to abstract_types

* Fixed bug where apply_filter and apply_index weren't returning a field on all code paths; beefed up tests to cover this

* Fixed issue in timestamp_field_create_like when group is set and is a dataframe

* persistence.filter_duplicate_fields now supports fields as well as ndarrays

* sort_on message now shows in verbose mode under all circumstances

* Fixed bug in apply filter when a destination dataset is applied

* Added a test to catch dataframe.apply_filter bug

* Bug fix: categorical_field_constructor in fields.py was returning numeric field when pass a h5py group as a destination for the field

* Copying data before filtering, as filtering in h5py is very slow

* Adding apply_spans functions to fields

* Fixed TestFieldApplySpansCount.test_timestamp_apply_spans that had been written but not run

* Issues found with indexed strings and merging; fixes found for apply_filter and apply_index when being passed a field rather than an ndarray; both with augmented testing

* Updated merge functions to consistently return memory fields if not provided with outputs but provided with fields

* concate cat keys instead of padding

* some docstring for fields

* dataframe copy/move/drop and unittest

* Fixing issue with dataframe move/copy being static

* Updating HDF5Field writeable methods to account for prior changes

* Adding merge functionality for dataframes

* dataset.drop is a member method of Dataset as it did not make sense for it to be static or outside of the class

* Added missing methods / properties to DataFrame ABC

* minor update on dataframe static function

* minor update

* minor update session

* minor comments update

* minor comments update

* add unittest for csv_reader_speedup.py

* count operation; logical not for numeric fields

* remove csv speed up work from commit

* minor update

* unit test for logical not in numeric field

* patch for get_spans for datastore

* tests for two fields

* add as type to numeric field

* seperate the unittest of get_spans by datastore reader

* unittest for astype

* update astype for fields, update logical_not for numeric fields

* remove dataframe view commits

* remove kwargs in get_spans in session, add fields back for backward compatibility

* remove filter view tests

* partial commit on viewer

* remote view from git

* add df.describe unittest

* sync with upstream

* Delete python-publish.yml

* Update python-app.yml

* Update python-app.yml

* dataframe describe function

* sync with upstream

* Update python-app.yml

* alternative get_timestamp notebook for discussion

* update the notebook output of linux and mac

* update format

* update the to_timestamp and to_timestamp function in utils
fix the current datetime.timestamp() error in test_fields and
test_sessions

* add unittest for utils to_timestamp and to_datetimie

* fix for pr

* setup github action specific for windows for cython

* minor workflow fix

* add example pyx file

* fix package upload command on win; as the git action
gh-action-pypi-publish works only on linux

* add twine as tools

* add linux action file

* update the linux build command

* build workflow for macos

* minor update the macos workflow

* fixed timestamp issue on windows by add timezone info to datetime

* finanlize workflow file, compile react to publish action only

* avoid the bytearray vs string error in windows by converting result to
bytearray

* fixing string vs bytesarray issue

* update categorical field key property, change the key, value to bytes if
it is a str

* solved index must be np.int64 error

* all unittest error on windoes removed

* minor update on workflow file

* minor update workflow file

* minor fix: use pip install -r ; remove unused import in utils.py

* update action file

* remove change on test_presistence on uint32 to int32

Co-authored-by: jie <jie@debian.mynet>
Co-authored-by: Ben Murray <ben.murray@gmail.com>
Co-authored-by: clyyuanzi-london <59363720+clyyuanzi-london@users.noreply.github.com>
---
 .github/workflows/python-app.yml              |  13 +-
 ...n-publish.yml => python-publish-linux.yml} |  12 +-
 .github/workflows/python-publish-macos.yml    |  38 +++
 .github/workflows/python-publish-win.yml      |  42 ++++
 exetera/_libs/ops.pyx                         |   8 +
 exetera/core/dataframe.py                     | 162 ++++++++++++
 exetera/core/field_importers.py               |  10 +-
 exetera/core/fields.py                        |  10 +-
 exetera/core/persistence.py                   |   4 +-
 exetera/core/readerwriter.py                  |   2 +-
 exetera/core/utils.py                         |   3 +-
 requirements.txt                              |   1 +
 setup.py                                      |   5 +
 tests/test_csv_reader_speedup.py              |   2 +-
 tests/test_dataframe.py                       | 234 +++++++++++++++++-
 tests/test_fields.py                          |  21 +-
 tests/test_importer.py                        |  11 +-
 tests/test_parsers.py                         |   7 +-
 tests/test_persistence.py                     |   6 +-
 tests/test_session.py                         |  11 +-
 tests/test_utils.py                           |   4 +-
 21 files changed, 558 insertions(+), 48 deletions(-)
 rename .github/workflows/{python-publish.yml => python-publish-linux.yml} (78%)
 create mode 100644 .github/workflows/python-publish-macos.yml
 create mode 100644 .github/workflows/python-publish-win.yml
 create mode 100644 exetera/_libs/ops.pyx

diff --git a/.github/workflows/python-app.yml b/.github/workflows/python-app.yml
index f1c89e6a..505930a4 100644
--- a/.github/workflows/python-app.yml
+++ b/.github/workflows/python-app.yml
@@ -12,8 +12,11 @@ on:
 jobs:
   build:
 
-    runs-on: ubuntu-latest
-
+    runs-on: ${{ matrix.os }}
+    strategy:
+      matrix:
+        os: [macos-latest, windows-latest, ubuntu-latest] 
+        
     steps:
     - uses: actions/checkout@v2
     - name: Set up Python 3.7
@@ -23,8 +26,8 @@ jobs:
     - name: Install dependencies
       run: |
         python -m pip install --upgrade pip
-        pip install flake8 numpy numba pandas h5py
-        if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
+        pip install flake8
+        pip install -r requirements.txt
     - name: Lint with flake8
       run: |
         # stop the build if there are Python syntax errors or undefined names
@@ -33,4 +36,4 @@ jobs:
         flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics
     - name: Test with unittest
       run: |
-        python -m unittest tests/*
+        python -m unittest
diff --git a/.github/workflows/python-publish.yml b/.github/workflows/python-publish-linux.yml
similarity index 78%
rename from .github/workflows/python-publish.yml
rename to .github/workflows/python-publish-linux.yml
index 3bfabfc1..18550be5 100644
--- a/.github/workflows/python-publish.yml
+++ b/.github/workflows/python-publish-linux.yml
@@ -6,7 +6,7 @@
 # separate terms of service, privacy policy, and support
 # documentation.
 
-name: Upload Python Package
+name: Build & upload package on Linux
 
 on:
   release:
@@ -26,9 +26,15 @@ jobs:
     - name: Install dependencies
       run: |
         python -m pip install --upgrade pip
-        pip install build
+        pip install flake8
+        pip install -r requirements.txt
+    - name: Set up GCC
+      uses: egor-tensin/setup-gcc@v1
+      with:
+        version: latest
+        platform: x64
     - name: Build package
-      run: python -m build
+      run: python setup.py bdist_wheel
     - name: Publish package
       uses: pypa/gh-action-pypi-publish@27b31702a0e7fc50959f5ad993c78deac1bdfc29
       with:
diff --git a/.github/workflows/python-publish-macos.yml b/.github/workflows/python-publish-macos.yml
new file mode 100644
index 00000000..34ae492b
--- /dev/null
+++ b/.github/workflows/python-publish-macos.yml
@@ -0,0 +1,38 @@
+# This workflow will upload a Python Package using Twine when a release is created
+# For more information see: https://help.github.com/en/actions/language-and-framework-guides/using-python-with-github-actions#publishing-to-package-registries
+
+# This workflow uses actions that are not certified by GitHub.
+# They are provided by a third-party and are governed by
+# separate terms of service, privacy policy, and support
+# documentation.
+
+name: Build & upload package on MacOS
+
+on:
+  release:
+    types: [published]
+
+jobs:
+  deploy:
+
+    runs-on: macos-latest
+
+    steps:
+    - uses: actions/checkout@v2
+    - name: Set up Python
+      uses: actions/setup-python@v2
+      with:
+        python-version: '3.x'
+    - name: Install dependencies
+      run: |
+        python -m pip install --upgrade pip
+        pip install flake8
+        pip install -r requirements.txt
+    - name: Build package
+      run: python setup.py bdist_wheel
+    - name: Publish package
+      run: |
+        python3 -m twine upload dist/*
+      env:
+        TWINE_USERNAME: __token__
+        TWINE_PASSWORD: ${{ secrets.PYPI_API_TOKEN }}
diff --git a/.github/workflows/python-publish-win.yml b/.github/workflows/python-publish-win.yml
new file mode 100644
index 00000000..aff63278
--- /dev/null
+++ b/.github/workflows/python-publish-win.yml
@@ -0,0 +1,42 @@
+# This workflow will upload a Python Package using Twine when a release is created
+# For more information see: https://help.github.com/en/actions/language-and-framework-guides/using-python-with-github-actions#publishing-to-package-registries
+
+# This workflow uses actions that are not certified by GitHub.
+# They are provided by a third-party and are governed by
+# separate terms of service, privacy policy, and support
+# documentation.
+
+name: Build & upload package on Windows
+
+on:
+  release:
+    types: [published]
+
+jobs:
+  deploy:
+
+    runs-on: windows-latest
+
+    steps:
+    - uses: actions/checkout@v2
+    - name: Set up Python
+      uses: actions/setup-python@v2
+      with:
+        python-version: '3.x'
+    - name: Install dependencies
+      run: |
+        python -m pip install --upgrade pip
+        pip install flake8
+        pip install -r requirements.txt
+    - name: Set up MinGW
+      uses: egor-tensin/setup-mingw@v2
+      with:
+       platform: x64
+    - name: Build package
+      run: python setup.py bdist_wheel
+    - name: Publish package
+      run: |
+        python3 -m twine upload dist/*
+      env:
+        TWINE_USERNAME: __token__
+        TWINE_PASSWORD: ${{ secrets.PYPI_API_TOKEN }}
diff --git a/exetera/_libs/ops.pyx b/exetera/_libs/ops.pyx
new file mode 100644
index 00000000..40a18630
--- /dev/null
+++ b/exetera/_libs/ops.pyx
@@ -0,0 +1,8 @@
+def fib(n):
+    """Print the Fibonacci series up to n."""
+    a, b = 0, 1
+    while b < n:
+        print(b)
+        a, b = b, a + b
+
+    print()
\ No newline at end of file
diff --git a/exetera/core/dataframe.py b/exetera/core/dataframe.py
index 19d9e4b3..174b3db0 100644
--- a/exetera/core/dataframe.py
+++ b/exetera/core/dataframe.py
@@ -565,6 +565,168 @@ def groupby(self, by: Union[str, List[str]], hint_keys_is_sorted=False):
         
         return HDF5DataFrameGroupBy(self._columns, by, sorted_index, spans)
 
+    def describe(self, include=None, exclude=None):
+        """
+        Show the basic statistics of the data in each field.
+
+        :param include: The field name or data type or simply 'all' to indicate the fields included in the calculation.
+        :param exclude: The filed name or data type to exclude in the calculation.
+        :return: A dataframe contains the statistic results.
+
+        """
+        # check include and exclude conflicts
+        if include is not None and exclude is not None:
+            if isinstance(include, str):
+                raise ValueError('Please do not use exclude parameter when include is set as a single field.')
+            elif isinstance(include, type):
+                if isinstance(exclude, type) or (isinstance(exclude, list) and isinstance(exclude[0], type)):
+                    raise ValueError(
+                        'Please do not use set exclude as a type when include is set as a single data type.')
+            elif isinstance(include, list):
+                if isinstance(include[0], str) and isinstance(exclude, str):
+                    raise ValueError('Please do not use exclude as the same type as the include parameter.')
+                elif isinstance(include[0], str) and isinstance(exclude, list) and isinstance(exclude[0], str):
+                    raise ValueError('Please do not use exclude as the same type as the include parameter.')
+                elif isinstance(include[0], type) and isinstance(exclude, type):
+                    raise ValueError('Please do not use exclude as the same type as the include parameter.')
+                elif isinstance(include[0], type) and isinstance(exclude, list) and isinstance(exclude[0], type):
+                    raise ValueError('Please do not use exclude as the same type as the include parameter.')
+
+        fields_to_calculate = []
+        if include is not None:
+            if isinstance(include, str):  # a single str
+                if include == 'all':
+                    fields_to_calculate = list(self.columns.keys())
+                elif include in self.columns.keys():
+                    fields_to_calculate = [include]
+                else:
+                    raise ValueError('The field to include in not in the dataframe.')
+            elif isinstance(include, type):  # a single type
+                for f in self.columns:
+                    if not self[f].indexed and np.issubdtype(self[f].data.dtype, include):
+                        fields_to_calculate.append(f)
+                if len(fields_to_calculate) == 0:
+                    raise ValueError('No such type appeared in the dataframe.')
+            elif isinstance(include, list) and isinstance(include[0], str):  # a list of str
+                for f in include:
+                    if f in self.columns.keys():
+                        fields_to_calculate.append(f)
+                if len(fields_to_calculate) == 0:
+                    raise ValueError('The fields to include in not in the dataframe.')
+
+            elif isinstance(include, list) and isinstance(include[0], type):  # a list of type
+                for t in include:
+                    for f in self.columns:
+                        if not self[f].indexed and np.issubdtype(self[f].data.dtype, t):
+                            fields_to_calculate.append(f)
+                if len(fields_to_calculate) == 0:
+                    raise ValueError('No such type appeared in the dataframe.')
+
+            else:
+                raise ValueError('The include parameter can only be str, dtype, or list of either.')
+
+        else:  # include is None, numeric & timestamp fields only (no indexed strings) TODO confirm the type
+            for f in self.columns:
+                if isinstance(self[f], fld.NumericField) or isinstance(self[f], fld.TimestampField):
+                    fields_to_calculate.append(f)
+
+        if len(fields_to_calculate) == 0:
+            raise ValueError('No fields included to describe.')
+
+        if exclude is not None:
+            if isinstance(exclude, str):
+                if exclude in fields_to_calculate:  # exclude
+                    fields_to_calculate.remove(exclude)  # remove from list
+            elif isinstance(exclude, type):  # a type
+                for f in fields_to_calculate:
+                    if np.issubdtype(self[f].data.dtype, exclude):
+                        fields_to_calculate.remove(f)
+            elif isinstance(exclude, list) and isinstance(exclude[0], str):  # a list of str
+                for f in exclude:
+                    fields_to_calculate.remove(f)
+
+            elif isinstance(exclude, list) and isinstance(exclude[0], type):  # a list of type
+                for t in exclude:
+                    for f in fields_to_calculate:
+                        if np.issubdtype(self[f].data.dtype, t):
+                            fields_to_calculate.remove(f)  # remove will raise valueerror if dtype not presented
+
+            else:
+                raise ValueError('The exclude parameter can only be str, dtype, or list of either.')
+
+        if len(fields_to_calculate) == 0:
+            raise ValueError('All fields are excluded, no field left to describe.')
+        # if flexible (str) fields
+        des_idxstr = False
+        for f in fields_to_calculate:
+            if isinstance(self[f], fld.CategoricalField) or isinstance(self[f], fld.FixedStringField) or isinstance(
+                    self[f], fld.IndexedStringField):
+                des_idxstr = True
+        # calculation
+        result = {'fields': [], 'count': [], 'mean': [], 'std': [], 'min': [], '25%': [], '50%': [], '75%': [],
+                  'max': []}
+
+        # count
+        if des_idxstr:
+            result['unique'], result['top'], result['freq'] = [], [], []
+
+        for f in fields_to_calculate:
+            result['fields'].append(f)
+            result['count'].append(len(self[f].data))
+
+            if des_idxstr and (isinstance(self[f], fld.NumericField) or isinstance(self[f],
+                                                                                   fld.TimestampField)):  # numberic, timestamp
+                result['unique'].append('NaN')
+                result['top'].append('NaN')
+                result['freq'].append('NaN')
+
+                result['mean'].append("{:.2f}".format(np.mean(self[f].data[:])))
+                result['std'].append("{:.2f}".format(np.std(self[f].data[:])))
+                result['min'].append("{:.2f}".format(np.min(self[f].data[:])))
+                result['25%'].append("{:.2f}".format(np.percentile(self[f].data[:], 0.25)))
+                result['50%'].append("{:.2f}".format(np.percentile(self[f].data[:], 0.5)))
+                result['75%'].append("{:.2f}".format(np.percentile(self[f].data[:], 0.75)))
+                result['max'].append("{:.2f}".format(np.max(self[f].data[:])))
+
+            elif des_idxstr and (isinstance(self[f], fld.CategoricalField) or isinstance(self[f],
+                                                                                         fld.IndexedStringField) or isinstance(
+                self[f], fld.FixedStringField)):  # categorical & indexed string & fixed string
+                a, b = np.unique(self[f].data[:], return_counts=True)
+                result['unique'].append(len(a))
+                result['top'].append(a[np.argmax(b)])
+                result['freq'].append(b[np.argmax(b)])
+
+                result['mean'].append('NaN')
+                result['std'].append('NaN')
+                result['min'].append('NaN')
+                result['25%'].append('NaN')
+                result['50%'].append('NaN')
+                result['75%'].append('NaN')
+                result['max'].append('NaN')
+
+            elif not des_idxstr:
+                result['mean'].append("{:.2f}".format(np.mean(self[f].data[:])))
+                result['std'].append("{:.2f}".format(np.std(self[f].data[:])))
+                result['min'].append("{:.2f}".format(np.min(self[f].data[:])))
+                result['25%'].append("{:.2f}".format(np.percentile(self[f].data[:], 0.25)))
+                result['50%'].append("{:.2f}".format(np.percentile(self[f].data[:], 0.5)))
+                result['75%'].append("{:.2f}".format(np.percentile(self[f].data[:], 0.75)))
+                result['max'].append("{:.2f}".format(np.max(self[f].data[:])))
+
+        # display
+        columns_to_show = ['fields', 'count', 'unique', 'top', 'freq', 'mean', 'std', 'min', '25%', '50%', '75%', 'max']
+        # 5 fields each time for display
+        for col in range(0, len(result['fields']), 5):  # 5 column each time
+            for i in columns_to_show:
+                if i in result:
+                    print(i, end='\t')
+                    for f in result[i][col:col + 5 if col + 5 < len(result[i]) - 1 else len(result[i])]:
+                        print('{:>15}'.format(f), end='\t')
+                    print('')
+            print('\n')
+
+        return result
+
 
 
 class HDF5DataFrameGroupBy(DataFrameGroupBy):
diff --git a/exetera/core/field_importers.py b/exetera/core/field_importers.py
index 0b33e592..e42b05d7 100644
--- a/exetera/core/field_importers.py
+++ b/exetera/core/field_importers.py
@@ -5,7 +5,8 @@
 from exetera.core import operations as ops
 from exetera.core.data_writer import DataWriter
 from exetera.core import utils
-from datetime import datetime, date
+from datetime import datetime, date, timezone
+import pytz
 
 INDEXED_STRING_FIELD_SIZE = 10 # guessing
 
@@ -307,14 +308,14 @@ def write_part(self, values):
                     # ts = datetime.strptime(value.decode(), '%Y-%m-%d %H:%M:%S.%f%z')
                     v_datetime = datetime(int(value[0:4]), int(value[5:7]), int(value[8:10]),
                                           int(value[11:13]), int(value[14:16]), int(value[17:19]),
-                                          int(value[20:26]))
+                                          int(value[20:26]), tzinfo=timezone.utc)
                 elif v_len == 25:
                     # ts = datetime.strptime(value.decode(), '%Y-%m-%d %H:%M:%S%z')
                     v_datetime = datetime(int(value[0:4]), int(value[5:7]), int(value[8:10]),
-                                          int(value[11:13]), int(value[14:16]), int(value[17:19]))
+                                          int(value[11:13]), int(value[14:16]), int(value[17:19]), tzinfo=timezone.utc)
                 elif v_len == 19:
                     v_datetime = datetime(int(value[0:4]), int(value[5:7]), int(value[8:10]),
-                                          int(value[11:13]), int(value[14:16]), int(value[17:19]))
+                                          int(value[11:13]), int(value[14:16]), int(value[17:19]), tzinfo=timezone.utc)
                 else:
                     raise ValueError(f"Date field '{self.field}' has unexpected format '{value}'")
                 datetime_ts[i] = v_datetime.timestamp()
@@ -362,6 +363,7 @@ def write_part(self, values):
                 flags[i] = False
             else:
                 ts = datetime.strptime(value.decode(), '%Y-%m-%d')
+                ts = ts.replace(tzinfo=timezone.utc)
                 date_ts[i] = ts.timestamp()
 
         self.field.data.write_part(date_ts)
diff --git a/exetera/core/fields.py b/exetera/core/fields.py
index 799ed4e4..e99a87b4 100644
--- a/exetera/core/fields.py
+++ b/exetera/core/fields.py
@@ -1557,8 +1557,14 @@ def nformat(self):
     @property
     def keys(self):
         self._ensure_valid()
-        kv = self._field['key_values']
-        kn = self._field['key_names']
+        if isinstance(self._field['key_values'][0], str):  # convert into bytearray to keep up with linux
+            kv = [bytes(i, 'utf-8') for i in self._field['key_values']]
+        else:
+            kv = self._field['key_values']
+        if isinstance(self._field['key_names'][0], str):
+            kn = [bytes(i, 'utf-8') for i in self._field['key_names']]
+        else:
+            kn = self._field['key_names']
         keys = dict(zip(kv, kn))
         return keys
 
diff --git a/exetera/core/persistence.py b/exetera/core/persistence.py
index d591f427..24bbfd61 100644
--- a/exetera/core/persistence.py
+++ b/exetera/core/persistence.py
@@ -169,7 +169,7 @@ def _apply_sort_to_array(index, values):
 @njit
 def _apply_sort_to_index_values(index, indices, values):
 
-    s_indices = np.zeros_like(indices)
+    s_indices = np.zeros_like(indices, dtype=np.int64)
     s_values = np.zeros_like(values)
     accumulated = np.int64(0)
     s_indices[0] = 0
@@ -1029,7 +1029,7 @@ def apply_spans_concat(self, spans, reader, writer):
 
         src_index = reader.field['index'][:]
         src_values = reader.field['values'][:]
-        dest_index = np.zeros(reader.chunksize, src_index.dtype)
+        dest_index = np.zeros(reader.chunksize, np.int64)
         dest_values = np.zeros(reader.chunksize * 16, src_values.dtype)
 
         max_index_i = reader.chunksize
diff --git a/exetera/core/readerwriter.py b/exetera/core/readerwriter.py
index 4cb9c8bd..7710df68 100644
--- a/exetera/core/readerwriter.py
+++ b/exetera/core/readerwriter.py
@@ -60,7 +60,7 @@ def dtype(self):
         return self.field['index'].dtype, self.field['values'].dtype
 
     def sort(self, index, writer):
-        field_index = self.field['index'][:]
+        field_index = np.array(self.field['index'][:], dtype=np.int64)
         field_values = self.field['values'][:]
         r_field_index, r_field_values =\
             pers._apply_sort_to_index_values(index, field_index, field_values)
diff --git a/exetera/core/utils.py b/exetera/core/utils.py
index b2fcb858..a65bb4e7 100644
--- a/exetera/core/utils.py
+++ b/exetera/core/utils.py
@@ -393,4 +393,5 @@ def one_dim_data_to_indexed_for_test(data, field_size):
                 length += 1
         indices[0, i + 1] = indices[0, i] + length
          
-    return indices, values, offsets, count_row
\ No newline at end of file
+    return indices, values, offsets, count_row
+
diff --git a/requirements.txt b/requirements.txt
index 3d49aac1..ba942618 100644
--- a/requirements.txt
+++ b/requirements.txt
@@ -2,3 +2,4 @@ numpy
 pandas
 h5py
 numba
+cython
\ No newline at end of file
diff --git a/setup.py b/setup.py
index 6c8ae893..dc66fb2e 100644
--- a/setup.py
+++ b/setup.py
@@ -1,5 +1,6 @@
 from setuptools import setup
 from pkg_resources import parse_requirements
+from Cython.Build import cythonize
 
 from os import path
 this_directory = path.abspath(path.dirname(__file__))
@@ -14,6 +15,9 @@
 with open(path.join(this_directory, "requirements.txt")) as o:
     requirements = [str(r) for r in parse_requirements(o.read())]
 
+pyxfiles = ['ops.pyx']
+pyx_full_path = [path.join(this_directory, 'exetera', '_libs', pyx) for pyx in pyxfiles]
+
 setup(
     name='exetera',
     version=__version__,
@@ -26,6 +30,7 @@
     license='http://www.apache.org/licenses/LICENSE-2.0',
     packages=['exetera', 'exetera.core', 'exetera.processing'],
     scripts=['exetera/bin/exetera'],
+    ext_modules = cythonize(pyx_full_path),
     python_requires='>=3.7',
     install_requires=requirements
 )
diff --git a/tests/test_csv_reader_speedup.py b/tests/test_csv_reader_speedup.py
index f3826e30..a2082d28 100644
--- a/tests/test_csv_reader_speedup.py
+++ b/tests/test_csv_reader_speedup.py
@@ -266,7 +266,7 @@ def test_read_file_on_only_categorical_field(self, mock_fromfile):
             # print(result)
             # print(df[field])
             self.assertEqual(len(result), len(df[field]))
-            self.assertListEqual(result, list(df[field]))
+            self.assertListEqual([i.replace('\r', '') for i in result], list(df[field]))  # remove \r due to windoes
 
 
     @patch("numpy.fromfile")
diff --git a/tests/test_dataframe.py b/tests/test_dataframe.py
index 54736fde..70d97b57 100644
--- a/tests/test_dataframe.py
+++ b/tests/test_dataframe.py
@@ -70,11 +70,11 @@ def test_dataframe_create_numeric(self):
             a = df.create_numeric('a','int32')
             a.data.write(values)
 
-            total = np.sum(a.data[:])
+            total = np.sum(a.data[:], dtype=np.int64)
             self.assertEqual(49997540637149, total)
 
             a.data[:] = a.data[:] * 2
-            total = np.sum(a.data[:])
+            total = np.sum(a.data[:], dtype=np.int64)
             self.assertEqual(99995081274298, total)
 
     def test_dataframe_create_categorical(self):
@@ -891,4 +891,232 @@ def test_to_csv_with_row_filter_field(self):
             self.assertEqual(f.readlines(), ['val1\n', '0\n', '2\n'])
 
         os.close(fd_csv)      
-        
\ No newline at end of file
+
+class TestDataFrameDescribe(unittest.TestCase):
+
+    def test_describe_default(self):
+        bio = BytesIO()
+        with session.Session() as s:
+            dst = s.open_dataset(bio, 'w', 'dst')
+            df = dst.create_dataframe('df')
+            df.create_numeric('num', 'int32').data.write([i for i in range(10)])
+            df.create_fixed_string('fs1', 1).data.write([b'a' for i in range(20)])
+            df.create_timestamp('ts1').data.write([1632234128 + i for i in range(20)])
+            df.create_categorical('c1', 'int32', {'a': 1, 'b': 2}).data.write([1 for i in range(20)])
+            df.create_indexed_string('is1').data.write(['abc' for i in range(20)])
+            result = df.describe()
+            expected = {'fields': ['num', 'ts1'], 'count': [10, 20], 'mean': ['4.50', '1632234137.50'],
+                        'std': ['2.87', '5.77'], 'min': ['0.00', '1632234128.00'], '25%': ['0.02', '1632234128.05'],
+                        '50%': ['0.04', '1632234128.10'], '75%': ['0.07', '1632234128.14'],
+                        'max': ['9.00', '1632234147.00']}
+            self.assertEqual(result, expected)
+
+    def test_describe_include(self):
+        bio = BytesIO()
+        with session.Session() as s:
+            dst = s.open_dataset(bio, 'w', 'dst')
+            df = dst.create_dataframe('df')
+            df.create_numeric('num', 'int32').data.write([i for i in range(10)])
+            df.create_fixed_string('fs1', 1).data.write([b'a' for i in range(20)])
+            df.create_timestamp('ts1').data.write([1632234128 + i for i in range(20)])
+            df.create_categorical('c1', 'int32', {'a': 1, 'b': 2}).data.write([1 for i in range(20)])
+            df.create_indexed_string('is1').data.write(['abc' for i in range(20)])
+
+            result = df.describe(include='all')
+            expected = {'fields': ['num', 'fs1', 'ts1', 'c1', 'is1'], 'count': [10, 20, 20, 20, 20],
+                        'mean': ['4.50', 'NaN', '1632234137.50', 'NaN', 'NaN'], 'std': ['2.87', 'NaN', '5.77', 'NaN', 'NaN'],
+                        'min': ['0.00', 'NaN', '1632234128.00', 'NaN', 'NaN'], '25%': ['0.02', 'NaN', '1632234128.05', 'NaN', 'NaN'],
+                        '50%': ['0.04', 'NaN', '1632234128.10', 'NaN', 'NaN'], '75%': ['0.07', 'NaN', '1632234128.14', 'NaN', 'NaN'],
+                        'max': ['9.00', 'NaN', '1632234147.00', 'NaN', 'NaN'], 'unique': ['NaN', 1, 'NaN', 1, 1],
+                        'top': ['NaN', b'a', 'NaN', 1, 'abc'], 'freq': ['NaN', 20, 'NaN', 20, 20]}
+            self.assertEqual(result, expected)
+
+            result = df.describe(include='num')
+            expected = {'fields': ['num'], 'count': [10], 'mean': ['4.50'], 'std': ['2.87'], 'min': ['0.00'],
+                        '25%': ['0.02'], '50%': ['0.04'], '75%': ['0.07'], 'max': ['9.00']}
+            self.assertEqual(result, expected)
+
+            result = df.describe(include=['num', 'fs1'])
+            expected = {'fields': ['num', 'fs1'], 'count': [10, 20], 'mean': ['4.50', 'NaN'], 'std': ['2.87', 'NaN'],
+                        'min': ['0.00', 'NaN'], '25%': ['0.02', 'NaN'], '50%': ['0.04', 'NaN'], '75%': ['0.07', 'NaN'],
+                        'max': ['9.00', 'NaN'], 'unique': ['NaN', 1], 'top': ['NaN', b'a'], 'freq': ['NaN', 20]}
+            self.assertEqual(result, expected)
+
+            result = df.describe(include=np.int32)
+            expected = {'fields': ['num', 'c1'], 'count': [10, 20], 'mean': ['4.50', 'NaN'], 'std': ['2.87', 'NaN'],
+                        'min': ['0.00', 'NaN'], '25%': ['0.02', 'NaN'], '50%': ['0.04', 'NaN'], '75%': ['0.07', 'NaN'],
+                        'max': ['9.00', 'NaN'], 'unique': ['NaN', 1], 'top': ['NaN', 1], 'freq': ['NaN', 20]}
+            self.assertEqual(result, expected)
+
+            result = df.describe(include=[np.int32, np.bytes_])
+            expected = {'fields': ['num', 'c1', 'fs1'], 'count': [10, 20, 20], 'mean': ['4.50', 'NaN', 'NaN'],
+                        'std': ['2.87', 'NaN', 'NaN'], 'min': ['0.00', 'NaN', 'NaN'], '25%': ['0.02', 'NaN', 'NaN'],
+                        '50%': ['0.04', 'NaN', 'NaN'], '75%': ['0.07', 'NaN', 'NaN'], 'max': ['9.00', 'NaN', 'NaN'],
+                        'unique': ['NaN', 1, 1], 'top': ['NaN', 1, b'a'], 'freq': ['NaN', 20, 20]}
+            self.assertEqual(result, expected)
+
+
+    def test_describe_exclude(self):
+        bio = BytesIO()
+        with session.Session() as s:
+            src = s.open_dataset(bio, 'w', 'src')
+            df = src.create_dataframe('df')
+            df.create_numeric('num', 'int32').data.write([i for i in range(10)])
+            df.create_numeric('num2', 'int64').data.write([i for i in range(10)])
+            df.create_fixed_string('fs1', 1).data.write([b'a' for i in range(20)])
+            df.create_timestamp('ts1').data.write([1632234128 + i for i in range(20)])
+            df.create_categorical('c1', 'int32', {'a': 1, 'b': 2}).data.write([1 for i in range(20)])
+            df.create_indexed_string('is1').data.write(['abc' for i in range(20)])
+
+            result = df.describe(exclude='num')
+            expected = {'fields': ['num2', 'ts1'], 'count': [10, 20], 'mean': ['4.50', '1632234137.50'],
+                        'std': ['2.87', '5.77'], 'min': ['0.00', '1632234128.00'], '25%': ['0.02', '1632234128.05'],
+                        '50%': ['0.04', '1632234128.10'], '75%': ['0.07', '1632234128.14'],
+                        'max': ['9.00', '1632234147.00']}
+            self.assertEqual(result, expected)
+
+            result = df.describe(exclude=['num', 'num2'])
+            expected = {'fields': ['ts1'], 'count': [20], 'mean': ['1632234137.50'], 'std': ['5.77'],
+                        'min': ['1632234128.00'], '25%': ['1632234128.05'], '50%': ['1632234128.10'],
+                        '75%': ['1632234128.14'], 'max': ['1632234147.00']}
+            self.assertEqual(result, expected)
+
+            result = df.describe(exclude=np.int32)
+            expected = {'fields': ['num2', 'ts1'], 'count': [10, 20], 'mean': ['4.50', '1632234137.50'],
+                        'std': ['2.87', '5.77'], 'min': ['0.00', '1632234128.00'], '25%': ['0.02', '1632234128.05'],
+                        '50%': ['0.04', '1632234128.10'], '75%': ['0.07', '1632234128.14'],
+                        'max': ['9.00', '1632234147.00']}
+            self.assertEqual(result, expected)
+
+            result = df.describe(exclude=[np.int32, np.float64])
+            expected = {'fields': ['num2'], 'count': [10], 'mean': ['4.50'], 'std': ['2.87'], 'min': ['0.00'],
+                        '25%': ['0.02'], '50%': ['0.04'], '75%': ['0.07'], 'max': ['9.00']}
+            self.assertEqual(result, expected)
+
+    def test_describe_include_and_exclude(self):
+        bio = BytesIO()
+        with session.Session() as s:
+            src = s.open_dataset(bio, 'w', 'src')
+            df = src.create_dataframe('df')
+            df.create_numeric('num', 'int32').data.write([i for i in range(10)])
+            df.create_numeric('num2', 'int64').data.write([i for i in range(10)])
+            df.create_fixed_string('fs1', 1).data.write([b'a' for i in range(20)])
+            df.create_timestamp('ts1').data.write([1632234128 + i for i in range(20)])
+            df.create_categorical('c1', 'int32', {'a': 1, 'b': 2}).data.write([1 for i in range(20)])
+            df.create_indexed_string('is1').data.write(['abc' for i in range(20)])
+
+            #str *
+            with self.assertRaises(Exception) as context:
+                df.describe(include='num', exclude='num')
+            self.assertTrue(isinstance(context.exception, ValueError))
+
+            # list of str , str
+            with self.assertRaises(Exception) as context:
+                df.describe(include=['num', 'num2'], exclude='num')
+            self.assertTrue(isinstance(context.exception, ValueError))
+            # list of str , type
+            result = df.describe(include=['num', 'num2'], exclude=np.int32)
+            expected = {'fields': ['num2'], 'count': [10], 'mean': ['4.50'], 'std': ['2.87'], 'min': ['0.00'],
+                        '25%': ['0.02'], '50%': ['0.04'], '75%': ['0.07'], 'max': ['9.00']}
+            self.assertEqual(result, expected)
+            # list of str , list of str
+            with self.assertRaises(Exception) as context:
+                df.describe(include=['num', 'num2'], exclude=['num', 'num2'])
+            self.assertTrue(isinstance(context.exception, ValueError))
+            # list of str , list of type
+            result = df.describe(include=['num', 'num2', 'ts1'], exclude=[np.int32, np.int64])
+            expected = {'fields': ['ts1'], 'count': [20], 'mean': ['1632234137.50'], 'std': ['5.77'],
+                        'min': ['1632234128.00'], '25%': ['1632234128.05'], '50%': ['1632234128.10'],
+                        '75%': ['1632234128.14'], 'max': ['1632234147.00']}
+            self.assertEqual(result, expected)
+
+            # type, str
+            result = df.describe(include=np.number, exclude='num2')
+            expected = {'fields': ['num', 'ts1', 'c1'], 'count': [10, 20, 20], 'mean': ['4.50', '1632234137.50', 'NaN'],
+                       'std': ['2.87', '5.77', 'NaN'], 'min': ['0.00', '1632234128.00', 'NaN'],
+                       '25%': ['0.02', '1632234128.05', 'NaN'], '50%': ['0.04', '1632234128.10', 'NaN'],
+                       '75%': ['0.07', '1632234128.14', 'NaN'], 'max': ['9.00', '1632234147.00', 'NaN'],
+                       'unique': ['NaN', 'NaN', 1], 'top': ['NaN', 'NaN', 1], 'freq': ['NaN', 'NaN', 20]}
+            self.assertEqual(result, expected)
+            # type, type
+            with self.assertRaises(Exception) as context:
+                df.describe(include=np.int32, exclude=np.int64)
+            self.assertTrue(isinstance(context.exception, ValueError))
+            # type, list of str
+            result = df.describe(include=np.number, exclude=['num', 'num2'])
+            expected = {'fields': ['ts1', 'c1'], 'count': [20, 20], 'mean': ['1632234137.50', 'NaN'],
+                        'std': ['5.77', 'NaN'], 'min': ['1632234128.00', 'NaN'], '25%': ['1632234128.05', 'NaN'],
+                        '50%': ['1632234128.10', 'NaN'], '75%': ['1632234128.14', 'NaN'], 'max': ['1632234147.00', 'NaN'],
+                        'unique': ['NaN', 1], 'top': ['NaN', 1], 'freq': ['NaN', 20]}
+            self.assertEqual(result, expected)
+            # type, list of type
+            with self.assertRaises(Exception) as context:
+                df.describe(include=np.int32, exclude=[np.int64, np.float64])
+            self.assertTrue(isinstance(context.exception, ValueError))
+
+            # list of type, str
+            result = df.describe(include=[np.int32, np.int64], exclude='num')
+            expected = {'fields': ['c1', 'num2'], 'count': [20, 10], 'mean': ['NaN', '4.50'], 'std': ['NaN', '2.87'],
+                        'min': ['NaN', '0.00'], '25%': ['NaN', '0.02'], '50%': ['NaN', '0.04'], '75%': ['NaN', '0.07'],
+                        'max': ['NaN', '9.00'], 'unique': [1, 'NaN'], 'top': [1, 'NaN'], 'freq': [20, 'NaN']}
+            self.assertEqual(result, expected)
+            # list of type, type
+            with self.assertRaises(Exception) as context:
+                df.describe(include=[np.int32, np.int64], exclude=np.int64)
+            self.assertTrue(isinstance(context.exception, ValueError))
+            # list of type, list of str
+            result = df.describe(include=[np.int32, np.int64], exclude=['num', 'num2'])
+            expected = {'fields': ['c1'], 'count': [20], 'mean': ['NaN'], 'std': ['NaN'], 'min': ['NaN'],
+                        '25%': ['NaN'], '50%': ['NaN'], '75%': ['NaN'], 'max': ['NaN'], 'unique': [1], 'top': [1],
+                        'freq': [20]}
+            self.assertEqual(result, expected)
+            # list of type, list of type
+            with self.assertRaises(Exception) as context:
+                df.describe(include=[np.int32, np.int64], exclude=[np.int32, np.int64])
+            self.assertTrue(isinstance(context.exception, ValueError))
+
+    def test_raise_errors(self):
+        bio = BytesIO()
+        with session.Session() as s:
+            src = s.open_dataset(bio, 'w', 'src')
+            df = src.create_dataframe('df')
+
+            df.create_fixed_string('fs1', 1).data.write([b'a' for i in range(20)])
+            df.create_categorical('c1', 'int32', {'a': 1, 'b': 2}).data.write([1 for i in range(20)])
+            df.create_indexed_string('is1').data.write(['abc' for i in range(20)])
+
+            with self.assertRaises(Exception) as context:
+                df.describe(include='num3')
+            self.assertTrue(isinstance(context.exception, ValueError))
+
+            with self.assertRaises(Exception) as context:
+                df.describe(include=np.int8)
+            self.assertTrue(isinstance(context.exception, ValueError))
+
+            with self.assertRaises(Exception) as context:
+                df.describe(include=['num3', 'num4'])
+            self.assertTrue(isinstance(context.exception, ValueError))
+
+            with self.assertRaises(Exception) as context:
+                df.describe(include=[np.int8, np.uint])
+            self.assertTrue(isinstance(context.exception, ValueError))
+
+            with self.assertRaises(Exception) as context:
+                df.describe(include=float('3.14159'))
+            self.assertTrue(isinstance(context.exception, ValueError))
+
+            with self.assertRaises(Exception) as context:
+                df.describe()
+            self.assertTrue(isinstance(context.exception, ValueError))
+
+            df.create_numeric('num', 'int32').data.write([i for i in range(10)])
+            df.create_numeric('num2', 'int64').data.write([i for i in range(10)])
+            df.create_timestamp('ts1').data.write([1632234128 + i for i in range(20)])
+
+            with self.assertRaises(Exception) as context:
+                df.describe(exclude=float('3.14159'))
+            self.assertTrue(isinstance(context.exception, ValueError))
+
+            with self.assertRaises(Exception) as context:
+                df.describe(exclude=['num', 'num2', 'ts1'])
+            self.assertTrue(isinstance(context.exception, ValueError))
\ No newline at end of file
diff --git a/tests/test_fields.py b/tests/test_fields.py
index c6709c62..4f380d99 100644
--- a/tests/test_fields.py
+++ b/tests/test_fields.py
@@ -674,8 +674,9 @@ def test_categorical_apply_filter(self):
 
     def test_timestamp_apply_filter(self):
         from datetime import datetime as D
-        data = [D(2020, 1, 1), D(2021, 5, 18), D(2950, 8, 17), D(1840, 10, 11),
-                D(2110, 11, 1), D(2002, 3, 3), D(2018, 2, 28), D(2400, 9, 1)]
+        from datetime import timezone
+        data = [D(2020, 1, 1, tzinfo=timezone.utc), D(2021, 5, 18, tzinfo=timezone.utc), D(2950, 8, 17, tzinfo=timezone.utc), D(1840, 10, 11, tzinfo=timezone.utc),
+                D(2110, 11, 1, tzinfo=timezone.utc), D(2002, 3, 3, tzinfo=timezone.utc), D(2018, 2, 28, tzinfo=timezone.utc), D(2400, 9, 1, tzinfo=timezone.utc)]
         data = np.asarray([d.timestamp() for d in data], dtype=np.float64)
         filt = np.array([0, 1, 0, 1, 0, 1, 0, 1], dtype=bool)
         expected = data[filt].tolist()
@@ -911,8 +912,9 @@ def test_categorical_apply_index(self):
 
     def test_timestamp_apply_index(self):
         from datetime import datetime as D
-        data = [D(2020, 1, 1), D(2021, 5, 18), D(2950, 8, 17), D(1840, 10, 11),
-                D(2110, 11, 1), D(2002, 3, 3), D(2018, 2, 28), D(2400, 9, 1)]
+        from datetime import timezone
+        data = [D(2020, 1, 1, tzinfo=timezone.utc), D(2021, 5, 18, tzinfo=timezone.utc), D(2950, 8, 17, tzinfo=timezone.utc), D(1840, 10, 11, tzinfo=timezone.utc),
+                D(2110, 11, 1, tzinfo=timezone.utc), D(2002, 3, 3, tzinfo=timezone.utc), D(2018, 2, 28, tzinfo=timezone.utc), D(2400, 9, 1, tzinfo=timezone.utc)]
         data = np.asarray([d.timestamp() for d in data], dtype=np.float64)
         indices = np.array([7, 0, 6, 1, 5, 2, 4, 3], dtype=np.int32)
         expected = data[indices].tolist()
@@ -1069,8 +1071,9 @@ def test_categorical_apply_spans(self):
     def test_timestamp_apply_spans(self):
         spans = np.array([0, 2, 3, 6, 8], dtype=np.int32)
         from datetime import datetime as D
-        src_data = [D(2020, 1, 1), D(2021, 5, 18), D(2950, 8, 17), D(1840, 10, 11),
-                    D(2021, 1, 1), D(2022, 5, 18), D(2951, 8, 17), D(1841, 10, 11)]
+        from datetime import timezone
+        src_data = [D(2020, 1, 1, tzinfo=timezone.utc), D(2021, 5, 1, tzinfo=timezone.utc), D(2950, 8, 17, tzinfo=timezone.utc), D(1840, 10, 11, tzinfo=timezone.utc),
+                    D(2021, 1, 1, tzinfo=timezone.utc), D(2022, 5, 18, tzinfo=timezone.utc), D(2951, 8, 17, tzinfo=timezone.utc), D(1841, 10, 11, tzinfo=timezone.utc)]
         src_data = np.asarray([d.timestamp() for d in src_data], dtype=np.float64)
 
         expected = src_data[[0, 2, 3, 6]].tolist()
@@ -1175,7 +1178,8 @@ def test_categorical_field_create_like(self):
 
     def test_timestamp_field_create_like(self):
         from datetime import datetime as D
-        data = [D(2020, 1, 1), D(2021, 5, 18), D(2950, 8, 17), D(1840, 10, 11)]
+        from datetime import timezone
+        data = [D(2020, 1, 1, tzinfo=timezone.utc), D(2021, 5, 18, tzinfo=timezone.utc), D(2950, 8, 17, tzinfo=timezone.utc), D(1840, 10, 11, tzinfo=timezone.utc)]
         data = np.asarray([d.timestamp() for d in data], dtype=np.float64)
 
         bio = BytesIO()
@@ -1262,7 +1266,8 @@ def test_categorical_field_create_like(self):
 
     def test_timestamp_field_create_like(self):
         from datetime import datetime as D
-        data = [D(2020, 1, 1), D(2021, 5, 18), D(2950, 8, 17), D(1840, 10, 11)]
+        from datetime import timezone
+        data = [D(2020, 1, 1, tzinfo=timezone.utc), D(2021, 5, 18, tzinfo=timezone.utc), D(2950, 8, 17, tzinfo=timezone.utc), D(1840, 10, 11, tzinfo=timezone.utc)]
         data = np.asarray([d.timestamp() for d in data], dtype=np.float64)
 
         bio = BytesIO()
diff --git a/tests/test_importer.py b/tests/test_importer.py
index 9a340691..fd4140f3 100644
--- a/tests/test_importer.py
+++ b/tests/test_importer.py
@@ -169,10 +169,10 @@ def test_importer_date(self):
             importer.import_with_schema(s, self.ts, self.ds_name, bio, self.schema, self.files, False, {}, {}, chunk_row_size=self.chunk_row_size)
             ds = s.get_dataset(self.ds_name)
             df = ds.get_dataframe('schema_key')
-            self.assertEqual(df['birthday'].data[:].tolist(), [datetime.strptime(x, "%Y-%m-%d").timestamp() for x in expected_birthday_date])
+            self.assertEqual(df['birthday'].data[:].tolist(), [datetime.strptime(x, "%Y-%m-%d").replace(tzinfo=timezone.utc).timestamp() for x in expected_birthday_date])
 
         with h5py.File(bio, 'r') as hf:
-            self.assertEqual(hf['schema_key']['birthday']['values'][:].tolist(), [datetime.strptime(x, "%Y-%m-%d").timestamp() for x in expected_birthday_date])
+            self.assertEqual(hf['schema_key']['birthday']['values'][:].tolist(), [datetime.strptime(x, "%Y-%m-%d").replace(tzinfo=timezone.utc).timestamp() for x in expected_birthday_date])
 
 
     def test_importer_datetime_with_create_day_field(self):
@@ -184,12 +184,12 @@ def test_importer_datetime_with_create_day_field(self):
             importer.import_with_schema(s, self.ts, self.ds_name, bio, self.schema, self.files, False, {}, {}, chunk_row_size=self.chunk_row_size)
             ds = s.get_dataset(self.ds_name)
             df = ds.get_dataframe('schema_key')
-            self.assertEqual(df['updated_at'].data[:].tolist(), [datetime.strptime(x, "%Y-%m-%d %H:%M:%S").timestamp() for x in expected_updated_at_list])
+            self.assertEqual(df['updated_at'].data[:].tolist(), [datetime.strptime(x, "%Y-%m-%d %H:%M:%S").replace(tzinfo=timezone.utc).timestamp() for x in expected_updated_at_list])
             self.assertEqual(df['updated_at_day'].data[:].tolist(), expected_updated_at_date_list )
 
         with h5py.File(bio, 'r') as hf:
             print(hf['schema_key']['updated_at']['values'][:])              
-            self.assertAlmostEqual(hf['schema_key']['updated_at']['values'][:].tolist(), [datetime.strptime(x, "%Y-%m-%d %H:%M:%S").timestamp() for x in expected_updated_at_list])  
+            self.assertAlmostEqual(hf['schema_key']['updated_at']['values'][:].tolist(), [datetime.strptime(x, "%Y-%m-%d %H:%M:%S").replace(tzinfo=timezone.utc).timestamp() for x in expected_updated_at_list])
             self.assertEqual(hf['schema_key']['updated_at_day']['values'][:].tolist(), expected_updated_at_date_list)
 
 
@@ -344,10 +344,11 @@ def test_categorical_field_importer_with_small_chunk_size(self):
             ds = s.get_dataset(self.ds_name)
             df = ds.get_dataframe('schema_key')
             self.assertEqual(df['postcode'].data[:].tolist(), expected_postcode_value_list)
+            self.assertEqual(list(df['postcode'].keys.values()), expected_key_names)
 
         with h5py.File(bio, 'r') as hf:
             self.assertEqual(hf['schema_key']['postcode']['values'][:].tolist(), expected_postcode_value_list)
-            self.assertEqual(hf['schema_key']['postcode']['key_names'][:].tolist(), expected_key_names)
+            #self.assertEqual(hf['schema_key']['postcode']['key_names'][:].tolist(), expected_key_names)
             self.assertEqual(hf['schema_key']['postcode']['key_values'][:].tolist(), expected_key_values)
 
 
diff --git a/tests/test_parsers.py b/tests/test_parsers.py
index d2dff732..5201f23d 100644
--- a/tests/test_parsers.py
+++ b/tests/test_parsers.py
@@ -191,7 +191,7 @@ def test_read_csv_only_datetime_field(self):
 
             expected_updated_at_list = ['2020-05-12 07:00:00', '2020-05-13 01:00:00', '2020-05-14 03:00:00', '2020-05-15 03:00:00', '2020-05-16 03:00:00']
             expected_updated_at_date_list = [b'2020-05-12', b'2020-05-13', b'2020-05-14',b'2020-05-15',b'2020-05-16']
-            self.assertEqual(df['updated_at'].data[:].tolist(), [datetime.strptime(x, "%Y-%m-%d %H:%M:%S").timestamp() for x in expected_updated_at_list])
+            self.assertEqual(df['updated_at'].data[:].tolist(), [datetime.strptime(x, "%Y-%m-%d %H:%M:%S").replace(tzinfo=timezone.utc).timestamp() for x in expected_updated_at_list])
             self.assertEqual(df['updated_at_day'].data[:].tolist(),expected_updated_at_date_list )
 
 
@@ -204,7 +204,7 @@ def test_read_csv_only_date_field(self):
             parsers.read_csv(self.csv_file_name, df, self.schema_dict, include=['birthday'])  
 
             expected_birthday_date = [b'1990-01-01', b'1980-03-04', b'1970-04-05', b'1960-04-05', b'1950-04-05']
-            self.assertEqual(df['birthday'].data[:].tolist(), [datetime.strptime(x.decode(), "%Y-%m-%d").timestamp() for x in expected_birthday_date])
+            self.assertEqual(df['birthday'].data[:].tolist(), [datetime.strptime(x.decode(), "%Y-%m-%d").replace(tzinfo=timezone.utc).timestamp() for x in expected_birthday_date])
             self.assertEqual(df['birthday_day'].data[:].tolist(), expected_birthday_date)
 
 
@@ -231,7 +231,8 @@ def test_read_csv_with_schema_missing_field(self):
             missing_schema_dict = {'name': String()}
             parsers.read_csv(self.csv_file_name, df, missing_schema_dict)
             self.assertListEqual(df['id'].data[:], ['1','2','3','4','5']) 
-            self.assertEqual(df['updated_at'].data[:],['2020-05-12 07:00:00', '2020-05-13 01:00:00', '2020-05-14 03:00:00', '2020-05-15 03:00:00', '2020-05-16 03:00:00'])
+            self.assertEqual([i.replace('\r', '') for i in df['updated_at'].data[:]],  # remove \r due to windows
+                             ['2020-05-12 07:00:00', '2020-05-13 01:00:00', '2020-05-14 03:00:00', '2020-05-15 03:00:00', '2020-05-16 03:00:00'])
             self.assertEqual(df['birthday'].data[:], ['1990-01-01', '1980-03-04', '1970-04-05', '1960-04-05', '1950-04-05'])
             self.assertEqual(df['postcode'].data[:], ['NW1', 'SW1P', 'E1', '', 'NW3'])
         
diff --git a/tests/test_persistence.py b/tests/test_persistence.py
index faf22d5e..65bd623d 100644
--- a/tests/test_persistence.py
+++ b/tests/test_persistence.py
@@ -540,7 +540,7 @@ def test_categorical_field_writer_from_reader(self):
                       '', 'True', 'False', 'False', '', '', 'True', 'False', 'True', '',
                       '', 'True', 'False', 'False', '']
             value_map = {'': 0, 'False': 1, 'True': 2}
-            rw.CategoricalImporter(datastore, hf, 'foo', value_map, ts).write(values)
+            rw.CategoricalImporter(datastore, hf, 'foo', value_map, ts).write_strings(values)
 
             reader = datastore.get_reader(hf['foo'])
             writer = reader.get_writer(hf, 'foo2', ts)
@@ -1042,7 +1042,7 @@ def filter_framework(name, raw_indices, raw_values, the_filter, expected):
         with h5py.File(bio, 'w') as hf:
             rw.IndexedStringWriter(datastore, hf, 'foo', ts).write(values)
 
-            raw_indices = hf['foo']['index'][:]
+            raw_indices = np.array(hf['foo']['index'][:], dtype=np.int64)
             raw_values = hf['foo']['values'][:]
 
             even_filter = np.zeros(len(values), bool)
@@ -1098,7 +1098,7 @@ def index_framework(name, raw_indices, raw_values, the_indices, expected):
         with h5py.File(bio, 'w') as hf:
             rw.IndexedStringWriter(datastore, hf, 'foo', ts).write(values)
 
-            raw_indices = hf['foo']['index'][:]
+            raw_indices = np.array(hf['foo']['index'][:], dtype=np.int64)
             raw_values = hf['foo']['values'][:]
 
             even_indices = np.arange(0, len(values), 2)
diff --git a/tests/test_session.py b/tests/test_session.py
index 1120325f..463b6d63 100644
--- a/tests/test_session.py
+++ b/tests/test_session.py
@@ -74,8 +74,9 @@ def test_create_then_load_numeric(self):
 
     def test_create_then_load_timestamp(self):
         from datetime import datetime as D
+        from datetime import timezone
         bio = BytesIO()
-        contents = [D(2021, 2, 6), D(2020, 11, 5), D(2974, 8, 1), D(1873, 12, 28)]
+        contents = [D(2021, 2, 6, tzinfo=timezone.utc), D(2020, 11, 5, tzinfo=timezone.utc), D(2974, 8, 1, tzinfo=timezone.utc), D(1873, 12, 28, tzinfo=timezone.utc)]
         contents = [c.timestamp() for c in contents]
 
         with session.Session() as s:
@@ -1010,11 +1011,11 @@ def test_write_then_read_numeric(self):
             a = fields.NumericField(s, hf['a'], None, write_enabled=True)
             a.data.write(values)
 
-            total = np.sum(a.data[:])
+            total = np.sum(a.data[:], dtype=np.int64)
             self.assertEqual(49997540637149, total)
 
             a.data[:] = a.data[:] * 2
-            total = np.sum(a.data[:])
+            total = np.sum(a.data[:], dtype=np.int64)
             self.assertEqual(99995081274298, total)
 
     def test_write_then_read_categorical(self):
@@ -1160,7 +1161,7 @@ def test_numeric_importer(self):
 
 
     def test_date_importer(self):
-        from datetime import datetime
+        from datetime import datetime, timezone
         bio = BytesIO()
         with session.Session() as s:
             dst = s.open_dataset(bio,'r+', 'dst')
@@ -1171,4 +1172,4 @@ def test_date_importer(self):
             foo.import_part(indices, values, offsets, 0, written_row_count)
 
             expected_date_list = ['2020-05-10', '2020-05-12', '2020-05-12', '2020-05-15']
-            self.assertListEqual(hf['foo'].data[:].tolist(), [datetime.strptime(x, "%Y-%m-%d").timestamp() for x in expected_date_list])
+            self.assertListEqual(hf['foo'].data[:].tolist(), [datetime.strptime(x, "%Y-%m-%d").replace(tzinfo=timezone.utc).timestamp() for x in expected_date_list])
diff --git a/tests/test_utils.py b/tests/test_utils.py
index 44cc7f2a..ac807007 100644
--- a/tests/test_utils.py
+++ b/tests/test_utils.py
@@ -15,7 +15,6 @@
 
 from exetera.core.utils import find_longest_sequence_of, to_escaped, bytearray_to_escaped, get_min_max
 
-
 class TestUtils(unittest.TestCase):
 
     def test_find_longest_sequence_of(self):
@@ -100,4 +99,5 @@ def test_get_min_max_for_permitted_types(self):
         for value_type in permitted_numeric_types:
             (min_value, max_value) = get_min_max(value_type)
             self.assertEqual(min_value, expected_min_max_values[value_type][0])
-            self.assertEqual(max_value, expected_min_max_values[value_type][1])
\ No newline at end of file
+            self.assertEqual(max_value, expected_min_max_values[value_type][1])
+