Skip to content

Latest commit

 

History

History
499 lines (371 loc) · 20.6 KB

CHANGES.rst

File metadata and controls

499 lines (371 loc) · 20.6 KB

Changelog

Kartothek 4.0.1 (2021-04-XX)

  • Fixed dataset corruption after updates when table names other than "table" are used (#445).

Kartothek 4.0.0 (2021-03-17)

This is a major release of kartothek with breaking API changes.

Version 3.20.0 (2021-03-15)

This will be the final release in the 3.X series. Please ensure your existing codebase does not raise any DeprecationWarning from kartothek and migrate your import paths ahead of time to the new :mod:`kartothek.api` modules to ensure a smooth migration to 4.X.

Version 3.19.1 (2021-02-24)

Version 3.19.0 (2021-02-12)

Version 3.18.0 (2021-01-25)

  • Add cube.suppress_index_on to switch off the default index creation for dimension columns
  • Fixed the import issue of zstd module for kartothek.core _zmsgpack.
  • Fix a bug in kartothek.io_components.read.dispatch_metapartitions_from_factory where dispatch_by=[] would be treated like dispatch_by=None, not merging all dataset partitions into a single partitions.

Version 3.17.3 (2020-12-04)

  • Allow pyarrow==2 as a dependency.

Version 3.17.2 (2020-12-01)

  • #378 Improve logging information for potential buffer serialization errors

Version 3.17.1 (2020-11-24)

Bugfixes

  • Fix GitHub #375 by loosening checks of the supplied store argument

Version 3.17.0 (2020-11-23)

Improvements

Bugfixes

Version 3.16.0 (2020-09-29)

New functionality

  • Allow filtering of nans using "==", "!=" and "in" operators

Bugfixes

  • Fix a regression which would not allow the usage of non serializable stores even when using factories

Version 3.15.1 (2020-09-28)

  • Fix a packaging issue where typing_extensions was not properly specified as a requirement for python versions below 3.8

Version 3.15.0 (2020-09-28)

New functionality

Improvements

  • Reduce memory consumption during index write.
  • Allow simplekv stores and storefact URLs to be passed explicitly as input for the store arguments

Version 3.14.0 (2020-08-27)

New functionality

  • Add hash_dataset functionality

Improvements

  • Expand pandas version pin to include 1.1.X
  • Expand pyarrow version pin to include 1.x
  • Large addition to documentation for multi dataset handling (Kartothek Cubes)

Version 3.13.1 (2020-08-04)

  • Fix evaluation of "OR"-connected predicates (#295)

Version 3.13.0 (2020-07-30)

Improvements

  • Update timestamp related code into Ktk Discover Cube functionality.
  • Support backward compatibility to old cubes and fix for cli entry point.

Version 3.12.0 (2020-07-23)

New functionality

  • Introduction of cube Functionality which is made with multiple Kartothek datasets.
  • Basic Features - Extend, Query, Remove(Partitions), Delete (can delete entire datasets/cube), API, CLI, Core and IO features.
  • Advanced Features - Multi-Dataset with Single Table, Explicit physical Partitions, Seed based join system.

Version 3.11.0 (2020-07-15)

New functionality

Bug fixes

  • Performance of dataset update with delete_scope significantly improved for datasets with many partitions (#308)

Version 3.10.0 (2020-07-02)

Improvements

  • Dispatch performance improved for large datasets including metadata
  • Introduction of dispatch_metadata kwarg to metapartitions read pipelines to allow for transition for future breaking release.

Bug fixes

Breaking changes in io_components.read

  • The dispatch_metapartitions and dispatch_metapartitions_from_factory will no longer attach index and metadata information to the created MP instances, unless explicitly requested.

Version 3.9.0 (2020-06-03)

Improvements

Version 3.8.2 (2020-04-09)

Improvements

  • Read performance improved for, especially for partitioned datasets and queries with empty payload columns.

Bug fixes

  • GH262: Raise an exception when trying to partition on a column with null values to prevent silent data loss
  • Fix multiple index creation issues (cutting data, crashing) for uint data
  • Fix index update issues for some types resulting in TypeError: Trying to update an index with different types... messages.
  • Fix issues where index creation with empty partitions can lead to ValueError: Trying to create non-typesafe index

Version 3.8.1 (2020-03-20)

Improvements

  • Only fix column odering when restoring DataFrame if the ordering is incorrect.

Bug fixes

  • GH248 Fix an issue causing a ValueError to be raised when using dask_index_on on non-integer columns
  • GH255 Fix an issue causing the python interpreter to shut down when reading an empty file (see also https://issues.apache.org/jira/browse/ARROW-8142)

Version 3.8.0 (2020-03-12)

Improvements

  • Add keyword argument dask_index_on which reconstructs a dask index from an kartothek index when loading the dataset
  • Add method :func:`~kartothek.core.index.IndexBase.observed_values` which returns an array of all observed values of the index column
  • Updated and improved documentation w.r.t. guides and API documentation

Bug fixes

  • GH227 Fix a Type error when loading categorical data in dask without specifying it explicitly
  • No longer trigger the SettingWithCopyWarning when using bucketing
  • GH228 Fix an issue where empty header creation from a pyarrow schema would not normalize the schema which causes schema violations during update.
  • Fix an issue where :func:`~kartothek.io.eager.create_empty_dataset_header` would not accept a store factory.

Version 3.7.0 (2020-02-12)

Improvements

Version 3.6.2 (2019-12-17)

Improvements

Bug fixes

Version 3.6.1 (2019-12-11)

Bug fixes

  • Fix a regression introduced in 3.5.0 where predicates which allow multiple values for a field would generate duplicates

Version 3.6.0 (2019-12-03)

New functionality

Bug fixes

  • Fix addition of bogus index columns to Parquet files when using sort_partitions_by.
  • Fix bug where partition_on in write path drops empty DataFrames and can lead to datasets without tables.

Version 3.5.1 (2019-10-25)

Version 3.5.0 (2019-10-21)

New functionality

Bug fixes

  • Input to normalize_args is properly normalized to list
  • MetaPartition.load_dataframes now raises if table in columns argument doesn't exist
  • require urlquote>=1.1.0 (where urlquote.quoting was introduced)
  • Improve performance for some cases where predicates are used with the in operator.
  • Correctly preserve :class:`~kartothek.core.index.ExplicitSecondaryIndex` dtype when index is empty
  • Fixed DeprecationWarning in pandas CategoricalDtype
  • Fixed broken docstring for store_dataframes_as_dataset
  • Internal operations no longer perform schema validations. This will improve performance for batched partition operations (e.g. partition_on) but will defer the validation in case of inconsistencies to the final commit. Exception messages will be less verbose in these cases as before.
  • Fix an issue where an empty dataframe of a partition in a multi-table dataset would raise a schema validation exception
  • Fix an issue where the dispatch_by keyword would disable partition pruning
  • Creating dataset with non existing columns as explicit index to raise a ValueError

Breaking changes

  • Remove support for pyarrow < 0.13.0
  • Move the docs module from io_components to core

Version 3.4.0 (2019-09-17)

  • Add support for pyarrow 0.14.1
  • Use urlquote for faster quoting/unquoting

Version 3.3.0 (2019-08-15)

  • Fix rejection of bool predicates in :func:`~kartothek.serialization.filter_array_like` when bool columns contains None
  • Streamline behavior of store_dataset_from_ddf when passing empty ddf.
  • Fix an issue where a segmentation fault may be raised when comparing MetaPartition instances
  • Expose a date_as_object flag in kartothek.core.index.as_flat_series

Version 3.2.0 (2019-07-25)

Version 3.1.1 (2019-07-12)

Version 3.1.0 (2019-07-10)

Breaking:

Version 3.0.0 (2019-05-02)

  • Initial public release