Kartothek v4.0.0
Kartothek 4.0.0 (2021-03-17)
This is a major release of kartothek with breaking API changes.
- Removal of complex user input (see gh427)
- Removal of multi table feature
- Removal of [kartothek.io.merge]{.title-ref} module
- class
~kartothek.core.dataset.DatasetMetadata
{.interpreted-text
role="class"} now has an attribute called [schema]{.title-ref} which
replaces the previous attribute [table_meta]{.title-ref} and returns
only a single schema - All outputs which previously returned a sequence of dictionaries
where each key-value pair would correspond to a table-data pair now
returns only onepandas.DataFrame
{.interpreted-text role="class"} - All read pipelines will now automatically infer the table to read
such that it is no longer necessary to provide [table]{.title-ref}
or [table_name]{.title-ref} as an input argument - All writing pipelines which previously supported a complex user
input type now expose an argument [table_name]{.title-ref} which can
be used to continue usage of legacy datasets (i.e. datasets with an
intrinsic, non-trivial table name). This usage is discouraged and we
recommend users to migrate to a default table name (i.e. leave it
None / [table]{.title-ref}) - All pipelines which previously accepted an argument
[tables]{.title-ref} to select the subset of tables to load no
longer accept this keyword. Instead the to-be-loaded table will be
inferred - Trying to read a multi-tabled dataset will now cause an exception
telling users that this is no longer supported with kartothek 4.0 - The dict schema for
~kartothek.core.dataset.DatasetMetadataBase.to_dict
{.interpreted-text
role="meth"} and
~kartothek.core.dataset.DatasetMetadata.from_dict
{.interpreted-text
role="meth"} changed replacing a dictionary in
[table_meta]{.title-ref} with the simple [schema]{.title-ref} - All pipeline arguments which previously accepted a dictionary of
sequences to describe a table specific subset of columns now accept
plain sequences (e.g. [columns]{.title-ref},
[categoricals]{.title-ref}) - Remove the following list of deprecated arguments for io pipelines
- label_filter
- central_partition_metadata
- load_dynamic_metadata
- load_dataset_metadata
- concat_partitions_on_primary_index
- Remove [output_dataset_uuid]{.title-ref} and
[df_serializer]{.title-ref} from
kartothek.io.eager.commit_dataset
{.interpreted-text role="func"}
since these arguments didn't have any effect - Remove [metadata]{.title-ref}, [df_serializer]{.title-ref},
[overwrite]{.title-ref}, [metadata_merger]{.title-ref} from
kartothek.io.eager.write_single_partition
{.interpreted-text
role="func"} ~kartothek.io.eager.store_dataframes_as_dataset
{.interpreted-text
role="func"} now requires a list as an input- Default value for argument [date_as_object]{.title-ref} is now
universally set toTrue
. The behaviour for [False]{.title-ref}
will be deprecated and removed in the next major release - No longer allow to pass [delete_scope]{.title-ref} as a delayed
object to
~kartothek.io.dask.dataframe.update_dataset_from_ddf
{.interpreted-text
role="func"} ~kartothek.io.dask.dataframe.update_dataset_from_ddf
{.interpreted-text
role="func"} and
~kartothek.io.dask.dataframe.store_dataset_from_ddf
{.interpreted-text
role="func"} now return a [dd.core.Scalar]{.title-ref} object. This
enables all [dask.DataFrame]{.title-ref} graph optimizations by
default.- Remove argument [table_name]{.title-ref} from
~kartothek.io.dask.dataframe.collect_dataset_metadata
{.interpreted-text
role="func"}