Metadata for Accelerators #50

pedroerp · 2024-05-08T22:34:12Z

pedroerp
May 8, 2024
Collaborator

Context

Decoding on accelerators is expected to transfer the bulk of the data directly to the accelerator and only the metadata to the CPU. The CPU will have to schedule the decoding preferably to be done in one host-device round trip for all columns.

For this the CPU must know, by looking at metadata only, the location of the column streams (and chunks/pages), their compression/encoding, and the decompressed/decoded sizes.

Additionally, if we transfer a large stripe (i.e. row group in Parquet terms) to the device, we may decode this in multiple batches. This means that the decoded size should be knowable also for fractions of the stripe.

Metadata

In the below, stripe means Parquet row group and stream means a contiguous range of bytes in a file that exists for one column within the stripe. The Parquet word is column chunk.

From this follows the following additional metadata:

Encoding: All the headers for a stream are stored in order, without the payload data in a for each stream. The headers indicate the type of encoding or compression, the byte size of the encoded data and the value count.

Nullable(8000 values, 23020 bytes)

Bitmap (8000 values, 1000 bytes)

Dictionary(7000 values, 23000 bytes)

Strings(130 values,  16000 bytes)

Zstd(100000 characters, 15990 bytes)

Bitmap(7000 values, 7000 bytes)

The above represents null flags for 8000 values, then 7000 non-null values with a dictionary of 130 strings compressed and 7000 one byte indices into the alphabet.

This contains the decompressed size for the alphabet in the zstd header. The information above allows allocating buffers for decoding and has the needed information for scheduling the decoding, for example, make scatter indices from the nulls and decompress the alphabet and decompress the 8 bit indices into 32 bit indices as 3 parallel operations. Then write the indices scattered based on the nulls after both null flags and the indices have decoded.

The specific order and parallel/sequential execution of the steps will depend on the implementation and data sizes. The crucial point is that these decisions can be made by looking at the metadata alone.

The above does not provide a space estimate for decoding a fraction of the stripe. Consider a stripe of 1GB with almost all data in one map column where the constituent vectors have 100M keys and values and the individual maps vary between 1 and 40000 keys. The total decoded size is visible but the size for the first 1000 rows cannot be known without decoding the lengths. For this use case we recommend storing extra metadata for columns with repeated data. This would be the decoded size for top level rows for every e.g. 10% of the top level rows, so for 10K top level rows, we would have an intermediate size for the first 1K,2K, ... 9K top level rows.

This is recorded for repeated type columns and their repeated type children. In the case of a flat map or a non-top level struct this is recorded for the member columns also.

With the above, we can schedule decoding for all columns as well as follow up processing in a single message to device and there will be no sync round trips to host for allocating device memory.

Cc: @helfman @zzhao0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metadata for Accelerators #50

{{title}}

Replies: 0 comments

Select a reply

Metadata for Accelerators #50

pedroerp May 8, 2024 Collaborator

Replies: 0 comments

pedroerp
May 8, 2024
Collaborator