You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Decoding on accelerators is expected to transfer the bulk of the data directly to the accelerator and only the metadata to the CPU. The CPU will have to schedule the decoding preferably to be done in one host-device round trip for all columns.
For this the CPU must know, by looking at metadata only, the location of the column streams (and chunks/pages), their compression/encoding, and the decompressed/decoded sizes.
Additionally, if we transfer a large stripe (i.e. row group in Parquet terms) to the device, we may decode this in multiple batches. This means that the decoded size should be knowable also for fractions of the stripe.
Metadata
In the below, stripe means Parquet row group and stream means a contiguous range of bytes in a file that exists for one column within the stripe. The Parquet word is column chunk.
From this follows the following additional metadata:
Encoding: All the headers for a stream are stored in order, without the payload data in a for each stream. The headers indicate the type of encoding or compression, the byte size of the encoded data and the value count.
The above represents null flags for 8000 values, then 7000 non-null values with a dictionary of 130 strings compressed and 7000 one byte indices into the alphabet.
This contains the decompressed size for the alphabet in the zstd header. The information above allows allocating buffers for decoding and has the needed information for scheduling the decoding, for example, make scatter indices from the nulls and decompress the alphabet and decompress the 8 bit indices into 32 bit indices as 3 parallel operations. Then write the indices scattered based on the nulls after both null flags and the indices have decoded.
The specific order and parallel/sequential execution of the steps will depend on the implementation and data sizes. The crucial point is that these decisions can be made by looking at the metadata alone.
The above does not provide a space estimate for decoding a fraction of the stripe. Consider a stripe of 1GB with almost all data in one map column where the constituent vectors have 100M keys and values and the individual maps vary between 1 and 40000 keys. The total decoded size is visible but the size for the first 1000 rows cannot be known without decoding the lengths. For this use case we recommend storing extra metadata for columns with repeated data. This would be the decoded size for top level rows for every e.g. 10% of the top level rows, so for 10K top level rows, we would have an intermediate size for the first 1K,2K, ... 9K top level rows.
This is recorded for repeated type columns and their repeated type children. In the case of a flat map or a non-top level struct this is recorded for the member columns also.
With the above, we can schedule decoding for all columns as well as follow up processing in a single message to device and there will be no sync round trips to host for allocating device memory.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
from @oerling:
Context
Decoding on accelerators is expected to transfer the bulk of the data directly to the accelerator and only the metadata to the CPU. The CPU will have to schedule the decoding preferably to be done in one host-device round trip for all columns.
For this the CPU must know, by looking at metadata only, the location of the column streams (and chunks/pages), their compression/encoding, and the decompressed/decoded sizes.
Additionally, if we transfer a large stripe (i.e. row group in Parquet terms) to the device, we may decode this in multiple batches. This means that the decoded size should be knowable also for fractions of the stripe.
Metadata
In the below, stripe means Parquet row group and stream means a contiguous range of bytes in a file that exists for one column within the stripe. The Parquet word is column chunk.
From this follows the following additional metadata:
The above represents null flags for 8000 values, then 7000 non-null values with a dictionary of 130 strings compressed and 7000 one byte indices into the alphabet.
This contains the decompressed size for the alphabet in the zstd header. The information above allows allocating buffers for decoding and has the needed information for scheduling the decoding, for example, make scatter indices from the nulls and decompress the alphabet and decompress the 8 bit indices into 32 bit indices as 3 parallel operations. Then write the indices scattered based on the nulls after both null flags and the indices have decoded.
The specific order and parallel/sequential execution of the steps will depend on the implementation and data sizes. The crucial point is that these decisions can be made by looking at the metadata alone.
The above does not provide a space estimate for decoding a fraction of the stripe. Consider a stripe of 1GB with almost all data in one map column where the constituent vectors have 100M keys and values and the individual maps vary between 1 and 40000 keys. The total decoded size is visible but the size for the first 1000 rows cannot be known without decoding the lengths. For this use case we recommend storing extra metadata for columns with repeated data. This would be the decoded size for top level rows for every e.g. 10% of the top level rows, so for 10K top level rows, we would have an intermediate size for the first 1K,2K, ... 9K top level rows.
This is recorded for repeated type columns and their repeated type children. In the case of a flat map or a non-top level struct this is recorded for the member columns also.
With the above, we can schedule decoding for all columns as well as follow up processing in a single message to device and there will be no sync round trips to host for allocating device memory.
Cc: @helfman @zzhao0
Beta Was this translation helpful? Give feedback.
All reactions