Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: add StructTransform base class for PartitionKey and SortKey. add SortOrderComparators #7798

Merged
merged 3 commits into from
Nov 28, 2023

Conversation

stevenzwu
Copy link
Contributor

@stevenzwu stevenzwu commented Jun 7, 2023


/**
* A struct of partition values.
*
* <p>Instances of this class can produce partition values from a data row passed to {@link
* #partition(StructLike)}.
*/
public class PartitionKey implements StructLike, Serializable {
public class PartitionKey extends StructTransform {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of code has been moved to the non-public StructTransform class so that it can be reused for the new SortKey class

}

private static class SortOrderComparator implements Comparator<StructLike> {
private final SortKey leftKey;
Copy link
Contributor Author

@stevenzwu stevenzwu Jun 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

included SortOrderComparator impl in this PR to demonstrate the usage of SortKey

@stevenzwu stevenzwu force-pushed the sort-key-comparator branch 2 times, most recently from 3970b55 to f93b012 Compare June 10, 2023 19:01
@stevenzwu stevenzwu force-pushed the sort-key-comparator branch from f93b012 to 55123d4 Compare June 23, 2023 17:13
@stevenzwu stevenzwu force-pushed the sort-key-comparator branch from 55123d4 to c824402 Compare August 15, 2023 05:34
@pvary
Copy link
Contributor

pvary commented Sep 13, 2023

@stevenzwu: Extracting the partition key from the row is on the hot path of writes. Based on the code I do not see any obvious issues, but it might worth to check the performance implications of this change?

WDYT?

@stevenzwu
Copy link
Contributor Author

@stevenzwu: Extracting the partition key from the row is on the hot path of writes. Based on the code I do not see any obvious issues, but it might worth to check the performance implications of this change?

@pvary there is no change in the logic. simple refactor with a common base class extracted. so I don't expect performance implications

@pvary
Copy link
Contributor

pvary commented Sep 20, 2023

@stevenzwu: Extracting the partition key from the row is on the hot path of writes. Based on the code I do not see any obvious issues, but it might worth to check the performance implications of this change?

@pvary there is no change in the logic. simple refactor with a common base class extracted. so I don't expect performance implications

Previous code had 1 less redirection/inheritance. Optimized loops can be a thing in this type of code paths, where the compiler could deduce that the different iterations of the loops are independent, and could be executed in batch. Adding more redirection/inheritance could break the optimization algorithm and instead of a parallel loop we can end up executing the loop sequentially. This could cause performance degradation. This is the only effect which might effect other users of the classes touched by this change, otherwise I think the change keeps the old behaviour correctly and introduces a new one, which we need.

import java.util.stream.Collectors;

/**
* A struct of flattened sort field values.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure I understand flattened in this context

Copy link
Contributor Author

@stevenzwu stevenzwu Oct 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's assume the table schema is like

Schema schema =
        new Schema(
            Types.NestedField.required(1, "id", Types.StringType.get()),
            Types.NestedField.optional(
                2,
                "location",
                Types.StructType.of(
                    Types.NestedField.required(11, "lat", Types.FloatType.get()),
                    Types.NestedField.required(12, "long", Types.FloatType.get()))));

if the SortKey is applied on id and lat, the StructTransform only produces a flattened tuple. nested structure is not maintained anymore.

  private final Object[] transformedTuple;

/**
* A struct of flattened transformed values.
*
* <p>Instances of this class can produce sort values from a data row passed to {@link
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo, can produce "transformed values"?

@stevenzwu stevenzwu force-pushed the sort-key-comparator branch from 78606fc to b671d27 Compare October 23, 2023 16:33
this.sortOrder = toCopy.sortOrder;
}

public SortKey copy() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the old method but we aren't actually copying here. Schema and Sort Order are not copies, just noting.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right it is not deep copy of everything. but it is a still copy/clone implementation of the SortKey that we deemed correct.

Transform<?, ?> transform = fieldTransforms.get(i).transform();
Accessor<StructLike> accessor = schema.accessorForField(sourceFieldId);
Preconditions.checkArgument(
accessor != null, "Cannot build accessor for field: " + schema.findField(sourceFieldId));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor nit: checkArgs has a "condition, string, args" version so you don't have to do a string concat


/**
* Simple POJO for source field id and transform function. {@code Pair} class is not usable here
* in API module, as it has Avro dep and lives in core module.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as it has a Avro dep and is in the core module.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx. will update to as it has an Avro dep and is in the core module

Copy link
Member

@RussellSpitzer RussellSpitzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me

@stevenzwu stevenzwu merged commit b21a8ce into apache:main Nov 28, 2023
46 checks passed
lisirrx pushed a commit to lisirrx/iceberg that referenced this pull request Jan 4, 2024
devangjhabakh pushed a commit to cdouglas/iceberg that referenced this pull request Apr 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants