Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark: Ensure that partition stats files are considered for GC procedures #9284

Merged
merged 3 commits into from
Jan 18, 2024

Conversation

ajantha-bhat
Copy link
Member

@ajantha-bhat ajantha-bhat commented Dec 12, 2023

Expire snapshots and remove orphan files action/procedure should also consider partition stats files.

Fixes #9336

*/
@Deprecated
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't use the Predicate<StatisticsFile> for partition stats. Hence, deprecated and introduced snapshot id based input like other methods in this class (manifest list, manifest etc)

@@ -475,18 +474,9 @@ public void testExpireSnapshotsWithStatisticFiles() throws Exception {
Assertions.assertThat(output.get(0)[5]).as("should be 1 deleted statistics file").isEqualTo(1L);

table.refresh();
List<StatisticsFile> statsWithSnapshotId1 =
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are checking containsExactly at line 482, no need of these checks.

@ajantha-bhat ajantha-bhat changed the title Spark-3.5: Ensure that partition stats files are considered for GC procedures Spark 3.5: Ensure that partition stats files are considered for GC procedures Dec 12, 2023
@@ -148,12 +148,49 @@ public static List<String> statisticsFilesLocations(Table table) {
* @param table table for which statistics files needs to be listed
* @param predicate predicate for filtering the statistics files
* @return the location of statistics files
* @deprecated use the {@code statisticsFilesLocationsForSnapshots(table, snapshotIds)} instead.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we add a note on when it was deprecated (1.5.0) and when it will be removed (1.6.0)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added this already in previous commit.

* statistics files for all the snapshots will be returned.
* @return the location of statistics files
*/
public static List<String> statisticsFilesLocationsForSnapshots(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about calling it statisticsFilesLocations to match manifestListLocations that also accepts a set of snapshot IDs?

Copy link
Member Author

@ajantha-bhat ajantha-bhat Dec 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The deprecated method already has a name statisticsFilesLocations. So, If I keep the same name, compiler says ambiguous error when I call statisticsFilesLocationsForSnapshots(table, null)

Hence, I had to keep different name.

@aokolnychyi
Copy link
Contributor

@ajantha-bhat, this looks mostly good to me. Can we add the tests to all Spark versions to be safe?

@ajantha-bhat
Copy link
Member Author

@ajantha-bhat, this looks mostly good to me. Can we add the tests to all Spark versions to be safe?

Sure. I didn't initially add to other spark version thinking if I get any comments for test, I have to rework in all the spark version. So, I thought of adding tests in a single follow up PR for other spark versions.

Now that we don't have any specific comment on test, I can add it in this PR itself.

@ajantha-bhat ajantha-bhat changed the title Spark 3.5: Ensure that partition stats files are considered for GC procedures Spark: Ensure that partition stats files are considered for GC procedures Dec 20, 2023
@ajantha-bhat
Copy link
Member Author

@aokolnychyi: Thanks for the review.
Addressed all the comments except #9284 (comment) (with a reasoning).

Build passed and PR is ready.

@ajantha-bhat
Copy link
Member Author

ping @aokolnychyi

Copy link
Contributor

@aokolnychyi aokolnychyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good to me, I had one optional comment. Let me know what you want to do there, @ajantha-bhat.

@ajantha-bhat
Copy link
Member Author

It looks good to me, I had one optional comment. Let me know what you want to do there, @ajantha-bhat.

@aokolnychyi : Are you referring to #9284 (comment) comment? I have replied why I can't change the name due to compile time ambiguity of same signature.
Other than that I don't see any more open comment. So, please let me know.

@aokolnychyi
Copy link
Contributor

Oops, I probably forgot to post the comment. I was thinking about statisticsFileLocations to match the naming of other methods like metadataFileLocations (Files -> File). It is optional, though.

@ajantha-bhat
Copy link
Member Author

Oops, I probably forgot to post the comment. I was thinking about statisticsFileLocations to match the naming of other methods like metadataFileLocations (Files -> File). It is optional, though.

statisticsFileLocations is an existing public method (before this PR). So, we can't rename it directly. We have to deprecate and introduce new one. I think we can leave it as it as.

The new method statisticsFilesLocationsForSnapshots can be renamed but it will become inconsistent with existing statisticsFileLocations.

@aokolnychyi
Copy link
Contributor

Yep, that's fair. Let's keep it as is then.

@aokolnychyi aokolnychyi merged commit 02836ea into apache:main Jan 18, 2024
41 checks passed
@aokolnychyi
Copy link
Contributor

Thanks, @ajantha-bhat!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Ensure that partition stats files are considered for GC procedures
2 participants