Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Enhancement] Refactor Cleanup Task Handler #516

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

danielhumanmod
Copy link
Contributor

@danielhumanmod danielhumanmod commented Dec 10, 2024

Description

This PR is created to refactor ManifestFileCleanupTaskHandler.java. The key changes include:

  • Extract a base class for cleanup task handler FileCleanupTaskHandler.java
  • Create a generic cleanup task handler for batch files BatchFileCleanupTaskHandler.java
  • Remove coupled code in ManifestFileCleanupTaskHandler.java

Fixes #515

Type of change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • Documentation update
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration

  • TableCleanupTaskHandlerTest.java
  • BatchFileCleanupTaskHandlerTest.java
  • ManifestFileCleanupTaskHandlerTest.java

Test Configuration:

  • Hardware:
  • Toolchain:
  • SDK:

Checklist:

Please delete options that are not relevant.

  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • If adding new functionality, I have discussed my implementation with the community using the linked GitHub issue

.atWarn()
.addKeyValue("batchFiles", batchFiles.toString())
.addKeyValue("tableId", tableId)
.log("File batch cleanup task scheduled, but the none of the file in batch exists");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.log("File batch cleanup task scheduled, but the none of the file in batch exists");
.log("File batch cleanup task scheduled, but none of the files in batch exists");


@Override
public boolean canHandleTask(TaskEntity task) {
throw new UnsupportedOperationException("This method must be implemented by subclasses.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use an abstract class then?

return LOGGER;
}

public static final class BatchFileCleanupTask {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be a record?

allDeletes.join();
} catch (Exception e) {
LOGGER.error("Exception detected during batch files deletion", e);
return false;
Copy link
Contributor

@adutra adutra Dec 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Imho this a bit too coarse-grained. The likelihood of having one failed delete will grow bigger with the number of files to delete. Shouldn't we return a more meaningful result, e.g. a statistics object reporting how many files were deleted, and how many failed? Not saying you should change this now though, because returning a boolean is imposed by TaskHandler API, so we need to look into improving the API itself.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That’s a great point! Since this is the first time we’re introducing a batch files handler, it would be helpful to discuss how to refine the TaskHandler API for future scenarios. @collado-mike, do you have any suggestions or thoughts on this?

@danielhumanmod
Copy link
Contributor Author

Thank you for the valuable feedback, @adutra ! Your comments were super helpful in improving the code quality. I’ve incorporated your suggestions and updated the code in the latest commit.

@Override
public abstract boolean handleTask(TaskEntity task);

public abstract Logger getLogger();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is imho not necessary, using this class' LOGGER is enough.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is imho not necessary, using this class' LOGGER is enough.

Thanks for reminding, already fixed it!

// Schedule the deletion for each file asynchronously
List<CompletableFuture<Void>> deleteFutures =
validFiles.stream()
.map(file -> super.tryDelete(tableId, authorizedFileIO, null, file, null, 1))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I'd have expected that this class actually leverages batched deletes of the underlying object store, but seems it does not.

List<String> batchFiles = cleanupTask.batchFiles();
try (FileIO authorizedFileIO = fileIOSupplier.apply(task)) {
List<String> validFiles =
batchFiles.stream().filter(file -> TaskUtils.exists(file, authorizedFileIO)).toList();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here and in line 69 are exists-checks performed for each individual file, which is inefficient (slow) and also incurs unnecessary cost for cloud object storages.

I'd prefer an approach that
a) leverages object-storage batch requests
b) does not perform the same operation more than once

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEATURE REQUEST] Refactor Cleanup Task Handler
3 participants