Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DAR-3846][External] Prevent upload of dataset items where the {item_path}/{item_name} already exists in the dataset #937

Merged
merged 1 commit into from
Oct 10, 2024

Conversation

JBWilkie
Copy link
Collaborator

@JBWilkie JBWilkie commented Oct 9, 2024

Problem

Recently, I discovered that it's possible to add slots to existing items. The ability to upload multi-file items with push was built on the assumption that this was impossible, and has led to the following behaviour (because we name slots differently based on the merge mode):

1: Upload some files as one merge mode --> No error
2: Upload the same files a different merge mode --> No error
3: Upload the same files as the same merge mode as step 2 again --> You get an error about skipping files

We've decided that this type of scenario should be blocked in darwin-py, as most users would expect deduplication validation to take place on the item-name level

Solution

Add a function to the UploadHandler constructor that runs the following before beginning the upload:

  • 1: Gets a full list of full remote filepaths from the target dataset
  • 2: Checks each planned full remote path against this list. If any path matches, we remove that file from the files to be uploaded and print a warning to the console

Changelog

Prevent upload of dataset items where the {item_path}/{item_name} already exists in the dataset

@JBWilkie JBWilkie merged commit acb371d into master Oct 10, 2024
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants