Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dataset adding new items and saving #660

Open
noknok00 opened this issue Feb 9, 2022 · 1 comment
Open

dataset adding new items and saving #660

noknok00 opened this issue Feb 9, 2022 · 1 comment

Comments

@noknok00
Copy link

noknok00 commented Feb 9, 2022

Hi,
the last few days I have been learning to use datumaro, but I have a few questions that I believe are not covered in the documentation.
I am working on a personal project similar to CVAT (labelling images and training of Object Detection model) for self-learning purpuses, specifically I am trying to integrate an "Active Learning" loop.

I don't fully understand how to integrate datumaro to my project.
My idea is to allow the user to upload images during training (after each training loop), this mean that the dataset changes during the usage of my application.

I have been using Datumaro project (adding sources) and Datumaro datasets (datumaro format), but I am unable to "persist" the changes.

I thought that by adding a "source" to a project, it would automatically "update" the dataset everytime I reload the project.
By example, if I am using a "VOC" source directory, I was expecting that if I made a change in the source directory (by example, adding a new image) it would reflect in the dataset the next time I reload the project.

and I am not sure how to "reload" a source, they only way I found was to delete the source and create a new one.

Now, my intention is to manipulate (add, remove, update) the dataset in memory (by modifying the dataset variable).
The only way I found was by calling "dataset.save()", but I feel this is not the right way since (if I understand correctly) it overwrite (delete and create) the original files, rather than just "update" the changes.

Not sure what is the right term to use, I feel like if this was a database, by using "save()" method, I am deleting the table and creating it again. My expectation is just to modify the current dataset file, in the case of the "datumaro format", to update the json file.

The workflow would be like:

  1. create empty dataset (first time use)
  2. user upload 10 images
  3. user add annotations to 5 images
  4. training loop ran.
  5. -- user restart application -- << here is my question (by example, user continue working a different day)
  6. dataset is reloaded
  7. user upload 15 more images
  8. user annotate 5 more images

while the application is running, the changes are kept in memory. the problem comes when the application is restarted.
so, everytime the user make a change to the dataset (add new items, modify an item's annotation), do I have to call the "save()" method of the dataset class (to persist the changes)?

I tried the "commit" method of the Project class, but the changes in the dataset are not saved.

I am a little bit lost in here, can someone point me to the right direction?

Thanks.

@zhiltsov-max
Copy link
Contributor

zhiltsov-max commented Feb 10, 2022

Hi, thank you for coming to us! I'll try to answer on your questions. Basically, a "project" currently represents a repository and "build tree" for a single or multiple datasets (called "sources") stored on the disk in the project directory. It is described in detail here.
For the repository part, you can commit changes and navigate over revisions, like in Git.
For the "build tree" part, you can more or less consider each source as a database. This should work as you expect for sources in the working directory (accessed by just a source name - without a revision or stage - in the project).
If you want to load a dataset using API, you can use the following:

from datumaro.components.project import Project

with Project('project/dir/') as project:
  dataset = project.working_tree.make_dataset('source_name')
  # do stuff with the dataset, eg.
  # dataset.add(item)
  # dataset.remove(id)
  # dataset.get(id)

  # then you can update it using save() or export()
  # Depending on the format, the command will only update what was changed - 
  # add, replace or remove some images and related annotation files
  dataset.save(save_images=True)

All these operations can be done without a project as well:

from datumaro.components.dataset import Dataset

dataset = Dataset.from_iterable([], categories=['cat', 'dog', ...])
dataset.save('path/') # or export('path/', 'datumaro')
# modify dataset
dataset.save()

You can also use the bind() dataset method to set dataset location. On saving and exporting, we check if the location and format are the same to distinguish between patching and full saving.

You can find more info in Dataset API docs and examples in tests.

I thought that by adding a "source" to a project, it would automatically "update" the dataset everytime I reload the project.
By example, if I am using a "VOC" source directory, I was expecting that if I made a change in the source directory (by example, adding a new image) it would reflect in the dataset the next time I reload the project.

and I am not sure how to "reload" a source, they only way I found was to delete the source and create a new one.

Basically, it should be working like this. Maybe, if you did changes manually, you just added the image to the image directory, but not to a subset image list (like this)? "Reloading" in this case is loading a dataset from the project:

Project('path/').working_tree.make_dataset('source_name')

I tried the "commit" method of the Project class, but the changes in the dataset are not saved.

This method works with project files, but updates to the project tree need to be saved with Project.save() first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants