dataset adding new items and saving #660

noknok00 · 2022-02-09T22:29:50Z

Hi,
the last few days I have been learning to use datumaro, but I have a few questions that I believe are not covered in the documentation.
I am working on a personal project similar to CVAT (labelling images and training of Object Detection model) for self-learning purpuses, specifically I am trying to integrate an "Active Learning" loop.

I don't fully understand how to integrate datumaro to my project.
My idea is to allow the user to upload images during training (after each training loop), this mean that the dataset changes during the usage of my application.

I have been using Datumaro project (adding sources) and Datumaro datasets (datumaro format), but I am unable to "persist" the changes.

I thought that by adding a "source" to a project, it would automatically "update" the dataset everytime I reload the project.
By example, if I am using a "VOC" source directory, I was expecting that if I made a change in the source directory (by example, adding a new image) it would reflect in the dataset the next time I reload the project.

and I am not sure how to "reload" a source, they only way I found was to delete the source and create a new one.

Now, my intention is to manipulate (add, remove, update) the dataset in memory (by modifying the dataset variable).
The only way I found was by calling "dataset.save()", but I feel this is not the right way since (if I understand correctly) it overwrite (delete and create) the original files, rather than just "update" the changes.

Not sure what is the right term to use, I feel like if this was a database, by using "save()" method, I am deleting the table and creating it again. My expectation is just to modify the current dataset file, in the case of the "datumaro format", to update the json file.

The workflow would be like:

create empty dataset (first time use)
user upload 10 images
user add annotations to 5 images
training loop ran.
-- user restart application -- << here is my question (by example, user continue working a different day)
dataset is reloaded
user upload 15 more images
user annotate 5 more images

while the application is running, the changes are kept in memory. the problem comes when the application is restarted.
so, everytime the user make a change to the dataset (add new items, modify an item's annotation), do I have to call the "save()" method of the dataset class (to persist the changes)?

I tried the "commit" method of the Project class, but the changes in the dataset are not saved.

I am a little bit lost in here, can someone point me to the right direction?

Thanks.

zhiltsov-max · 2022-02-10T08:30:07Z

Hi, thank you for coming to us! I'll try to answer on your questions. Basically, a "project" currently represents a repository and "build tree" for a single or multiple datasets (called "sources") stored on the disk in the project directory. It is described in detail here.
For the repository part, you can commit changes and navigate over revisions, like in Git.
For the "build tree" part, you can more or less consider each source as a database. This should work as you expect for sources in the working directory (accessed by just a source name - without a revision or stage - in the project).
If you want to load a dataset using API, you can use the following:

from datumaro.components.project import Project

with Project('project/dir/') as project:
  dataset = project.working_tree.make_dataset('source_name')
  # do stuff with the dataset, eg.
  # dataset.add(item)
  # dataset.remove(id)
  # dataset.get(id)

  # then you can update it using save() or export()
  # Depending on the format, the command will only update what was changed - 
  # add, replace or remove some images and related annotation files
  dataset.save(save_images=True)

All these operations can be done without a project as well:

from datumaro.components.dataset import Dataset

dataset = Dataset.from_iterable([], categories=['cat', 'dog', ...])
dataset.save('path/') # or export('path/', 'datumaro')
# modify dataset
dataset.save()

You can also use the bind() dataset method to set dataset location. On saving and exporting, we check if the location and format are the same to distinguish between patching and full saving.

You can find more info in Dataset API docs and examples in tests.

I thought that by adding a "source" to a project, it would automatically "update" the dataset everytime I reload the project.
By example, if I am using a "VOC" source directory, I was expecting that if I made a change in the source directory (by example, adding a new image) it would reflect in the dataset the next time I reload the project.

and I am not sure how to "reload" a source, they only way I found was to delete the source and create a new one.

Basically, it should be working like this. Maybe, if you did changes manually, you just added the image to the image directory, but not to a subset image list (like this)? "Reloading" in this case is loading a dataset from the project:

Project('path/').working_tree.make_dataset('source_name')

I tried the "commit" method of the Project class, but the changes in the dataset are not saved.

This method works with project files, but updates to the project tree need to be saved with Project.save() first.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dataset adding new items and saving #660

dataset adding new items and saving #660

noknok00 commented Feb 9, 2022 •

edited

Loading

zhiltsov-max commented Feb 10, 2022 •

edited

Loading

dataset adding new items and saving #660

dataset adding new items and saving #660

Comments

noknok00 commented Feb 9, 2022 • edited Loading

zhiltsov-max commented Feb 10, 2022 • edited Loading

noknok00 commented Feb 9, 2022 •

edited

Loading

zhiltsov-max commented Feb 10, 2022 •

edited

Loading