Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create dataset loader for SleukRith Set #206

Closed
SamuelCahyawijaya opened this issue Dec 26, 2023 · 9 comments · Fixed by #556
Closed

Create dataset loader for SleukRith Set #206

SamuelCahyawijaya opened this issue Dec 26, 2023 · 9 comments · Fixed by #556
Assignees
Labels
help wanted Extra attention is needed pr-ready A PR that closes this issue is Ready to be reviewed

Comments

@SamuelCahyawijaya
Copy link
Collaborator

Dataloader name: sleukrith_ocr/sleukrith_ocr.py
DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?sleukrith_ocr

Dataset sleukrith_ocr
Description SleukRith Set is the first dataset specifically created for Khmer palm leaf manuscripts. The dataset consists of annotated data from 657 pages of digitized palm leaf manuscripts which are selected arbitrarily from a large collection of existing and also recently digitized images. The dataset contains three types of data: isolated characters, words, and lines. Each type of data is annotated with the ground truth information which is very useful for evaluating and serving as a training set for common document analysis tasks such as character/text recognition, word/line segmentation, and word spotting.
Subsets -
Languages khm
Tasks Optical Character Recognition
License Unknown (unknown)
Homepage https://github.com/donavaly/SleukRith-Set
HF URL -
Paper URL https://dl.acm.org/doi/10.1145/3151509.3151510
@SamuelCahyawijaya SamuelCahyawijaya converted this from a draft issue Dec 26, 2023
@Gyyz Gyyz removed their assignment Dec 29, 2023
@Gyyz
Copy link
Contributor

Gyyz commented Dec 29, 2023

no idea on what kind of the binary file, tried several package like pickle...to load train/test, but failed
I put the google links here in case someone want to create the dataloader:

_URLS = {
      "beta_image" :"https://drive.google.com/uc?export=download&id=1Sdv0pPYS0dwBvJGCKthQIge6IthlEeNo",
      "beta_anno": "https://drive.google.com/uc?export=download&id=175eCHpbGSaNWqPcFY5f0LlX014MAarXK",
      "v100_image": "https://drive.google.com/uc?export=download&id=19JIxAjjXWuJ7mEyUl5-xRr2B8uOb-GKk",
      "v100_anno": "https://drive.google.com/uc?export=download&id=1Xi5ucRUb1e9TUU-nv2rCUYv2ANVsXYDk",
      "train_data": "https://drive.google.com/uc?export=download&id=1KXf5937l-Xu_sXsGPuQOgFt4zRaXlSJ5",
      "train_label": "https://drive.google.com/uc?export=download&id=1IbmLg-4l-3BtRhprDWWvZjCp7lqap0Z-",
      "test_data": "https://drive.google.com/uc?export=download&id=1KSt5AiRIilRryh9GBcxyUUhnbiScdQ-9",
      "test_label": "https://drive.google.com/uc?export=download&id=1GYcaUInkxtuuQps-qA38u-4zxK7HgrAB",
   }

@holylovenia
Copy link
Contributor

https://github.com/donavaly/SleukRith-Set

Hi @Gyyz, the data homepage specifies this data format. I'll copy it here for convenience.

data file
- first 4 bytes (integer): width
- next 4 bytes (integer): height
- next 4 bytes (integer): nb samples
- width*height bytes per image (1 byte per pixel)

label file
- first 4 bytes (integer): nb classes
- next 4 bytes (integer): nb samples
- 4 bytes (integer) per label (label value in [0, nb_class[)

@Gyyz
Copy link
Contributor

Gyyz commented Jan 8, 2024

Yes, thanks, they specified the data format, but haven't specified the file format. No idea on how to load the data.

@sabilmakbar sabilmakbar added the help wanted Extra attention is needed label Jan 15, 2024
@ssun32 ssun32 removed their assignment Feb 2, 2024
@akhdanfadh
Copy link
Collaborator

akhdanfadh commented Feb 15, 2024

Hi, I managed to load the data. There are 113206 48*48 images with 111 labels.

Image

Q1. Label Mapping

The labels are given as a number from 0 to 110. From the unextracted data, these numbers map to a character following the label in .xml files (see below). To get the full mapping dictionary, I need to iterate those .xml files because it seems they do not provide it. Should I proceed with this or just load it as is (number)?

<CharAnno>
    <Char id="0" label="" lineid="0">
    	<poly x="406" y="100"/>
        <poly x="406" y="87"/>
        ...
    </Char>
    ...
</CharAnno>

Q2. Google Drive Wrapper

My main question here is actually where should I store the downloaded dataset for local data? For context, datasets.DownloadManager stores them in a $HOME/.cache folder, and I can't use them for this.

For further discussions, the data provided is in Google Drive and large. As you may know, there is more to do when downloading large (>100MB) files from that drive, not just a simple curl command. I noticed that DownloadManager can only download small files (here a sample for the code), by testing on this SleukRith dataset. The thing is several implemented data loaders in SEACrowd for GDrive link are just a simple DownloadManager.download_and_extract function, even though the files are large. I haven't tested those code myself, but just for a note here, they are indosum, paracotta_id, squad_id train data, and wikilingua. Also, sentiment_natasha_review link is GDrive but it gives 404 error. For solution though, I found a trick on downloading this kind of data without third-party library from here.

@sabilmakbar @holylovenia

@holylovenia
Copy link
Contributor

Wow, amazing at least you can load the data! Wait, let me come back to this tonight.

@akhdanfadh
Copy link
Collaborator

Waiting for further instructions before GitHub adding staled-issue @sabilmakbar @holylovenia

@holylovenia
Copy link
Contributor

Waiting for further instructions before GitHub adding staled-issue @sabilmakbar @holylovenia

Sorry I missed this. 🙏 @sabilmakbar, do you have any suggestions on the download method?

Alternatively, if downloading the dataset via the dataloader is too difficult, we can use _LOCAL = True. For _LOCAL = True dataloaders, usually it's up to the user where to store the data. Later the user can use the data_dir flag/parameter to load the data.

@sabilmakbar
Copy link
Collaborator

sabilmakbar commented Mar 18, 2024

My main question here is actually where should I store the downloaded dataset for local data? For context, datasets.DownloadManager stores them in a $HOME/.cache folder, and I can't use them for this.

For no 2, may I know why exactly the data can't be put in $HOME/.cache folder? In my case, I can successfully download it using datasets.DownloadManager() (not sure about the data loading step post-download tho, you might want to test it on yourself). Moreover, the HF download directory is located within a disk, not on a memory (if you think it's stored on the memory because it's named cache).

If you're curious about the download size (which has been extracted using download_and_extract method), you may take a look on the snippet

from datasets import DownloadManager as dl_manager

_URLS = {
      "beta_image" :"https://drive.google.com/uc?export=download&id=1Sdv0pPYS0dwBvJGCKthQIge6IthlEeNo",
      "beta_anno": "https://drive.google.com/uc?export=download&id=175eCHpbGSaNWqPcFY5f0LlX014MAarXK",
      "v100_image": "https://drive.google.com/uc?export=download&id=19JIxAjjXWuJ7mEyUl5-xRr2B8uOb-GKk",
      "v100_anno": "https://drive.google.com/uc?export=download&id=1Xi5ucRUb1e9TUU-nv2rCUYv2ANVsXYDk",
      "train_data": "https://drive.google.com/uc?export=download&id=1KXf5937l-Xu_sXsGPuQOgFt4zRaXlSJ5",
      "train_label": "https://drive.google.com/uc?export=download&id=1IbmLg-4l-3BtRhprDWWvZjCp7lqap0Z-",
      "test_data": "https://drive.google.com/uc?export=download&id=1KSt5AiRIilRryh9GBcxyUUhnbiScdQ-9",
      "test_label": "https://drive.google.com/uc?export=download&id=1GYcaUInkxtuuQps-qA38u-4zxK7HgrAB",
   }

local_dl_path = dl_manager().download_and_extract(_URLS)

import os
def get_size_in_bytes(start_path = '.'):
    total_size = 0

    if not os.path.isdir(start_path):
        total_size = os.path.getsize(start_path)
    else:
        for dirpath, dirnames, filenames in os.walk(start_path):
            for f in filenames:
                fp = os.path.join(dirpath, f)
                # skip if it is symbolic link
                if not os.path.islink(fp):
                    total_size += os.path.getsize(fp)

    return total_size

sum_of_size = 0
for key, file in local_dl_path.items():
    size_per_dir = get_size(file)
    sum_of_size += size_per_dir
    print(f"total file size of {key}: {size_per_dir} byte(s)")

print(f"total file size: {sum_of_size} byte(s)")
image

Hope this answers and pls let me know if I misunderstood your questions, @akhdanfadh (apologies for not replying sooner since a few months ago, I turned off GH notif in my email).

@akhdanfadh
Copy link
Collaborator

akhdanfadh commented Mar 29, 2024

Update on my Q1. Label Mapping

It turns out id in all .xml files does not correspond to the loaded label, it is just an identifier for each recognized character in a file. Thus, there will be no mapping as I haven't found any and the dataloader will load the label as is, i.e., integers from 0 to 110 with no character mapping to it for labelling sake.


For _LOCAL = True dataloaders, usually it's up to the user where to store the data. Later the user can use the data_dir flag/parameter to load the data.

@holylovenia Got it, answers the question. I have two simple solutions for this google drive issue, what do you think I should do? I currently implementing the latter.

  1. Just use _LOCAL = True and let the user download the data manually
  2. Use third-party library pip install gdown and update requirements.py

My main question here is actually where should I store the downloaded dataset for local data? For context, datasets.DownloadManager stores them in a $HOME/.cache folder, and I can't use them for this.

@sabilmakbar What I meant by them here is the DownloadManager. I don't know how reliable downloading from Google Drive is using the function, see my result below. I even use the same URL format as yours.
image

@akhdanfadh akhdanfadh added the in-progress Assignee has given confirmation on progress and ETA label Mar 29, 2024
@akhdanfadh akhdanfadh added pr-ready A PR that closes this issue is Ready to be reviewed and removed in-progress Assignee has given confirmation on progress and ETA labels Mar 29, 2024
sabilmakbar pushed a commit that referenced this issue May 14, 2024
* init commit

* uncommenting unused features

* handle additional package using try-except

* modified requirements back to master

* modified constants.py back to master

* remove gitignore

* remove constants

* remove abstract and keywords

* remove mod variable name

* add labelling description

* add url comment
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed pr-ready A PR that closes this issue is Ready to be reviewed
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

6 participants