Create dataset loader for SleukRith Set #206

SamuelCahyawijaya · 2023-12-26T03:19:32Z

Dataloader name: sleukrith_ocr/sleukrith_ocr.py
DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?sleukrith_ocr

Dataset	sleukrith_ocr
Description	SleukRith Set is the first dataset specifically created for Khmer palm leaf manuscripts. The dataset consists of annotated data from 657 pages of digitized palm leaf manuscripts which are selected arbitrarily from a large collection of existing and also recently digitized images. The dataset contains three types of data: isolated characters, words, and lines. Each type of data is annotated with the ground truth information which is very useful for evaluating and serving as a training set for common document analysis tasks such as character/text recognition, word/line segmentation, and word spotting.
Subsets	-
Languages	khm
Tasks	Optical Character Recognition
License	Unknown (unknown)
Homepage	https://github.com/donavaly/SleukRith-Set
HF URL	-
Paper URL	https://dl.acm.org/doi/10.1145/3151509.3151510

The text was updated successfully, but these errors were encountered:

Gyyz · 2023-12-29T08:39:50Z

no idea on what kind of the binary file, tried several package like pickle...to load train/test, but failed
I put the google links here in case someone want to create the dataloader:

_URLS = {
      "beta_image" :"https://drive.google.com/uc?export=download&id=1Sdv0pPYS0dwBvJGCKthQIge6IthlEeNo",
      "beta_anno": "https://drive.google.com/uc?export=download&id=175eCHpbGSaNWqPcFY5f0LlX014MAarXK",
      "v100_image": "https://drive.google.com/uc?export=download&id=19JIxAjjXWuJ7mEyUl5-xRr2B8uOb-GKk",
      "v100_anno": "https://drive.google.com/uc?export=download&id=1Xi5ucRUb1e9TUU-nv2rCUYv2ANVsXYDk",
      "train_data": "https://drive.google.com/uc?export=download&id=1KXf5937l-Xu_sXsGPuQOgFt4zRaXlSJ5",
      "train_label": "https://drive.google.com/uc?export=download&id=1IbmLg-4l-3BtRhprDWWvZjCp7lqap0Z-",
      "test_data": "https://drive.google.com/uc?export=download&id=1KSt5AiRIilRryh9GBcxyUUhnbiScdQ-9",
      "test_label": "https://drive.google.com/uc?export=download&id=1GYcaUInkxtuuQps-qA38u-4zxK7HgrAB",
   }

holylovenia · 2024-01-08T08:25:37Z

https://github.com/donavaly/SleukRith-Set

Hi @Gyyz, the data homepage specifies this data format. I'll copy it here for convenience.

data file
- first 4 bytes (integer): width
- next 4 bytes (integer): height
- next 4 bytes (integer): nb samples
- width*height bytes per image (1 byte per pixel)

label file
- first 4 bytes (integer): nb classes
- next 4 bytes (integer): nb samples
- 4 bytes (integer) per label (label value in [0, nb_class[)

Gyyz · 2024-01-08T09:03:03Z

Yes, thanks, they specified the data format, but haven't specified the file format. No idea on how to load the data.

akhdanfadh · 2024-02-15T15:14:37Z

Hi, I managed to load the data. There are 113206 48*48 images with 111 labels.

Q1. Label Mapping

The labels are given as a number from 0 to 110. From the unextracted data, these numbers map to a character following the label in .xml files (see below). To get the full mapping dictionary, I need to iterate those .xml files because it seems they do not provide it. Should I proceed with this or just load it as is (number)?

<CharAnno>
    <Char id="0" label="យ" lineid="0">
    	<poly x="406" y="100"/>
        <poly x="406" y="87"/>
        ...
    </Char>
    ...
</CharAnno>

Q2. Google Drive Wrapper

My main question here is actually where should I store the downloaded dataset for local data? For context, datasets.DownloadManager stores them in a $HOME/.cache folder, and I can't use them for this.

For further discussions, the data provided is in Google Drive and large. As you may know, there is more to do when downloading large (>100MB) files from that drive, not just a simple curl command. I noticed that DownloadManager can only download small files (here a sample for the code), by testing on this SleukRith dataset. The thing is several implemented data loaders in SEACrowd for GDrive link are just a simple DownloadManager.download_and_extract function, even though the files are large. I haven't tested those code myself, but just for a note here, they are indosum, paracotta_id, squad_id train data, and wikilingua. Also, sentiment_natasha_review link is GDrive but it gives 404 error. For solution though, I found a trick on downloading this kind of data without third-party library from here.

@sabilmakbar @holylovenia

holylovenia · 2024-02-19T07:00:22Z

Wow, amazing at least you can load the data! Wait, let me come back to this tonight.

akhdanfadh · 2024-03-01T02:42:23Z

Waiting for further instructions before GitHub adding staled-issue @sabilmakbar @holylovenia

holylovenia · 2024-03-18T05:23:43Z

Waiting for further instructions before GitHub adding staled-issue @sabilmakbar @holylovenia

Sorry I missed this. 🙏 @sabilmakbar, do you have any suggestions on the download method?

Alternatively, if downloading the dataset via the dataloader is too difficult, we can use _LOCAL = True. For _LOCAL = True dataloaders, usually it's up to the user where to store the data. Later the user can use the data_dir flag/parameter to load the data.

sabilmakbar · 2024-03-18T19:54:15Z

My main question here is actually where should I store the downloaded dataset for local data? For context, datasets.DownloadManager stores them in a $HOME/.cache folder, and I can't use them for this.

For no 2, may I know why exactly the data can't be put in $HOME/.cache folder? In my case, I can successfully download it using datasets.DownloadManager() (not sure about the data loading step post-download tho, you might want to test it on yourself). Moreover, the HF download directory is located within a disk, not on a memory (if you think it's stored on the memory because it's named cache).

If you're curious about the download size (which has been extracted using download_and_extract method), you may take a look on the snippet

from datasets import DownloadManager as dl_manager

_URLS = {
      "beta_image" :"https://drive.google.com/uc?export=download&id=1Sdv0pPYS0dwBvJGCKthQIge6IthlEeNo",
      "beta_anno": "https://drive.google.com/uc?export=download&id=175eCHpbGSaNWqPcFY5f0LlX014MAarXK",
      "v100_image": "https://drive.google.com/uc?export=download&id=19JIxAjjXWuJ7mEyUl5-xRr2B8uOb-GKk",
      "v100_anno": "https://drive.google.com/uc?export=download&id=1Xi5ucRUb1e9TUU-nv2rCUYv2ANVsXYDk",
      "train_data": "https://drive.google.com/uc?export=download&id=1KXf5937l-Xu_sXsGPuQOgFt4zRaXlSJ5",
      "train_label": "https://drive.google.com/uc?export=download&id=1IbmLg-4l-3BtRhprDWWvZjCp7lqap0Z-",
      "test_data": "https://drive.google.com/uc?export=download&id=1KSt5AiRIilRryh9GBcxyUUhnbiScdQ-9",
      "test_label": "https://drive.google.com/uc?export=download&id=1GYcaUInkxtuuQps-qA38u-4zxK7HgrAB",
   }

local_dl_path = dl_manager().download_and_extract(_URLS)

import os
def get_size_in_bytes(start_path = '.'):
    total_size = 0

    if not os.path.isdir(start_path):
        total_size = os.path.getsize(start_path)
    else:
        for dirpath, dirnames, filenames in os.walk(start_path):
            for f in filenames:
                fp = os.path.join(dirpath, f)
                # skip if it is symbolic link
                if not os.path.islink(fp):
                    total_size += os.path.getsize(fp)

    return total_size

sum_of_size = 0
for key, file in local_dl_path.items():
    size_per_dir = get_size(file)
    sum_of_size += size_per_dir
    print(f"total file size of {key}: {size_per_dir} byte(s)")

print(f"total file size: {sum_of_size} byte(s)")

Hope this answers and pls let me know if I misunderstood your questions, @akhdanfadh (apologies for not replying sooner since a few months ago, I turned off GH notif in my email).

akhdanfadh · 2024-03-29T07:25:44Z

Update on my Q1. Label Mapping

It turns out id in all .xml files does not correspond to the loaded label, it is just an identifier for each recognized character in a file. Thus, there will be no mapping as I haven't found any and the dataloader will load the label as is, i.e., integers from 0 to 110 with no character mapping to it for labelling sake.

For _LOCAL = True dataloaders, usually it's up to the user where to store the data. Later the user can use the data_dir flag/parameter to load the data.

@holylovenia Got it, answers the question. I have two simple solutions for this google drive issue, what do you think I should do? I currently implementing the latter.

Just use _LOCAL = True and let the user download the data manually
Use third-party library pip install gdown and update requirements.py

My main question here is actually where should I store the downloaded dataset for local data? For context, datasets.DownloadManager stores them in a $HOME/.cache folder, and I can't use them for this.

@sabilmakbar What I meant by them here is the DownloadManager. I don't know how reliable downloading from Google Drive is using the function, see my result below. I even use the same URL format as yours.

* init commit * uncommenting unused features * handle additional package using try-except * modified requirements back to master * modified constants.py back to master * remove gitignore * remove constants * remove abstract and keywords * remove mod variable name * add labelling description * add url comment

SamuelCahyawijaya added this to SEACrowd Data Hub Dec 26, 2023

SamuelCahyawijaya converted this from a draft issue Dec 26, 2023

github-actions bot assigned Gyyz Dec 29, 2023

Gyyz removed their assignment Dec 29, 2023

sabilmakbar added the help wanted Extra attention is needed label Jan 15, 2024

github-actions bot assigned ssun32 Feb 2, 2024

ssun32 removed their assignment Feb 2, 2024

github-actions bot assigned akhdanfadh Feb 15, 2024

github-actions bot added the staled-issue label Mar 16, 2024

holylovenia removed the staled-issue label Mar 18, 2024

akhdanfadh mentioned this issue Mar 29, 2024

Add OCR task #555

Merged

akhdanfadh added the in-progress Assignee has given confirmation on progress and ETA label Mar 29, 2024

akhdanfadh mentioned this issue Mar 29, 2024

Closes #206 | Add Dataloader SleukRith Set #556

Merged

8 tasks

akhdanfadh added pr-ready A PR that closes this issue is Ready to be reviewed and removed in-progress Assignee has given confirmation on progress and ETA labels Mar 29, 2024

sabilmakbar mentioned this issue Apr 22, 2024

Create dataset loader for M3LS #228

Closed

sabilmakbar closed this as completed in #556 May 14, 2024

github-project-automation bot moved this to Done in SEACrowd Data Hub May 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create dataset loader for SleukRith Set #206

Create dataset loader for SleukRith Set #206

SamuelCahyawijaya commented Dec 26, 2023

Gyyz commented Dec 29, 2023 •

edited

Loading

holylovenia commented Jan 8, 2024

Gyyz commented Jan 8, 2024 •

edited

Loading

akhdanfadh commented Feb 15, 2024 •

edited

Loading

holylovenia commented Feb 19, 2024

akhdanfadh commented Mar 1, 2024

holylovenia commented Mar 18, 2024

sabilmakbar commented Mar 18, 2024 •

edited

Loading

akhdanfadh commented Mar 29, 2024 •

edited

Loading

Create dataset loader for SleukRith Set #206

Create dataset loader for SleukRith Set #206

Comments

SamuelCahyawijaya commented Dec 26, 2023

Gyyz commented Dec 29, 2023 • edited Loading

holylovenia commented Jan 8, 2024

Gyyz commented Jan 8, 2024 • edited Loading

akhdanfadh commented Feb 15, 2024 • edited Loading

Q1. Label Mapping

Q2. Google Drive Wrapper

holylovenia commented Feb 19, 2024

akhdanfadh commented Mar 1, 2024

holylovenia commented Mar 18, 2024

sabilmakbar commented Mar 18, 2024 • edited Loading

akhdanfadh commented Mar 29, 2024 • edited Loading

Gyyz commented Dec 29, 2023 •

edited

Loading

Gyyz commented Jan 8, 2024 •

edited

Loading

akhdanfadh commented Feb 15, 2024 •

edited

Loading

sabilmakbar commented Mar 18, 2024 •

edited

Loading

akhdanfadh commented Mar 29, 2024 •

edited

Loading