-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create dataset loader for SleukRith Set #206
Comments
no idea on what kind of the binary file, tried several package like pickle...to load train/test, but failed _URLS = {
"beta_image" :"https://drive.google.com/uc?export=download&id=1Sdv0pPYS0dwBvJGCKthQIge6IthlEeNo",
"beta_anno": "https://drive.google.com/uc?export=download&id=175eCHpbGSaNWqPcFY5f0LlX014MAarXK",
"v100_image": "https://drive.google.com/uc?export=download&id=19JIxAjjXWuJ7mEyUl5-xRr2B8uOb-GKk",
"v100_anno": "https://drive.google.com/uc?export=download&id=1Xi5ucRUb1e9TUU-nv2rCUYv2ANVsXYDk",
"train_data": "https://drive.google.com/uc?export=download&id=1KXf5937l-Xu_sXsGPuQOgFt4zRaXlSJ5",
"train_label": "https://drive.google.com/uc?export=download&id=1IbmLg-4l-3BtRhprDWWvZjCp7lqap0Z-",
"test_data": "https://drive.google.com/uc?export=download&id=1KSt5AiRIilRryh9GBcxyUUhnbiScdQ-9",
"test_label": "https://drive.google.com/uc?export=download&id=1GYcaUInkxtuuQps-qA38u-4zxK7HgrAB",
} |
Hi @Gyyz, the data homepage specifies this data format. I'll copy it here for convenience.
|
Yes, thanks, they specified the data format, but haven't specified the file format. No idea on how to load the data. |
Hi, I managed to load the data. There are 113206 48*48 images with 111 labels. Q1. Label MappingThe labels are given as a number from 0 to 110. From the unextracted data, these numbers map to a character following the label in .xml files (see below). To get the full mapping dictionary, I need to iterate those .xml files because it seems they do not provide it. Should I proceed with this or just load it as is (number)? <CharAnno>
<Char id="0" label="យ" lineid="0">
<poly x="406" y="100"/>
<poly x="406" y="87"/>
...
</Char>
...
</CharAnno> Q2. Google Drive WrapperMy main question here is actually where should I store the downloaded dataset for local data? For context, For further discussions, the data provided is in Google Drive and large. As you may know, there is more to do when downloading large (>100MB) files from that drive, not just a simple curl command. I noticed that |
Wow, amazing at least you can load the data! Wait, let me come back to this tonight. |
Waiting for further instructions before GitHub adding staled-issue @sabilmakbar @holylovenia |
Sorry I missed this. 🙏 @sabilmakbar, do you have any suggestions on the download method? Alternatively, if downloading the dataset via the dataloader is too difficult, we can use |
For no 2, may I know why exactly the data can't be put in If you're curious about the download size (which has been extracted using from datasets import DownloadManager as dl_manager
_URLS = {
"beta_image" :"https://drive.google.com/uc?export=download&id=1Sdv0pPYS0dwBvJGCKthQIge6IthlEeNo",
"beta_anno": "https://drive.google.com/uc?export=download&id=175eCHpbGSaNWqPcFY5f0LlX014MAarXK",
"v100_image": "https://drive.google.com/uc?export=download&id=19JIxAjjXWuJ7mEyUl5-xRr2B8uOb-GKk",
"v100_anno": "https://drive.google.com/uc?export=download&id=1Xi5ucRUb1e9TUU-nv2rCUYv2ANVsXYDk",
"train_data": "https://drive.google.com/uc?export=download&id=1KXf5937l-Xu_sXsGPuQOgFt4zRaXlSJ5",
"train_label": "https://drive.google.com/uc?export=download&id=1IbmLg-4l-3BtRhprDWWvZjCp7lqap0Z-",
"test_data": "https://drive.google.com/uc?export=download&id=1KSt5AiRIilRryh9GBcxyUUhnbiScdQ-9",
"test_label": "https://drive.google.com/uc?export=download&id=1GYcaUInkxtuuQps-qA38u-4zxK7HgrAB",
}
local_dl_path = dl_manager().download_and_extract(_URLS)
import os
def get_size_in_bytes(start_path = '.'):
total_size = 0
if not os.path.isdir(start_path):
total_size = os.path.getsize(start_path)
else:
for dirpath, dirnames, filenames in os.walk(start_path):
for f in filenames:
fp = os.path.join(dirpath, f)
# skip if it is symbolic link
if not os.path.islink(fp):
total_size += os.path.getsize(fp)
return total_size
sum_of_size = 0
for key, file in local_dl_path.items():
size_per_dir = get_size(file)
sum_of_size += size_per_dir
print(f"total file size of {key}: {size_per_dir} byte(s)")
print(f"total file size: {sum_of_size} byte(s)") Hope this answers and pls let me know if I misunderstood your questions, @akhdanfadh (apologies for not replying sooner since a few months ago, I turned off GH notif in my email). |
Update on my Q1. Label Mapping It turns out
@holylovenia Got it, answers the question. I have two simple solutions for this google drive issue, what do you think I should do? I currently implementing the latter.
@sabilmakbar What I meant by them here is the |
* init commit * uncommenting unused features * handle additional package using try-except * modified requirements back to master * modified constants.py back to master * remove gitignore * remove constants * remove abstract and keywords * remove mod variable name * add labelling description * add url comment
Dataloader name:
sleukrith_ocr/sleukrith_ocr.py
DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?sleukrith_ocr
The text was updated successfully, but these errors were encountered: