Skip to content

Bhashini-IITJ/BharatSceneTextDataset

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Bharat Scene Text Dataset

GitHub stars Hits

Bharat Scene Text Dataset (BSTD) is large and real Indian language scene-text dataset with coverage across 13 Indian languages and English. It consists of 6,582 scene-text images, with polygon bounding box annotations of 1,20,560 words and ground truth text annotations of 1,00,495 cropped words. This dataset is an effort towards scaling scene-text detection and recognition systems to work on Indian languages. The current version of this dataset can be used for studying scene text detection and cropped scene text word recognition.

Release updates:

  • [8/8/24] First Public Release.

Data Statistics:

Scene Text Detection

Total images #Total detection annotations #Train #Test
6,582 1,20,560 94,128 26,432

Cropped Word Recogntion

Language #Total recognition annotations #Train #Test
Assamese 3,966 2,623 1,343
Bengali 6,129 4,968 1,161
English 36,891 28,778 8,113
Gujarati 2,649 1,956 693
Hindi 18,889 14,855 4,034
Kannada 2,934 2,241 693
Malayalam 2,975 2,408 567
Marathi 4,977 3,932 1,045
Meitei 494 381 113
Odia 4,198 3,176 1,022
Punjabi 11,104 8,544 2,560
Tamil 2,548 2,041 507
Telugu 2,709 2,227 482
Urdu 32 29 3
Total 1,00,495 78,159 22,336

Task 1: Scene text detection

Data Download:

Download the detection.zip from the link (zip file ~17 GB).

Annotations are in BSTD_release_v1.json

File structure

Detection/
│
├── A/
│   ├── image_xx.jpg
│   ├── ...
│   └── image_xx.jpg
├── B/
├── C/
├── ...
├── M/
└── BSTD_release_v1.json

Annotation Format (BSTD_release_v1.json):

Words in the image are annotated in the polygon format. The annotation file is a json file with the following format:

"folderName_image_id": {
    "annotations": 
    {
        "polygon_0":
        {
            "coordinates":
                [
                    [x1, y1],
                    [x2, y2],
                    ...,
                    [xn, yn]
                ],
            "text": "text in the current polygon",
            "script_language" : "language of the word present in the polygon."
        },
        ...,
        "polygon_n":
        {
            "coordinates":
                [
                    [x1, y1],
                    [x2, y2],
                    ...,
                    [xn, yn]
                ],
            "text": "text in the current polygon",
            "script_language" : "language of the word present in the polygon."
        }
    },
    "url": "url of the image",
    "image_name": "path to the image",
    "split" : "train/test split"
    "folderName": "folder of the image"
}

Task 2: Cropped word recognition

Data Download:

Download the recognition.zip from the link (zip file ~774 MB).

File structure

Recognition/
│
├── train/
│   ├── assamese/
│   │   ├── X_image_name_xx_xx.jpg
    │   ├── X_image_name_xx_xx.jpg
    │   ├── X_image_name_xx_xx.jpg
│   ├── bengali/
│   │   ├── ...
│   ├── ...
│   └── urdu/
├── test/
│   ├── assamese/
│   ├── bengali/
│   ├── ...
│   └── urdu/
├── train.csv
└── test.csv

Annotation Format (BSTD_release_v1.json):

Files: recognition/train.csv and recognition/test.csv

Each file contains rows (each row has comma seperated values as follows)

path_to_the_cropped_word_image, recogntion_annotation, script_language

Data Connversion:

To convert the recognition data into lmbd files use utils/fetch_lmdb_format_data.py.

Usage
python fetch_lmdb_format_data.py --recognition_folder_path ~bstd/recognition/ --split train --language hindi --output_directory lmdb/hindi/train/real/hindi

To get more details on arguments

python fetch_lmdb_format_data.py --help

Task 3: Script Identification

For the task of script identification, a dataset comprising images from three languages—English, Hindi, and a specific regional language—has been created. This setup allows for the evaluation of a model that classifies these three classes. The choice of languages reflects the common occurrence of these three languages in certain regions of India.

Folder #Language #English #Hindi
Train Test Train Test Train Test
assamese_ 2623 1343 2623 1343 2623 1343
bengali_ 4968 1161 4968 1161 4968 1161
gujarati_ 1956 693 1956 693 1956 693
kannada_ 2241 693 2241 693 2241 693
malayalam_ 2408 567 2408 567 2408 567
marathi_ 3932 1045 3932 1045 3932 1045
meitei_ 381 113 381 113 381 113
odia_ 3176 1022 3176 1022 3176 1022
tamil_ 2041 507 2041 507 2041 507
telugu_ 2227 482 2227 482 2227 482
urdu_ 29 3 29 3 29 3
hindi_ - - 14855 4034 14855 4034

This dataset can be downloaded from this link. A script utils/make_dataset_for_scriptIdentification.py has also beed added to as to be able to directly create this dataset using the recognition dataset made available the upper section.

How to use

Each folder contains images from three language folders. For example, the folder bengali_ includes cropped word images of Hindi, English, and Bengali. For the test/bengali_folder, all image paths are listed in test.csv, which includes the correct language tag for each image. Similarly, all images in the train folder under each language-specific folder are listed in train.csv with their respective language tags.

Note: The hindi_ folder contains only cropped images of Hindi and English, with each image path listed in the CSV files.

Image subset used in (Vaidya et al., ICPR 2024) Preprint

Data Download:

BSTD images split used for Hindi to English scene text to scene text transaltion can be downloaded from the link

Images used for Hindi to English scene text to scene text transaltion can be downloaded directly from the link

Data Visualisation of Detection Annotations:

To visualise detection annotations, run the following command:

python3 visualise.py <image_path> <path_to_BSTD_release_v1.json>

for e.g.

python3 visualise.py D/image_141.jpg path_to_BSTD_release_v1.json

Some examples are below:

image info image info

Data Annotation

  • All the images are collected from Wikimedia commons (under Creative Commons Licence, cc-by-sa-4.0).
  • Further detection and recognition annotations are manually annotated.

Related Indian Language Scene Text Recognition Toolkit

IndicPhotoOCR

Acknowledgement

This work was partly supported by MeitY, Government of India (Project Number: S/MeitY/AM/20210114) under NLTM-Bhashini.

Contact

For any queries, please contact us at:

Citation

@misc{BSTD,
   title      = {{B}harat {S}cene {T}ext {D}ataset},
  howpublished = {\url{https://github.com/Bhashini-IITJ/BharatSceneTextDataset}},
  year         = 2024,
}

About

Large-Scale Scene Text Dataset for 13 Indic Languages

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages