This page lists the datasets which are commonly used in text detection, text recognition and key information extraction, and their download links.
The structure of the text detection dataset directory is organized as follows.
├── ctw1500
│ ├── annotations
│ ├── imgs
│ ├── instances_test.json
│ └── instances_training.json
├── icdar2015
│ ├── imgs
│ ├── instances_test.json
│ └── instances_training.json
├── icdar2017
│ ├── imgs
│ ├── instances_training.json
│ └── instances_val.json
├── synthtext
│ ├── imgs
│ └── instances_training.lmdb
├── textocr
│ ├── train
│ ├── instances_training.json
│ └── instances_val.json
├── totaltext
│ ├── imgs
│ ├── instances_test.json
│ └── instances_training.json
Dataset | Images | Annotation Files | |||
---|---|---|---|---|---|
training | validation | testing | |||
CTW1500 | homepage | - | - | - | |
ICDAR2015 | homepage | instances_training.json | - | instances_test.json | |
ICDAR2017 | homepage | instances_training.json | instances_val.json | - | |
Synthtext | homepage | instances_training.lmdb | - | - | |
TextOCR | homepage | - | - | - | |
Totaltext | homepage | - | - | - |
Note: For users who want to train models on CTW1500, ICDAR 2015/2017, and Totaltext dataset, there might be some images containing orientation info in EXIF data. The default OpenCV
backend used in MMCV would read them and apply the rotation on the images. However, their gold annotations are made on the raw pixels, and such
inconsistency results in false examples in the training set. Therefore, users should use dict(type='LoadImageFromFile', color_type='color_ignore_orientation')
in pipelines to change MMCV's default loading behaviour. (see DBNet's config for example)
-
For
icdar2015
:- Step1: Download
ch4_training_images.zip
,ch4_test_images.zip
,ch4_training_localization_transcription_gt.zip
,Challenge4_Test_Task1_GT.zip
from homepage - Step2:
mkdir icdar2015 && cd icdar2015 mkdir imgs && mkdir annotations # For images, mv ch4_training_images imgs/training mv ch4_test_images imgs/test # For annotations, mv ch4_training_localization_transcription_gt annotations/training mv Challenge4_Test_Task1_GT annotations/test
- Step3: Download instances_training.json and instances_test.json and move them to
icdar2015
- Or, generate
instances_training.json
andinstances_test.json
with following command:
python tools/data/textdet/icdar_converter.py /path/to/icdar2015 -o /path/to/icdar2015 -d icdar2015 --split-list training test
- Step1: Download
-
For
icdar2017
:- Follow similar steps as above.
-
For
ctw1500
:- Step1: Download
train_images.zip
,test_images.zip
,train_labels.zip
,test_labels.zip
from github
mkdir ctw1500 && cd ctw1500 mkdir imgs && mkdir annotations # For annotations cd annotations wget -O train_labels.zip https://universityofadelaide.box.com/shared/static/jikuazluzyj4lq6umzei7m2ppmt3afyw.zip wget -O test_labels.zip https://cloudstor.aarnet.edu.au/plus/s/uoeFl0pCN9BOCN5/download unzip train_labels.zip && mv ctw1500_train_labels training unzip test_labels.zip -d test cd .. # For images cd imgs wget -O train_images.zip https://universityofadelaide.box.com/shared/static/py5uwlfyyytbb2pxzq9czvu6fuqbjdh8.zip wget -O test_images.zip https://universityofadelaide.box.com/shared/static/t4w48ofnqkdw7jyc4t11nsukoeqk9c3d.zip unzip train_images.zip && mv train_images training unzip test_images.zip && mv test_images test
- Step2: Generate
instances_training.json
andinstances_test.json
with following command:
python tools/data/textdet/ctw1500_converter.py /path/to/ctw1500 -o /path/to/ctw1500 --split-list training test
- Step1: Download
-
For
TextOCR
:- Step1: Download train_val_images.zip, TextOCR_0.1_train.json and TextOCR_0.1_val.json to
textocr/
.
mkdir textocr && cd textocr # Download TextOCR dataset wget https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip wget https://dl.fbaipublicfiles.com/textvqa/data/textocr/TextOCR_0.1_train.json wget https://dl.fbaipublicfiles.com/textvqa/data/textocr/TextOCR_0.1_val.json # For images unzip -q train_val_images.zip mv train_images train
- Step2: Generate
instances_training.json
andinstances_val.json
with the following command:
python tools/data/textdet/textocr_converter.py /path/to/textocr
- Step1: Download train_val_images.zip, TextOCR_0.1_train.json and TextOCR_0.1_val.json to
-
For
Totaltext
:- Step1: Download
totaltext.zip
from github dataset andgroundtruth_text.zip
from github Groundtruth (Our totaltext_converter.py supports groundtruth with both .mat and .txt format).
mkdir totaltext && cd totaltext mkdir imgs && mkdir annotations # For images # in ./totaltext unzip totaltext.zip mv Images/Train imgs/training mv Images/Test imgs/test # For annotations unzip groundtruth_text.zip cd Groundtruth mv Polygon/Train ../annotations/training mv Polygon/Test ../annotations/test
- Step2: Generate
instances_training.json
andinstances_test.json
with the following command:
python tools/data/textdet/totaltext_converter.py /path/to/totaltext -o /path/to/totaltext --split-list training test
- Step1: Download
The structure of the text recognition dataset directory is organized as follows.
├── mixture
│ ├── coco_text
│ │ ├── train_label.txt
│ │ ├── train_words
│ ├── icdar_2011
│ │ ├── training_label.txt
│ │ ├── Challenge1_Training_Task3_Images_GT
│ ├── icdar_2013
│ │ ├── train_label.txt
│ │ ├── test_label_1015.txt
│ │ ├── test_label_1095.txt
│ │ ├── Challenge2_Training_Task3_Images_GT
│ │ ├── Challenge2_Test_Task3_Images
│ ├── icdar_2015
│ │ ├── train_label.txt
│ │ ├── test_label.txt
│ │ ├── ch4_training_word_images_gt
│ │ ├── ch4_test_word_images_gt
│ ├── III5K
│ │ ├── train_label.txt
│ │ ├── test_label.txt
│ │ ├── train
│ │ ├── test
│ ├── ct80
│ │ ├── test_label.txt
│ │ ├── image
│ ├── svt
│ │ ├── test_label.txt
│ │ ├── image
│ ├── svtp
│ │ ├── test_label.txt
│ │ ├── image
│ ├── Syn90k
│ │ ├── shuffle_labels.txt
│ │ ├── label.txt
│ │ ├── label.lmdb
│ │ ├── mnt
│ ├── SynthText
│ │ ├── shuffle_labels.txt
│ │ ├── instances_train.txt
│ │ ├── label.txt
│ │ ├── label.lmdb
│ │ ├── synthtext
│ ├── SynthAdd
│ │ ├── label.txt
│ │ ├── label.lmdb
│ │ ├── SynthText_Add
│ ├── TextOCR
│ │ ├── image
│ │ ├── train_label.txt
│ │ ├── val_label.txt
│ ├── Totaltext
│ │ ├── imgs
│ │ ├── annotations
│ │ ├── train_label.txt
│ │ ├── test_label.txt
Dataset | images | annotation file | annotation file |
---|---|---|---|
training | test | ||
coco_text | homepage | train_label.txt | - |
icdar_2011 | homepage | train_label.txt | - |
icdar_2013 | homepage | train_label.txt | test_label_1015.txt |
icdar_2015 | homepage | train_label.txt | test_label.txt |
IIIT5K | homepage | train_label.txt | test_label.txt |
ct80 | - | - | test_label.txt |
svt | homepage | - | test_label.txt |
svtp | - | - | test_label.txt |
Syn90k | homepage | shuffle_labels.txt | label.txt | - |
SynthText | homepage | shuffle_labels.txt | instances_train.txt | label.txt | - |
SynthAdd | SynthText_Add.zip (code:627x) | label.txt | - |
TextOCR | homepage | - | - |
Totaltext | homepage | - | - |
-
For
icdar_2013
:- Step1: Download
Challenge2_Test_Task3_Images.zip
andChallenge2_Training_Task3_Images_GT.zip
from homepage - Step2: Download test_label_1015.txt and train_label.txt
- Step1: Download
-
For
icdar_2015
:- Step1: Download
ch4_training_word_images_gt.zip
andch4_test_word_images_gt.zip
from homepage - Step2: Download train_label.txt and test_label.txt
- Step1: Download
-
For
IIIT5K
:- Step1: Download
IIIT5K-Word_V3.0.tar.gz
from homepage - Step2: Download train_label.txt and test_label.txt
- Step1: Download
-
For
svt
:- Step1: Download
svt.zip
form homepage - Step2: Download test_label.txt
- Step3:
python tools/data/textrecog/svt_converter.py <download_svt_dir_path>
- Step1: Download
-
For
ct80
:- Step1: Download test_label.txt
-
For
svtp
:- Step1: Download test_label.txt
-
For
coco_text
:- Step1: Download from homepage
- Step2: Download train_label.txt
-
For
Syn90k
:- Step1: Download
mjsynth.tar.gz
from homepage - Step2: Download shuffle_labels.txt
- Step3:
mkdir Syn90k && cd Syn90k mv /path/to/mjsynth.tar.gz . tar -xzf mjsynth.tar.gz mv /path/to/shuffle_labels.txt . # create soft link cd /path/to/mmocr/data/mixture ln -s /path/to/Syn90k Syn90k
- Step1: Download
-
For
SynthText
:- Step1: Download
SynthText.zip
from homepage - Step2:
mkdir SynthText && cd SynthText mv /path/to/SynthText.zip . unzip SynthText.zip mv SynthText synthtext mv /path/to/shuffle_labels.txt . # create soft link cd /path/to/mmocr/data/mixture ln -s /path/to/SynthText SynthText
- Step3: Generate cropped images and labels:
cd /path/to/mmocr python tools/data/textrecog/synthtext_converter.py data/mixture/SynthText/gt.mat data/mixture/SynthText/ data/mixture/SynthText/synthtext/SynthText_patch_horizontal --n_proc 8
- Step1: Download
-
For
SynthAdd
:mkdir SynthAdd && cd SynthAdd mv /path/to/SynthText_Add.zip . unzip SynthText_Add.zip mv /path/to/label.txt . # create soft link cd /path/to/mmocr/data/mixture ln -s /path/to/SynthAdd SynthAdd
Note: To convert label file with
txt
format tolmdb
format,
python tools/data/utils/txt2lmdb.py -i <txt_label_path> -o <lmdb_label_path>
For example,
python tools/data/utils/txt2lmdb.py -i data/mixture/Syn90k/label.txt -o data/mixture/Syn90k/label.lmdb
-
For
TextOCR
:- Step1: Download train_val_images.zip, TextOCR_0.1_train.json and TextOCR_0.1_val.json to
textocr/
.
mkdir textocr && cd textocr # Download TextOCR dataset wget https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip wget https://dl.fbaipublicfiles.com/textvqa/data/textocr/TextOCR_0.1_train.json wget https://dl.fbaipublicfiles.com/textvqa/data/textocr/TextOCR_0.1_val.json # For images unzip -q train_val_images.zip mv train_images train
- Step2: Generate
train_label.txt
,val_label.txt
and crop images using 4 processes with the following command:
python tools/data/textrecog/textocr_converter.py /path/to/textocr 4
- Step1: Download train_val_images.zip, TextOCR_0.1_train.json and TextOCR_0.1_val.json to
-
For
Totaltext
:- Step1: Download
totaltext.zip
from github dataset andgroundtruth_text.zip
from github Groundtruth (Our totaltext_converter.py supports groundtruth with both .mat and .txt format).
mkdir totaltext && cd totaltext mkdir imgs && mkdir annotations # For images # in ./totaltext unzip totaltext.zip mv Images/Train imgs/training mv Images/Test imgs/test # For annotations unzip groundtruth_text.zip cd Groundtruth mv Polygon/Train ../annotations/training mv Polygon/Test ../annotations/test
- Step2: Generate cropped images,
train_label.txt
andtest_label.txt
with the following command (the cropped images will be saved todata/totaltext/dst_imgs/
.):
python tools/data/textrecog/totaltext_converter.py /path/to/totaltext -o /path/to/totaltext --split-list training test
- Step1: Download
The structure of the key information extraction dataset directory is organized as follows.
└── wildreceipt
├── class_list.txt
├── dict.txt
├── image_files
├── test.txt
└── train.txt
- Download wildreceipt.tar
The structure of the named entity recognition dataset directory is organized as follows.
└── cluener2020
├── cluener_predict.json
├── dev.json
├── README.md
├── test.json
├── train.json
└── vocab.txt
-
Download cluener_public.zip
-
Download vocab.txt and move
vocab.txt
tocluener2020