Ground Truth dataset for French handwritten pages of Civil Registry "Tables Décennales"
150 images and Alto XML files divided into 3 sub-corpus.
Only first names, last names and dates are transcribed and only for birth sections of the documents.
The Alto files contain:
- Segmentation of the transcribed texts.
- Transcription of the texts.
- Polygonalization of the transcribed text zones (performed by kraken OCR solution).
# | name | nb of images | GT for segmenter? | GT for recognizer? | link(s) to source images |
---|---|---|---|---|---|
1 | sermaises | (69) | y | y | Archives départementales du Loiret (Sermaises) |
2 | rom-1883-1892 | (41) | y | y | Archives départementales de l'Aube (Romilly-sur-Seine) |
3 | rom-1893-1902 | (40) | y | y | Archives départementales de l'Aube (Romilly-sur-Seine) |
Portions of text that are superscripted are preceded with ^
such as "1er" will be transcribed as "1^er".
If several words are superscripted, each word starts with a "^".
This dataset was built by Jean-François Boutet and Jean-Pierre Merx.
The original works and their digitization are all copyright-free, but properly annotating a corpus takes time and is a task that should be recognized. If you use any item from this corpus of ground truth, cite the dataset using the following information:
title : 'GenAuto TD Corpus'
url: 'https://github.com/jpmjpmjpm/genauto-td-htr.git'
project-name: 'GenAuto'
project-website: ''
authors:
- name: 'Boutet'
surname: 'Jean-François'
roles:
- 'transcriber'
- 'aligner'
- name: 'Merx'
surname: 'Jean-Pierre'
roles:
- 'transcriber'
- 'aligner'
- 'project-manager'
description: '150 transcribed images from "Tables Décennales" French Civil Registry.
Those come from Sermaises and Romilly-sur-Seine municipalities. '
language: 'French'
#other-languages:
# - "Optional"
script: 'Latin'
script-type: 'only-manuscript'
time: 1792--1902
hands:
- count: 'less-than-11'
precision: 'estimated'
license:
- {name: 'CC-BY 4.0', url: 'https://creativecommons.org/licenses/by/4.0/'}
format: 'Alto-XML'
volume:
- {count: "300", metric: "pages"}
- {count: "150, metric: "images"}
This work is licensed under a Creative Commons Attribution 4.0 International License.