DataCatalogue is a research project jointly led by Inria Paris' ALMAnaCH research team, the Bibliothèque nationale de France (BnF), and the Institut national d'histoire de l'art (INHA). It is funded by Inria and the French Ministry of Culture. After an experimental phase in 2021-2022, the project has been renewed for a second phase in 2023-2024.
Our corpus consists in a sample of the sales catalogs from the collections of the BnF and the INHA (over 280,000 documents in total). The 713 catalogs in our sampled corpus are representative in terms of time periods (18th to 21st centuries) and types of sales (numismatics, books, antiquities, works of art, furniture, etc.). The vast majority of the catalogs is in French, but there are instances of catalogs in English and German as well. We aim at desining a complete and mostly automated workflow for processing sales catalogs from their digitization to their publication online as augmented documents that can be queried like a database.
→ README for the DataCatalogue GitHub organizationdatacat-object-detection-dataset
→ Development of the object detection model with YOLOv8datacat-tei
→ TEI customization for sales catalogsextraction-internship
→ Internship on information extraction with GROBID (Abdel Farhi, 2022)grobid-datacat
→ GROBID module for catalogsgrobid-datacat-TrainingData
→ Training datasets for the GROBID "catalogues" modulepublication-internship
→ Internship on publication with TEI Publisher (Jules Nuguet, 2022)
- Hugo Scheithauer, Sarah Bénière, Jean-Philippe Moreux, & Laurent Romary. (2023, November 29). DataCatalogue : rétro-structuration automatique des catalogues de vente. Webinaire Culture Inria.
- Thibault Clérice, Juliette Janès., Hugo Scheithauer, Sarah Bénière, Laurent Romary, & Benoît Sagot. (2024, August 6-9). Layout Analysis Dataset with SegmOnto. DH 2024 - Annual Conference of the Alliance of Digital Humanities Organizations, Washington, D.C., United States.
- Hugo Scheithauer, Sarah Bénière, & Laurent Romary. (2024, August 6-9). Automatic Retro-Structuration of Auction Sales Catalogs layout and Content. DH 2024 - Annual Conference of the Alliance of Digital Humanities Organizations, Washington, D.C., United States.
Logo by Alix Chagué, inspiration from Loading Artist.