This is a benchmark dataset for ontology matching in the field of Digital Humanities (DH) created by Felix Kraus (KIT). You can find all the general information about the dataset that you need for using it at the OAEI in this repository.
For further information on the OAEI 2024 see also here.
This benchmark dataset facilitaes the development of ontology matching systems for the Humanities, which face special obstacles that are at least partly addressed in this dataset:
- wide range of (historical) languages and writing systems
- domain-specific terms with a small research community at times
- use of a data model suitable for easily creating knowledge organization systems
The dataset includes several test cases grouped into three sub-domains. Each test case consists of a source ontology, a target ontology and a manually compiled reference alignment. Only equivalent relations ("=") are targeted.
There were five different vocabularies used for the four different archaeology test cases:
- DEFC [1]
- About 800 terms
- Languages: Mainly English and German
- PACTOLS [2]
- Adapted version: Only narrower terms and direct ancestors of the concept "archaeological site" were used
- About 70 terms
- Languages: Arabic, Dutch, English, French, German, Italian, Spanish
- iDAI.world [3]
- Adapted version: Only narrower terms and direct ancestors of the concept "material things" were used
- About 2600 terms
- Major languages: Arabic, English, French, German, Italian
- Iron-Age-Danube [4]
- About 290 terms
- Languages: Croatian, English, German, Hungarian, Slovenian
- PARTHENOS [5]
- Adapted version: Only narrower terms and direct ancestors of the concept "place types" were used
- About 800 terms
- Language: English
- Source: DEFC
- Target: PACTOLS
- Reference: 800*70=56000 possible combinations, 11 true positives (~0,02%)
- Source: iDAI.world
- Target: PACTOLS
- Reference: 2600*70=182000 possible combinations, 18 true positives (~0,01%)
- Source: Iron-Age-Danube
- Target: PACTOLS
- Reference: 290*70=20300 possible combinations, 6 true positives (~0,03%)
- Source: PACTOLS
- Target: PARTHENOS
- Reference: 70*800=56000 possible combinations, 13 true positives (~0,02%)
There were five different vocabularies used for the two different cultural history test cases:
- iDAI.world [3]
- Adapted version: Only narrower terms and direct ancestors of the concept "chronology" were used
- About 270 terms
- Major languages: Arabic, English, French, German, Italian
- PARTHENOS [5]
- Adapted version: Only narrower terms and direct ancestors of the concept "Periods" were used
- About 200 terms
- Language: English
- OeAI [6]
- About 400 terms
- Languages: English, German
- source: iDAI.world
- target: PARTHENOS
- Reference: 270*200=54000 possible combinations, 53 true positives (~0,1%)
- source: OeAI
- target: PARTHENOS
- Reference: 400*200=80000 possible combinations, 48 true positives (0,06%)
There were five different vocabularies used for the two different DH/CS test cases:
- DHA Taxonomy [7]
- About 115 terms
- Languages: English
- UNESCO [8]
- Adapted version: Only narrower terms and direct ancestors of the concept "Information and communication" were used
- About 490 terms
- Languages: Arabic, English, French, Russion, Spanish
- TaDiRAH [9]
- About 170 terms
- Main Language: English
- source: DHA Taxonomy
- target: UNESCO
- Reference: 115*490=56350 possible combinations, 12 true positives (~0,02%)
- source: TaDiRAH
- target: UNESCO
- Reference: 170*490=83300 possible combinations, 16 true positives (~0,02%)
The following criteria were used to select suitable controlled vocabularies (CVs) from all DH CVs that were found:
- The track should cover different subfields of Humanities
- Preferably specific terminology than general to pose a challenge to the OM systems
- CV is in SKOS format (unlike e.g. plain HTML CVs that could not be used)
- A combination of two CVs needs to have at least some true positives
- CVs with errors such as doublette terms, terms not in hierarchy, violation of SKOS were not considered
Some potential test cases are hold back for the following OAEI years such that the systems always encounter unseen test cases.
To create a well designed reference alignment, the following points were taken into account:
- Special care was taken to ensure that similar terms are really identical and not closely related, see [10]
- Only the same part-of-speech was considered to be similar (e.g. analysis and to analyse was not considered to be similar)
- Singular and plural was considered to be similar
- Large vocabularies were reduced to only the relevant terms. This means that the branch of the hierarchy containing terms of the desired topic was kept and the other branches was removed. This ensures that the reference alignment can be compiled manually. To create the manual alignment, the dataset creator went through each term of the source CV and searched thouroughly for similar terms in the target CV using full text search and its hierarchy.
[1]: DEFC [CC BY 4.0; Creators: Seta Štuhec, Anja Masur, Peter Andorfer, Ksenia Zaytseva, Edeltraud Aspöck]
[2]: PACTOLS (adapted) [ODbL v1.0; Creators: Groupe PACTOLS/FRANTIQ]
[3]: iDAI.world (adapted) [CC BY 4.0; Creators: Annika Kirscheneder, Camilla Colombi, Elenore Pape, Gabriele Rasbach, Henriette Senst, Lena Vitt, Matthias Block, Nina Dworschak, Reinhard Förtsch, Sabine Thänert]
[4]: Iron-Age-Danube [CC BY 4.0; Creator: Seta Štuhec]
[5]: PARTHENOS (adapted) [CC BY 4.0; Creators: PARTHENOS project]
[6]: OeAI (adapted) [CC BY 4.0; Creator: Micheline Welte]
[7]: DHA Taxonomy (adapted) [CC BY 4.0; Creators: ACDH-OEAW Team]
[8]: UNESCO (adapted) [CC BY-SA 3.0; Creators: UNESCO]
[9]: TaDiRAH (adapted) [CC0; Creators: Luise Borek, Canan Hastik, Vera Khramova, Jonathan Geiger]
[10]: Hill F, Reichart R, Korhonen A (2015) SimLex-999: Evaluating Semantic Models With (Genuine) Similarity Estimation. Computational Linguistics 41:665–695. https://doi.org/10.1162/COLI_a_00237
- Creator: Felix Kraus
- Email (substitute accordingly): firstname.lastname (at) kit (dot) edu
- License owner: Karlsruhe Institute of Technology (KIT)
Development of this software product was funded by the research program “Engineering Digital Futures” of the Helmholtz Association of German Research Centers.