When spreadsheets are filled freely by knowledge workers, they can contain rather unstructured content. For humans and especially machines it becomes difficult to interpret such data properly. If a data maintenance strategy has been missing and user-generated data becomes "messy", the construction of knowledge graphs will be a challenging task.
- Multiple Surface Forms: Entities can be mentioned in various ways.
- Mixed Date Representations: Dates can be represented in various formats.
- Acronyms and Symbols: To reduce typing effort, users tend to write acronyms or even single symbols.
- Free Comments: In spreadsheets, users are able to edit cells in order to write small comments.
- Style Usage: Changing style is a common way to express additional information in cells (e.g. struck out).
- Multiple Entities in a Cell: If a table is unnormalized, data redundancies can occur. Cell contents have to be split appropriately.
- Multiple Types in a Table: dynamically assigning types to extracted row entities.
- Implicit Relationship: Users tend to write identical texts (like IDs) in different cells to express an implicit relationship.
Based on observations in a work (currently under review) mapping approaches seems to struggle with: (5.) Style Usage, (6.) Multiple Entities in a Cell and (8.) Implicit Relationship.
See also for (6.) a related question at stackoverflow: https://stackoverflow.com/questions/61751174/is-there-a-solution-in-rml-for-multiple-complex-entities-in-one-data-element-ce
Dataset is privat, because it is from an industrial scenario. Paper with proposed approach is currently under review.