-
Notifications
You must be signed in to change notification settings - Fork 3
About
Statistical data on the web is often published as Excel or CSV sheets. Thus, they have the advantage of being easily readable by humans. However, they cannot be queried efficiently. Also it is difficult to integrate with other datasets, which may be in different formats. Thus, it is useful to convert the data into a single data model – RDF. But in some cases, a single statistical value is described in several dimensions. Thus, a simple row-based transformation is not possible. Therefore, we used The RDF Data Cube vocabulary for the conversion as it is designed particularly to represent multidimensional statistical data using RDF.
The RDF DataCube vocabulary is based on the popular SDMX standard and designed particularly to represent multidimensional statistical data using RDF. The statistical dataset is considered a multi-dimensional cube which can be characterized by a set of dimensions that define what the values apply to (e.g. time, country, population), along with the metadata describing what was measured (e.g. death rate), how it is measured and how the observations are expressed (e.g. rate, status). Thus, a cube is organized according to a set of dimensions, attributes and measures collectively called components. A set of dimensions is sufficient to describe a single observation. The measure de- scribes the phenomenon that is reported. The attribute, on the other hand, qualifies and interprets the observed value, such as the status of the observation. The dimensions, attributes and measures are represented as RDF properties. Each is an instance of the abstract qb:ComponentProperty class, which in turn has sub-classes qb:DimensionProperty, qb:AttributeProperty and qb:MeasureProperty. Another feature of the Data Cube vocabulary is that it allows defining the structure of the dataset, which enables verification that the dataset matches the expected structure. The qb:DataStructureDefinition allows a user to determine which dimensions are available for query. Thus, the data structure definition can be defined once and reused for similar structured files. The Data Cube vocabulary also uses the SDMX feature of content oriented guidelines (COG). COG’s define a set of common statistical concepts and associated code lists that can be re-used across datasets.
Using this plugin, when a spreadsheet containing multi-dimensional statistical data is imported, it is presented to the user as a table. This presentation of the data gives the users the ability to configure (1) dimensions, (2) attributes and (3) metrics by manually creating them and selecting all the relevant elements and (4) the range of statistical items that are measured. The corresponding COG concepts are automatically suggested, using RDFa, when a user enters a word in the text box provided. It is also possible to save and reuse these configurations for other spreadsheets, which adhere to the same structure (e.g. for data published in consecutive years). Once the transformation is configured by the user, this plugin takes care of automatically transforming the spreadsheets into RDF.
The csvimport plugin is being developed by the following members of AKSW (Agile Knowledge Engineering and Semantic Web):
- Michael Martin (Principle Contact / Maintainer)
- Timofey Ermilov (Development)
- Amrapali Zaveri (Concept, Use Cases)
Further information about it can be found here. AKSW is hosted by the Chair of Business Information Systems (BIS) of the Institute of Computer Science (IfI) / University of Leipzig as well as the Institute for Applied Informatics (InfAI).