Skip to content

Commit

Permalink
review doc and add links to IGNF/validator-schema (refs #155)
Browse files Browse the repository at this point in the history
  • Loading branch information
mborne committed Jul 15, 2020
1 parent ebaa3f7 commit ca7926b
Show file tree
Hide file tree
Showing 21 changed files with 190 additions and 368 deletions.
29 changes: 21 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,16 +14,15 @@ Ce programme permet de valider et de normaliser les données présentes dans une
* Des fichiers PDF
* Des dossiers (principalement pour contrôle de présence)

Le paramétrage s'effectue à l'aide de fichiers XML décrivant :

* Des modèles de table (FeatureCatalogue : FeatureType/AttributeType)
* Un mapping de fichiers (chemin d'accès, obligatoire/conseillé/optionel, type: pdf, table, dossier, etc.)
Le paramétrage s'effectue à l'aide de [fichiers JSON décrivant des arborescences de fichiers et des tables](https://github.com/IGNF/validator-schema#ignfvalidator-schema).

Il a été développé dans le cadre du [géoportail de l'urbanisme](https://www.geoportail-urbanisme.gouv.fr) pour la validation des [standards CNIG](https://www.geoportail-urbanisme.gouv.fr/standard/).

## Principe de fonctionnement

![Working principle](doc/principe.jpg)
Le schéma suivant illustre le [Principe de fonctionnement du validateur de document](doc/principe.md) :

![Working principle](doc/img/principe.jpg)

## Principales fonctionnalités

Expand All @@ -49,9 +48,13 @@ Voir [LICENCE.md](LICENCE.md)

## Documentation technique

* [Modélisation des documents](doc/model.md)
* [Metadata validation](doc/metadata.md)
* [Validation des caractères](doc/characters/index.md)
* [Modélisation des documents (french)](doc/model.md)
* [Liste des codes d'erreurs (json)](validator-core/src/main/resources/error-code.json)
* [Projection supportées (json)](validator-core/src/main/resources/projection.json)
* [Metadata modelization (english)](doc/metadata.md)
* [Characters validation (english)](doc/characters/index.md)
* [plugin-cnig - validation des champs IDURBA](doc/plugin-cnig/idurba.md)


## Cas d'utilisation

Expand All @@ -61,6 +64,7 @@ Ce programme a été développé dans le cadre du [géoportail de l'urbanisme](h

* java >= 11
* [ogr2ogr >= v2.3.0](doc/dependencies/ogr2ogr.md) : Utilisé pour lire et convertir les données en entrée dans un format pivot avant validation (CSV)
* [geotools](doc/dependencies/geotools.md)

## Compilation

Expand Down Expand Up @@ -109,3 +113,12 @@ java -jar validator-cli/target/validator-cli.jar metadata_to_json \

Exemple : [01.xml](validator-core/src/test/resources/metadata/01.xml) -> [01.json](validator-core/src/test/resources/metadata/01-expected.json)


## Extensibilité

Le validateur permet l'ajout qui plugin qui vont exécuter des tâches à différentes étapes de la validation :

* Avant la mise en correspondance des fichiers et du modèle (ex : modification d'extension)
* Avant la validation (ex : détection de l'encodage des fichiers à partir des métadonnées)
* Après la validation (ex : receuil de métadonnées sur les données validées, contrôles supplémentaires, etc.)

26 changes: 11 additions & 15 deletions doc/characters/index.md → doc/characters.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,30 +5,26 @@
The charset used to read data is either :

* The default value : UTF-8
* The value provided as a command line argument
* The value provided in a metadata file (```identificationInfo[1]/*/characterSet```)
* The value provided as a command line argument (`--encoding`)
* The value provided in a metadata file (`identificationInfo[1]/*/characterSet`)

(For now, value provided in companion file such as .cpg for shapefile is ignored because it may contains value such as "system")
For now, value provided in companion file such as .cpg for shapefile is ignored because it may contains value such as "system".

## Data charset validation

![validation process](img/process.png)
![Characters validation](img/characters-validation.png)

## Deep character validation

Deep character validation is based on the attempt to apply the following transforms. A validation error is triggered if the string is modified.

TODO :
* Create an error code per transform (currently, either an error or a warning is produced if the string is modified)
* Discuss separation between validation/normalization

### Double UTF-8 encoding

If a dataset encoded UTF-8 and declared as LATIN1, the reading process can't detect an error. Meanwhile, strings will contains character sequence rarely presents in real data.

The validator optionally search sequences of double encoded UTF-8 characters and replace them by original characters.

Command line option : ```--string-fix-utf8```
Command line option : `--string-fix-utf8`

### Character simplification

Expand All @@ -39,26 +35,26 @@ The validator optionnaly apply character replacement to produce normalized data

#### Common simplification

The file ```/validator-core/src/main/resources/simplify/common.csv``` defines this replacements.
The file [validator-core/src/main/resources/simplify/common.csv](../validator-core/src/main/resources/simplify/common.csv) defines this replacements.

Command line option : ```--string-simplify-common```
Command line option : `--string-simplify-common`

#### Charset specific simplification

The file ```/validator-core/src/main/resources/simplify/ISO-8859-1.csv``` defines this replacements for LATIN1.
The file [validator-core/src/main/resources/simplify/ISO-8859-1.csv](../validator-core/src/main/resources/simplify/ISO-8859-1.csv) defines this replacements for LATIN1.

Command line option : ```--string-simplify-charset <CHARSET>```
Command line option : `--string-simplify-charset <CHARSET>`

### Character escaping

#### Control characters

Non standard control characters are detected and escaped in hexadecimal form (ex : [\\u0092](http://www.fileformat.info/info/unicode/char/0092/index.htm))

Command line option : ```--string-escape-controls```
Command line option : `--string-escape-controls`

#### Characters not supported by specific charset

To ensure compatibility with a given charset, it is possible to escape unsupported characters too.

Command line option : ```--string-escape-charset <CHARSET>```
Command line option : `--string-escape-charset <CHARSET>`
7 changes: 7 additions & 0 deletions doc/dependencies/geotools.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,13 @@
# geotools

## Relation to java version

See [Geotools - Java Install](http://docs.geotools.org/latest/userguide/build/install/jdk.html#) to get information about java version support.

## Projection

Coordinate order behavior differs between usage of `gt-epsg-hsql` or `gt-epsg-wkt`.

There is no way to manage standard lat,lon for EPSG:4326 with `gt-epsg-wkt`.

So, `gt-epsg-hsql` is used as `java -Dorg.geotools.referencing.forceXY=true` may allow non standard lon,lat order.
23 changes: 3 additions & 20 deletions doc/dependencies/ogr2ogr.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,30 +4,13 @@

`ogr2ogr` from `gdal-bin` is used to convert input data to UTF-8 encoded CSV with WKT geometries.

Version **2.3.0** or more is recommanded as it allows uniform charset handling for input data.
Version **2.3.0** or more is required as it allows uniform charset handling for input data.

## Tested versions

* 1.9.1
* 1.9.3
* 1.10.1
* 1.11.3
* 2.1.*
* 2.2.2
* 2.3.0
* 2.3.3
* 2.4.0

## Banned versions

* 1.9.0 : WKT bug (8000 chars limit)
* 2.4.2

## Notes

* Between GDAL 1.x and 2.x, GDAL introduce a regression in WKT precision management while converting to CSV (OGR_WKT_PRECISION is a global precision, not a number of decimals). Meanwhile, coordinate accuracy is better with GDAL 2.x.
* Between GDAL 2.2 and 2.3, GDAL introduce a regression while converting LATIN1 TAB to CSV. Since GDAL 2.3.0, output CSV is UTF-8 encoded according to TAB declaration.





* Between GDAL 2.2 and 2.3, GDAL introduce a regression while converting LATIN1 TAB to CSV. Since GDAL 2.3.0, output CSV is UTF-8 encoded according to charset declaration in TAB files.
16 changes: 0 additions & 16 deletions doc/geotools.md

This file was deleted.

File renamed without changes.
File renamed without changes
File renamed without changes.
File renamed without changes
Binary file added doc/img/principe.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added doc/img/principe.odg
Binary file not shown.
119 changes: 86 additions & 33 deletions doc/metadata/metadata.md → doc/metadata.md
Original file line number Diff line number Diff line change
@@ -1,39 +1,46 @@
# Metadata
# Metadata modelization

Metadata "attributes" with INSPIRE multiplicity for datasets according to INSPIRE_GUIDELINE_2017.
The following Metadata model is dedicated to validation according to INSPIRE and CNIG profiles. XML parsing is partial and based on XPath.

## Class diagram

The following profile is used to store metadata parsed from ISO 19915. Metadata attributes are based on INSPIRE requirements.

![Class diagram](uml/metadata.png)

| name | type | title | multiplicity |
| ------------------------- | ----------------------------- | ------------------------------------ | ------------ |
| fileIdentifier | String | File identifier | [0..1] |
| title | String | Resource title | [1] |
| abstract | String | Resource abstract | [1] |
| type | ScopeCode | Resource type | [1] |
| locators | OnlineResource[] | Resource locator | [1..*] |
| identifiers | String[] | Unique resource identifier | [1..*] |
| language | LanguageCode | Resource langage | [0..*] (1) |
| topicCategory | TopicCategoryCode | Topic category | [1..*] (1) |
| keywords | Keywords | Keyword | [1..*] |
| extents | Extent[] | Extents with geographic bounding box | [1..*] |
| referenceSystemIdentifier | ReferenceSystemIdentifier | Coordinate Reference System | [0..*] (1) |
| dateOfPublication | Date | Date of publication | [0..*] (1) |
| dateOfLastRevision | Date | Date of last revision | [0..1] |
| dateOfCreation | Date | Date of creation | [0..1] |
| characterSet | CharacterSetCode | Character Encoding | [1..*] (1) |
| contraints | Contraint[] | Resource constraints | [0..*] |
| distributionFormats | Format | Encoding | [0..*] |
| spatialRepresentationType | SpatialRepresentationTypeCode | Spatial representation type | [1..*] (1) |
| lineage | String | Lineage | [1] |
| spatialResolutions | Resolution | Spatial resolution | [0..*] |
| specifications | Specification | Specification title and degree | [1..*] |
| contact | ResponsibleParty | Responsible party | [0..*] (1) |
| metadataContact | ResponsibleParty | Metadata point of contact | [1..*] (1) |
| metadataDate | Date | Metadata date | [1] |
| metadataLanguage | LanguageCode | Metadata langage | [1] |


(1) multiplicity is adapted, only the first element is parsed
## Metadata properties

Metadata "attributes" with INSPIRE multiplicity for datasets according to INSPIRE_GUIDELINE_2017.

| name | type | title | multiplicity |
|---------------------------|---------------------------------|--------------------------------------|--------------|
| contraints | `Contraint[]` | Resource constraints | [0..*] |
| distributionFormats | `Format` | Encoding | [0..*] |
| spatialResolutions | `Resolution` | Spatial resolution | [0..*] |
| language | `LanguageCode` | Resource langage | [0..*] (1) |
| referenceSystemIdentifier | `ReferenceSystemIdentifier` | Coordinate Reference System | [0..*] (1) |
| dateOfPublication | `Date` | Date of publication | [0..*] (1) |
| contact | `ResponsibleParty` | Responsible party | [0..*] (1) |
| fileIdentifier | `String` | File identifier | [0..1] |
| dateOfLastRevision | `Date` | Date of last revision | [0..1] |
| dateOfCreation | `Date` | Date of creation | [0..1] |
| locators | `OnlineResource[]` | Resource locator | [1..*] |
| identifiers | `String[]` | Unique resource identifier | [1..*] |
| keywords | `Keywords` | Keyword | [1..*] |
| extents | `Extent[]` | Extents with geographic bounding box | [1..*] |
| specifications | `Specification` | Specification title and degree | [1..*] |
| topicCategory | `TopicCategoryCode` | Topic category | [1..*] (1) |
| characterSet | `CharacterSetCode` | Character Encoding | [1..*] (1) |
| spatialRepresentationType | `SpatialRepresentationTypeCode` | Spatial representation type | [1..*] (1) |
| metadataContact | `ResponsibleParty` | Metadata point of contact | [1..*] (1) |
| title | `String` | Resource title | [1] |
| abstract | `String` | Resource abstract | [1] |
| type | `ScopeCode` | Resource type | [1] |
| lineage | `String` | Lineage | [1] |
| metadataDate | `Date` | Metadata date | [1] |
| metadataLanguage | `LanguageCode` | Metadata langage | [1] |

> (1) multiplicity is adapted, only the first element is parsed
## fileIdentifier

Expand Down Expand Up @@ -75,7 +82,7 @@ identificationInfo[1]/*/citation/*/title
### References

* INSPIRE_GUIDELINE_2017 - 2.3 Identification info section / 2.3.1 Resource title (p14)
* INSPIRE_GUIDELINE_2013 - 2.2 Identification / 2.2.1 Resource title (p17)
* INSPIRE_GUIDELINE_2013 - 2.2 Identification / 2.2.1 Resource title (p17)
* CNIG_MD_DU - 1) Identification des données / Intitulé de la resource (p4)


Expand Down Expand Up @@ -595,3 +602,49 @@ language

* INSPIRE_GUIDELINE_2013 - 2.11.3 Metadata language / 2.11.3 Metadata langage (p60)
* CNIG_MD_DU - 10) Métadonnées concernant les métadonnées / Langue des métadonnées (p15)


## Resources

### Documents (english)

* INSPIRE_GUIDELINE_2017 : [Technical Guidance for the implementation of
INSPIRE dataset and service metadata based
on ISO/TS 19139:2007](https://inspire.ec.europa.eu/id/document/tg/metadata-iso19139)

https://inspire.ec.europa.eu/id/document/tg/metadata-iso19139

* INSPIRE_GUIDELINE_2013: [INSPIRE Metadata Implementing
Rules: Technical Guidelines based
on EN ISO 19115 and EN ISO 19119](https://inspire.ec.europa.eu/documents/inspire-metadata-implementing-rules-technical-guidelines-based-en-iso-19115-and-en-iso-1)

### Documents (french)

* CNIG_MD_INSPIRE - [Guide de saisie des éléments
de métadonnées INSPIRE - juillet 2014](http://inspire.ec.europa.eu/documents/Metadata/MD_IR_and_ISO_20131029.pdf)

* CNIG_MD_DU - [CNIG - Consignes de saisie des
Métadonnées INSPIRE pour les
documents d’urbanisme - septembre 2017](http://cnig.gouv.fr/wp-content/uploads/2017/09/170914_consignes_saisie_metadonnees_DU_vprojet.pdf)

* CNIG_MD_SUP - [CNIG - Consignes de saisie des
Métadonnées INSPIRE pour les
servitudes d’utilité publique](http://cnig.gouv.fr/wp-content/uploads/2017/09/170914_consignes_saisie_metadonnees_SUP_vprojet.pdf)

### Normative references

* [ISO 19115-1:2014 - Geographic information -- Metadata -- Part 1: Fundamentals](https://www.iso.org/fr/standard/53798.html)
* [ISO/TS 19139:2007 - Geographic information -- Metadata -- XML schema implementation](https://www.iso.org/standard/32557.html)

### XSD schemas and resources

* [http://www.isotc211.org/2005/](http://www.isotc211.org/2005/)
* [http://www.isotc211.org/2005/resources/Codelist/gmxCodelists.xml](http://www.isotc211.org/2005/resources/Codelist/gmxCodelists.xml)

* [http://inspire.ec.europa.eu/metadata-codelist/](INSPIRE metadata code list register)
* [ISO 19115 and 19115-2 CodeList Dictionaries](https://geo-ide.noaa.gov/wiki/index.php?title=ISO_19115_and_19115-2_CodeList_Dictionaries)

### Third part tools

* [http://www.isotc211.org/2005/gmd - schema explorer](http://www.datypic.com/sc/niem21/ns-gmd.html)
* [INSPIRE metadata validator](http://inspire-geoportal.ec.europa.eu/validator2/)
Loading

0 comments on commit ca7926b

Please sign in to comment.