Skip to content

Commit

Permalink
Merge pull request #58 from tegonal/format
Browse files Browse the repository at this point in the history
fix formatting issues in components.md
  • Loading branch information
robstoll authored Feb 17, 2021
2 parents 1dca6b4 + f1093b3 commit 5a8e8d5
Showing 1 changed file with 5 additions and 2 deletions.
7 changes: 5 additions & 2 deletions docs/components.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,15 +5,18 @@
## Data collection
Data is collected from a selection of repositories (see figure below), either on a schedule or triggered by webhooks, if possible. In the case of media files hosted on Wikimedia, only the metadata and a link to the media file is collected.
To increase transparency and protect against data sources going offline, backups of the data will be made on a weekly basis.
##Data assimilation

## Data assimilation
The data collected from the different sources is imported into a data structure (e.g. database table, see “consolidation DB” in figure below) where each row corresponds to information for a single fountain as read from a single data source. It is thereby possible to have multiple rows for a single fountain, and the origin of each information is memorized.
Data importation (see “data importer” in figure below) into the data structure thanks to scripts and configuration files which
1. map property names from the source to the data structure (e.g. “name_fr”: “nom”)
2. Indicate nodata values (e.g. ‘inconnu’)
3. Provide missing metadata (e.g. city = ‘Geneva’ or water_quality=’excellent’)
4. Set the authority level of the datasource (Zurich OGD is of higher authority than Wikidata), relevant for the merging process
5. Provide information on the estimated accuracy of the fountain coordinates (e.g. +/- 1 m)
##Data exporting/merging

## Data exporting/merging

The data served to the web app must meet certain quality standards (no duplicates, certain fields required). The data export step polishes the data quality and formats the data as a json for the web app:
1. Merge duplicates:
- The rows of the data structure are grouped by similarity of location and given name. For the location, a distance threshold can be defined. For the comparison of names, many algorithms are available: Hamming distance, Levenshtein distance, Damerau–Levenshtein distance, Jaro–Winkler distance. A smart combination of the two distances must be designed (e.g. if the name matches perfectly, then the location doesn’t matter as much). It would be clever to normalize the geometric distance with the estimated accuracy of the coordinates. Warning: two empty names must have a non-zero distance.
Expand Down

0 comments on commit 5a8e8d5

Please sign in to comment.