From 667661c762c1a2b61b867d853857d30be62d55fb Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Mikel=20Ega=C3=B1a=20Aranguren?= Date: Wed, 2 Jun 2021 09:31:16 +0200 Subject: [PATCH 1/2] Add minor fixes to MarkDown documentation --- doc/Plugins.md | 83 +++++++++++++++++++++++++++++++++++--------------- 1 file changed, 58 insertions(+), 25 deletions(-) diff --git a/doc/Plugins.md b/doc/Plugins.md index c8552830e8..6f0f9bd60b 100644 --- a/doc/Plugins.md +++ b/doc/Plugins.md @@ -1,4 +1,6 @@ -# Data Sources +# Plugins + +## Data Sources - Sparql Endpoints - RDF files @@ -7,10 +9,11 @@ - CSV files - MySQL database -# Similarity Measures +## Similarity Measures The following similarity measures are included: -## Characterbased + +### Character based Character-based distance measures compare strings on the character level. They are well suited for handling typographical errors. @@ -24,10 +27,10 @@ handling typographical errors. | qGrams([q: Int = '2'], [minChar: Char = '0'], [maxChar: Char = 'z']) | qGrams | String similarity based on q-grams (by default q=2). | | substring([granularity: String = '3']) | SubString | Return 0 to 1 for strong similarity to weak similarity | -## Tokenbased +### Token based + +While character-based distance measures work well for typographical errors, there are a number of tasks where token-base distance measures are better suited: -While character-based distance measures work well for typographical -errors, there are a number of tasks where token-base distance measures are better suited: - Strings where parts are reordered e.g. “John Doe” and “Doe, John” - Texts consisting of multiple words @@ -38,7 +41,9 @@ errors, there are a number of tasks where token-base distance measures are bette | jaccard() | Jaccard | Jaccard similarity coefficient. | | softjaccard([maxDistance: Int = '1']) | Soft Jaccard | Soft Jaccard similarity coefficient. Same as Jaccard distance but values within an levenhstein distance of 'maxDistance' are considered equivalent. | | tokenwiseDistance([ignoreCase: Boolean = 'true'], [metricName: String = 'levenshtein'], [splitRegex: String = '[\s\d\p{Punct}]+'], [stopwords: String = ''], [stopwordWeight: Double = '0.01'], [nonStopwordWeight: Double = '0.1'], [useIncrementalIdfWeights: Boolean = 'false'], [matchThreshold: Double = '0.0'], [orderingImpact: Double = '0.0'], [adjustByTokenLength: Boolean = 'false']) | Token-wise Distance | Token-wise string distance using the specified metric | -## Numeric + +### Numeric + | Function and parameters | Name | Description | | --- | --- | --- | | date() | Date | The distance in days between two dates ('YYYY-MM-DD' format). | @@ -46,7 +51,9 @@ errors, there are a number of tasks where token-base distance measures are bette | insideNumericInterval([separator: String]) | Inside numeric interval | Checks if a number is contained inside a numeric interval, such as '1900 - 2000' | | num([minValue: Double = '-Infinity'], [maxValue: Double = 'Infinity']) | Numeric similarity | Computes the numeric distance between two numbers. | | wgs84([unit: String = 'km']) | Geographical distance | Computes the geographical distance between two points. Author: Konrad Höffner (MOLE subgroup of Research Group AKSW, University of Leipzig) | -## Equality + +### Equality + | Function and parameters | Name | Description | | --- | --- | --- | | constant([value: Double = '1.0']) | Constant | Always returns a constant similarity value. | @@ -54,20 +61,24 @@ errors, there are a number of tasks where token-base distance measures are bette | inequality() | Inequality | Return 1 if strings are equal, 0 otherwise. | | lowerThan([orEqual: Boolean = 'false']) | LowerThan | Return 1 if the source value is lower than the target value, 0 otherwise. If both strings are numbers, numerical order is used for comparison. Otherwise, alphanumerical order is used | | relaxedEquality() | RelaxedEquality | Return 1 if strings are equal, 0 otherwise. Lower/upper case and differences like ö/o, n/ñ, c/ç etc. are treated as equal. | -## Asian + +### Asian + | Function and parameters | Name | Description | | --- | --- | --- | | cjkReadingDistance([minChar: Char = '0'], [maxChar: Char = 'z']) | CJK Reading Distance | CJK Reading Distance. | | korean TranslitDistance([minChar: Char = '0'], [maxChar: Char = 'z']) | Korean translit distance | Transliterated Korean distance. | | koreanPhonemeDistance([minChar: Char = '0'], [maxChar: Char = 'z']) | Korean phoneme distance | Korean phoneme distance. | -## Spatial Distances +### Spatial Distances + | Function and parameters | Name | Description | | --- | --- | --- | | CentroidDistanceMetric([blockingParameter: Double = 1.0]) | Centroid distance | Computes the distance between the centroids of two geometries in meters. | | MinDistanceMetric([blockingParameter: Double = 1.0]) | Min distance | Computes the minimum distance between two geometries in meters. | -## Spatial Relations +### Spatial Relations + | Function and parameters | Name | Description | | --- | --- | --- | | SContainsMetric([blockingParameter: Double = 1.0]) | Spatial Contains | Computes the relation "contains" between two geometries. | @@ -80,7 +91,8 @@ errors, there are a number of tasks where token-base distance measures are bette | TouchesMetric([blockingParameter: Double = 1.0]) | Touches | Computes the relation "touches" between two geometries. | | WithinMetric([blockingParameter: Double = 1.0]) | Within | Computes the relation "within" between two geometries. | -## Temporal Distances +### Temporal Distances + | Function and parameters | Name | Description | | --- | --- | --- | | DaysDistanceMetric([blockingParameter: Double = 1.0]) | Days distance | Computes the distance in days between two time periods or instants. | @@ -91,7 +103,8 @@ errors, there are a number of tasks where token-base distance measures are bette | SecsDistanceMetric([blockingParameter: Double = 1.0]) | Secs distance | Computes the distance in secs between two time periods or instants. | | YearsDistanceMetric([blockingParameter: Double = 1.0]) | Years distance | Computes the distance in years between two time periods or instants. | -## Temporal Relations +### Temporal Relations + | Function and parameters | Name | Description | | --- | --- | --- | | AfterMetric([blockingParameter: Double = 1.0]) | After | Computes the relation "after" between two time periods or instants. | @@ -108,21 +121,27 @@ errors, there are a number of tasks where token-base distance measures are bette | TOverlapsMetric([blockingParameter: Double = 1.0]) | Temporal Overlaps | Computes the relation "overlaps" between two time periods or instants. | | StartsMetric([blockingParameter: Double = 1.0]) | Starts | Computes the relation "starts" between two time periods or instants. | -# Transformations +## Transformations The following transform and normalization functions are included: -## Replace + +### Replace + | Function and parameters | Name | Description | | --- | --- | --- | | regexReplace(regex: String, replace: String) | Regex replace | Replace all occurrences of a regex "regex" with "replace" in a string. | | replace(search: String, replace: String) | Replace | Replace all occurrences of a string "search" with "replace" in a string. | -## Combine + +### Combine + | Function and parameters | Name | Description | | --- | --- | --- | | concat([glue: String = '']) | Concatenate | Concatenates strings from two inputs. | | concatMultiValues([glue: String = ''], [removeDuplicates: Boolean = 'false']) | ConcatenateMultipleValues | Concatenates multiple values received for an input. If applied to multiple inputs, yields at most one value per input. Optionally removes duplicate values. | | merge() | Merge | Merges the values of all inputs. | -## Normalize + +### Normalize + | Function and parameters | Name | Description | | --- | --- | --- | | alphaReduce() | Alpha reduce | Strips all non-alphabetic characters from a string. | @@ -134,7 +153,9 @@ The following transform and normalization functions are included: | stripUriPrefix() | Strip URI prefix | Strips the URI prefix and decodes the remainder. Leaves values unchanged which don't start with 'http:' | | trim() | Trim | Remove leading and trailing whitespaces. | | upperCase() | Upper case | Converts a string to upper case. | -## Linguistic + +### Linguistic + | Function and parameters | Name | Description | | --- | --- | --- | | NYSIIS([refined: Boolean = 'true']) | NYSIIS | NYSIIS phonetic encoding. Provided by the StringMetric library: http://rockymadden.com/stringmetric/ | @@ -143,7 +164,9 @@ The following transform and normalization functions are included: | soundex([refined: Boolean = 'true']) | Soundex | Soundex algorithm. Provided by the StringMetric library: http://rockymadden.com/stringmetric/ | | spotlight() | Spotlight | Concatenates all values to a string and gets a weighted entity vector from the Spotlight service. | | stem() | Stem | Stems a string using the Porter Stemmer. | -## Substring + +### Substring + | Function and parameters | Name | Description | | --- | --- | --- | | stripPostfix(postfix: String) | Strip postfix | Strips a postfix of a string. | @@ -152,29 +175,39 @@ The following transform and normalization functions are included: | substring([beginIndex: Int = '0'], [endIndex: Int = '0']) | Substring | Returns a substring between 'beginIndex' (inclusive) and 'endIndex' (exclusive). If 'endIndex' is 0 (default), it is ignored and the entire remaining string starting with 'beginIndex' is returned. If 'endIndex' is negative, -endIndex characters are removed from the end.' | | trim() | Trim | Remove leading and trailing whitespaces. | | untilCharacter(untilCharacter: Char) | Until Character | Give a substring until the character given | -## Conversion + +### Conversion + | Function and parameters | Name | Description | | --- | --- | --- | | convertCharset([sourceCharset: String = 'ISO-8859-1'], [targetCharset: String = 'UTF-8']) | Convert Charset | Convert the string from "sourceCharset" to "targetCharset". | -## Filter + +### Filter + | Function and parameters | Name | Description | | --- | --- | --- | | filterByLength([min: Int = '0'], [max: Int = '2147483647']) | filter by length | Removes all strings that are shorter than 'min' characters and longer than 'max' characters. | | filterByRegex(regex: String, [negate: Boolean = 'false']) | filter by regex | Removes all strings that do NOT match a regex. If 'negate' is true, only strings will be removed that match the regex. | | removeEmptyValues() | Remove empty values | Removes empty values. | | removeValues(blacklist: String) | Remove values | Removes values. | -## Tokenization + +### Tokenization + | Function and parameters | Name | Description | | --- | --- | --- | | camelcasetokenizer() | Camel Case Tokenizer | Tokenizes a camel case string. That is it splits strings between a lower case characted and an upper case character. | | tokenize([regex: String = '\s']) | Tokenize | Tokenizes all input values. | -## Numeric + +### Numeric + | Function and parameters | Name | Description | | --- | --- | --- | | aggregateNumbers(operator: String) | Aggregate Numbers | Aggregates all numbers in this set using a mathematical operation. | | compareNumbers([comparator: String = '<']) | Compare Numbers | Compares the numbers of two sets. | | numReduce() | Numeric reduce | Strip all non-numeric characters from a string. | -## Date + +### Date + | Function and parameters | Name | Description | | --- | --- | --- | | compareDates([comparator: String = '<']) | Compare Dates | Compares two dates. Returns 1 if the comparison yields true and 0 otherwise. If there are multiple dates in both sets, the comparator must be true for all dates. e.g. {2014-08-02,2014-08-03} < {2014-08-03} yields 0 as not all dates in the first set are smaller than in the second. Accepts one parameter: comparator: One of '<', '<=', '=', '>=', '>' | @@ -184,8 +217,8 @@ The following transform and normalization functions are included: | durationInSeconds() | Duration in Seconds | Converts an xsd:duration to seconds. | | timeToDate() | Timestamp to date | convert Unix timestamp to xsd:date | +### Spatial -## Spatial | Function and parameters | Name | Description | | --- | --- | --- | | AreaTransformer | Area Transformer | Returns the area of the input geometry. | From 4aa7546d417614caab26ace37df8b111f2368adc Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Mikel=20Ega=C3=B1a=20Aranguren?= Date: Wed, 2 Jun 2021 09:34:21 +0200 Subject: [PATCH 2/2] Fix Readme (Forgot it in last commit) --- doc/Readme.md | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/doc/Readme.md b/doc/Readme.md index 52f4d40815..4592e728b4 100644 --- a/doc/Readme.md +++ b/doc/Readme.md @@ -1,8 +1,10 @@ # Silk Documentation -- [Overview of the Silk Workbench](Workbench.md) +- [Overview of the Silk Workbench](Overview.md) - [Linking with the Silk Workbench](Linking.md) - - [REST API](RestInterface.md) + - [Command Line](CommandLine.md) + - [Configuration](Configuration.md) + - [Internal API](Internal.md) - [Linkage Rules](LinkageRules.md) - [The Silk Link Specification Language](LinkSpecificationLanguage.md) - [Available Plugins](Plugins.md)