A plugin bundle for Elastisearch

This plugin contains 18 token filters, 4 tokenizers, 2 char filters, 5 analyzers, 4 field mappers, and 2 REST actions to enhance Elasticsearch.

It’s the combination of the following plugins:

elasticsearch-analysis-autophrase
elasticsearch-analysis-baseform
elasticsearch-analysis-concat
elasticsearch-analysis-decompound
elasticsearch-analysis-german
elasticsearch-analysis-hyphen
elasticsearch-analysis-icu
elasticsearch-analysis-naturalsort
elasticsearch-analysis-sortform
elasticsearch-analysis-standardnumber
elasticsearch-analysis-symbolname
elasticsearch-analysis-worddelimiter
elasticsearch-analysis-year
elasticsearch-mapper-crypt
elasticsearch-mapper-langdetect
elasticsearch-mapper-reference

The plugin code in each plugin is equivalent to the code in this combined bundle plugin.

Table 1. Compatibility matrix

Plugin version	Elasticsearch version	Release date
5.1.1.0	5.1.1	Dec 31 2016
2.3.4.0	2.3.4	Jul 30, 2016
2.3.3.0	2.3.3	May 23, 2016
2.3.2.0	2.3.2	May 11, 2016
2.2.0.6	2.2.0	Mar 25, 2016
2.2.0.3	2.2.0	Mar 6, 2016
2.2.0.2	2.2.0	Mar 3, 2016
2.2.0.1	2.2.0	Feb 22, 2016
2.2.0.0	2.2.0	Feb 8, 2016
2.1.1.2	2.1.1	Dec 30, 2015
2.1.1.0	2.1.1	Dec 21, 2015
2.1.0.0	2.1.0	Nov 27, 2015
2.0.0.0	2.0.0	Oct 28, 2015
1.6.0.0	1.6.0	Jun 30, 2015
1.5.2.1	1.5.2	Jun 30, 2015
1.5.2.0	1.5.2	Apr 27, 2015
1.5.1.0	1.5.1	Apr 23, 2015
1.5.0.0	1.5.0	Mar 31, 2015
1.4.4.0	1.4.4	Apr 26, 2015
1.4.0.6	1.4.0	Feb 23, 2015
1.4.0.5	1.4.0	Jan 28, 2015
1.4.0.4	1.4.0	Jan 19, 2015
1.4.0.3	1.4.0	Dec 16, 2014
1.4.0.1	1.4.0	Nov 10, 2014

Installation

Elasticsearch 5.x

./bin/elasticsearch-plugin install http://search.maven.org/remotecontent?filepath=org/xbib/elasticsearch/plugin/elasticsearch-plugin-bundle/5.1.1.0/elasticsearch-plugin-bundle-5.1.1.0-plugin.zip

or

./bin/elasticsearch-plugin install http://xbib.org/repository/org/xbib/elasticsearch/plugin/elasticsearch-plugin-bundle/5.1.1.0/elasticsearch-plugin-bundle-5.1.1.0-plugin.zip

Do not forget to restart the node after installing.

Elasticsearch 2.x

./bin/plugin install 'http://search.maven.org/remotecontent?filepath=org/xbib/elasticsearch/plugin/elasticsearch-plugin-bundle/2.3.3.0/elasticsearch-plugin-bundle-2.3.3.0-plugin.zip'

or

./bin/plugin install 'http://xbib.org/repository/org/xbib/elasticsearch/plugin/elasticsearch-plugin-bundle/2.3.4.0/elasticsearch-plugin-bundle-2.3.4.0-plugin.zip'

Do not forget to restart the node after installing.

Elasticsearch 1.x

./bin/plugin -install bundle -url http://xbib.org/repository/org/xbib/elasticsearch/plugin/elasticsearch-plugin-bundle/1.6.0.0/elasticsearch-plugin-bundle-1.6.0.0-plugin.zip

Do not forget to restart the node after installing.

Examples

German normalizer

The german_normalizer is equivalent to Elasticsearch german_normalization. It performs umlaut treatment with vocal expansion which is typical for german language.

PUT /test
{
   "settings": {
      "index": {
         "analysis": {
            "filter": {
               "umlaut": {
                  "type": "german_normalize"
               }
            },
            "analyzer": {
               "umlaut": {
                  "type": "custom",
                  "tokenizer": "standard",
                  "filter": [
                     "umlaut",
                     "lowercase"
                  ]
               }
            }
         }
      }
   },
   "mappings": {
      "docs": {
         "properties": {
            "text": {
               "type": "text",
               "analyzer": "umlaut"
            }
         }
      }
   }
}

GET /test/docs/_mapping

PUT /test/docs/1
{
    "text" : "Jörg Prante"
}

POST /test/docs/_search?explain
{
    "query": {
        "match": {
           "text": "Jörg"
        }
    }
}

POST /test/docs/_search?explain
{
    "query": {
        "match": {
           "text": "joerg"
        }
    }
}

POST /test/docs/_search?explain
{
    "query": {
        "match": {
           "text": "jorg"
        }
    }
}

International components for Unicode

The plugin contains an extended version of the Lucene ICU functionality with a dependancy on ICU 58.2

Available are icu_collation, icu_folding, icu_tokenizer, icu_numberformat, icu_transform

icu_collation

The icu_collation analyzer can apply rbbi ICU rule files on a field.

PUT /test
{
   "settings": {
      "index": {
         "analysis": {
            "analyzer": {
               "icu_german_collate": {
                  "type": "icu_collation",
                  "language": "de",
                  "country": "DE",
                  "strength": "primary",
                  "rules": "& ae , ä & AE , Ä& oe , ö & OE , Ö& ue , ü & UE , ü"
               },
               "icu_german_collate_without_punct": {
                  "type": "icu_collation",
                  "language": "de",
                  "country": "DE",
                  "strength": "quaternary",
                  "alternate": "shifted",
                  "rules": "& ae , ä & AE , Ä& oe , ö & OE , Ö& ue , ü & UE , ü"
               }
            }
         }
      }
   },
   "mappings": {
      "docs": {
         "properties": {
            "text": {
               "type": "text",
               "fielddata" : true,
               "analyzer": "icu_german_collate"
            },
            "catalog_text" : {
               "type": "text",
               "fielddata" : true,
               "analyzer": "icu_german_collate_without_punct"
            }
         }
      }
   }
}

GET /test/docs/_mapping

PUT /test/docs/1
{
    "text" : "Göbel",
    "catalog_text" : "Göbel"
}

PUT /test/docs/2
{
    "text" : "Goethe",
    "catalog_text" : "G-oethe"
}

PUT /test/docs/3
{
    "text" : "Goldmann",
    "catalog_text" : "Gold*mann"
}

PUT /test/docs/4
{
    "text" : "Göthe",
    "catalog_text" : "Göthe"
}

PUT /test/docs/5
{
    "text" : "Götz",
    "catalog_text" : "Götz"
}


POST /test/docs/_search
{
    "query": {
        "match_all": {
        }
    },
    "sort" : {
        "text" : { "order" : "asc" }
    }
}

POST /test/docs/_search
{
    "query": {
        "match_all": {
        }
    },
    "sort" : {
        "catalog_text" : { "order" : "asc" }
    }
}

icu_folding

The icu_folding character filter folds characters in strings according to Unicode folding rules. UTR#30 is retracted, but still used here.

PUT /test
{
   "settings": {
          "index":{
        "analysis":{
            "char_filter" : {
                "my_icu_folder" : {
                   "type" : "icu_folding"
                }
            },
            "tokenizer" : {
                "my_icu_tokenizer" : {
                    "type" : "icu_tokenizer"
                }
            },
            "filter" : {
                "my_icu_folder_filter" : {
                    "type" : "icu_folding"
                },
                "my_icu_folder_filter_with_exceptions" : {
                    "type" : "icu_folding",
                    "name" : "utr30",
                    "unicodeSetFilter" : "[^åäöÅÄÖ]"
                }
            },
            "analyzer" : {
                "my_icu_analyzer" : {
                    "type" : "custom",
                    "tokenizer" : "my_icu_tokenizer",
                    "filter" : [ "my_icu_folder_filter" ]
                },
                "my_icu_analyzer_with_exceptions" : {
                    "type" : "custom",
                    "tokenizer" : "my_icu_tokenizer",
                    "filter" : [ "my_icu_folder_filter_with_exceptions" ]
                }
            }
        }
    }
   },
   "mappings": {
      "docs": {
         "properties": {
            "text": {
               "type": "text",
               "fielddata" : true,
               "analyzer": "my_icu_analyzer"
            },
            "text2" : {
               "type": "text",
               "fielddata" : true,
               "analyzer": "my_icu_analyzer_with_exceptions"
            }
         }
      }
   }
}

GET /test/docs/_mapping

PUT /test/docs/1
{
    "text" : "Jörg Prante",
    "text2" : "Jörg Prante"
}

POST /test/docs/_search
{
    "query": {
        "match": {
            "text" : "jörg"
        }
    }
}

POST /test/docs/_search
{
    "query": {
        "match": {
            "text" : "jorg"
        }
    }
}

POST /test/docs/_search
{
    "query": {
        "match": {
            "text2" : "jörg"
        }
    }
}

// no hit

POST /test/docs/_search
{
    "query": {
        "match": {
            "text2" : "jorg"
        }
    }
}

icu_tokenizer

The icu_tokenizer can use rules from file. Here, we set up rules to prevent tokenization of words with hyphen.

PUT /test
{
   "settings": {
      "index": {
         "analysis": {
            "tokenizer": {
               "my_hyphen_icu_tokenizer": {
                  "type": "icu_tokenizer",
                  "rulefiles": "Latn:icu/Latin-dont-break-on-hyphens.rbbi"
               }
            },
            "analyzer" : {
               "my_icu_analyzer" : {
                   "type" : "custom",
                   "tokenizer" : "my_hyphen_icu_tokenizer"
               }
            }
         }
      }
   },
   "mappings": {
      "docs": {
         "properties": {
            "text": {
               "type": "text",
               "analyzer": "my_icu_analyzer"
            }
         }
      }
   }
}

GET /test/docs/_mapping

PUT /test/docs/1
{
    "text" : "we do-not-break on hyphens"
}

POST /test/docs/_search?explain
{
    "query": {
        "term": {
            "text" : "do-not-break"
        }
    }
}

icu_numberformat

With the icu_numberformat filter, you can index numbers as they are spelled out in a language.

PUT /test
{
   "settings": {
       "index":{
        "analysis":{
            "filter" : {
                "spellout_de" : {
                  "type" : "icu_numberformat",
                  "locale" : "de",
                  "format" : "spellout"
                }
            },
            "analyzer" : {
               "my_icu_analyzer" : {
                   "type" : "custom",
                   "tokenizer" : "standard",
                   "filter" : [ "spellout_de" ]
               }
            }
         }
      }
   },
   "mappings": {
      "docs": {
         "properties": {
            "text": {
               "type": "text",
               "analyzer": "my_icu_analyzer"
            }
         }
      }
   }
}

GET /test/docs/_mapping

PUT /test/docs/1
{
    "text" : "Das sind 1000 Bücher"
}

POST /test/docs/_search?explain
{
    "query": {
        "match": {
            "text" : "eintausend"
        }
    }
}

Baseform

{
 "index":{
    "analysis":{
        "filter":{
            "baseform":{
                "type" : "baseform",
                "language" : "de"
            }
        },
        "tokenizer" : {
            "baseform" : {
               "type" : "standard",
               "filter" : [ "baseform", "unique" ]
            }
        }
    }
 }
}

WordDelimiterFilter2

{
    "index":{
        "analysis":{
            "filter" : {
                "wd" : {
                   "type" : "worddelimiter2",
                   "generate_word_parts" : true,
                   "generate_number_parts" : true,
                   "catenate_all" : true,
                   "split_on_case_change" : true,
                   "split_on_numerics" : true,
                   "stem_english_possessive" : true
                }
            }
        }
    }
}

Example

In the mapping, us a token filter of type "decompound": { "index":{ "analysis":{ "filter":{ "decomp":{ "type" : "decompound" } }, "tokenizer" : { "decomp" : { "type" : "standard", "filter" : [ "decomp" ] } } } } }

"Die Jahresfeier der Rechtsanwaltskanzleien auf dem Donaudampfschiff hat viel Ökosteuer gekostet" will be tokenized into "Die", "Die", "Jahresfeier", "Jahr", "feier", "der", "der", "Rechtsanwaltskanzleien", "Recht", "anwalt", "kanzlei", "auf", "auf", "dem", "dem", "Donaudampfschiff", "Donau", "dampf", "schiff", "hat", "hat", "viel", "viel", "Ökosteuer", "Ökosteuer", "gekostet", "gekosten"

It is recommended to add the `Unique token filter http://www.elasticsearch.org/guide/reference/index-modules/analysis/unique-tokenfilter.html`_ to skip tokens that occur more than once.

Also the Lucene german normalization token filter is provided: { "index":{ "analysis":{ "filter":{ "umlaut":{ "type":"german_normalize" } }, "tokenizer" : { "umlaut" : { "type":"standard", "filter" : "umlaut" } } } } }

The input "Ein schöner Tag in Köln im Café an der Straßenecke" will be tokenized into "Ein", "schoner", "Tag", "in", "Koln", "im", "Café", "an", "der", "Strassenecke".

Threshold

The decomposing algorithm knows about a threshold when to assume words as decomposed successfully or not. If the threshold is too low, words could silently disappear from being indexed. In this case, you have to adapt the threshold so words do no longer disappear.

The default threshold value is 0.51. You can modify it in the settings: { "index" : { "analysis" : { "filter" : { "decomp" : { "type" : "decompound", "threshold" : 0.51 } }, "tokenizer" : { "decomp" : { "type" : "standard", "filter" : [ "decomp" ] } } } } }

Subwords

Sometimes only the decomposed subwords should be indexed. For this, you can use the parameter "subwords_only": true

{
   "index" : {
      "analysis" : {
          "filter" : {
              "decomp" : {
                  "type" : "decompound",
                  "subwords_only" : true
              }
          },
          "tokenizer" : {
              "decomp" : {
                 "type" : "standard",
                 "filter" : [ "decomp" ]
              }
          }
      }
   }
}

Langdetect

curl -XDELETE 'localhost:9200/test'

curl -XPUT 'localhost:9200/test'

curl -XPOST 'localhost:9200/test/article/_mapping' -d '
{
  "article" : {
    "properties" : {
       "content" : { "type" : "langdetect" }
    }
  }
}
'

curl -XPUT 'localhost:9200/test/article/1' -d '
{
  "title" : "Some title",
  "content" : "Oh, say can you see by the dawn`s early light, What so proudly we hailed at the twilight`s last gleaming?"
}
'

curl -XPUT 'localhost:9200/test/article/2' -d '
{
  "title" : "Ein Titel",
  "content" : "Einigkeit und Recht und Freiheit für das deutsche Vaterland!"
}
'

curl -XPUT 'localhost:9200/test/article/3' -d '
{
  "title" : "Un titre",
  "content" : "Allons enfants de la Patrie, Le jour de gloire est arrivé!"
}
'

curl -XGET 'localhost:9200/test/_refresh'

curl -XPOST 'localhost:9200/test/_search' -d '
{
   "query" : {
       "term" : {
            "content" : "en"
       }
   }
}
'
curl -XPOST 'localhost:9200/test/_search' -d '
{
   "query" : {
       "term" : {
            "content" : "de"
       }
   }
}
'

curl -XPOST 'localhost:9200/test/_search' -d '
{
   "query" : {
       "term" : {
            "content" : "fr"
       }
   }
}
'

Standardnumber

{
   "index" : {
      "analysis" : {
          "filter" : {
              "standardnumber" : {
                  "type" : "standardnumber"
              }
          },
          "analyzer" : {
              "standardnumber" : {
                  "tokenizer" : "whitespace",
                  "filter" : [ "standardnumber", "unique" ]
              }
          }
      }
   }
}

WordDelimiterFilter2: taken from Lucene
baseform: index also base forms of words (german, english)
decompound: decompose words if possible (german)
langdetect: find language code of detected languages
standardnumber: standard number entity recognition
hyphen: token filter for shingling and combining hyphenated words (german: Bindestrichwörter), the opposite of the decompound token filter
sortform: process string forms for bibliographical sorting, taking non-sort areas into account
year: token filter for 4-digit sequences
reference:

Crypt mapper

{
    "someType" : {
        "_source" : {
            "enabled": false
        },
        "properties" : {
            "someField":{ "type" : "crypt", "algo": "SHA-512" }
        }
    }
}

Issues

All feedback is welcome! If you find issues, please post them at [Github](https://github.com/jprante/elasticsearch-plugin-bundle/issues)

References

The decompunder is a derived work of ASV toolbox http://asv.informatik.uni-leipzig.de/asv/methoden

The Compact Patricia Trie data structure can be found in

Morrison, D.: Patricia - practical algorithm to retrieve information coded in alphanumeric. Journal of ACM, 1968, 15(4):514–534

The compound splitter used for generating features for document classification is described in

Witschel, F., Biemann, C.: Rigorous dimensionality reduction through linguistically motivated feature selection for text categorization. Proceedings of NODALIDA 2005, Joensuu, Finland

The base form reduction step (for Norwegian) is described in

Eiken, U.C., Liseth, A.T., Richter, M., Witschel, F. and Biemann, C.: Ord i Dag: Mining Norwegian Daily Newswire. Proceedings of FinTAL, Turku, 2006, Finland

License

elasticsearch-plugin-bundle - a compilation of useful plugins for Elasticsearch

This program is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more details.

You should have received a copy of the GNU Affero General Public License along with this program. If not, see http://www.gnu.org/licenses/.

Name		Name	Last commit message	Last commit date
Latest commit History 122 Commits
bin		bin
config/checkstyle		config/checkstyle
docs/test		docs/test
gradle		gradle
licenses		licenses
src		src
.gitignore		.gitignore
.travis.yml		.travis.yml
CREDITS.txt		CREDITS.txt
LICENSE.txt		LICENSE.txt
README.adoc		README.adoc
build.gradle		build.gradle
gradle.properties		gradle.properties
gradlew		gradlew
gradlew.bat		gradlew.bat
settings.gradle		settings.gradle

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A plugin bundle for Elastisearch

Installation

Elasticsearch 5.x

Elasticsearch 2.x

Elasticsearch 1.x

Examples

German normalizer

International components for Unicode

icu_collation

icu_folding

icu_tokenizer

icu_numberformat

Baseform

WordDelimiterFilter2

Example

Threshold

Subwords

Langdetect

Standardnumber

Crypt mapper

Issues

References

License

About

Releases

Packages

Languages

License

marbleman/elasticsearch-plugin-bundle

Folders and files

Latest commit

History

Repository files navigation

A plugin bundle for Elastisearch

Installation

Elasticsearch 5.x

Elasticsearch 2.x

Elasticsearch 1.x

Examples

German normalizer

International components for Unicode

icu_collation

icu_folding

icu_tokenizer

icu_numberformat

Baseform

WordDelimiterFilter2

Example

Threshold

Subwords

Langdetect

Standardnumber

Crypt mapper

Issues

References

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages