Matching tokens #11

marbleman · 2015-02-10T19:04:26Z

Hi,

I am stuck with this issue and I am quite sure I miss something really essential:

I setup the analyzer as below and it works quite well:

GET /myIndex/_analyze?analyzer=german&text=Straßenbahnschienenritzenreiniger

gives me all kinds of tokens. But: Searching returns all documents containing just ONE of the Tokens (with an OR-Operator so to say), ranking documents containing "straße" higher then documents containing "reiniiger" - ignoring multiple matches in the score. This is of course not what I intended...

However, I can see, that an AND-Operator for tokens would not do the right thing either... In fact the operation that could work would be something like (tokens derived from "straße" combined with OR) AND (tokens derived from "bahn" combined with OR) AND (...)

I could run analyze from the external application and build the AND-/OR-query there, but this does not seem to be quite elegant.

Is there another/better way?

"analysis": {
    "filter": {
       "baseform": {
          "type": "baseform",
          "language": "de"
       },
       "decomp": {
          "type": "decompound"
       }
    },
    "analyzer": {
       "german": {
          "filter": [
             "decomp",
             "baseform"
          ],
          "type": "custom",
          "tokenizer": "baseform"
       }
    },
    "tokenizer": {
       "baseform": {
          "filter": [
             "decomp",
             "baseform"
          ],
          "type": "standard"
       }
    }
 }

The text was updated successfully, but these errors were encountered:

jprante · 2015-02-11T09:41:57Z

I only tried the decompounder as index analyzer right now. But I will have a look into the issue. It seems like a related issue when searching for synonyms using the synonym filter.

marbleman · 2015-02-16T21:07:23Z

I guess any filter adding words has to deal with that in some way: as long as you just search for one word adding synonyms with OR will be ok. But when searching two words...
I'll setup a synonom filter the next days to cross check.

marbleman · 2015-03-13T18:19:38Z

It took quite while but I promised to come back with some details and here is what I found:

I used the explain API on a field having a baseform filter applied which adds a base form for verbs and process the phrase "hoch gezogen":

"query": {
"multi_match": {
"query": "hoch gezogen",
"fields": ["title"],
"operator": "and"
}
}

Result: "explanation": "+title:hoch +(title:gezog title:zieh)"

As expected the query will search for "hoch" AND ("gezog" OR "zieh") which is exactly what we expect.
The synonym filter, will do the same thing.

However, when I use the decompounder, to explain a search for the phrase "Abfall Kunsstoff" the result is

"explanation": "+title:abfall +(title:kunststoff title:kunst title:stoff)"

As a matter of fact, we will find any documents talking about "Abfall" and "Stoff" or any kind of "Kunst Abfall"... Ok, one can find a lot of rubbish declared to be art....;) but that wasn't what our search was all about...

The correct search should look like: +title:abfall +(title:kunststoff | (+title:kunst +title:stoff))
Forgive me if this is not syntactically correct: We want "kunststoff" or ("kunst" and "stoff")

Ok, I admit the example is not too good... in fact "Kunststoff" should not be decompounded at all. But this is another issue...

So when you say, you've never used the decompounder on the query side: I cannot see a way for proper results if the decompounder was just applied to the index... In my understanding the intention of decompounding "Hochfrequenzumkehrschraube" is finding documents talking about "Schrauben für die Umkehrung von Hochfrequenz". And this is where I am stuck in some way...

fgrosse · 2015-11-19T08:23:28Z

I am running into exactly the same issue.

Lets say I index two documents where the text field is decompounded:

{ "_id" : 1, "text" : "...direkt im Stadtzentrum..." }
{ "_id" : 2, "text" : "... Forschungszentrum..." }

Stadtzentrum from document 1 is decompounded into stadt and zentrum.
Forschungszentrum from document 2 is decompounded into forschung and zentrum.

Then I run the following search:

{
    "query": {
        "multi_match": {
           "query": "Forschungszentrum",
           "operator": "and",
           "fields": [ "title", "text"]
        }
    }
}

Unfortunately this returns both documents even though I used the and operator.
I don't want to find everything that contains the term zentrum.

If the query were Forschung zentrum it works as expected but this is user input and can not be controlled.

Did you ever find a solution to this @marbleman ?
@jprante If you want I can open a new issue at jprante/elasticsearch-plugin-bundle

fgrosse · 2015-11-19T08:36:52Z

P.S. We can not just use the decompounder only for indexing. Consider the following use case:

{ "_id" : 1, "text" : "Krebsforschungszentrum" }

Search:

{
    "query": {
        "match": { "text": "Forschungszentrum" }
    }
}

In that case the search term needs to be decompounded so we can find the Krebsforschungszentrum

AndreKR · 2015-12-01T01:27:43Z

It's impossible for a TokenFilter to have an interpretation like "+title:abfall +(title:kunststoff | (+title:kunst +title:stoff))" because of the way QueryBuilder.analyzeMultiBoolean() works.
What we can have is an interpretation like "+title:abfall +title:kunst +title:stoff".

To get it, pull #19 and set only_subwords: true.

jprante · 2015-12-01T07:39:01Z

@AndreKR thanks for fixing only_subwords.

Good analysis of QueryBuilder.analyzeMultiBoolean, there is only one boolean operator that can be used for the clause list. I think for improved token stream analysis on subwords, the whole query must be rewritten with transformed boolean operators so groups of and and or can be handled. This is something that should be done before token stream analysis within Lucene at the moment, because Lucene does not offer a good API for query transformations.

AndreKR · 2015-12-01T13:13:20Z

Honestly, I would even remove the only_subwords option and make it default to true. What is the use of getting the compound word along with its subwords? If we just get the subwords, the analyzed token stream can be freely used in whatever combination of queries.

jprante · 2015-12-01T14:44:37Z

@AndreKR you are right, with https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-keyword-repeat-tokenfilter.html it is possible to keep the compound word anyway. I will change the behavior in a new version.

marbleman · 2015-12-02T08:58:46Z

I am glad to see that this wasn't just a lack of understanding on my side ;-) And no: I did not find a workaround for it yet except building the query in another step. Since I did not find the time to walk through the code myself yet, I really appreciate a solution to this issue!

However, after getting around this one, there might be another related one: There are lots of compound words such as "Straßenbahn" or "Kugelbolzen" for example that must not be decompounded at all...

Let me know if you are interested in some exchange of experience

AndreKR · 2015-12-02T13:07:42Z

What's the harm in having Straßenbahn decompounded during indexing and searching? Anyway, there is a (currently undocumented) option respect_keywords that you can set to true and then you can block words from being decompounded in the same way as with other filters.

fgrosse · 2015-12-02T14:11:53Z

See #14 for respect_keywords pull request.

I would be interested in some exchange. How can I reach you? don't want to spam the issue here to much :)

fgrosse · 2015-12-03T13:41:04Z

@jprante will you merge that change into https://github.com/jprante/elasticsearch-plugin-bundle/ as well and release a new version? I switched to elasticsearch-plugin-bundle as you recommended earlier. If not I will switch back to this repository.

jprante · 2015-12-03T15:54:59Z

Merged into bundle plugin release 2.1.0.1

AndreKR · 2015-12-03T23:01:34Z

I would be interested in some exchange. How can I reach you? don't want to spam the issue here to much :)

@fgrosse Who were you talking to? Anyway, my profile now has an email address.

fgrosse · 2016-01-07T13:54:16Z

Since the name of the configuration has been mixed up here two times as only_subwords I want to point out that the correct configuration option is called subwords_only

AndreKR mentioned this issue Dec 1, 2015

Respect "and" operator from match query when using only_subwords #19

Merged

jprante added a commit to jprante/elasticsearch-plugin-bundle that referenced this issue Dec 3, 2015

applying fix from jprante/elasticsearch-analysis-decompound#11

0ba56ea

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Matching tokens #11

Matching tokens #11

marbleman commented Feb 10, 2015

jprante commented Feb 11, 2015

marbleman commented Feb 16, 2015

marbleman commented Mar 13, 2015

fgrosse commented Nov 19, 2015

fgrosse commented Nov 19, 2015

AndreKR commented Dec 1, 2015

jprante commented Dec 1, 2015

AndreKR commented Dec 1, 2015

jprante commented Dec 1, 2015

marbleman commented Dec 2, 2015

AndreKR commented Dec 2, 2015

fgrosse commented Dec 2, 2015

fgrosse commented Dec 3, 2015

jprante commented Dec 3, 2015

AndreKR commented Dec 3, 2015

fgrosse commented Jan 7, 2016

Matching tokens #11

Matching tokens #11

Comments

marbleman commented Feb 10, 2015

jprante commented Feb 11, 2015

marbleman commented Feb 16, 2015

marbleman commented Mar 13, 2015

fgrosse commented Nov 19, 2015

fgrosse commented Nov 19, 2015

AndreKR commented Dec 1, 2015

jprante commented Dec 1, 2015

AndreKR commented Dec 1, 2015

jprante commented Dec 1, 2015

marbleman commented Dec 2, 2015

AndreKR commented Dec 2, 2015

fgrosse commented Dec 2, 2015

fgrosse commented Dec 3, 2015

jprante commented Dec 3, 2015

AndreKR commented Dec 3, 2015

fgrosse commented Jan 7, 2016