Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Matching tokens #11

Open
marbleman opened this issue Feb 10, 2015 · 16 comments
Open

Matching tokens #11

marbleman opened this issue Feb 10, 2015 · 16 comments

Comments

@marbleman
Copy link

Hi,

I am stuck with this issue and I am quite sure I miss something really essential:

I setup the analyzer as below and it works quite well:

GET /myIndex/_analyze?analyzer=german&text=Straßenbahnschienenritzenreiniger

gives me all kinds of tokens. But: Searching returns all documents containing just ONE of the Tokens (with an OR-Operator so to say), ranking documents containing "straße" higher then documents containing "reiniiger" - ignoring multiple matches in the score. This is of course not what I intended...

However, I can see, that an AND-Operator for tokens would not do the right thing either... In fact the operation that could work would be something like (tokens derived from "straße" combined with OR) AND (tokens derived from "bahn" combined with OR) AND (...)

I could run analyze from the external application and build the AND-/OR-query there, but this does not seem to be quite elegant.

Is there another/better way?

"analysis": {
    "filter": {
       "baseform": {
          "type": "baseform",
          "language": "de"
       },
       "decomp": {
          "type": "decompound"
       }
    },
    "analyzer": {
       "german": {
          "filter": [
             "decomp",
             "baseform"
          ],
          "type": "custom",
          "tokenizer": "baseform"
       }
    },
    "tokenizer": {
       "baseform": {
          "filter": [
             "decomp",
             "baseform"
          ],
          "type": "standard"
       }
    }
 }
@jprante
Copy link
Owner

jprante commented Feb 11, 2015

I only tried the decompounder as index analyzer right now. But I will have a look into the issue. It seems like a related issue when searching for synonyms using the synonym filter.

@marbleman
Copy link
Author

I guess any filter adding words has to deal with that in some way: as long as you just search for one word adding synonyms with OR will be ok. But when searching two words...
I'll setup a synonom filter the next days to cross check.

@marbleman
Copy link
Author

It took quite while but I promised to come back with some details and here is what I found:

I used the explain API on a field having a baseform filter applied which adds a base form for verbs and process the phrase "hoch gezogen":

"query": {
"multi_match": {
"query": "hoch gezogen",
"fields": ["title"],
"operator": "and"
}
}

Result: "explanation": "+title:hoch +(title:gezog title:zieh)"

As expected the query will search for "hoch" AND ("gezog" OR "zieh") which is exactly what we expect.
The synonym filter, will do the same thing.

However, when I use the decompounder, to explain a search for the phrase "Abfall Kunsstoff" the result is

"explanation": "+title:abfall +(title:kunststoff title:kunst title:stoff)"

As a matter of fact, we will find any documents talking about "Abfall" and "Stoff" or any kind of "Kunst Abfall"... Ok, one can find a lot of rubbish declared to be art....;) but that wasn't what our search was all about...

The correct search should look like: +title:abfall +(title:kunststoff | (+title:kunst +title:stoff))
Forgive me if this is not syntactically correct: We want "kunststoff" or ("kunst" and "stoff")

Ok, I admit the example is not too good... in fact "Kunststoff" should not be decompounded at all. But this is another issue...

So when you say, you've never used the decompounder on the query side: I cannot see a way for proper results if the decompounder was just applied to the index... In my understanding the intention of decompounding "Hochfrequenzumkehrschraube" is finding documents talking about "Schrauben für die Umkehrung von Hochfrequenz". And this is where I am stuck in some way...

@fgrosse
Copy link
Contributor

fgrosse commented Nov 19, 2015

I am running into exactly the same issue.

Lets say I index two documents where the text field is decompounded:

{ "_id" : 1, "text" : "...direkt im Stadtzentrum..." }
{ "_id" : 2, "text" : "... Forschungszentrum..." }

Stadtzentrum from document 1 is decompounded into stadt and zentrum.
Forschungszentrum from document 2 is decompounded into forschung and zentrum.

Then I run the following search:

{
    "query": {
        "multi_match": {
           "query": "Forschungszentrum",
           "operator": "and",
           "fields": [ "title", "text"]
        }
    }
}

Unfortunately this returns both documents even though I used the and operator.
I don't want to find everything that contains the term zentrum.

If the query were Forschung zentrum it works as expected but this is user input and can not be controlled.

Did you ever find a solution to this @marbleman ?
@jprante If you want I can open a new issue at jprante/elasticsearch-plugin-bundle

@fgrosse
Copy link
Contributor

fgrosse commented Nov 19, 2015

P.S. We can not just use the decompounder only for indexing. Consider the following use case:

{ "_id" : 1, "text" : "Krebsforschungszentrum" }

Search:

{
    "query": {
        "match": { "text": "Forschungszentrum" }
    }
}

In that case the search term needs to be decompounded so we can find the Krebsforschungszentrum

@AndreKR
Copy link
Contributor

AndreKR commented Dec 1, 2015

It's impossible for a TokenFilter to have an interpretation like "+title:abfall +(title:kunststoff | (+title:kunst +title:stoff))" because of the way QueryBuilder.analyzeMultiBoolean() works.
What we can have is an interpretation like "+title:abfall +title:kunst +title:stoff".

To get it, pull #19 and set only_subwords: true.

@jprante
Copy link
Owner

jprante commented Dec 1, 2015

@AndreKR thanks for fixing only_subwords.

Good analysis of QueryBuilder.analyzeMultiBoolean, there is only one boolean operator that can be used for the clause list. I think for improved token stream analysis on subwords, the whole query must be rewritten with transformed boolean operators so groups of and and or can be handled. This is something that should be done before token stream analysis within Lucene at the moment, because Lucene does not offer a good API for query transformations.

@AndreKR
Copy link
Contributor

AndreKR commented Dec 1, 2015

Honestly, I would even remove the only_subwords option and make it default to true. What is the use of getting the compound word along with its subwords? If we just get the subwords, the analyzed token stream can be freely used in whatever combination of queries.

@jprante
Copy link
Owner

jprante commented Dec 1, 2015

@AndreKR you are right, with https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-keyword-repeat-tokenfilter.html it is possible to keep the compound word anyway. I will change the behavior in a new version.

@marbleman
Copy link
Author

I am glad to see that this wasn't just a lack of understanding on my side ;-) And no: I did not find a workaround for it yet except building the query in another step. Since I did not find the time to walk through the code myself yet, I really appreciate a solution to this issue!

However, after getting around this one, there might be another related one: There are lots of compound words such as "Straßenbahn" or "Kugelbolzen" for example that must not be decompounded at all...

Let me know if you are interested in some exchange of experience

@AndreKR
Copy link
Contributor

AndreKR commented Dec 2, 2015

What's the harm in having Straßenbahn decompounded during indexing and searching? Anyway, there is a (currently undocumented) option respect_keywords that you can set to true and then you can block words from being decompounded in the same way as with other filters.

@fgrosse
Copy link
Contributor

fgrosse commented Dec 2, 2015

See #14 for respect_keywords pull request.

I would be interested in some exchange. How can I reach you? don't want to spam the issue here to much :)

@fgrosse
Copy link
Contributor

fgrosse commented Dec 3, 2015

@jprante will you merge that change into https://github.com/jprante/elasticsearch-plugin-bundle/ as well and release a new version? I switched to elasticsearch-plugin-bundle as you recommended earlier. If not I will switch back to this repository.

jprante added a commit to jprante/elasticsearch-plugin-bundle that referenced this issue Dec 3, 2015
@jprante
Copy link
Owner

jprante commented Dec 3, 2015

Merged into bundle plugin release 2.1.0.1

@AndreKR
Copy link
Contributor

AndreKR commented Dec 3, 2015

I would be interested in some exchange. How can I reach you? don't want to spam the issue here to much :)

@fgrosse Who were you talking to? Anyway, my profile now has an email address.

@fgrosse
Copy link
Contributor

fgrosse commented Jan 7, 2016

Since the name of the configuration has been mixed up here two times as only_subwords I want to point out that the correct configuration option is called subwords_only

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants