-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Matching tokens #11
Comments
I only tried the decompounder as index analyzer right now. But I will have a look into the issue. It seems like a related issue when searching for synonyms using the synonym filter. |
I guess any filter adding words has to deal with that in some way: as long as you just search for one word adding synonyms with OR will be ok. But when searching two words... |
It took quite while but I promised to come back with some details and here is what I found: I used the explain API on a field having a baseform filter applied which adds a base form for verbs and process the phrase "hoch gezogen": "query": { Result: "explanation": "+title:hoch +(title:gezog title:zieh)" As expected the query will search for "hoch" AND ("gezog" OR "zieh") which is exactly what we expect. However, when I use the decompounder, to explain a search for the phrase "Abfall Kunsstoff" the result is "explanation": "+title:abfall +(title:kunststoff title:kunst title:stoff)" As a matter of fact, we will find any documents talking about "Abfall" and "Stoff" or any kind of "Kunst Abfall"... Ok, one can find a lot of rubbish declared to be art....;) but that wasn't what our search was all about... The correct search should look like: +title:abfall +(title:kunststoff | (+title:kunst +title:stoff)) Ok, I admit the example is not too good... in fact "Kunststoff" should not be decompounded at all. But this is another issue... So when you say, you've never used the decompounder on the query side: I cannot see a way for proper results if the decompounder was just applied to the index... In my understanding the intention of decompounding "Hochfrequenzumkehrschraube" is finding documents talking about "Schrauben für die Umkehrung von Hochfrequenz". And this is where I am stuck in some way... |
I am running into exactly the same issue. Lets say I index two documents where the
Then I run the following search: {
"query": {
"multi_match": {
"query": "Forschungszentrum",
"operator": "and",
"fields": [ "title", "text"]
}
}
} Unfortunately this returns both documents even though I used the If the query were Did you ever find a solution to this @marbleman ? |
P.S. We can not just use the decompounder only for indexing. Consider the following use case: { "_id" : 1, "text" : "Krebsforschungszentrum" } Search: {
"query": {
"match": { "text": "Forschungszentrum" }
}
} In that case the search term needs to be decompounded so we can find the |
It's impossible for a TokenFilter to have an interpretation like "+title:abfall +(title:kunststoff | (+title:kunst +title:stoff))" because of the way To get it, pull #19 and set |
@AndreKR thanks for fixing Good analysis of |
Honestly, I would even remove the |
@AndreKR you are right, with https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-keyword-repeat-tokenfilter.html it is possible to keep the compound word anyway. I will change the behavior in a new version. |
I am glad to see that this wasn't just a lack of understanding on my side ;-) And no: I did not find a workaround for it yet except building the query in another step. Since I did not find the time to walk through the code myself yet, I really appreciate a solution to this issue! However, after getting around this one, there might be another related one: There are lots of compound words such as "Straßenbahn" or "Kugelbolzen" for example that must not be decompounded at all... Let me know if you are interested in some exchange of experience |
What's the harm in having Straßenbahn decompounded during indexing and searching? Anyway, there is a (currently undocumented) option |
See #14 for I would be interested in some exchange. How can I reach you? don't want to spam the issue here to much :) |
@jprante will you merge that change into https://github.com/jprante/elasticsearch-plugin-bundle/ as well and release a new version? I switched to |
Merged into bundle plugin release 2.1.0.1 |
@fgrosse Who were you talking to? Anyway, my profile now has an email address. |
Since the name of the configuration has been mixed up here two times as only_subwords I want to point out that the correct configuration option is called subwords_only |
Hi,
I am stuck with this issue and I am quite sure I miss something really essential:
I setup the analyzer as below and it works quite well:
GET /myIndex/_analyze?analyzer=german&text=Straßenbahnschienenritzenreiniger
gives me all kinds of tokens. But: Searching returns all documents containing just ONE of the Tokens (with an OR-Operator so to say), ranking documents containing "straße" higher then documents containing "reiniiger" - ignoring multiple matches in the score. This is of course not what I intended...
However, I can see, that an AND-Operator for tokens would not do the right thing either... In fact the operation that could work would be something like (tokens derived from "straße" combined with OR) AND (tokens derived from "bahn" combined with OR) AND (...)
I could run analyze from the external application and build the AND-/OR-query there, but this does not seem to be quite elegant.
Is there another/better way?
The text was updated successfully, but these errors were encountered: