Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can not get decompound to work #13

Open
Xaratas opened this issue May 15, 2015 · 6 comments
Open

Can not get decompound to work #13

Xaratas opened this issue May 15, 2015 · 6 comments

Comments

@Xaratas
Copy link

Xaratas commented May 15, 2015

I have the current Elasticsearch version (1.5.2) and tried to setup decompound with the thin readme. I got not the expected results.

PUT /leads
{
  "settings": {
    "index": {
      "analysis": {
        "filter": {
          "decomp": {
            "type": "decompound"
          }
        },
        "tokenizer": {
          "decomp": {
            "type": "standard",
            "filter": [
              "decomp"
            ]
          }
        }
      }
    }
  }
}

Tested with:
GET leads/_analyze?
{Die Jahresfeier der Rechtsanwaltskanzleien auf dem Donaudampfschiff hat viel Ökosteuer gekostet}
Results in, which is not the same as shown in the readme:

{
   "tokens": [
      {
         "token": "die",
         "start_offset": 1,
         "end_offset": 4,
         "type": "<ALPHANUM>",
         "position": 1
      },
      {
         "token": "jahresfeier",
         "start_offset": 5,
         "end_offset": 16,
         "type": "<ALPHANUM>",
         "position": 2
      },
      {
         "token": "der",
         "start_offset": 17,
         "end_offset": 20,
         "type": "<ALPHANUM>",
         "position": 3
      },
      {
         "token": "rechtsanwaltskanzleien",
         "start_offset": 21,
         "end_offset": 43,
         "type": "<ALPHANUM>",
         "position": 4
      },
      {
         "token": "auf",
         "start_offset": 44,
         "end_offset": 47,
         "type": "<ALPHANUM>",
         "position": 5
      },
      {
         "token": "dem",
         "start_offset": 48,
         "end_offset": 51,
         "type": "<ALPHANUM>",
         "position": 6
      },
      {
         "token": "donaudampfschiff",
         "start_offset": 52,
         "end_offset": 68,
         "type": "<ALPHANUM>",
         "position": 7
      },
      {
         "token": "hat",
         "start_offset": 69,
         "end_offset": 72,
         "type": "<ALPHANUM>",
         "position": 8
      },
      {
         "token": "viel",
         "start_offset": 73,
         "end_offset": 77,
         "type": "<ALPHANUM>",
         "position": 9
      },
      {
         "token": "ökosteuer",
         "start_offset": 78,
         "end_offset": 87,
         "type": "<ALPHANUM>",
         "position": 10
      },
      {
         "token": "gekostet",
         "start_offset": 88,
         "end_offset": 96,
         "type": "<ALPHANUM>",
         "position": 11
      }
   ]
}

Equivalent setup in via java api did not change the outcome.

        final XContentBuilder mappingBuilder2 = jsonBuilder()
            .startObject()
                .startObject("index") // decompound filter
                    .startObject("analysis")
                        .startObject("filter")
                            .startObject("decomp").field("type", "decompound").endObject()
                        .endObject()
                        .startObject("tokenizer")
                            .startObject("decomp").field("type", "standard")
                            .startArray("filter")                                               
                                .field("decomp")
                            .endArray()
                            .endObject()
                        .endObject()
                    .endObject()
                .endObject()
            .endObject();


           final CreateIndexRequestBuilder createIndexRequestBuilder = client.admin().indices().prepareCreate(indexName);
     createIndexRequestBuilder.setSettings(ImmutableSettings.settingsBuilder().loadFromSource(mappingBuilder2.string()));

I tried also the your pack of plugins with the same result.
And yes I did the restart of my test elasticsearch server, otherwise it should have bailed out to create a filter of type decompound.

@jprante
Copy link
Owner

jprante commented May 15, 2015

Can you show the mapping where you use the decomp tokenizer?

@Xaratas
Copy link
Author

Xaratas commented May 15, 2015

I haven't it set to my mapping yet, as i expected the "raw" request to the analyzer would show me "all is running fine".
Also i thought the extra "index" in the "settings" block sets it in general for this index, or am i wrong? I could only find documentation with settings and then analyzers, not index in between.

@jprante
Copy link
Owner

jprante commented May 15, 2015

If you want to change the default analyzer, you have to declare a standard analyzer in the settings with the decomp tokenizer. But I recommend to set up a custom analyzer and set only a field to use the decompounding because it is expensive.

@Xaratas
Copy link
Author

Xaratas commented May 15, 2015

Then for what is the "index" block?

@jprante
Copy link
Owner

jprante commented May 15, 2015

This project state is very old, it is for Elasticsearch 1.0.0, not 1.5. In 1.5, the index was omitted, see https://www.elastic.co/guide/en/elasticsearch/guide/master/configuring-analyzers.html

@Xaratas
Copy link
Author

Xaratas commented May 18, 2015

Ok, reading the documentation back and forth i found a combination of directives that work.
For documentation purposes in json and java what worked. It uses the default field, so it works for all fields. Some fields like person names should maybe excluded, or "brand" finds "Hildebrand", "Makebrand", etc. Other consideration: add filter "tolower" to be case independend.

first index of ~72k documents took 5min instead of < 1 min
searchtime unaffected

PUT /<indexname>
{
  "index": {
    "analysis": {
      "analyzer": {
        "default": {
          "tokenizer": "decomp",
          "filter": [
            "decomp",
            "unique"
          ]
        }
      },
      "filter": {
        "decomp": {
          "type": "decompound"
        }
      },
      "tokenizer": {
        "decomp": {
          "type": "standard",
          "filter": [
            "decomp"
          ]
        }
      }
    }
  }
}
        final XContentBuilder settingsBuilder = jsonBuilder()
        .startObject()
            .startObject("index")
                .startObject("analysis")
                    .startObject("analyzer")
                        .startObject("default")  // = generell und für alle felder
                            .field("tokenizer", "decomp")
                            .field("filter", new String[] {"decomp", "unique"})
                        .endObject()
                    .endObject()
                    .startObject("filter")
                        .startObject("decomp")
                            .field("type", "decompound")
                        .endObject()
                    .endObject()
                    .startObject("tokenizer")
                        .startObject("decomp")
                            .field("type", "standard")
                            .startArray("filter")                                               
                                .field("decomp")
                            .endArray()
                        .endObject()
                    .endObject()
                .endObject()
            .endObject()
        .endObject();
     final CreateIndexRequestBuilder createIndexRequestBuilder = client.admin().indices().prepareCreate(indexName);
     createIndexRequestBuilder.setSettings(ImmutableSettings.settingsBuilder().loadFromSource(settingsBuilder.string()));

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants