Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

match_phrase not highlighting when stopwords removed #27

Open
rpedela opened this issue Feb 25, 2017 · 3 comments
Open

match_phrase not highlighting when stopwords removed #27

rpedela opened this issue Feb 25, 2017 · 3 comments

Comments

@rpedela
Copy link

rpedela commented Feb 25, 2017

When searching an exact phrase where the search terms contain a stopword and stopwords have been removed, the experimental highlighter does not highlight the phrase. However ES finds the phrase and the plain highlighter highlights it correctly. I don't think stemming and word_delimiter have anything to do with the problem, but they are part of the real world analyzer where I found the problem. Below is a complete Node.js test case.

OS: Ubuntu 14.04
ES version: 2.4.1

var async = require('async');
var es = require('elasticsearch');

var INDEX_NAME = 'test_word_delimiter';
var SEARCH_TERMS = 'board of directors';
var TEST_SENTENCE = '\
    On February 9, 2017 in Form 8-K/A, the Board of Directors (the “Board”) of Tractor \
    Supply Company ("the Company"), amended and restated the Company’s \
    Fourth Amended and Restated By-laws (the “By-laws” and, as amended \
    and restated, the “Amended By-laws”). The following is a brief summary \
    of the material changes effected by adoption of the Amended By-laws, \
    which is qualified in its entirety by reference to the Amended By-laws \
    filed as Exhibit 3.1(i) hereto.';

var esClient = new es.Client({
    apiVersion: '2.4',
    hosts: [ 'localhost:9200' ],
});

async.waterfall([
    function (callback) {

        var params = {
            index: INDEX_NAME,
        };

        esClient.indices.delete(params, function (err) {

            if (err && err.response) {
                var res = JSON.parse(err.response);
                if (res.error && res.error.type === 'index_not_found_exception') {
                    return callback(null);
                }
            }

            return callback(err);
        });
    },
    function (callback) {

        var params = {
            index: INDEX_NAME,
            body: {
                mappings: {
                    default: {
                        _all: { enabled: false },
                        properties: {
                            text: {
                                analyzer: 'word_delimiter_stopword_stem',
                                type: 'string',
                            },
                        },
                    },
                },
                settings: {
                    analysis: {
                        char_filter: {
                            single_quotes: {
                                type: 'mapping',
                                mappings: [
                                    '\\u0091=>\\u0027',
                                    '\\u0092=>\\u0027',
                                    '\\u2018=>\\u0027',
                                    '\\u2019=>\\u0027',
                                    '\\u201B=>\\u0027'
                                ],
                            },
                        },
                        filter: {
                            en_US: {
                                type: 'stemmer',
                                language: 'english',
                            },
                            english_stopwords: {
                                type: 'stop',
                                stopwords: '_english_',
                            },
                            word_delimiter: {
                                type: 'word_delimiter',
                                catenate_all: true,
                                generate_number_parts: false,
                                generate_word_parts: false,
                                preserve_original: false,
                                split_on_case_change: false,
                                split_on_numerics: false,
                                stem_english_possessive: true,
                            },
                        },
                        analyzer: {
                            word_delimiter_stopword_stem: {
                                char_filter: [ 'single_quotes' ],
                                filter: [
                                    'lowercase',
                                    'word_delimiter',
                                    'english_stopwords',
                                    'en_US',
                                ],
                                tokenizer: 'whitespace',
                            },
                        },
                    },
                },
            },
        };

        esClient.indices.create(params, function (err) {
            return callback(err);
        });
    },
    function (callback) {

        var params = {
            index: INDEX_NAME,
            type: 'default',
            id: 1,
            body: {
                text: TEST_SENTENCE,
            },
            refresh: true,
        };

        esClient.index(params, function (err) {
            return callback(err);
        });
    },
    function (callback) {

        console.log('----------------------------------------------------------');
        console.log('  No highlight returned using experimental highlighter.');
        console.log('----------------------------------------------------------');

        var params = {
            index: INDEX_NAME,
            type: 'default',
            body: {
                query: {
                    match_phrase: {
                        text: {
                            query: SEARCH_TERMS,
                        },
                    },
                },
                highlight: {
                    fields: {
                        text: {
                            type: 'experimental',
                        },
                    },
                },
            },
        };

        esClient.search(params, function (err, res) {

            if (err) {
                return callback(err);
            }

            console.log(JSON.stringify(res,null,4));

            return callback(null);
        });
    },
    function (callback) {

        console.log('----------------------------------------------------------');
        console.log('  Correctly highlighted using plain highlighter.');
        console.log('----------------------------------------------------------');

        var params = {
            index: INDEX_NAME,
            type: 'default',
            body: {
                query: {
                    match_phrase: {
                        text: {
                            query: SEARCH_TERMS,
                        },
                    },
                },
                highlight: {
                    fields: {
                        text: {},
                    },
                },
            },
        };

        esClient.search(params, function (err, res) {

            if (err) {
                return callback(err);
            }

            console.log(JSON.stringify(res,null,4));

            return callback(null);
        });
    },
],
function (err) {
    esClient.close();
    if (err) {
        console.error(JSON.stringify(JSON.parse(err.response),null,4));
        console.error(err.stack);
    }
});
@rpedela
Copy link
Author

rpedela commented Feb 25, 2017

I attempted to upgrade to 5.1.2 and noticed phrase highlighting problems as well so the above example may also fail on 5.x. However I haven't fully investigated the 5.x problems yet, it is possible it is my code.

@rpedela
Copy link
Author

rpedela commented Mar 2, 2017

I tested this using 5.2.2 and the same behavior.

@nomoa
Copy link
Member

nomoa commented Mar 2, 2017

Yes it's a bug, the phrase matching does not keep track term positions properly, when a "hole" appears due to a stopword it will fail to detect the phrase. I'll try to find some time to fix it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants