Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add and/or make content relevant to people and planet more discoverable #1717

Open
1 of 4 tasks
jpmckinney opened this issue Nov 15, 2024 · 1 comment
Open
1 of 4 tasks
Labels
Focus - Documentation Includes corrections, clarifications, new guidance, and UI/UX issues

Comments

@jpmckinney
Copy link
Member

jpmckinney commented Nov 15, 2024

  • Ask the OCP team about other keywords to cover
  • Check whether these keywords return relevant results
  • Add Elasticsearch synonyms where possible
    • gender -> women
    • green -> sustainable
    • SPP -> sustainable
    • SME -> small business
    • inclusion -> ?
  • Otherwise, add new content referencing our resources on these subjects, e.g. under the Design phase
@jpmckinney jpmckinney added the Focus - Documentation Includes corrections, clarifications, new guidance, and UI/UX issues label Nov 15, 2024
@jpmckinney
Copy link
Member Author

jpmckinney commented Nov 15, 2024

Analyze words to tokens

First, convert the synonyms to tokens. For example (change analyzer to "spanish" if needed, and change the text to the phrase to analyze):

curl -n -H "Content-Type: application/json" -H "Accept: application/json" \
  https://standard.open-contracting.org/search/_analyze \
  --data '{"analyzer":"english","text":"sustainable"}

Create a synonyms set

Run this as root on the server to create an English synonyms set:

curl -n -X PUT -H "Content-Type: application/json" -H "Accept: application/json" \
  "https://standard.open-contracting.org/search/_synonyms/ocdssynonyms_en" \
  --data '{"synonyms_set":[{"synonyms":"gender, women"},{"synonyms":"green, spp, sustain"},{"synonyms":"sme => small busi"}]}'

Check that the synonyms set exists:

curl -n "https://standard.open-contracting.org/search/_synonyms/ocdssynonyms_en"

Note: "sme, small busi" wasn't working, though the _validate API looked fine:

$ curl -n -X GET -H "Content-Type: application/json" -H "Accept: application/json" \
  "https://standard.open-contracting.org/search/ocdsindex_en/_validate/query?rewrite=true" \
  --data '{"query":{"bool":{"must":{"simple_query_string":{"query":"sme","fields":["text","title^3"],"default_operator":"and"}}}}}'
{"_shards":{"total":1,"successful":1,"failed":0},"valid":true,"explanations":[{"index":"ocdsindex_en","valid":true,"explanation":"text:\"small busi\" text:sme (title:\"small busi\" title:sme)^3.0"}]}

"sme => small busi" works, but the analyzer changes the query to not search for "sme" at all:

curl -n -x GET -H "Content-Type: application/json" -H "Accept: application/json" \
  "https://standard.open-contracting.org/search/ocdsindex_en/_validate/query?rewrite=true"
  \--data '{"query":{"bool":{"must":{"simple_query_string":{"query":"sme","fields":["text","title^3"],"default_operator":"and"}}}}}'
{"_shards":{"total":1,"successful":1,"failed":0},"valid":true,"explanations":[{"index":"ocdsindex_en","valid":true,"explanation":"(+text:small +text:busi) (+title:small +title:busi)^3.0"}]}

OCDS 1.1 doesn't mention "SME" and OCDS 1.2 always expands it on the same page, so this behavior is fine.

Choose an approach for configuring the search analyzer

As described at https://www.elastic.co/guide/en/elasticsearch/reference/current/search-with-synonyms.html#synonyms-synonym-token-filters, we want a "synonym graph", not a "synonym" token filter, because we want multi-word synonyms (like "small business").

Per the note at https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-synonym-graph-tokenfilter.html, the "synonym graph" filter can be applied "as part of a search analyzer only," not during indexing.

The search analyzer is determined as described at https://www.elastic.co/guide/en/elasticsearch/guide/current/_controlling_analysis.html#_default_analyzers (NOT as described at https://www.elastic.co/guide/en/elasticsearch/reference/current/specify-analyzer.html#specify-search-analyzer).

One option is to set the the search query's "simple_query_string":{"analyzer": "default_search",...}. However, this requires having different logic in search.js in standard_theme for English and other languages (unless we define a default_search for those other languages).

Instead, we'll set the search_analyzer mapping parameter for the field (ocds-index doesn't set this, only analyzer), which can be updated via https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-put-mapping.html This parameter defaults to the field's analyzer parameter. ocds-index uses language analyzers, and defaults to "standard" if the language isn't recognized. The tokenizer for all language analyzers is "standard" except for thai (which uses the "thai" tokenizer). See english, spanish.

While trying to figure this out, I set the analysis.analyzer.default_search index setting (as to why the query below sets analysis.analyzer.default, there is a note that "If a search analyzer is provided, a default index analyzer must also be specified using the analysis.analyzer.default setting."). We'll reuse this analyzer when setting search_analyzer mapping parameters.

Configure a synonym graph search analyzer in the index settings

I checked the index settings, and that field isn't presently set.

curl -n "https://standard.open-contracting.org/search/ocdsindex_en/_settings"

I updated the index settings for the English index (based on the english language analyzer):

curl -n -X POST "https://standard.open-contracting.org/search/ocdsindex_en/_close"

curl -n -X PUT -H "Content-Type: application/json" -H "Accept: application/json" \
  "https://standard.open-contracting.org/search/ocdsindex_en/_settings" \
  --data '{
  "settings": {
    "analysis": {
      "filter": {
        "english_stop": {
          "type": "stop",
          "stopwords": "_english_" 
        },
        "english_keywords": {
          "type": "keyword_marker",
          "keywords": ["example"] 
        },
        "english_stemmer": {
          "type": "stemmer",
          "language": "english"
        },
        "english_possessive_stemmer": {
          "type": "stemmer",
          "language": "possessive_english"
        },
        "english_synonyms": {
          "type": "synonym_graph",
          "synonyms_set": "ocdssynonyms_en",
          "updateable": true
        }
      },
      "analyzer": {
        "default": {
          "type": "standard"
        },
        "default_search": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "english_possessive_stemmer",
            "lowercase",
            "english_stop",
            "english_keywords",
            "english_stemmer",
            "english_synonyms"
          ]
        }
      }
    }
  }
}'

curl -n -X POST "https://standard.open-contracting.org/search/ocdsindex_en/_open"

updateable is documented at https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-synonym-graph-tokenfilter.html#analysis-synonym-graph-configure-sets

For Spanish, change:

Configure search_analyzer mapping parameters

curl -n -X PUT -H "Content-Type: application/json" -H "Accept: application/json" \
  "https://standard.open-contracting.org/search/ocdsindex_en/_mapping" \
  --data '{
    "properties": {
      "title": {"type": "text", "analyzer": "english", "search_analyzer": "default_search"},
      "text": {"type": "text", "analyzer": "english", "search_analyzer": "default_search"}
    }
  }'

Test the synonyms

Check that the settings are applied:

curl -n "https://standard.open-contracting.org/search/ocdsindex_en/_settings"

Check that the mapping is updated:

curl -n "https://standard.open-contracting.org/search/ocdsindex_en/_mapping"

Check that the synonyms work when applying the filter only:

$ curl -n -H "Content-Type: application/json" -H "Accept: application/json" \
  "https://standard.open-contracting.org/search/ocdsindex_en/_analyze" \
  --data '{"tokenizer":"standard","filter":["english_synonyms"],"text":"gender"}'
{"tokens":[{"token":"women","start_offset":0,"end_offset":6,"type":"SYNONYM","position":0},{"token":"gender","start_offset":0,"end_offset":6,"type":"<ALPHANUM>","position":0}]}

Check that the synonyms work when performing a search:

$ curl-n  -X GET -H "Content-Type: application/json" -H "Accept: application/json" \
  "https://standard.open-contracting.org/search/ocdsindex_en/_validate/query?rewrite=true" \
  --data '{"query":{"bool":{"must":{"simple_query_string":{"query":"gender","fields":["text","title^3"]}}}}}'
{"_shards":{"total":1,"successful":1,"failed":0},"valid":true,"explanations":[{"index":"ocdsindex_en","valid":true,"explanation":"(Synonym(title:gender title:women))^3.0 Synonym(text:gender text:women)"}]}

Automation notes

Since ocds-index only creates a given index once, I think it's easier to just update the synonyms manually as written here. Since the synonyms filter is updateable, it should be possible to just update the synonyms set without doing the rest. Of course, we should test that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Focus - Documentation Includes corrections, clarifications, new guidance, and UI/UX issues
Projects
None yet
Development

No branches or pull requests

1 participant