Add lowercase token filter docs #8162

AntonEliatra · 2024-09-03T14:42:58Z

Description

Add lowercase token filter docs

Issues Resolved

Closes #8154

Version

all

Checklist

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license and subject to the Developers Certificate of Origin.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Anton Rubin <[email protected]>

…#8154 Signed-off-by: Anton Rubin <[email protected]>

github-actions · 2024-09-03T14:43:12Z

Thank you for submitting your PR. The PR states are In progress (or Draft) -> Tech review -> Doc review -> Editorial review -> Merged.

Before you submit your PR for doc review, make sure the content is technically accurate. If you need help finding a tech reviewer, tag a maintainer.

When you're ready for doc review, tag the assignee of this PR. The doc reviewer may push edits to the PR directly or leave comments and editorial suggestions for you to address (let us know in a comment if you have a preference). The doc reviewer will arrange for an editorial review.

vagimeli · 2024-09-03T17:09:02Z

@udabhas Will you review this PR for technical accuracy? Thank you.

Signed-off-by: AntonEliatra <[email protected]>

vagimeli · 2024-10-03T16:26:07Z

@udabhas Will you review this PR for technical accuracy? Thank you.

@udabhas @varun-lodaya This is over a month old. We need tech review approval to move it forward in the documentation process. Please review this week or provide a peer who can review it. Thank you.

Signed-off-by: Fanit Kolchina <[email protected]>

natebower

@kolchfa-aws A few comments and changes. Thanks!

_analyzers/token-filters/index.md

natebower · 2024-11-26T11:58:05Z

_analyzers/token-filters/index.md

@@ -38,7 +38,7 @@ Token filter | Underlying Lucene token filter|  Description
 `kuromoji_completion` | [JapaneseCompletionFilter](https://lucene.apache.org/core/9_10_0/analysis/kuromoji/org/apache/lucene/analysis/ja/JapaneseCompletionFilter.html) | Adds Japanese romanized terms to the token stream (in addition to the original tokens). Usually used to support autocomplete on Japanese search terms. Note that the filter has a `mode` parameter, which should be set to `index` when used in an index analyzer and `query` when used in a search analyzer. Requires the `analysis-kuromoji` plugin. For information about installing the plugin, see [Additional plugins]({{site.url}}{{site.baseurl}}/install-and-configure/plugins/#additional-plugins).
 `length` | [LengthFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/LengthFilter.html) | Removes tokens whose lengths are shorter or longer than the length range specified by `min` and `max`. 
 `limit` | [LimitTokenCountFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/LimitTokenCountFilter.html) | Limits the number of output tokens. A common use case is to limit the size of document field values based on token count.
-`lowercase` | [LowerCaseFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/core/LowerCaseFilter.html) | Converts tokens to lowercase. The default [LowerCaseFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/core/LowerCaseFilter.html) is for the English language. You can set the `language` parameter to `greek` (uses [GreekLowerCaseFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/el/GreekLowerCaseFilter.html)), `irish` (uses [IrishLowerCaseFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/ga/IrishLowerCaseFilter.html)), or `turkish` (uses [TurkishLowerCaseFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/tr/TurkishLowerCaseFilter.html)).
+[`lowercase`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/lowercase/) | [LowerCaseFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/core/LowerCaseFilter.html) | Converts tokens to lowercase. The default [LowerCaseFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/core/LowerCaseFilter.html) is for the English language. You can set the `language` parameter to `greek` (uses [GreekLowerCaseFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/el/GreekLowerCaseFilter.html)), `irish` (uses [IrishLowerCaseFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/ga/IrishLowerCaseFilter.html)), or `turkish` (uses [TurkishLowerCaseFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/tr/TurkishLowerCaseFilter.html)).
 `min_hash` | [MinHashFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/minhash/MinHashFilter.html) | Uses the [MinHash technique](https://en.wikipedia.org/wiki/MinHash) to estimate document similarity. Performs the following operations on a token stream sequentially: <br> 1. Hashes each token in the stream. <br> 2. Assigns the hashes to buckets, keeping only the smallest hashes of each bucket. <br> 3. Outputs the smallest hash from each bucket as a token stream.


Line 41: Confirm the accuracy of my changes.

Not quite: the default specifies no language and processes English. There are also a couple of other language options. Reworded.

_analyzers/token-filters/lowercase.md

natebower · 2024-11-26T12:01:13Z

_analyzers/token-filters/lowercase.md

+Parameter | Required/Optional | Description
+:--- | :--- | :---
+ `language` | Optional | Specifies a language-specific token filter to use for lowercasing. Valid values are: <br>- [`greek`](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/el/GreekLowerCaseFilter.html) <br>-  [`irish`](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/ga/IrishLowerCaseFilter.html) <br>-  [`turkish`](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/tr/TurkishLowerCaseFilter.html). <br> Default is [Lucene’s LowerCaseFilter](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/core/LowerCaseFilter.html). 
+


Line 18: Is english also an option, or is that the default used by the Lucene filter?

Default without specifying a language.

Co-authored-by: Nathan Bower <[email protected]> Signed-off-by: kolchfa-aws <[email protected]>

Signed-off-by: kolchfa-aws <[email protected]>

* add lowercase token filter Signed-off-by: Anton Rubin <[email protected]> * adding examples in greek to lowercase token filter #8154 Signed-off-by: Anton Rubin <[email protected]> * Update lowercase.md Signed-off-by: AntonEliatra <[email protected]> * Doc review Signed-off-by: Fanit Kolchina <[email protected]> * Apply suggestions from code review Co-authored-by: Nathan Bower <[email protected]> Signed-off-by: kolchfa-aws <[email protected]> --------- Signed-off-by: Anton Rubin <[email protected]> Signed-off-by: AntonEliatra <[email protected]> Signed-off-by: Fanit Kolchina <[email protected]> Signed-off-by: kolchfa-aws <[email protected]> Co-authored-by: Fanit Kolchina <[email protected]> Co-authored-by: kolchfa-aws <[email protected]> Co-authored-by: Nathan Bower <[email protected]> (cherry picked from commit c0d158f) Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

AntonEliatra added 2 commits September 3, 2024 14:30

add lowercase token filter

5dddda8

Signed-off-by: Anton Rubin <[email protected]>

adding examples in greek to lowercase token filter opensearch-project…

167d063

…#8154 Signed-off-by: Anton Rubin <[email protected]>

AntonEliatra requested review from kolchfa-aws, Naarcha-AWS, vagimeli, AMoo-Miki, natebower, dlvenable, stephen-crawford and epugh as code owners September 3, 2024 14:42

github-actions bot assigned kolchfa-aws Sep 3, 2024

kolchfa-aws assigned vagimeli and unassigned kolchfa-aws Sep 3, 2024

vagimeli added 3 - Tech review PR: Tech review in progress Needs SME Waiting on input from subject matter expert labels Sep 3, 2024

Update lowercase.md

9a46d9a

Signed-off-by: AntonEliatra <[email protected]>

vagimeli added Content gap analyzers labels Sep 30, 2024

Doc review

f39b925

Signed-off-by: Fanit Kolchina <[email protected]>

kolchfa-aws assigned kolchfa-aws and unassigned vagimeli Nov 15, 2024

kolchfa-aws added 5 - Editorial review PR: Editorial review in progress backport 2.18 PR: Backport label for 2.18 and removed 3 - Tech review PR: Tech review in progress Needs SME Waiting on input from subject matter expert labels Nov 15, 2024

natebower reviewed Nov 26, 2024

View reviewed changes

Apply suggestions from code review

b0ce7c1

Co-authored-by: Nathan Bower <[email protected]> Signed-off-by: kolchfa-aws <[email protected]>

kolchfa-aws approved these changes Dec 2, 2024

View reviewed changes

Merge branch 'main' into add-lowercase-token-filter-docs

0a98c60

Signed-off-by: kolchfa-aws <[email protected]>

kolchfa-aws merged commit c0d158f into opensearch-project:main Dec 2, 2024
5 checks passed

opensearch-trigger-bot bot mentioned this pull request Dec 2, 2024

[Backport 2.18] Add lowercase token filter docs #8834

Merged

github-actions bot pushed a commit that referenced this pull request Dec 2, 2024

Add lowercase token filter docs (#8162) (#8834)

e65268f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add lowercase token filter docs #8162

Add lowercase token filter docs #8162

AntonEliatra commented Sep 3, 2024

github-actions bot commented Sep 3, 2024

vagimeli commented Sep 3, 2024

vagimeli commented Oct 3, 2024

natebower left a comment

natebower Nov 26, 2024

kolchfa-aws Dec 2, 2024

natebower Nov 26, 2024

kolchfa-aws Dec 2, 2024

Add lowercase token filter docs #8162

Add lowercase token filter docs #8162

Conversation

AntonEliatra commented Sep 3, 2024

Description

Issues Resolved

Version

Checklist

github-actions bot commented Sep 3, 2024

vagimeli commented Sep 3, 2024

vagimeli commented Oct 3, 2024

natebower left a comment

Choose a reason for hiding this comment

natebower Nov 26, 2024

Choose a reason for hiding this comment

kolchfa-aws Dec 2, 2024

Choose a reason for hiding this comment

natebower Nov 26, 2024

Choose a reason for hiding this comment

kolchfa-aws Dec 2, 2024

Choose a reason for hiding this comment