-
Notifications
You must be signed in to change notification settings - Fork 190
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
discussing changes to LanguageAnalyzer #788
base: main
Are you sure you want to change the base?
discussing changes to LanguageAnalyzer #788
Conversation
2.adding methods for other languages in the Analyzer.java Signed-off-by: brentam <[email protected]>
Signed-off-by: brentam <[email protected]>
Signed-off-by: brentam <[email protected]>
Signed-off-by: brentam <[email protected]>
Signed-off-by: brentam <[email protected]>
Signed-off-by: brentam <[email protected]>
Signed-off-by: brentam <[email protected]>
Signed-off-by: brentam <[email protected]>
…yser Signed-off-by: brentam <[email protected]>
@@ -68,7 +69,7 @@ private LanguageAnalyzer(Builder builder) { | |||
|
|||
this.version = builder.version; | |||
this.language = ApiTypeHelper.requireNonNull(builder.language, this, "language"); | |||
this.stemExclusion = ApiTypeHelper.unmodifiableRequired(builder.stemExclusion, this, "stemExclusion"); | |||
this.stemExclusion = builder.stemExclusion; // ApiTypeHelper.unmodifiableRequired(builder.stemExclusion, this, "stemExclusion"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stemExclusion should not be mandatory
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is supported by most analyzers but not all
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but it is reasonable to make it optional for those that supports this property, I agree
@@ -589,8 +592,11 @@ public Analyzer build() { | |||
|
|||
protected static void setupAnalyzerDeserializer(ObjectDeserializer<Builder> op) { | |||
|
|||
for (Language value : Language.values()) { | |||
op.add(Builder::language, LanguageAnalyzer._DESERIALIZER, value.jsonValue().toLowerCase()); | |||
//TODO should we lowercase in the Language Enum? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Language jsonValue has to match the Kind jsonValue
So we set the deserializers for all the languages here
Arabic("Arabic"), | ||
|
||
Armenian("Armenian"), | ||
Arabic("arabic"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the json value has to be lowercase. to match the values Opensearch expects
@@ -83,7 +84,7 @@ public static LanguageAnalyzer of(Function<Builder, ObjectBuilder<LanguageAnalyz | |||
*/ | |||
@Override | |||
public Analyzer.Kind _analyzerKind() { | |||
return Analyzer.Kind.Language; | |||
return Analyzer.Kind.valueOf(this.language.name()); | |||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have to determine the Analyzer.Kind at runtime....
Note that the Language values have to match the Kind values
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the Cjk
was added because it does not support stem_exclusion
, not every language analyzer supports it actually (per documentation).
assertEquals(analyzer.cjk().stopwords(), analyzer2.cjk().stopwords()); | ||
assertEquals(analyzer.cjk().stopwordsPath(), analyzer2.cjk().stopwordsPath()); | ||
} | ||
// temporary commenting this test until we decide if we keep the CjkAnalyser |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should keep CjkAnalyser
since it does not support stem_exclusion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@reta
sorry if the comment was not clear. What I meant is that if we change the LanguageAnalyzer, the CJkAnalyzer will be deleted, same as DutchAnalyzer, since all languages analyzers would be constructed using the LanguageAnalyzer+Language.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am usure what you mean.
The whole point of the discussion was that maybe we should use the LanguageAnalyzer pattern to build the language analyzers. If we have one exception for Cjk, we will keep supporting two patterns. In that case I rather keep all the 33 analyzers created in the the other PR, since the LanguageAnalyzer atm is not correct and the Analyzer Per language pattern is working fine..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The whole point of the discussion was that maybe we should use the LanguageAnalyzer pattern to build the language analyzers.
This is correct, we though need to acknowledge that not all LanguageAnalyzer
are supporting the same customization capabilities - CJK does not support stem_exclusion
(may be others as well but this one for sure). Making stem_exclusion
optional won't solve the problem - the server side will (or at least, should) reject the CJK analyzer with stem_exclusion
.
Looking how we could accommodate that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, @brentam I think I have a solution for that:
- we keep
CjkAnalyzer
but - we remove
Cjk
fromLanguage
enumeration
This is not ideal but the pros are that - it prevents users from referring to unsupported features at compile time (vs removing CjkAnalyzer
but failing at runtime)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found other big issues with the analyzers api while testing and looking at the code, I was thinking about creating other tickets because this discussion is becoming quite long but I will add them in this thread so you have an idea of the minimum I think needs to be done. Those are based on the use case we need to be supported at work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here it is
There seems to be a lot of problems with the IndexSettingsAnalysis.Builder.
(This can be the subject of another PR)
Check this example
Map<String, Analyzer> map = new HashMap<>();
map.put("some_analyzer",
LanguageAnalyzer.of(l -> l.language(Language.German).
stemExclusion(new ArrayList<>()))
._toAnalyzer());
IndexSettingsAnalysis settings = IndexSettingsAnalysis.of(
anl -> anl
.analyzer(map)
.charFilter("some_char_filter", CharFilter.of(c -> c.name("html_strip")))
);
System.out.println(toJson(settings));
This will produce this json
{"analyzer":{"some_analyzer":{"type":"language","language":"German","stem_exclusion":[]}},"char_filter":{"some_char_filter":"html_strip"}}
.
Now, we should be able to specify the char_filter under "some_analyzer" too, not only under the "analyzer" like it is here.
These two jsons are perfectly fine and we are actually using this in production:
"english_analyzer": {
"type": "english",
"char_filter": [
"html_strip"
],
"tokenizer": "standard"
}
"german_analyzer" : {
"filter" : [ "lowercase", "german_normalization", "german_stop", "german_synonyms", "german_decompounder", "german_synonyms", "possessive_stemmer", "german_stemmer" ],
"char_filter" : [ "html_strip" ],
"tokenizer" : "standard"
},
So there is some redesigned needed, the easiest would be modify the language Deserializer to accept the "char_filter" and some other clauses.
Here is the Deserialization function for IndexSettingsAnalysis
protected static void setupIndexSettingsAnalysisDeserializer(ObjectDeserializer<IndexSettingsAnalysis.Builder> op) {
op.add(Builder::analyzer, JsonpDeserializer.stringMapDeserializer(Analyzer._DESERIALIZER), "analyzer");
op.add(Builder::charFilter, JsonpDeserializer.stringMapDeserializer(CharFilter._DESERIALIZER), "char_filter");
op.add(Builder::filter, JsonpDeserializer.stringMapDeserializer(TokenFilter._DESERIALIZER), "filter");
op.add(Builder::normalizer, JsonpDeserializer.stringMapDeserializer(Normalizer._DESERIALIZER), "normalizer");
op.add(Builder::tokenizer, JsonpDeserializer.stringMapDeserializer(Tokenizer._DESERIALIZER), "tokenizer");
}
I believe what we need to do is to modify the Analyzer._DESERIALIZER and the LanguageAnalyzer.Builder to allow for
"char_filter", "filter", "normalizer" and "tokenizer".
The CharFilter.Builder seems to be incomplete too.
There is no way to construct an array of names, like how is expected here
"english_analyzer": {
"type": "english",
"char_filter": [
"html_strip"
],
"tokenizer": "standard"
}
It only allows a map
.charFilter("some_char_filter", CharFilter.of(c -> c.name("html_strip")))
that will result in
{"some_char_filter":"html_strip"}
Now, I was looking at the apis related to the Charfilter builders. There is a lot of classes and hard to understand the right ones to use, so refactoring that will be a pain...
I suggest the simple fix would be having a "reference" method in the builder...
We would set it like this
CharFilter.of(c -> c.references(new String[]{"html_strip","some_custom_charset_defined_elsewhere"}))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @brentam, I till try to answer based on the documentation
Arent we being to pendantic here?
Is really the way to construct the language analyzer using LanguageAnalyzer Builder that important??
This model is generated from the API specification so it is supposed to work or the API is completely wrong. The fact stem_exclusion
are not enforced is probably not good, I suspect when specified, the expectation is it should be applied during analysis (I will look into it shortly).
That is another reason I am in favour of keeping the 33 new analyzers I have created in the other PR an not touching the LanguageAnalyzer at all
As you discovered, the LanguageAnalyzer
does not work at all at the moment. I believe with the changes form this pull request it should work the way it is supposed to work, right?
I believe what we need to do is to modify the Analyzer._DESERIALIZER and the LanguageAnalyzer.Builder to allow for
"char_filter", "filter", "normalizer" and "tokenizer".
No, I think we should not - this belongs to custom analyzers [1]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@brentam quick update (my apologies, learning some things along with you), this is my understand of all the pieces so far:
-
Language Analyzers: OpenSearch uses some predefined language analyzers that come with Apache Lucene, those could be used by specifying
"type":"<language>"
when definiting analyzers -
All of those predefined language analyzers support stop words but not all support
stem_exclusion
. The way it works though is very unexpected - there is no validation but the the unsupported properties (likestem_exclusion
in case ofCjk
) are silently ignored. More to that, to my understanding, it equally applies tochar_filter
& co - they are not taken into account since predefined language analyzers do not support this customization (that is whycustom
has to be used). In practice it means that what you think you configure is not what you get:"english_analyzer": { "type": "english", "char_filter": [ "html_strip" ], "tokenizer": "standard" }
is equivalent to
"english_analyzer": { "type": "english" }
and sadly, is equivalent to (since char_filter
and tokenizer
are ignored completely)
"english_analyzer": {
"type": "english",
"char_filter": [
"garbage"
],
"tokenizer": "garbage"
}
All in all, I think we are doing very good change (thanks to you) by first of all, fixing the client libraries, and secondly - by uncovering the gaps we currently have on the server side. Lastly, the documentation would need large chunk of work there as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
omg, that is sad...lol. I will have to review our analyzers definition at work now..lol
@@ -92,6 +90,74 @@ public enum Kind implements JsonEnum { | |||
|
|||
Cjk("cjk"), | |||
|
|||
Arabic("arabic"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious why do we need to mirror Language
enumeration here?
@@ -83,7 +84,7 @@ public static LanguageAnalyzer of(Function<Builder, ObjectBuilder<LanguageAnalyz | |||
*/ | |||
@Override | |||
public Analyzer.Kind _analyzerKind() { | |||
return Analyzer.Kind.Language; | |||
return Analyzer.Kind.valueOf(this.language.name()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we don't need to change analyzer kind here, we need to change serialization (type -> language) and deserialization (Analyzer
class):
for (final Language language: Language.values()) {
op.add(Builder::language, LanguageAnalyzer._DESERIALIZER, language.jsonValue());
}
|
||
generator.write("type", "language"); | ||
generator.write("type", this._analyzerKind().jsonValue()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@@ -589,8 +592,11 @@ public Analyzer build() { | |||
|
|||
protected static void setupAnalyzerDeserializer(ObjectDeserializer<Builder> op) { | |||
|
|||
for (Language value : Language.values()) { | |||
op.add(Builder::language, LanguageAnalyzer._DESERIALIZER, value.jsonValue().toLowerCase()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You already did that, right?
op.add(Builder::language, LanguageAnalyzer._DESERIALIZER, value.jsonValue().toLowerCase()); | |
op.add(Builder::language, LanguageAnalyzer._DESERIALIZER, value.jsonValue()); |
// assertTrue(analyzer2.isCjk()); | ||
// assertEquals(analyzer.cjk().stopwords(), analyzer2.cjk().stopwords()); | ||
// assertEquals(analyzer.cjk().stopwordsPath(), analyzer2.cjk().stopwordsPath()); | ||
// } | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Test
public void testLanguage_Analyzer() {
for (final Language language: Language.values()) {
final Analyzer analyzer = new Analyzer.Builder().language(b -> b.language(language).stopwords(Arrays.asList("a", "b", "c")).stopwordsPath("path")).build();
String str = toJson(analyzer);
assertEquals("{\"type\":\"" + language.jsonValue() + "\",\"stopwords\":[\"a\",\"b\",\"c\"],\"stopwords_path\":\"path\"}", str);
Analyzer analyzer2 = fromJson(str, Analyzer._DESERIALIZER);
assertEquals(analyzer.language().stopwords(), analyzer2.language().stopwords());
assertEquals(analyzer.language().stopwordsPath(), analyzer2.language().stopwordsPath());
assertEquals(analyzer.language().language(), language);
}
}
}
Thanks! I'll need to go through this a bit. I'm not clear on why the client needs to be aware of what analyzers exist server-side (especially since that can vary based on what plugins are installed on the server). I'm going to read through until I understand what the client is doing. |
@reta @msfroh |
@brentam sure, please check if we already have an issue for that , fe #377 may be the one your are looking for |
Description
This a temporary/draft PR to discuss the viability of changes to the LanguageAnalyzer needed to correct some issues.
The original PR is here #779
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.