diff --git a/0.9.0/404.html b/0.9.0/404.html new file mode 100644 index 00000000..5b61d264 --- /dev/null +++ b/0.9.0/404.html @@ -0,0 +1,419 @@ + + + +
+ + + + + + + + + + + + + + +All OCR formats supported by this plugin have the possibility of encoding alternative readings for +a given word. These can either come from the OCR engine itself and consist of other high-confidence +readings for a given sequence of characters, or they could come from an manual or semi-automatic +OCR correction system.
+Note
+<span class="alternatives"><ins class="alt">...</ins><del class="alt">...</del></span>
(see hOCR specification)<String …><ALTERNATIVE>...</ALTERNATIVE></String>
(see AlternativeType
in the ALTO schema)⇿
(U+21FF) (see MiniOCR documentation)In any case, these alternative readings can improve your user's search experience, by allowing us to +index multiple forms for a given text position. This enables users to find more matching passages +for a given query than if only a single form was indexed for every word. This is a form of +index-time term expansion, similar in concept to e.g. the Synonym Graph Filter +that ships with Solr.
+To enable the indexing of alternative readings, you have to make some modifications to your OCR field's +index analysis chain.
+First, you need to enable alternative expansion in the OcrCharFilterFactory
by setting the
+expandAlternatives
attribute to true
:
Next, you need to add a new OcrAlternativesFilterFactory
token filter component to your analysis
+chain. This component must to be placed after the tokenizer:
<fieldType name="text_ocr" class="solr.TextField">
+ <!-- .... -->
+ <tokenizer class="solr.StandardTokenizerFactory"/>
+ <filter class="solrocr.OcrAlternativesFilterFactory"/>
+ <!-- .... -->
+</fieldType>
+
A full field definition for an OCR field with alternative expansion could look like this:
+<fieldType name="text_ocr" class="solr.TextField">
+ <analyzer type="index">
+ <charFilter class="solrocr.ExternalUtf8ContentFilterFactory"/>
+ <charFilter
+ class="solrocr.OcrCharFilterFactory"
+ expandAlternatives="true"
+ />
+ <tokenizer class="solr.StandardTokenizerFactory"/>
+ <filter class="solrocr.OcrAlternativesFilterFactory"/<
+ <filter class="solr.LowerCaseFilterFactory"/>
+ </analyzer>
+ <analyzer type="query">
+ <tokenizer class="solr.StandardTokenizerFactory"/>
+ <filter class="solr.LowerCaseFilterFactory"/>
+ </analyzer>
+</fieldType>
+
Highlighting matches on alternative forms
+During highlighting, you will only see the matching alternative form in the snippet if the match +is on a single word, or if it is at the beginning or the end of a phrase match. This is because we cannot +get to the offsets of matching terms inside of a phrase match through Lucene's highlighting machinery.
+Unsupported tokenizers
+The OcrAlternativesFilterFactory
works with almost all tokenizers shipping with Solr, except for
+the ClassicTokenizer
. This is because we use the WORD JOINER
(U+2060) character to denote
+alternative forms in the character stream and the classic tokenizer splits tokens on this character
+(contrary to Unicode rules). This also means that if you use a custom tokenizer, you need to make
+sure that it does not split tokens on U+2060.
Non-alphabetic characters in alternatives
+Some of Solr's built-in tokenizers split tokens on special characters like -
that occur inside
+of words. When such characters occur within tokens that have alternatives, the alternatives are
+severed from the original token and the plugin will not index them. To avoid this, either use
+a tokenizer that doesn't split on these characters (like WhitespaceTokenizerFactory
) or consider
+customizing your tokenizer of choice to not split on these characters when a token includes
+alternative readings. Note that this can lead to less precise results, e.g. when alpha-numeric
+is not split, only a query like alphanumeric
or alpha-numeric
will match (depending on the
+analysis chains), but not alpha
or numeric
alone or a "alpha numeric"
phrase query.
Consider increasing the standard maxTokenLength
of 255
When your OCR contains a large number of alternatives for tokens, or these alternatives can
+get quite long, consider increasing the maximum token length in your tokenizer's configuration.
+For most of Solr's tokenizers this can be done with the maxTokenLength
parameter that defaults
+to 255. When the plugin encounters a case where this leads to truncated alternatives, it will
+print a warning to the Solr log. Consider increasing the value to 512 or 1024. This will come
+at the expense of an increase in memory usage during indexing, but will preserve as many of your
+alternative readings as possible.
{"use strict";/*!
+ * escape-html
+ * Copyright(c) 2012-2013 TJ Holowaychuk
+ * Copyright(c) 2015 Andreas Lubbe
+ * Copyright(c) 2015 Tiancheng "Timothy" Gu
+ * MIT Licensed
+ */var Va=/["'&<>]/;qn.exports=za;function za(e){var t=""+e,r=Va.exec(t);if(!r)return t;var o,n="",i=0,s=0;for(i=r.index;i