Skip to content

Commit

Permalink
Update docs, use material theme
Browse files Browse the repository at this point in the history
- Updated CI config to publish documentation to gh-pages
- Updated namespaces in examples to reflect recent change
- Updated changelog to always have a link to the GitHub release page
- Added references to the new package management repository
- mkdocs-material allows us to highlight individual lines in code
  blocks, which should help with the readability of the configuration
  snippets.
  • Loading branch information
jbaiter committed Jun 1, 2022
1 parent ee36b6f commit d6c29d7
Show file tree
Hide file tree
Showing 10 changed files with 246 additions and 71 deletions.
47 changes: 47 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -39,3 +39,50 @@ jobs:
run: chmod -R a+rw ./target
- name: Run integration tests
run: ./integration-tests/run.sh

publish_mkdocs_latest:
if: github.event_name == 'push' and contains(github.ref, 'refs/tags/')
runs-on: ubuntu-latest
needs: build
env:
GIT_COMMITTER_NAME: mkdocs-mike
GIT_COMMITTER_EMAIL: [email protected]
steps:
- uses: actions/checkout@v3
- name: Set up Python 3.10
uses: actions/setup-python@v3
with:
python-version: 3.10
cache: 'pip'
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install mkdocs singledispatch mkdocs-material mike
- name: Deploy documentation with mike
run: mike deploy ${{github.ref_name}} latest
- name: Push gh-pages branch
run: git push origin gh-pages:gh-pages


publish_mkdocs_wip:
if: github.event_name == 'push' and contains(github.ref, 'main')
runs-on: ubuntu-latest
needs: build
env:
GIT_COMMITTER_NAME: mkdocs-mike
GIT_COMMITTER_EMAIL: [email protected]
steps:
- uses: actions/checkout@v3
- name: Set up Python 3.10
uses: actions/setup-python@v3
with:
python-version: 3.10
cache: 'pip'
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install mkdocs singledispatch mkdocs-material mike
- name: Deploy documentation with mike
run: mike deploy wip
- name: Push gh-pages branch
run: git push origin gh-pages:gh-pages
2 changes: 1 addition & 1 deletion docs/alternatives.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ chain. This component must to be placed **after the tokenizer**:

A full field definition for an OCR field with alternative expansion could look like this:

```xml
```xml hl_lines="3 4 5 6 7 9"
<fieldType name="text_ocr" class="solr.TextField">
<analyzer type="index">
<charFilter class="solrocr.ExternalUtf8ContentFilterFactory"/>
Expand Down
69 changes: 69 additions & 0 deletions docs/changes.md
Original file line number Diff line number Diff line change
@@ -1,16 +1,67 @@
## 0.8.0 (2022-05-??)
The major improvement in this version is compatibility with Solr 9.

Due to a number of API changes in Solr and Lucene, we now have to ship two separate releases,
one for Solr 7 and 8 and one for Solr 9, so please take extra care when downloading to pick
the correct release. In the Package Repository, the Solr 7/8 release will always have version
with the suffix `-solr78`.

We also **changed the package namespaces** for all user-facing components so they are easier
to identify and write. What this means is that you will need to change the `class="..."`
attributes in your `solrconfig.xml` and `schema.xml` to match the new package namespaces.
Whenever you previously had `de.digitalcollections.solrocr.<other stuff>.ClassName`, you
now have to simply write `solrocr.ClassName`.

**New Features:**

- For users running Solr in the Solrcloud mode, the plugin can now be installed via Solr's
[Package Manager](https://solr.apache.org/guide/solr/latest/configuration-guide/package-manager.html):
```
$ bin/solr package add-repo dbmdz.github.io https://dbmdz.github.io/solr
$ bin/solr package install ocrhighlighting # For Solr 9
$ bin/solr package install ocrhighlighting:0.8.0-solr78 # For Solr 7 and 8
```
Note that Solr 7/8 users need to manually specify the version.

**API changes:**

- Changed deployment process to use two separate packages, one for Solr 9 and later and one for Solr 7/8, with a `-solr78.jar` suffix
- Changed namespace of all user-facing components to simply `solrocr` and moved all
user-facing component classes to it:
* `de.digitalcollections.solrocr.lucene.filters.OcrCharFilterFactory`<br>
`solrocr.OcrCharFilterFactory`
* `de.digitalcollections.solrocr.lucene.filters.ExternalUtf8ContentFilterFactory`<br>
`solrocr.ExternalUtf8ContentFilterFactory`
* `de.digitalcollections.solrocr.lucene.OcrAlternativesFilterFactory`<br>
`solrocr.OcrAlternativesFilterFactory`
* `de.digitalcollections.solrocr.lucene.OcrHighlightComponent`<br>
`solrocr.OcrHighlightComponent`

**Bugfixes**

- Fix handling of quoted property values in hOCR title tags. We deviate a bit from the spec
to be more compatible with existing real-world data: Values like `x_source` can now either
be quoted in single- or double-quotes, or not at all, the parser will handle every case.

## 0.7.2 (2022-03-22)
[GitHub Release](https://github.com/dbmdz/solr-ocrhighlighting/releases/tag/0.7.2)

And yet another bugfix release.

**Bugfixes:**

- Fixed using single-quotes in MiniOCR input, previously these files were not recognized as valid MiniOCR files
([#247](https://github.com/dbmdz/solr-ocrhighlighting/pull/247), thanks @mspalti for the fix!)
- Fixed `OutOfBoundsException` when using alternatives with very long tokens (
[#230](https://github.com/dbmdz/solr-ocrhighlighting/pull/230), thanks @fd17 for the report and review)

## 0.7.1 (2021-09-24)
[GitHub Release](https://github.com/dbmdz/solr-ocrhighlighting/releases/tag/0.7.1)

Another bugfix release, upgrading is recommended.

**Bugfixes:**

- Fix text display and "number of snippets" slider in demo setup
- Fix instances where we were using Java SDK methods that relied on a default locale, which led to
hard-to-debug issues in some locales
Expand All @@ -20,6 +71,8 @@ Another bugfix release, upgrading is recommended.
- Fix issue with namespaced ALTO documents

## 0.7.0 (2021-07-12)
[GitHub Release](https://github.com/dbmdz/solr-ocrhighlighting/releases/tag/0.7.0)

This is a bugfix release, especially users with ALTO files are encouraged to upgrade. Other than
bugfixes, this is the first release to support Solr 8.9.

Expand All @@ -30,6 +83,8 @@ bugfixes, this is the first release to support Solr 8.9.
- Fix issue when an hOCR file had empty OCR boxes

## 0.6.0 (2021-05-11)
[GitHub Release](https://github.com/dbmdz/solr-ocrhighlighting/releases/tag/0.6.0)

This is a major new release with significant improvements in stability, accuracy and most importantly performance.
Updating is **highly** recommended, especially for ALTO users, who can expect a speed-up in indexing of up to
**6000% (i.e. 60x as fast)**. We also recommend updating your JVM to at least Java 11 (LTS), since Java 9 introduced
Expand Down Expand Up @@ -77,6 +132,7 @@ significantly.


**API changes:**

- **No more need for an explicit `hl.fl` parameter for highlighting non-OCR fields.** By default,
if highlighting is enabled and no `hl.fl` parameter is passed by the user, Solr falls back to
highlighting every stored field in the document. Previously this did not work with the plugin and
Expand All @@ -93,6 +149,7 @@ significantly.
See the above section unter *New Features* for an explanation of this flag.

**Bugfixes:**

- **Improved tolerance for incomplete bounding boxes.** Previously the occurrence of an incomplete
bounding box in a snippet (i.e. with one or more missing coordinates) would crash the whole query.
We now simply insert a `0` default value in these cases.
Expand All @@ -103,16 +160,20 @@ significantly.
the OCR parsers would try to either load a file from the empty string or parse OCR markup from it.

## 0.5.0 (2020-10-07)
[GitHub Release](https://github.com/dbmdz/solr-ocrhighlighting/releases/tag/0.5.0)

No breaking changes this time around, but a few essential bugfixes, more stability and a new feature.

**API changes:**

- **Snippets are now sorted by their descending score/relevancy.** Previously the order was non-deterministic, which
broke the use case for dynamically fetching more snippets.
- **Add a new boolean `hl.ocr.alignSpans` parameter to align text and image spans.** This new option (disabled by
default) ensures that the spans in text and image match, i.e. it forces the `<em>...</em>` in the highlighted text
to correspond to actual OCR word boundaries.

**Bugfixes:**

- **Fix regular highlighting in distributed setup.** Regular, non-OCR highlighting was broken in previous versions due
to a bad check in the shard response collection phase if users only requested regular highlighting, but not for OCR
fields
Expand All @@ -125,9 +186,12 @@ No breaking changes this time around, but a few essential bugfixes, more stabili


## 0.4.1 (2020-06-02)
[GitHub Release](https://github.com/dbmdz/solr-ocrhighlighting/releases/tag/0.4.1)

This is a patch release with a fix for excessive memory usage during indexing.

## 0.4.0 (2020-05-11)
[GitHub Release](https://github.com/dbmdz/solr-ocrhighlighting/releases/tag/0.4.0)

This is a major release with a focus on compatibility and performance.

Expand All @@ -136,6 +200,7 @@ This is a major release with a focus on compatibility and performance.
the future.

**Breaking API changes:**

- **Add new `pages` key to snippet response with page dimensions**. This can be helpful if you need to calculate
the snippet coordinates relative to the page image dimensions.
- **Replace `page` key on regions and highlights with `pageIdx`**. That is, instead of a string with the
Expand All @@ -146,11 +211,13 @@ This is a major release with a focus on compatibility and performance.
disjunct parts of the page or even multiple pages.

**Format changes:**

- hocr: Add support for retrieving page identifier from `x_source` an `ppageno` properties
- hocr: Strip out title tag during indexing and highlighting
- ALTO: The plugin now supports ALTO files with coordinates expressed as floating point numbers (thanks to @mspalti!)

**Performance:**

- Add concurrent preloading for highlighting target files. This can result in a nice performance boost, since by the
time the plugin gets to actually highlighting the files, their contents are already in the OS' page cache. See
the [Performance Tuning section in the docs](https://dbmdz.github.io/solr-ocrhighlighting/performance/) for more
Expand All @@ -159,12 +226,14 @@ This is a major release with a focus on compatibility and performance.
compared to previous versions.

**Miscellaneous:**

- Log warnings during source pointer parsing
- Filter out empty files during indexing
- Add new documentation section on performance tuning
- Empty regions or regions with only whitespace are no longer included in the output

## 0.3.1 (2019-07-26)
[GitHub Release](https://github.com/dbmdz/solr-ocrhighlighting/releases/tag/0.3.1)

This is patch release that fixes compatibility with Solr/Lucene 8.2.

Expand Down
23 changes: 13 additions & 10 deletions docs/example.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,17 +44,17 @@ To run the example setup yourself, you will need:
## Solr Configuration Walkthrough

[`solrconfig.xml`](https://github.com/dbmdz/solr-ocrhighlighting/blob/master/example/solr/cores/ocr/conf/solrconfig.xml)
```xml
```xml hl_lines="7 10 11 21"
<config>
<luceneMatchVersion>7.6</luceneMatchVersion>
<directoryFactory name="DirectoryFactory" class="${solr.directoryFactory:solr.StandardDirectoryFactory}"/>
<schemaFactory class="ClassicIndexSchemaFactory"/>
<luceneMatchVersion>9.0</luceneMatchVersion>

<!-- Load the plugin JAR from the contrib directory -->
<!-- Load the plugin JAR from the contrib directory.
NOTE: Not needed when running with Solrcloud and Package Manager.
-->
<lib dir="../../../contrib/ocrsearch/lib" regex=".*\.jar" />

<!-- Define a search component that takes care of OCR highlighting -->
<searchComponent class="de.digitalcollections.solrocr.solr.OcrHighlightComponent"
<searchComponent class="solrocr.OcrHighlightComponent"
name="ocrHighlight" />

<!-- Add the OCR Highlighting component to the request handler -->
Expand All @@ -73,11 +73,14 @@ To run the example setup yourself, you will need:
```

[`schema.xml`](https://github.com/dbmdz/solr-ocrhighlighting/blob/master/example/solr/cores/ocr/conf/schema.xml)
```xml
<fieldtype name="text_ocr" class="solr.TextField" storeOffsetsWithPositions="true" termVectors="true">
```xml hl_lines="4 5 6 7"
<fieldtype
name="text_ocr" class="solr.TextField" storeOffsetsWithPositions="true">
<analyzer type="index">
<charFilter class="de.digitalcollections.solrocr.lucene.filters.ExternalUtf8ContentFilterFactory" />
<charFilter class="de.digitalcollections.solrocr.lucene.filters.OcrCharFilterFactory" />
<charFilter
class="solrocr.ExternalUtf8ContentFilterFactory" />
<charFilter
class="solrocr.OcrCharFilterFactory" />
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory"/>
Expand Down
2 changes: 1 addition & 1 deletion docs/formats.md
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,7 @@ You should use this format when:

- you want to store the OCR in the index (to keep the index size as low)
- reusing the existing OCR files is not possible or practical (to keep occupied disk space low)
- you want the best possible performance, highlighting MiniOCR is ~25% faster than ALTO and ~50% faster than hOCR
- you want the best possible performance, highlighting MiniOCR is ~25% faster than ALTO and ~50% faster than hOCR (in an artificial benchmark that is purely CPU-bound)

A basic example looks like this:

Expand Down
12 changes: 6 additions & 6 deletions docs/index.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Solr OCR Highlighting
# Solr OCR Highlighting Plugin

![Highlighted OCR snippet](img/snippet.png)

Expand All @@ -21,8 +21,8 @@ and its position on the page at query time:
{ "ulx": 196, "uly": 1703, "lrx": 1232, "lry": 1968, "pageIdx": 0 }
],
"highlights":[
[{ "text": "Mason and Jeremiah", "ulx": 675, "uly": 110, "lrx": 1036, "lry": 145,
"parentRegionIdx": 0},
[{ "text": "Mason and Jeremiah", "ulx": 675, "uly": 110, "lrx": 1036,
"lry": 145, "parentRegionIdx": 0},
{ "text": "Dixon,", "ulx": 1, "uly": 167, "lrx": 119, "lry": 204,
"parentRegionIdx": 0 }]
]
Expand All @@ -41,15 +41,15 @@ does not interfere with Solr's standard highlighting component, i.e. it works
transparently with non-OCR fields and just lets the default implementation handle
those.

The plugin **works with all Solr versions >= 7.x**.
The plugin **works with all Solr versions >= 7.x up to 9.0**.

## Features
- Index various [OCR formats](formats.md) directly without any pre-processing
* [hOCR](formats.md#hocr)
* [ALTO](formats.md#alto)
* [MiniOCR](formats.md#miniocr)
- Retrieve all the information needed to render a highlighted snippet view
directly from Solr, without post-processing
directly from Solr, without post-processing on the client-side
- Keep your index size manageable by optionally re-using OCR documents on disk for
highlighting

Expand All @@ -62,5 +62,5 @@ If you want to see the **plugin in action**, you can play around with the
[example setup](example.md) hosted at [https://ocrhl.jbaiter.de](https://ocrhl.jbaiter.de)

Should you want to **run the example on your own computer** and play around with the
settings, the [Docker-based setup is available on GitHub](https://github.com/dbmdz/solr-ocrhighlighting/tree/master/example)
settings, the [Docker-based setup is available on GitHub](https://github.com/dbmdz/solr-ocrhighlighting/tree/main/example)
and instructions for using it are in the [Example Setup chapter](example.md)
12 changes: 7 additions & 5 deletions docs/indexing.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
# Indexing OCR documents

**If you want to store the OCR in the index itself** you can skip this section: Just put the OCR
content in the field and submit it to Solr for indexing. We recommend using the space-efficient
[MiniOCR format](./formats.md#miniocr) if you decide to go this way.
!!! note "If you want to store the OCR in the index itself you can all but _skip this section_"
Just put the OCR content in the field and submit it to Solr for indexing. We recommend using the space-efficient
[MiniOCR format](./formats.md#miniocr) if you decide to go this way.

Indexing OCR documents without storing the actual content in the index is also relatively simple:
Indexing OCR documents without storing the actual content in the index is relatively simple:
When building the index document, instead of putting the actual OCR content into the field, you use
a **source pointer**. This pointer will tell the plugin from which location to load the OCR content
during indexing and highlighting.
Expand All @@ -26,6 +26,8 @@ the (again, potentially very large) contents themselves in the index.
account. To signal to the plugin that a given source path is encoded in ASCII, include the `{ascii}`
string after the path, e.g. `/mnt/data/ocrdoc.xml{ascii}[31337:41337]`.

For even more advice on performance tuning, refer to the [corresponding documentation section](./performance.md).

The structure of the source pointers depends on how your actual OCR files on disk map to documents in the Solr
index.

Expand Down Expand Up @@ -111,7 +113,7 @@ The format of the regions is inspired by [Python's slicing syntax](https://docs.
- `start:end` → Everything between the byte offsets `start` (inclusive) and `end` (exclusive)
- `:end` → Everything from the start of the file to byte offset `end` (exclusive)

!!! caution "Region Requirements""
!!! caution "Region Requirements"
- The concatenated content of your regions must be a half-way valid XML structure. While we
tolerate *unclosed tags or unmatched closing tags* (they often can't be avoided), other
errors such as partial tags (i.e. a missing `<` or `>`) will lead to an error during indexing.
Expand Down
Loading

0 comments on commit d6c29d7

Please sign in to comment.