-
Notifications
You must be signed in to change notification settings - Fork 25
Version 3 Solr 7 notes
At the Royal Danish Library, the Solr 7 schema from webarchive-discovery 3.0-alpha was used in the beginning of 2018 for a full re-index of 24 billion web resources from the Danish Net Archive. The old index used the Solr 4 schema from webarchive-discovery 2.0. This document captures technical differences between 2.0 and 3.0-alpha as well as observations from the upgrade.
The Royal Danish Library uses a setup with static and fully optimized sub-collections of ~900GB / 280M documents: when a sub-collection reaches this size, it is fully optimized. A new sub-collection is then created and the old sub-collection is never updated again. Solr's alias mechanism is used to provide unified search across the sub-collections, making them appear (nearly) as a single collection.
On the server-level, 4 machines are used, each machine has 380GB of RAM and 16 CPU cores (x2 with Hyperthreading). Storage is 25 individually mounted Samsung 930GB SSDs on each machine, 1 SSD/sub-collection. Each sub-collection is handled by a separate Solr node with 8GB heap.
General changes to the processing done in webarchive-discovery is not covered here. See the webarchive-discovery changelog for that. New features are reflected in new fields in the Solr index, covered below.
A general change to the Solr schema has been a switch away from stored
fields, replacing them with docValues
. docValues
allows for low-overhead faceting, sorting, grouping and exporting. The price is increased retrieval time when returning documents.
Observation: In the old 2.0 setup with mostly stored
fields, the amount of fields in the returned documents has little impact on response time. Consequently the default setting was to return all possible fields. Simple document searches took ½-2 seconds. In the 3.0-aplha setup, returning all fields takes ½-1 second per document, increasing response time to 10 seconds for simple searches. Limiting to 5 fields relevant to the Royal Danish Library's test-GUI brought response times down in the old ½-2 second range.
Recommendation: Only request the fields that are to be used.
If the limiting of fields is unacceptable, the schema can be updated to enable stored
to all docValues
-fields. This will increase index size markedly (qualified guess: 10-30%) and require a full re-index.
-
exif_location
with geo-coordinates from images -
host_surt
with the host name elements in reversed order using the SURT standard -
index_time
the index time for the document -
links_hosts_surts
outgoing links to hosts in SURT form -
links_images
links to images shown in HTML pages -
links_norm
outgoing links from HTML pages -
redirect_to_norm
HTTP 3xx redirect support -
status_code
the HTTP status code -
type
human readable type akin tocontent_type_norm
-
url_norm
normalised and un-ambiguated version of the URL -
url_path
the path part of the url, sans-host -
url_search
human-query searchable variant of the URL -
warc_key_id
the ID specified in the WARC entry
Please see the JavaDoc for the webarchive-discovery Solr 7 schema for further details and examples of use for the different fields.
- Using multiple sub-collections tied together with an alias with Solr Collapse will treat entries in separate sub-collections as different, even though their field values are the same. Fortunately Solr Grouping works fine and adding
group.format=simple
makes the result nearly the same as for collapsing. - The
crawl_date
-field uses the default Solr DatePointField, which is documented to be with millisecond precision. This works well for standard sorting (sort=crawl_date desc
), but when using it for temporal proximity sort (sort=abs(sub(ms(2018-01-01T18:03:20Z), crawl_date)) asc
theres is jitter in the ordering which indicates a coarser (5+ seconds) granularity or a bug somewhere. It can be bypassed somewhat by over-provisioning and re-sorting in the client, but that is a frail kludge. -
Solr 7.2 tightened security for local parameters, meaning that queries such as
q={!qf='title, text...'}horses
no longer works. Blacklight uses this syntax. At the Royal Danish Library this was fixed by settingdefType=edismax
and setting thetype
if the local parameters with{!type=edismax qf='title, text...'}horses
. The problem is known at Blacklight #1838.