-
Notifications
You must be signed in to change notification settings - Fork 25
Version 3 Solr 7 notes
At the Royal Danish Library, the Solr 7 schema from webarchive-discovery 3.0-alpha was used in the beginning of 2018 for a full re-index of 24 billion web resources from the Danish Net Archive. The old index used the Solr 4 schema from webarchive-discovery 2.0. This document captures technical differences between 2.0 and 3.0-alpha as well as observations from the upgrade.
The Royal Danish Library uses a setup with static and fully optimized sub-collections of ~900GB / 280M documents: when a sub-collection reaches this size, it is fully optimized. A new sub-collection is then created and the old sub-collection is never updated again. Solr's alias mechanism is used to provide unified search across the sub-collections, making them appear (nearly) as a single collection.
On the server-level, 4 machines are used, each machine has 380GB of RAM and 16 CPU cores (x2 with Hyperthreading). Storage is 25 individually mounted Samsung 930GB SSDs on each machine, 1 SSD/sub-collection. Each sub-collection is handled by a separate Solr node with 8GB heap.
General changes to the processing done in webarchive-discovery is not covered here. See the webarchive-discovery changelog for that. New features are reflected in new fields in the Solr index, covered below.
A general change to the Solr schema has been a switch away from stored
fields, replacing them with docValues
. docValues
allows for low-overhead faceting, sorting, grouping and exporting. The price is increased retrieval time when returning documents.
Observation: In the old 2.0 setup with mostly stored
fields, the amount of fields in the returned documents has little impact on response time. Consequently the default setting was to return all possible fields. Simple document searches took ½-2 seconds. In the 3.0-aplha setup, returning all fields takes ½-1 second per document, increasing response time to 10 seconds for simple searches. Limiting to 5 fields relevant to the Royal Danish Library's test-GUI brought response times down in the old ½-2 second range.
Recommendation: Only request the fields that are to be used.
If the limiting of fields is unacceptable, the schema can be updated to enable stored
to all docValues
-fields. This will increase index size markedly (qualified guess: 10-30%) and require a full re-index.
Update 2018-07-03: It seems that the DocValues impact on performance is caused by the way DocValues are represented in Solr 7. There is a decent chance of improving DocValues performance considerably, without re-indexing. Keep an eye on LUCENE-8374.
A resource that has been de-duplicated in the harvester is represented with record_type:revisit
. Unfortunately the WARC header WARC-Refers-To
is not indexed in 3.0-alpha, so locating the real record instance for a revisited record is quite convoluted: q=url:"<revisit_url>" AND hash:"<revisit_hash>" NOT record_type:revisit&rows=1&crawl_date:[* TO <revisit_date}&sort=crawl_date desc
.
If crawl_date
and HTTP-header information is not relevant for the task, the fairly heavy query above can be reduced to q=url:"<revisit_url>" AND hash:"<revisit_hash>" NOT record_type:revisit&rows=1
. This will return any of the instances where the content matches the revisited record.
-
exif_location
with geo-coordinates from images -
host_surt
with the host name elements in reversed order using the SURT standard -
index_time
the index time for the document -
links_hosts_surts
outgoing links to hosts in SURT form -
links_images
links to images shown in HTML pages -
links_norm
outgoing links from HTML pages -
redirect_to_norm
HTTP 3xx redirect support -
status_code
the HTTP status code -
type
human readable type akin tocontent_type_norm
-
url_norm
normalised and un-ambiguated version of the URL -
url_path
the path part of the url, sans-host -
url_search
human-query searchable variant of the URL -
warc_key_id
the ID specified in the WARC entry
Please see the JavaDoc for the webarchive-discovery Solr 7 schema for further details and examples of use for the different fields.
- Using multiple sub-collections tied together with an alias with Solr Collapse will treat entries in separate sub-collections as different, even though their field values are the same. Fortunately Solr Grouping works fine and adding
group.format=simple
makes the result nearly the same as for collapsing. - Asking for the number of unique groups with
group.ngroups=true
is highly discouraged. On a distributed web-scale index, this operation is extremely heavy (think minutes and Out Of Memory). Instead an approximate count of unique groups for e.g.url
can be calculated at relatively low cost withstats=true&stats.field={!cardinality=true}url
. - The
crawl_date
-field uses the default Solr DatePointField, which is documented to be with millisecond precision. This works well for standard sorting (sort=crawl_date desc
), but when using it for temporal proximity sort (sort=abs(sub(ms(2018-01-01T18:03:20Z), crawl_date)) asc
theres is jitter in the ordering which indicates a coarser (5+ seconds) granularity or a bug somewhere. It can be bypassed somewhat by over-provisioning and re-sorting in the client, but that is a frail kludge. -
Solr 7.2 tightened security for local parameters, meaning that queries such as
q={!qf='title, text...'}horses
no longer works. Blacklight uses this syntax. At the Royal Danish Library this was fixed by settingdefType=edismax
and setting thetype
if the local parameters with{!type=edismax qf='title, text...'}horses
. The problem is known at Blacklight #1838.