You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Here's a migration profile of a task adding 2100 records: https://cernbox.cern.ch/index.php/s/D0rPrCM68FaqGPh. The culprit is clearly get_record_collections, partly because of _build_cache, partly because of _find_matching_collections_internally. The issue with _build_cache has been fixed by #69, so only the other one remains.
Now, if we go inside _find_matching_collections_internally no method appears to be particularly slow: it's just that creating a query and seeing if a record matches are quite expensive operations, and doing it 20 times per record is what makes it slow.
Truth to be told, running a query for copying a value from collections.primary to _collections seems a little silly to me, so I'd like to inject my own behaviour here, and falling back to _find_matching_collections_internally only for the actual queries.
NB: _find_matching_collections_externally has better performance, but it puts a lot of pressure on ES, which makes it lose record inserts...
Proposal
Add a COLLECTIONS_MATCHER configuration variable to override
On INSPIRE we have roughly 20 collections defined as queries: https://github.com/inspirehep/inspire-next/blob/5b7207c8ee090658f23b818168fcef31d846e139/inspirehep/config.py#L150-L235. Using
invenio-collections
has led to a 3x slowdown in the migration of our nightly (http://inspire-nightly.cern.ch/), which means we don't get a full set of migration errors each night.Here's a migration profile of a task adding 2100 records: https://cernbox.cern.ch/index.php/s/D0rPrCM68FaqGPh. The culprit is clearly
get_record_collections
, partly because of_build_cache
, partly because of_find_matching_collections_internally
. The issue with_build_cache
has been fixed by #69, so only the other one remains.Now, if we go inside
_find_matching_collections_internally
no method appears to be particularly slow: it's just that creating a query and seeing if a record matches are quite expensive operations, and doing it 20 times per record is what makes it slow.Truth to be told, running a query for copying a value from
collections.primary
to_collections
seems a little silly to me, so I'd like to inject my own behaviour here, and falling back to_find_matching_collections_internally
only for the actual queries.NB:
_find_matching_collections_externally
has better performance, but it puts a lot of pressure on ES, which makes it lose record inserts...Proposal
COLLECTIONS_MATCHER
configuration variable to overrideinvenio-collections/invenio_collections/receivers.py
Lines 102 to 107 in a8aec24
The text was updated successfully, but these errors were encountered: