Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

receivers: bad performance of _find_matching_collections_internally #72

Open
jacquerie opened this issue Oct 27, 2016 · 2 comments
Open
Milestone

Comments

@jacquerie
Copy link
Contributor

jacquerie commented Oct 27, 2016

On INSPIRE we have roughly 20 collections defined as queries: https://github.com/inspirehep/inspire-next/blob/5b7207c8ee090658f23b818168fcef31d846e139/inspirehep/config.py#L150-L235. Using invenio-collections has led to a 3x slowdown in the migration of our nightly (http://inspire-nightly.cern.ch/), which means we don't get a full set of migration errors each night.

Here's a migration profile of a task adding 2100 records: https://cernbox.cern.ch/index.php/s/D0rPrCM68FaqGPh. The culprit is clearly get_record_collections, partly because of _build_cache, partly because of _find_matching_collections_internally. The issue with _build_cache has been fixed by #69, so only the other one remains.

Now, if we go inside _find_matching_collections_internally no method appears to be particularly slow: it's just that creating a query and seeing if a record matches are quite expensive operations, and doing it 20 times per record is what makes it slow.

Truth to be told, running a query for copying a value from collections.primary to _collections seems a little silly to me, so I'd like to inject my own behaviour here, and falling back to _find_matching_collections_internally only for the actual queries.

NB: _find_matching_collections_externally has better performance, but it puts a lot of pressure on ES, which makes it lose record inserts...

Proposal

  • Add a COLLECTIONS_MATCHER configuration variable to override
    app = app or current_app
    if not app.config['COLLECTIONS_USE_PERCOLATOR']:
    self.matcher = _find_matching_collections_internally
    else:
    from .percolator import _find_matching_collections_externally
    self.matcher = _find_matching_collections_externally
    , and provide your own matcher.
@jirikuncar
Copy link
Member

👍 for COLLECTIONS_MATCHER - consider moving it to extension state as matcher property.

@jacquerie
Copy link
Contributor Author

jacquerie commented Aug 20, 2017

Alternate Proposal

Add an extra parameter copy to the query configuration. If set to True then it short circuits this line, skipping the expensive query building:

if _build_query(data['query']).match(record):

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants