Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

migration: performance issues #1640

Closed
3 of 6 tasks
jacquerie opened this issue Oct 12, 2016 · 4 comments
Closed
3 of 6 tasks

migration: performance issues #1640

jacquerie opened this issue Oct 12, 2016 · 4 comments

Comments

@jacquerie
Copy link
Contributor

jacquerie commented Oct 12, 2016

For the past week nightly migration has been unusually slow. While previously migration of all records terminated around 3:00 AM, now, by the time we get back at the office, only half of it is done.

Here's a profile of migrate('dumps/all_XXX.xml.gz', wait_for_results=True), obtained with the techniques described in 9b0f4a8: https://cernbox.cern.ch/index.php/s/D0rPrCM68FaqGPh

It's clear that the problem lies in get_record_collections, part in _build_cache, and part in _find_matching_collections_internally.

@jacquerie
Copy link
Contributor Author

By manually adding a cache to invenio-collections with

>>> from inspirehep.modules.cache import current_cache
>>> app.extensions['invenio-collections'].cache = current_cache

we were able to work around the performance issue in _build_cache. Here's a new profile that proves it: https://cernbox.cern.ch/index.php/s/XfnXKM9i8Qggd0c

This workaround has to be translated in a proper solution.

@jacquerie
Copy link
Contributor Author

This workaround has to be translated in a proper solution.

@jirikuncar suggests:

[...] you can make the COLLECTION_CACHE configurable using similar pattern as we do in Invenio-Access (https://github.com/inveniosoftware/invenio-access/blob/master/invenio_access/ext.py#L54-L59).

@jacquerie
Copy link
Contributor Author

jacquerie commented Oct 12, 2016

Despite what the docstring says, setting COLLECTIONS_USE_PERCOLATOR = True appears to have improved a lot with the performance. Here's a newer profile that proves it: https://cernbox.cern.ch/index.php/s/klvHbciBjgx2TqE (note that this is also using the workaround described in #1640 (comment)).

Now I just manually started a nightly build. Let's see how many records are going to be there at 9.

@jacquerie
Copy link
Contributor Author

jacquerie commented Oct 27, 2016

Now I just manually started a nightly build. Let's see how many records are going to be there at 9.

The problem with this is that the percolator puts too much pressure on ES, and record inserts are lost, so we cannot go this route.

I outlined an alternative fix here: inveniosoftware/invenio-collections#72

@kaplun kaplun assigned jacquerie and unassigned Kjili Dec 6, 2016
@jacquerie jacquerie removed their assignment May 11, 2017
@jacquerie jacquerie self-assigned this Aug 10, 2017
@ghost ghost added the Status: WIP label Sep 3, 2017
@ghost ghost removed the Status: WIP label Sep 3, 2017
@jacquerie jacquerie removed their assignment Sep 29, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants