-
Notifications
You must be signed in to change notification settings - Fork 493
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resolve performance issues with large datasets #10383
Conversation
…n search API endpoint
This comment has been minimized.
This comment has been minimized.
…atas call in SolrSearchResult
src/main/java/edu/harvard/iq/dataverse/search/SolrSearchResult.java
Outdated
Show resolved
Hide resolved
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
7 similar comments
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
371a5f0
to
8f01fc7
Compare
This comment has been minimized.
This comment has been minimized.
1 similar comment
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
1 similar comment
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good.
This is really awesome. For example, with the dataset |
📦 Pushed preview images as
🚢 See on GHCR. Use by referencing with full name as printed above, mind the registry name. |
For future reference, some examples of the command lines used on the perf. cluster for benchmarking:
And, just for the record, this one is not even our largest real-life prod. dataset by the number of files. |
All tests passed after the branch was synced up with develop. Merging. |
This reverts commit eaec499.
What this PR does / why we need it
Resolves performance issues for API calls involving heavy datasets and their files.
The main reason for the detected issues is due to the use of the
getFileMetadatas
method from theDatasetVersion
class in different areas of the application.This method queries the database for returning all files present in a version, which for small datasets is not an expensive operation but generates a performance bottleneck for big datasets like the heavy one on beta (10000) files.
Issue 1) Slow collections page / search API endpoint
Although the search API endpoint uses Solr to quickly search for results, there was a performance bottleneck when composing the json object returned by the API when one of the returned elements was a heavy dataset.
In particular, the json converter method of the
SolrSearchResult
class was calling thegetFileMetadatas
method for obtaining the total number of files. See:https://github.com/IQSS/dataverse/blob/develop/src/main/java/edu/harvard/iq/dataverse/search/SolrSearchResult.java#L574
I have replaced this expensive call with a custom query call, which was already present in the code (
DatasetVersionFilesServiceBean
):dataverse/src/main/java/edu/harvard/iq/dataverse/DatasetVersionFilesServiceBean.java
Line 52 in 63a09cb
I also reorganized the code a bit and added a general cleanup.
Performance monitoring
I have tested the affected Search endpoint, requesting the first page ordered by date (desc), forcing the heavy dataset to appear in the results. This is the same call js-dataverse uses.
The performance improvement obtained after the change is presented below:
Before optimization
0.140740 + 8.130421 = 8.131392 seconds
After optimization
0.134426 + 0.508241 = 0.509249 seconds
Achieved optimization: ~x16 times faster.
Considerations
While solving this problem, I tried to optimize the index search, to see if we can improve the performance in that part too. However, I did not achieve any noticeable improvement.
For example, I tested the search operation after configuring the dateSort field (used to sort the collection page results by date) to use docValues, which is a recommended mechanism for efficient sorting and faceting in Solr. But as I mentioned above, I found no significant improvements.
Issue 2) Slow Files Tab / API endpoints using PermissionsServiceBean
The GetDataFileCommand command is a widely used command in the API to obtain a file. This command handles permission checks to verify that the calling user has permissions to access the requested file.
The permissions checking logic is located in the PermissionsServiceBean class.
We discovered possible performance bottlenecks in this class, especially when dealing with files belonging to large datasets. In particular, within the isPublicallyDownloadable method, a call to getFileMetadatas was made with a for-loop iteration over the files that caused a significant performance downgrade.
I developed a new native query to replace this behavior. This new query checks if a datafile is present in a specific dataset version. In the case of this particular scenario, it checks if the datafile is present in the released dataset version. See:
https://github.com/IQSS/dataverse/blob/solr-date-sort-optimization/src/main/java/edu/harvard/iq/dataverse/PermissionServiceBean.java#L455
https://github.com/IQSS/dataverse/blob/solr-date-sort-optimization/src/main/java/edu/harvard/iq/dataverse/DatasetVersionFilesServiceBean.java#L209
Performance monitoring
To test GetDataFileCommand, I used the
getFileData
endpoint for an affected datafile.The performance improvement obtained after the change is presented below:
Before optimization
0.194373 + 8.678860 = 8.679027 seconds
After optimization
0.139459 + 0.443956 = 0.444019 seconds
Achieved optimization: ~x19 times faster.
Conclusions
Observing the nature of the issues found, we can affirm that the use of the
getFileMetadatas
method should be avoided, or at least, meticulously controlled, to ensure that we do not generate performance bottlenecks in the code.In all cases where we have found this problem, it has been possible to replace the call to this method with a custom database query. We must keep in mind that a custom query designed for the particular use case will always be much more optimal than this method + the associated post filtering code.
Which issue(s) this PR closes
Is there a release notes update needed for this change?
Yes, attached.