Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resolve performance issues with large datasets #10383

Merged
merged 22 commits into from
Apr 23, 2024

Conversation

GPortas
Copy link
Contributor

@GPortas GPortas commented Mar 18, 2024

What this PR does / why we need it

Resolves performance issues for API calls involving heavy datasets and their files.

The main reason for the detected issues is due to the use of the getFileMetadatas method from the DatasetVersion class in different areas of the application.

This method queries the database for returning all files present in a version, which for small datasets is not an expensive operation but generates a performance bottleneck for big datasets like the heavy one on beta (10000) files.

Issue 1) Slow collections page / search API endpoint

Although the search API endpoint uses Solr to quickly search for results, there was a performance bottleneck when composing the json object returned by the API when one of the returned elements was a heavy dataset.

In particular, the json converter method of the SolrSearchResult class was calling the getFileMetadatas method for obtaining the total number of files. See:
https://github.com/IQSS/dataverse/blob/develop/src/main/java/edu/harvard/iq/dataverse/search/SolrSearchResult.java#L574

I have replaced this expensive call with a custom query call, which was already present in the code (DatasetVersionFilesServiceBean):

public long getFileMetadataCount(DatasetVersion datasetVersion, FileSearchCriteria searchCriteria) {

I also reorganized the code a bit and added a general cleanup.

Performance monitoring

I have tested the affected Search endpoint, requesting the first page ordered by date (desc), forcing the heavy dataset to appear in the results. This is the same call js-dataverse uses.

curl -H "XXXXXXXXXXXXXX" -w "\n\n%{time_connect} + %{time_starttransfer} = %{time_total}\n" "https://beta.dataverse.org/api/v1/search?q=*&type=dataset&sort=date&order=desc&per_page=10&start=0&subtree=root"

The performance improvement obtained after the change is presented below:

Before optimization

0.140740 + 8.130421 = 8.131392 seconds

After optimization

0.134426 + 0.508241 = 0.509249 seconds

Achieved optimization: ~x16 times faster.

Considerations

While solving this problem, I tried to optimize the index search, to see if we can improve the performance in that part too. However, I did not achieve any noticeable improvement.

For example, I tested the search operation after configuring the dateSort field (used to sort the collection page results by date) to use docValues, which is a recommended mechanism for efficient sorting and faceting in Solr. But as I mentioned above, I found no significant improvements.

Issue 2) Slow Files Tab / API endpoints using PermissionsServiceBean

The GetDataFileCommand command is a widely used command in the API to obtain a file. This command handles permission checks to verify that the calling user has permissions to access the requested file.

The permissions checking logic is located in the PermissionsServiceBean class.

We discovered possible performance bottlenecks in this class, especially when dealing with files belonging to large datasets. In particular, within the isPublicallyDownloadable method, a call to getFileMetadatas was made with a for-loop iteration over the files that caused a significant performance downgrade.

I developed a new native query to replace this behavior. This new query checks if a datafile is present in a specific dataset version. In the case of this particular scenario, it checks if the datafile is present in the released dataset version. See:

Performance monitoring

To test GetDataFileCommand, I used the getFileData endpoint for an affected datafile.

curl -H "X-Datavese-Key:XXXXXX" -w "\n\n%{time_connect} + %{time_starttransfer} = %{time_total}\n" "https://beta.dataverse.org/api/v1/files/16588"

The performance improvement obtained after the change is presented below:

Before optimization

0.194373 + 8.678860 = 8.679027 seconds

After optimization

0.139459 + 0.443956 = 0.444019 seconds

Achieved optimization: ~x19 times faster.

Conclusions

Observing the nature of the issues found, we can affirm that the use of the getFileMetadatas method should be avoided, or at least, meticulously controlled, to ensure that we do not generate performance bottlenecks in the code.

In all cases where we have found this problem, it has been possible to replace the call to this method with a custom database query. We must keep in mind that a custom query designed for the particular use case will always be much more optimal than this method + the associated post filtering code.

Which issue(s) this PR closes

Is there a release notes update needed for this change?

Yes, attached.

@coveralls
Copy link

coveralls commented Mar 19, 2024

Coverage Status

coverage: 20.656% (-0.003%) from 20.659%
when pulling aa13c91 on solr-date-sort-optimization
into 30666f9 on develop.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

@GPortas GPortas changed the title [PoC PR] Solr date sort optimization Resolve performance issues with large datasets on the SPA Mar 19, 2024

This comment has been minimized.

7 similar comments

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

@GPortas GPortas force-pushed the solr-date-sort-optimization branch from 371a5f0 to 8f01fc7 Compare March 20, 2024 13:38

This comment has been minimized.

1 similar comment

This comment has been minimized.

@GPortas GPortas added Size: 30 A percentage of a sprint. 21 hours. (formerly size:33) pm.GREI-d-2.7.1 NIH, yr2, aim7, task1: R&D UI modules for creating datasets and supporting publishing workflows pm.GREI-d-2.7.2 NIH, yr2, aim7, task2: Implement UI modules for creating datasets and publishing workflows SPA These changes are required for the Dataverse SPA labels Mar 22, 2024
@GPortas GPortas self-assigned this Mar 22, 2024
@GPortas GPortas marked this pull request as ready for review March 25, 2024 10:31
@GPortas GPortas removed their assignment Mar 25, 2024

This comment has been minimized.

1 similar comment

This comment has been minimized.

@GPortas GPortas changed the title Resolve performance issues with large datasets on the SPA Resolve performance issues with large datasets Mar 25, 2024
Copy link
Member

@qqmyers qqmyers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

@GPortas GPortas added Size: 10 A percentage of a sprint. 7 hours. and removed Size: 30 A percentage of a sprint. 21 hours. (formerly size:33) labels Mar 27, 2024
qqmyers added a commit to QualitativeDataRepository/dataverse that referenced this pull request Apr 3, 2024
@landreev landreev self-assigned this Apr 10, 2024
@landreev
Copy link
Contributor

landreev commented Apr 23, 2024

This is really awesome.
For the record, testing on the IQSS prod. db clone, the results are even more spectacular than what's described in the opening comment.

For example, with the dataset doi:10.7910/DVN/3CTMKP (25K files in 2 versions), the "before" vs. "after" with the test 1) (collection search) are 80+ seconds vs. 1 sec.; on the test 2) (/files api on a DataFile from the dataset above), it's 80+ seconds vs. a fraction of a sec.

Copy link

📦 Pushed preview images as

ghcr.io/gdcc/dataverse:solr-date-sort-optimization
ghcr.io/gdcc/configbaker:solr-date-sort-optimization

🚢 See on GHCR. Use by referencing with full name as printed above, mind the registry name.

@landreev
Copy link
Contributor

landreev commented Apr 23, 2024

For future reference, some examples of the command lines used on the perf. cluster for benchmarking:

  1. The search api:
    (after updating the releasetime stamp on doi:10.7910/DVN/3CTMKP and reindexing it, making sure it appears on the first 10-items page of the output)
curl -H "X-Datavese-Key:xxxxx" -w "\n\n%{time_connect} + %{time_starttransfer} = %{time_total}\n" "http://localhost:8080/api/v1/search?q=*&type=dataset&sort=date&order=desc&per_page=10&start=0&subtree=harvard"
{"status":"OK","data":{"q":"*","total_count":129376,"start":0,"spelling_alternatives":{},"items":[{"name":"30 m Resolution Global Annual Burned Area Product","type":"dataset","url":"https://doi.org/10.7910/DVN/3CTMKP","global_id":"doi:10.7910/DVN/3CTMKP", 
... truncated ... 
"count_in_response":10}}

0.000530 + 0.866643 = 0.869349
  1. The /files api (random file picked out of the 25K files in the dataset above):
curl -H "X-Datavese-Key:xxxxx" -w "\n\n%{time_connect} + %{time_starttransfer} = %{time_total}\n" "http://localhost:8080/api/v1/files/4577909" 
{"status":"OK","data":{"label":"S15W175_burn_class.tif","restricted":false,"directoryLabel":"Burned area_2020","version":1,"datasetVersionId":238908,"dataFile":{"id":4577909,"persistentId":"","filename":"S15W175_burn_class.tif","contentType":"image/tiff","friendlyType":"TIFF Image","filesize":3245850,"storageIdentifier":"s3://dvn-cloud:178f85a87f1-fa4c745ca807","rootDataFileId":-1,"md5":"7763bc66a09f9e625b8845a81e7ddc6d","checksum":{"type":"MD5","value":"7763bc66a09f9e625b8845a81e7ddc6d"},"tabularData":false,"creationDate":"2021-04-22","publicationDate":"2021-05-09","fileAccessRequest":false}}}

0.000542 + 0.136391 = 0.136420

And, just for the record, this one is not even our largest real-life prod. dataset by the number of files.

@landreev
Copy link
Contributor

All tests passed after the branch was synced up with develop. Merging.
Thank you again @GPortas.

@landreev landreev merged commit 9bda7dd into develop Apr 23, 2024
19 checks passed
@landreev landreev deleted the solr-date-sort-optimization branch April 23, 2024 15:41
@landreev landreev removed their assignment Apr 23, 2024
@pdurbin pdurbin added this to the 6.3 milestone Apr 23, 2024
qqmyers added a commit to QualitativeDataRepository/dataverse that referenced this pull request May 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pm.GREI-d-2.7.1 NIH, yr2, aim7, task1: R&D UI modules for creating datasets and supporting publishing workflows pm.GREI-d-2.7.2 NIH, yr2, aim7, task2: Implement UI modules for creating datasets and publishing workflows Size: 10 A percentage of a sprint. 7 hours. SPA These changes are required for the Dataverse SPA
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Analyze and find a solution for SPA performance issues on heavy datasets
5 participants