Resolve performance issues with large datasets #10383

GPortas · 2024-03-18T11:19:38Z

What this PR does / why we need it

Resolves performance issues for API calls involving heavy datasets and their files.

The main reason for the detected issues is due to the use of the getFileMetadatas method from the DatasetVersion class in different areas of the application.

This method queries the database for returning all files present in a version, which for small datasets is not an expensive operation but generates a performance bottleneck for big datasets like the heavy one on beta (10000) files.

Issue 1) Slow collections page / search API endpoint

Although the search API endpoint uses Solr to quickly search for results, there was a performance bottleneck when composing the json object returned by the API when one of the returned elements was a heavy dataset.

In particular, the json converter method of the SolrSearchResult class was calling the getFileMetadatas method for obtaining the total number of files. See:
https://github.com/IQSS/dataverse/blob/develop/src/main/java/edu/harvard/iq/dataverse/search/SolrSearchResult.java#L574

I have replaced this expensive call with a custom query call, which was already present in the code (DatasetVersionFilesServiceBean):

dataverse/src/main/java/edu/harvard/iq/dataverse/DatasetVersionFilesServiceBean.java

Line 52 in 63a09cb

    
           public long getFileMetadataCount(DatasetVersion datasetVersion, FileSearchCriteria searchCriteria) {

I also reorganized the code a bit and added a general cleanup.

Performance monitoring

I have tested the affected Search endpoint, requesting the first page ordered by date (desc), forcing the heavy dataset to appear in the results. This is the same call js-dataverse uses.

curl -H "XXXXXXXXXXXXXX" -w "\n\n%{time_connect} + %{time_starttransfer} = %{time_total}\n" "https://beta.dataverse.org/api/v1/search?q=*&type=dataset&sort=date&order=desc&per_page=10&start=0&subtree=root"

The performance improvement obtained after the change is presented below:

Before optimization

0.140740 + 8.130421 = 8.131392 seconds

After optimization

0.134426 + 0.508241 = 0.509249 seconds

Achieved optimization: ~x16 times faster.

Considerations

While solving this problem, I tried to optimize the index search, to see if we can improve the performance in that part too. However, I did not achieve any noticeable improvement.

For example, I tested the search operation after configuring the dateSort field (used to sort the collection page results by date) to use docValues, which is a recommended mechanism for efficient sorting and faceting in Solr. But as I mentioned above, I found no significant improvements.

Issue 2) Slow Files Tab / API endpoints using PermissionsServiceBean

The GetDataFileCommand command is a widely used command in the API to obtain a file. This command handles permission checks to verify that the calling user has permissions to access the requested file.

The permissions checking logic is located in the PermissionsServiceBean class.

We discovered possible performance bottlenecks in this class, especially when dealing with files belonging to large datasets. In particular, within the isPublicallyDownloadable method, a call to getFileMetadatas was made with a for-loop iteration over the files that caused a significant performance downgrade.

I developed a new native query to replace this behavior. This new query checks if a datafile is present in a specific dataset version. In the case of this particular scenario, it checks if the datafile is present in the released dataset version. See:

Performance monitoring

To test GetDataFileCommand, I used the getFileData endpoint for an affected datafile.

curl -H "X-Datavese-Key:XXXXXX" -w "\n\n%{time_connect} + %{time_starttransfer} = %{time_total}\n" "https://beta.dataverse.org/api/v1/files/16588"

The performance improvement obtained after the change is presented below:

Before optimization

0.194373 + 8.678860 = 8.679027 seconds

After optimization

0.139459 + 0.443956 = 0.444019 seconds

Achieved optimization: ~x19 times faster.

Conclusions

Observing the nature of the issues found, we can affirm that the use of the getFileMetadatas method should be avoided, or at least, meticulously controlled, to ensure that we do not generate performance bottlenecks in the code.

In all cases where we have found this problem, it has been possible to replace the call to this method with a custom database query. We must keep in mind that a custom query designed for the particular use case will always be much more optimal than this method + the associated post filtering code.

Which issue(s) this PR closes

Closes Analyze and find a solution for SPA performance issues on heavy datasets #10415

Is there a release notes update needed for this change?

Yes, attached.

… Solr test

…rt-optimization

coveralls · 2024-03-19T09:18:40Z

coverage: 20.656% (-0.003%) from 20.659%
when pulling aa13c91 on solr-date-sort-optimization
into 30666f9 on develop.

…n search API endpoint

…atas call in SolrSearchResult

src/main/java/edu/harvard/iq/dataverse/search/SolrSearchResult.java

…rt-optimization

qqmyers

Looks good.

landreev · 2024-04-23T15:00:19Z

This is really awesome.
For the record, testing on the IQSS prod. db clone, the results are even more spectacular than what's described in the opening comment.

For example, with the dataset doi:10.7910/DVN/3CTMKP (25K files in 2 versions), the "before" vs. "after" with the test 1) (collection search) are 80+ seconds vs. 1 sec.; on the test 2) (/files api on a DataFile from the dataset above), it's 80+ seconds vs. a fraction of a sec.

github-actions · 2024-04-23T15:09:06Z

📦 Pushed preview images as

ghcr.io/gdcc/dataverse:solr-date-sort-optimization

ghcr.io/gdcc/configbaker:solr-date-sort-optimization

🚢 See on GHCR. Use by referencing with full name as printed above, mind the registry name.

landreev · 2024-04-23T15:12:49Z

For future reference, some examples of the command lines used on the perf. cluster for benchmarking:

The search api:
(after updating the releasetime stamp on doi:10.7910/DVN/3CTMKP and reindexing it, making sure it appears on the first 10-items page of the output)

curl -H "X-Datavese-Key:xxxxx" -w "\n\n%{time_connect} + %{time_starttransfer} = %{time_total}\n" "http://localhost:8080/api/v1/search?q=*&type=dataset&sort=date&order=desc&per_page=10&start=0&subtree=harvard"
{"status":"OK","data":{"q":"*","total_count":129376,"start":0,"spelling_alternatives":{},"items":[{"name":"30 m Resolution Global Annual Burned Area Product","type":"dataset","url":"https://doi.org/10.7910/DVN/3CTMKP","global_id":"doi:10.7910/DVN/3CTMKP", 
... truncated ... 
"count_in_response":10}}

0.000530 + 0.866643 = 0.869349

The /files api (random file picked out of the 25K files in the dataset above):

curl -H "X-Datavese-Key:xxxxx" -w "\n\n%{time_connect} + %{time_starttransfer} = %{time_total}\n" "http://localhost:8080/api/v1/files/4577909" 
{"status":"OK","data":{"label":"S15W175_burn_class.tif","restricted":false,"directoryLabel":"Burned area_2020","version":1,"datasetVersionId":238908,"dataFile":{"id":4577909,"persistentId":"","filename":"S15W175_burn_class.tif","contentType":"image/tiff","friendlyType":"TIFF Image","filesize":3245850,"storageIdentifier":"s3://dvn-cloud:178f85a87f1-fa4c745ca807","rootDataFileId":-1,"md5":"7763bc66a09f9e625b8845a81e7ddc6d","checksum":{"type":"MD5","value":"7763bc66a09f9e625b8845a81e7ddc6d"},"tabularData":false,"creationDate":"2021-04-22","publicationDate":"2021-05-09","fileAccessRequest":false}}}

0.000542 + 0.136391 = 0.136420

And, just for the record, this one is not even our largest real-life prod. dataset by the number of files.

landreev · 2024-04-23T15:41:34Z

All tests passed after the branch was synced up with develop. Merging.
Thank you again @GPortas.

This reverts commit eaec499.

GPortas added 5 commits March 18, 2024 11:18

Added: docValues=true to dateSort field in Solr schema.xml

c5b3114

Removed: temporarily removed docValues=true for dateSort solr field

50c202d

Added: docValues=true reintroduced in dateSort solr field definition

ee37a02

Changed: temporarily changed target branch of deploy_beta_testing for…

ed78ad1

… Solr test

Merge branch 'develop' of github.com:IQSS/dataverse into solr-date-so…

e3c7091

…rt-optimization

Refactor: not creating facets array builder if facets not requested i…

5a730c4

…n search API endpoint

This comment has been minimized.

Sign in to view

Changed: using DatasetVersionFilesServiceBean instead of getFileMetad…

1feeb6e

…atas call in SolrSearchResult

github-actions bot reviewed Mar 19, 2024

View reviewed changes

src/main/java/edu/harvard/iq/dataverse/search/SolrSearchResult.java Outdated Show resolved Hide resolved

This comment has been minimized.

Sign in to view

Fixed: SolrSearchResult to correctly pass files count

78b8835

This comment has been minimized.

Sign in to view

Removed: docValues=true in dateSort solr field

6c4d958

GPortas changed the title ~~[PoC PR] Solr date sort optimization~~ Resolve performance issues with large datasets on the SPA Mar 19, 2024

GPortas added 2 commits March 19, 2024 14:48

Changed: restored develop branch as target deploy_beta_testing branch

99ec9ee

Refactor: Search API solrSearchResult json logic

8f01fc7

This comment has been minimized.

Sign in to view

GPortas force-pushed the solr-date-sort-optimization branch from 371a5f0 to 8f01fc7 Compare March 20, 2024 13:38

Changed: deploy_beta_testing pointing to target branch

08448dd

This comment has been minimized.

Sign in to view

Added: PoC to fetch dataFile relationships by lazy instead of eager

888f113

GPortas self-assigned this Mar 22, 2024

GPortas added 3 commits March 25, 2024 10:21

Merge branch 'develop' of github.com:IQSS/dataverse into solr-date-so…

39d0733

…rt-optimization

Changed: restored deploy_beta_testing using develop target branch

70dd9d9

Added: release docs for #10415

aa13c91

GPortas marked this pull request as ready for review March 25, 2024 10:31

GPortas removed their assignment Mar 25, 2024

This comment has been minimized.

Sign in to view

GPortas changed the title ~~Resolve performance issues with large datasets on the SPA~~ Resolve performance issues with large datasets Mar 25, 2024

qqmyers approved these changes Mar 27, 2024

View reviewed changes

GPortas added Size: 10 A percentage of a sprint. 7 hours. and removed Size: 30 A percentage of a sprint. 21 hours. (formerly size:33) labels Mar 27, 2024

qqmyers added a commit to QualitativeDataRepository/dataverse that referenced this pull request Apr 3, 2024

cherry pick from IQSS#10383

eaec499

landreev self-assigned this Apr 10, 2024

Merge branch 'develop' into solr-date-sort-optimization

d1b73cf

landreev merged commit 9bda7dd into develop Apr 23, 2024
19 checks passed

landreev deleted the solr-date-sort-optimization branch April 23, 2024 15:41

landreev removed their assignment Apr 23, 2024

pdurbin added this to the 6.3 milestone Apr 23, 2024

landreev mentioned this pull request May 2, 2024

Upgrade prod. to 6.2 IQSS/dataverse.harvard.edu#271

Closed

qqmyers added a commit to QualitativeDataRepository/dataverse that referenced this pull request May 15, 2024

Revert "cherry pick from IQSS#10383"

84e4c08

This reverts commit eaec499.

pdurbin mentioned this pull request May 30, 2024

8941 adding file count in solr (v2) #10598

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resolve performance issues with large datasets #10383

Resolve performance issues with large datasets #10383

GPortas commented Mar 18, 2024 •

edited

Loading

coveralls commented Mar 19, 2024 •

edited

Loading

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

qqmyers left a comment

landreev commented Apr 23, 2024 •

edited

Loading

github-actions bot commented Apr 23, 2024

landreev commented Apr 23, 2024 •

edited

Loading

landreev commented Apr 23, 2024

Resolve performance issues with large datasets #10383

Resolve performance issues with large datasets #10383

Conversation

GPortas commented Mar 18, 2024 • edited Loading

What this PR does / why we need it

Issue 1) Slow collections page / search API endpoint

Performance monitoring

Before optimization

After optimization

Considerations

Issue 2) Slow Files Tab / API endpoints using PermissionsServiceBean

Performance monitoring

Before optimization

After optimization

Conclusions

Which issue(s) this PR closes

Is there a release notes update needed for this change?

coveralls commented Mar 19, 2024 • edited Loading

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

qqmyers left a comment

Choose a reason for hiding this comment

landreev commented Apr 23, 2024 • edited Loading

github-actions bot commented Apr 23, 2024

landreev commented Apr 23, 2024 • edited Loading

landreev commented Apr 23, 2024

GPortas commented Mar 18, 2024 •

edited

Loading

coveralls commented Mar 19, 2024 •

edited

Loading

landreev commented Apr 23, 2024 •

edited

Loading

landreev commented Apr 23, 2024 •

edited

Loading