Optimization of QLever File Configuration for High-Speed Index Creation #37

arcangelo7 · 2024-05-21T14:37:32Z

Hello QLever Team,

I am currently working on configuring QLever for a project that involves indexing a substantial dataset with nearly 5 billion triples. The objective is to maximize the speed of the indexing process on a high-performance server with ~1 TB RAM.

Given the high RAM capacity, I am seeking advice on the optimal combination of parameters to significantly speed up the index creation process. Specifically, I would like guidance on the following:

Number of Triples per Batch: What would be the ideal setting considering the server's high RAM capacity?
STXXL Memory: How can we best utilize the available 1 TB of RAM?
Any Additional Parameters: Are there other settings or parameters that can be adjusted to further enhance the indexing speed?

The speed of the indexing process is a crucial factor for us. Any insights or recommendations on how to best leverage our server's capabilities would be greatly appreciated.

Thank you in advance for your support!

hannahbast · 2024-05-21T15:25:42Z

@arcangelo7 Interesting question. As a baseline, can you run the index build with "num-triples-per-batch": 10000000 and STXXL_MEMORY = 10G and send us the output of qlever index-stats and maybe also attach the full index-log.txt file?

I assume you are using fast NVMe SSDs, correct?

arcangelo7 · 2024-05-24T13:29:31Z

Yes, we are using NVMe SSDs. I initially set the "num-triples-per-batch" parameter to 1,000,000, but encountered error 129, indicating a SIGHUP signal, even though I was launching qlever index with nohup. I'm not certain if the issue was due to the number of triples per batch or other parameters that might have influenced the process, which I haven't fully investigated yet.

However, setting the "num-triples-per-batch" to 100,000 resolved the issue. Additionally, I configured the STXXL_MEMORY to 128G. Here are the results of the index stats and the entire index-log.txt file.

Command: index-stats

Breakdown of the time used for building the index, based on the timestamps for key lines in "oc_meta.index-log.txt"

Parse input           :   31.0 min
Build vocabularies    :   15.6 min
Convert to global IDs :    2.9 min
Permutation SPO & SOP :   10.1 min
Permutation OSP & OPS :   20.9 min
Permutation PSO & POS :   19.7 min
Text index            :   93.2 min

TOTAL time            :  193.4 min

Breakdown of the space used for building the index

Files index.*         :   59.8 GB
Files vocabulary.*    :   23.1 GB
Files text.*          :   25.7 GB

TOTAL size            :  108.6 GB

The content of the index-log.txt file is attached.

oc_meta.index-log.txt

I am genuinely very satisfied with being able to recreate the entire index, including the text index, in a little over three hours. If I achieve better results, I will notify you in this issue.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimization of QLever File Configuration for High-Speed Index Creation #37

Optimization of QLever File Configuration for High-Speed Index Creation #37

arcangelo7 commented May 21, 2024

hannahbast commented May 21, 2024

arcangelo7 commented May 24, 2024

Optimization of QLever File Configuration for High-Speed Index Creation #37

Optimization of QLever File Configuration for High-Speed Index Creation #37

Comments

arcangelo7 commented May 21, 2024

hannahbast commented May 21, 2024

arcangelo7 commented May 24, 2024