Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimization of QLever File Configuration for High-Speed Index Creation #37

Open
arcangelo7 opened this issue May 21, 2024 · 2 comments

Comments

@arcangelo7
Copy link

Hello QLever Team,

I am currently working on configuring QLever for a project that involves indexing a substantial dataset with nearly 5 billion triples. The objective is to maximize the speed of the indexing process on a high-performance server with ~1 TB RAM.

Given the high RAM capacity, I am seeking advice on the optimal combination of parameters to significantly speed up the index creation process. Specifically, I would like guidance on the following:

  • Number of Triples per Batch: What would be the ideal setting considering the server's high RAM capacity?
  • STXXL Memory: How can we best utilize the available 1 TB of RAM?
  • Any Additional Parameters: Are there other settings or parameters that can be adjusted to further enhance the indexing speed?

The speed of the indexing process is a crucial factor for us. Any insights or recommendations on how to best leverage our server's capabilities would be greatly appreciated.

Thank you in advance for your support!

@hannahbast
Copy link
Member

@arcangelo7 Interesting question. As a baseline, can you run the index build with "num-triples-per-batch": 10000000 and STXXL_MEMORY = 10G and send us the output of qlever index-stats and maybe also attach the full index-log.txt file?

I assume you are using fast NVMe SSDs, correct?

@arcangelo7
Copy link
Author

Yes, we are using NVMe SSDs. I initially set the "num-triples-per-batch" parameter to 1,000,000, but encountered error 129, indicating a SIGHUP signal, even though I was launching qlever index with nohup. I'm not certain if the issue was due to the number of triples per batch or other parameters that might have influenced the process, which I haven't fully investigated yet.

However, setting the "num-triples-per-batch" to 100,000 resolved the issue. Additionally, I configured the STXXL_MEMORY to 128G. Here are the results of the index stats and the entire index-log.txt file.

Command: index-stats

Breakdown of the time used for building the index, based on the timestamps for key lines in "oc_meta.index-log.txt"

Parse input           :   31.0 min
Build vocabularies    :   15.6 min
Convert to global IDs :    2.9 min
Permutation SPO & SOP :   10.1 min
Permutation OSP & OPS :   20.9 min
Permutation PSO & POS :   19.7 min
Text index            :   93.2 min

TOTAL time            :  193.4 min

Breakdown of the space used for building the index

Files index.*         :   59.8 GB
Files vocabulary.*    :   23.1 GB
Files text.*          :   25.7 GB

TOTAL size            :  108.6 GB

The content of the index-log.txt file is attached.

oc_meta.index-log.txt

I am genuinely very satisfied with being able to recreate the entire index, including the text index, in a little over three hours. If I achieve better results, I will notify you in this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants