Optimize hash table insertion #2107

olivier-matz-6wind · 2023-09-25T10:09:51Z

Hello,

This is a draft pull request that solves the issue #2106

The idea is to allow to insert elements in the hash table in O(1), even in case of collisions. First, the hash table is reworked to avoid browsing records to find a free one. Then, a new API is added to allow an insertion without checking for duplicates.

Feedback is of course welcome!

Note: the first patch breaks the validation test, but it is fixed by the second. From what I understand, it is expected that the browsing order is the same than the insert order, but I think this was not always guaranteed by the previous implementation.

olivier-matz-6wind · 2023-09-25T11:20:45Z

I noticed the test failures, I'll fix them if we move forward with this draft.

michalvasko

Thanks for the effort, in general it seems fine but I have made some comments. I am also not completely sure I understand what exact inefficiencies you solved so please correct me if I have it wrong.

Skip the check that a value is already in the hash table for all data manipulation, is completely redundant (interestingly, it seems that way and I could not think of a use-case where it could lead to problems).
Avoid linear search for an empty record in case the hash-based record is taken. Implemented using a "bucket" for each record that holds the information about the first/last filled record for the hash.

Would be great if something like this (if I am right) is mentioned in the hash table description. Also, from what I remember, buckets were used as an overflow space for collisions to not fill additional hash table records in some hash table collision algorithm so I am not sure the best name was chosen for these structures. Lastly, I believe the size of hash tables has increased by size * 8 B, which is not ideal but should probably be fine.

src/hash_table.h

src/hash_table_internal.h

src/hash_table.c

src/hash_table_internal.h

olivier-matz-6wind · 2023-09-26T06:39:13Z

Thanks @michalvasko for the comments.

Thanks for the effort, in general it seems fine but I have made some comments. I am also not completely sure I understand what exact inefficiencies you solved so please correct me if I have it wrong.
1. Skip the check that a value is already in the hash table for all data manipulation, is completely redundant (interestingly, it seems that way and I could not think of a use-case where it could lead to problems).

Yes, this is achieved by the new API lyht_insert_no_check().

2. Avoid linear search for an empty record in case the hash-based record is taken. Implemented using a "bucket" for each record that holds the information about the first/last filled record for the hash.

Yes.

Would be great if something like this (if I am right) is mentioned in the hash table description.

OK, I'll update the hash table description with more details.

Also, from what I remember, buckets were used as an overflow space for collisions to not fill additional hash table records in some hash table collision algorithm so I am not sure the best name was chosen for these structures.

I can rename it as "hlist" if you feel it's more relevant.

Lastly, I believe the size of hash tables has increased by size * 8 B, which is not ideal but should probably be fine.

Yes. It is +(size * 4) after the first commit and +(size * 8) after the second commit.

While there, do you have an idea why the second commit would be needed to fix the validation unit test?

michalvasko · 2023-09-26T07:38:28Z

I can rename it as "hlist" if you feel it's more relevant.

Sure, just to avoid people assuming they are buckets from the other algorithm.

While there, do you have an idea why the second commit would be needed to fix the validation unit test?

Sorry, I am working on something else and would prefer not to leave it in the middle of things unless something else that is critical appears. Just do what you can and incorporate the feedback I give you, I will then take a look later if you will not be able to solve some issues.

olivier-matz-6wind · 2023-09-28T11:49:08Z

changelog between v1 and v2:

rename bucket as hlist
revert to val[1] instead of val[] in record structure
revert to packed record structure
better document ly_ht structure (how the hash table works)
move lyht_insert_no_check() declaration in the proper patch
fix API documentation of lyht_insert_no_check()
do not use __builtin_clz() which is not available under windows
call ly_ht_insert_no_check() on resize
use lyht_get_fixed_size(u) in patch "optimize creation of children htable"
fix minor format and cosmetic issues

olivier-matz-6wind · 2023-09-28T13:02:28Z

call ly_ht_insert_no_check() on resize

Thinking a bit more about it, I will remove this change, and revert to what I had in v1.

There is certainly something I don't understand about the need of specifying a val_eq_cb when resizing. My naive idea was that we don't need to check anything at resize, since the elements were already there. But there is an explicit API to do set the callback at resize, so my idea is probably wrong...

olivier-matz-6wind · 2023-09-28T13:23:08Z

changelog between v2 and v3:

call ly_ht_insert_no_check() on resize only if the resize is triggered by a ly_ht_insert_no_check(), as it was done in v1

The current implementation manages collision by browsing the next records to find an unused one. This has the following consequences: - when there are a lot of collisions, the insertion can take a lot of time before finding an empty entry. - this may require to rehash the table to get rid of invalid records - the code that handles the collisions is not trivial This commit reworks the hash table to use a per-hash list of records. It prepares the work to have an insertion in the hash table in O(1) even if there are hash collisions. This commit is not sufficient for that yet, since we always check for duplicates at insertion. See the introduction of ly_ht_insert_no_check() in a latter commit. Note: this change breaks the validation unit test. It is fixed by the next commit. Signed-off-by: Olivier Matz <[email protected]>

Change the type of hlist head: instead of only referencing the first record, reference both first and last records. Therefore we can add new elements at the tail of the list. This impacts how the records of a hlist will be browsed in case of collisions: - before this commit: last inserted is browsed first - after this commit: first inserted is browsed first It solves the validation unit test that was broken by the previous commit. Signed-off-by: Olivier Matz <[email protected]>

When lyht_insert() is called, it first searches for an existing object that matches the previously specified val_equal() callback. In this situation, the callback is invoked with mod=1. In case there are a lot of collisions, this check can take some time. Introduce a new API that bypasses this lookup operation, it will be used in a next commit to optimize keyless list elements insertions. Signed-off-by: Olivier Matz <[email protected]>

When inserting a node into the children hash table, we check that it was not already added, using a pointer comparison. This check can take a lot of time if there are a lot of collisions in the hash table, which is the case for keyless lists. Use the new API lyht_insert_no_check() to optimize the insertion. Signed-off-by: Olivier Matz <[email protected]>

Here, we know the number of children that will be added in the hash table, so create the hash table with the correct number of elements to avoid automatic resizes. Signed-off-by: Olivier Matz <[email protected]>

olivier-matz-6wind · 2023-10-03T14:34:04Z

changelog between v3 and v4:

fix format issues (sorry, I didn't realize I had to install uncrustify 0.77 to pass the test)

About the ABI compatibility check failure, the problem identified by abi-dumper is in lyd_node_inner, which includes a pointer to the children hash table:

Medium Severity
Base type of field children_ht has been changed from struct hash_table to struct ly_ht of different format.
This field may be incorrectly initialized or accessed by applications.

My thinking is that this field is never accessed by an application, only by the library (moreover the hash table type was not public until quite recently), so I think it should not be an issue. Please let me know if that's not the case.

michalvasko · 2023-10-04T13:41:44Z

Thanks, the ABI problem should be fine.

olivier-matz-6wind · 2023-10-04T15:45:59Z

Thank you!

michalvasko requested changes Sep 25, 2023

View reviewed changes

src/hash_table.h Outdated Show resolved Hide resolved

src/hash_table_internal.h Outdated Show resolved Hide resolved

src/hash_table.c Outdated Show resolved Hide resolved

src/hash_table_internal.h Show resolved Hide resolved

src/hash_table_internal.h Outdated Show resolved Hide resolved

olivier-matz-6wind force-pushed the optim_hash_table_1 branch from 912de25 to 72a1eea Compare September 28, 2023 11:47

olivier-matz-6wind marked this pull request as ready for review September 28, 2023 12:12

olivier-matz-6wind force-pushed the optim_hash_table_1 branch from 72a1eea to 8709d88 Compare September 28, 2023 13:21

olivier-matz-6wind mentioned this pull request Sep 28, 2023

Invalid comparison of strings in dict #2108

Closed

olivier-matz-6wind added 5 commits October 3, 2023 16:22

tree_data OPTIMIZE optimize creation of children htable

9f69558

Here, we know the number of children that will be added in the hash table, so create the hash table with the correct number of elements to avoid automatic resizes. Signed-off-by: Olivier Matz <[email protected]>

olivier-matz-6wind force-pushed the optim_hash_table_1 branch from 8709d88 to 9f69558 Compare October 3, 2023 14:23

olivier-matz-6wind requested a review from michalvasko October 3, 2023 14:49

michalvasko approved these changes Oct 4, 2023

View reviewed changes

michalvasko merged commit c0feb68 into CESNET:devel Oct 4, 2023
11 of 12 checks passed

olivier-matz-6wind mentioned this pull request Oct 5, 2023

Creation of keyless list elements takes a lot of time #2106

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize hash table insertion #2107

Optimize hash table insertion #2107

olivier-matz-6wind commented Sep 25, 2023

olivier-matz-6wind commented Sep 25, 2023

michalvasko left a comment

olivier-matz-6wind commented Sep 26, 2023

michalvasko commented Sep 26, 2023

olivier-matz-6wind commented Sep 28, 2023

olivier-matz-6wind commented Sep 28, 2023

olivier-matz-6wind commented Sep 28, 2023

olivier-matz-6wind commented Oct 3, 2023

michalvasko commented Oct 4, 2023

olivier-matz-6wind commented Oct 4, 2023

Optimize hash table insertion #2107

Optimize hash table insertion #2107

Conversation

olivier-matz-6wind commented Sep 25, 2023

olivier-matz-6wind commented Sep 25, 2023

michalvasko left a comment

Choose a reason for hiding this comment

olivier-matz-6wind commented Sep 26, 2023

michalvasko commented Sep 26, 2023

olivier-matz-6wind commented Sep 28, 2023

olivier-matz-6wind commented Sep 28, 2023

olivier-matz-6wind commented Sep 28, 2023

olivier-matz-6wind commented Oct 3, 2023

michalvasko commented Oct 4, 2023

olivier-matz-6wind commented Oct 4, 2023