Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize hash table insertion #2107

Merged
merged 5 commits into from
Oct 4, 2023

Conversation

olivier-matz-6wind
Copy link
Contributor

Hello,

This is a draft pull request that solves the issue #2106

The idea is to allow to insert elements in the hash table in O(1), even in case of collisions. First, the hash table is reworked to avoid browsing records to find a free one. Then, a new API is added to allow an insertion without checking for duplicates.

Feedback is of course welcome!

Note: the first patch breaks the validation test, but it is fixed by the second. From what I understand, it is expected that the browsing order is the same than the insert order, but I think this was not always guaranteed by the previous implementation.

@olivier-matz-6wind
Copy link
Contributor Author

I noticed the test failures, I'll fix them if we move forward with this draft.

Copy link
Member

@michalvasko michalvasko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the effort, in general it seems fine but I have made some comments. I am also not completely sure I understand what exact inefficiencies you solved so please correct me if I have it wrong.

  1. Skip the check that a value is already in the hash table for all data manipulation, is completely redundant (interestingly, it seems that way and I could not think of a use-case where it could lead to problems).
  2. Avoid linear search for an empty record in case the hash-based record is taken. Implemented using a "bucket" for each record that holds the information about the first/last filled record for the hash.

Would be great if something like this (if I am right) is mentioned in the hash table description. Also, from what I remember, buckets were used as an overflow space for collisions to not fill additional hash table records in some hash table collision algorithm so I am not sure the best name was chosen for these structures. Lastly, I believe the size of hash tables has increased by size * 8 B, which is not ideal but should probably be fine.

src/hash_table.h Outdated Show resolved Hide resolved
src/hash_table_internal.h Outdated Show resolved Hide resolved
src/hash_table.c Outdated Show resolved Hide resolved
src/hash_table_internal.h Show resolved Hide resolved
src/hash_table_internal.h Outdated Show resolved Hide resolved
@olivier-matz-6wind
Copy link
Contributor Author

Thanks @michalvasko for the comments.

Thanks for the effort, in general it seems fine but I have made some comments. I am also not completely sure I understand what exact inefficiencies you solved so please correct me if I have it wrong.

1. Skip the check that a value is already in the hash table for all data manipulation, is completely redundant (interestingly, it seems that way and I could not think of a use-case where it could lead to problems).

Yes, this is achieved by the new API lyht_insert_no_check().

2. Avoid linear search for an empty record in case the hash-based record is taken. Implemented using a "bucket" for each record that holds the information about the first/last filled record for the hash.

Yes.

Would be great if something like this (if I am right) is mentioned in the hash table description.

OK, I'll update the hash table description with more details.

Also, from what I remember, buckets were used as an overflow space for collisions to not fill additional hash table records in some hash table collision algorithm so I am not sure the best name was chosen for these structures.

I can rename it as "hlist" if you feel it's more relevant.

Lastly, I believe the size of hash tables has increased by size * 8 B, which is not ideal but should probably be fine.

Yes. It is +(size * 4) after the first commit and +(size * 8) after the second commit.

While there, do you have an idea why the second commit would be needed to fix the validation unit test?

@michalvasko
Copy link
Member

I can rename it as "hlist" if you feel it's more relevant.

Sure, just to avoid people assuming they are buckets from the other algorithm.

While there, do you have an idea why the second commit would be needed to fix the validation unit test?

Sorry, I am working on something else and would prefer not to leave it in the middle of things unless something else that is critical appears. Just do what you can and incorporate the feedback I give you, I will then take a look later if you will not be able to solve some issues.

@olivier-matz-6wind
Copy link
Contributor Author

changelog between v1 and v2:

  • rename bucket as hlist
  • revert to val[1] instead of val[] in record structure
  • revert to packed record structure
  • better document ly_ht structure (how the hash table works)
  • move lyht_insert_no_check() declaration in the proper patch
  • fix API documentation of lyht_insert_no_check()
  • do not use __builtin_clz() which is not available under windows
  • call ly_ht_insert_no_check() on resize
  • use lyht_get_fixed_size(u) in patch "optimize creation of children htable"
  • fix minor format and cosmetic issues

@olivier-matz-6wind olivier-matz-6wind marked this pull request as ready for review September 28, 2023 12:12
@olivier-matz-6wind
Copy link
Contributor Author

call ly_ht_insert_no_check() on resize

Thinking a bit more about it, I will remove this change, and revert to what I had in v1.

There is certainly something I don't understand about the need of specifying a val_eq_cb when resizing. My naive idea was that we don't need to check anything at resize, since the elements were already there. But there is an explicit API to do set the callback at resize, so my idea is probably wrong...

@olivier-matz-6wind
Copy link
Contributor Author

changelog between v2 and v3:

  • call ly_ht_insert_no_check() on resize only if the resize is triggered by a ly_ht_insert_no_check(), as it was done in v1

The current implementation manages collision by browsing the next
records to find an unused one.

This has the following consequences:
- when there are a lot of collisions, the insertion can take a lot
  of time before finding an empty entry.
- this may require to rehash the table to get rid of invalid records
- the code that handles the collisions is not trivial

This commit reworks the hash table to use a per-hash list of records.

It prepares the work to have an insertion in the hash table in O(1) even
if there are hash collisions. This commit is not sufficient for that yet,
since we always check for duplicates at insertion. See the introduction
of ly_ht_insert_no_check() in a latter commit.

Note: this change breaks the validation unit test. It is fixed by the
next commit.

Signed-off-by: Olivier Matz <[email protected]>
Change the type of hlist head: instead of only referencing the first
record, reference both first and last records. Therefore we can add
new elements at the tail of the list.

This impacts how the records of a hlist will be browsed in case of
collisions:
- before this commit: last inserted is browsed first
- after this commit: first inserted is browsed first

It solves the validation unit test that was broken by the previous
commit.

Signed-off-by: Olivier Matz <[email protected]>
When lyht_insert() is called, it first searches for an existing object
that matches the previously specified val_equal() callback. In this
situation, the callback is invoked with mod=1.

In case there are a lot of collisions, this check can take some time.

Introduce a new API that bypasses this lookup operation, it will be
used in a next commit to optimize keyless list elements insertions.

Signed-off-by: Olivier Matz <[email protected]>
When inserting a node into the children hash table, we check that
it was not already added, using a pointer comparison.

This check can take a lot of time if there are a lot of collisions
in the hash table, which is the case for keyless lists.

Use the new API lyht_insert_no_check() to optimize the insertion.

Signed-off-by: Olivier Matz <[email protected]>
Here, we know the number of children that will be added in the hash
table, so create the hash table with the correct number of elements to
avoid automatic resizes.

Signed-off-by: Olivier Matz <[email protected]>
@olivier-matz-6wind
Copy link
Contributor Author

changelog between v3 and v4:

  • fix format issues (sorry, I didn't realize I had to install uncrustify 0.77 to pass the test)

About the ABI compatibility check failure, the problem identified by abi-dumper is in lyd_node_inner, which includes a pointer to the children hash table:

Medium Severity
Base type of field children_ht has been changed from struct hash_table to struct ly_ht of different format.
This field may be incorrectly initialized or accessed by applications.

My thinking is that this field is never accessed by an application, only by the library (moreover the hash table type was not public until quite recently), so I think it should not be an issue. Please let me know if that's not the case.

@michalvasko michalvasko merged commit c0feb68 into CESNET:devel Oct 4, 2023
11 of 12 checks passed
@michalvasko
Copy link
Member

Thanks, the ABI problem should be fine.

@olivier-matz-6wind
Copy link
Contributor Author

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants