-
Notifications
You must be signed in to change notification settings - Fork 118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve the connector hashing and extend condesc_t #1528
Conversation
The second invocation of connector_list_equal(slot->clist, c) is not needed, because at this point we know that its result was "true" 2 lines before. So replace it by "true". This is only a very slight speedup, if any, because the compiler (in -O3) knows it is a "pure" function and thus returns its previous result.
Due to changes in include files I got a compiler warning that it shadows the global verbosity variable.
Keep condesc_t at 32 bytes after adding con_num, which will be used in connector hash and might later be used for radix sort.
Knuth's Method / Fibonacci Hashing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the circular pointers seem unusual. Are you sure you meant that? Merging anyway.
@@ -424,7 +424,7 @@ static bool sort_condesc_by_uc_constring(Dictionary dict) | |||
condesc_t *condesc = dict->contable.hdesc[n].desc; | |||
|
|||
if (NULL == condesc) continue; | |||
calculate_connector_info(condesc); | |||
calculate_connector_info(&dict->contable.hdesc[n]); | |||
sdesc[i++] = dict->contable.hdesc[n].desc; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems to be same as condesc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
condesc
is dict->contable.hdesc[n].desc
, not dict->contable.hdesc[n]
.
To be clearer, these lines can be changed to (not tested):
hdesc_t *hdesc = &dict->contable.hdesc[n];
if (NULL == hdesc->desc) continue;
calculate_connector_info(hdesc);
sdesc[i++] = hdesc->desc;
{ | ||
lc_enc_t lc_letters; | ||
lc_enc_t lc_mask; | ||
hdesc_t *more; /* More information, for keeping small struct size. */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So hdesc
points at condesc
, and condesc
points at hdesc
. Interesting ... the old code did not have this loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hdesc
are the slots of the connector hash table. Each one that is used, has a desc
pointer to a condesc_t
element. Up to here it is a standard implementation of an open hash table.
The more
elements of condesc_t
need to point somewhere, and it was convenient to use the hdesc
slots, which are already allocated (it requires less code). However, this is memory inefficient (and indirectly CPU inefficiently due to cache trashing) because most of the hdesc
elements are empty (NULL desc
).
The solution is to use a memory pool for the more
elements. I will eventually send a cleanup PR for that (and the sort_condesc_by_uc_constring()
code).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh! Ah! OK. I think I understand. This wasn't obvious from casual perusal, so I thought I'd point it out.
This patch enumerates the connector descriptors and uses this number in the connector hash functions.
(I also plan to use it as an array index in a future parse speedup.)
It also extends the information obtained through Connector::desc without increasing its 32-bit size by adding a pointer to
condesc_t
pointing to extra data. This will allow us to add more data per connector type, e.g., costs for multi-connectors (#1351, #1352) and a cost table per connection length (I think mentioned in #632, and a placeholder added in 2c1d979). The idea is that the data needed during parsing for easy_match is still directly available, while obtaining other data before and after the parsing may need an extra redirection.The hashing functions have also improved, and they should now keep the collisions low, hopefully without pathological cases (the current hash doesn't differentiate between connectors with more than 4 UP and 4 LC characters, and increasing
connector_hash_t
tosize_t
turned out to be too costly).I handled two types of collisions:
I've also included a few minor efficiency tweaks. I tested the speed on the corpus batches and the
big-dict
(from #1479), and it seems there is a small speedup (but this patch is not about speedup).