You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I thought it might be interesting to do a profile to see which parts actually have the largest impacts on performance.
The setup was fairly simple. I just wrote a simple test program which parsed the first 5 million entries from a table dump then exited. This was then compiled in release mode with debug symbols using bgpkit-parser 6055612. I used Intel VTune to perform the profile and it gave me the following results.
Here was the result of that run. I included the image for context, but much of it is unreadable without clicking the various segments.
Here are a couple of the parts I found interesting:
Elementor::record_to_elems took up 11.9% of the total CPU time, but the vast majority (67.7%) of that time was spent waiting on the system allocator. From a quick glance, all of these cases involved using Vec.
The function that took the most CPU time (42.0%) was ReadUtils::read_nlri_prefix. This is not that surprising given the type of file being parsed, but it looks like there are a number of ways that this could be improved.
26.1% of the entire application runtime was spent to allocate/free memory.
Because viewing a table dump leads to somewhat biased results, I also ran it again on one of the largest updates files I could find for rcc15 (updates.20230124.0750.gz, 31MB). The test code was exactly the same except for switching out the file path.
In this case, the majority (59.1%) of the CPU time was spent allocating and freeing memory using the system allocator. This is a bit alarming since it means more time was spent waiting on allocations then actually performing any meaningful processing. An additional 7.8% of the CPU time was spent using memcpy. It is a bit harder to tell if memcpy is being overused, but roughly a third of that seems to involve stuff being cloned in bgp_update_to_elems.
An easy way to get a sizable performance boost might be to use a crate like smallvec, tinyvec, or arrayvec. With some slight variations, they all provide vec-like data structures that reserve a certain amount of space on the stack before allocating space on the heap. This could have a massive impact on performance for cases where you need the flexibility of a Vec, but know than in most cases it will only hold a small number of elements. In fact, if you enable the union feature for smallvec it can use the space a Vec would normally use for the base pointer and capacity to start storing values instead. This means that if total number of items placed on the stack before moving to the heap totals to less than 2 machine words (16 bytes on x64) then it will be the exact same size as a Vec would be minus the heap.
The text was updated successfully, but these errors were encountered:
Originally posted by @jmeggitt in #81 (comment)
I thought it might be interesting to do a profile to see which parts actually have the largest impacts on performance.
The setup was fairly simple. I just wrote a simple test program which parsed the first 5 million entries from a table dump then exited. This was then compiled in release mode with debug symbols using bgpkit-parser 6055612. I used Intel VTune to perform the profile and it gave me the following results.
Here was the result of that run. I included the image for context, but much of it is unreadable without clicking the various segments.
Here are a couple of the parts I found interesting:
Elementor::record_to_elems
took up 11.9% of the total CPU time, but the vast majority (67.7%) of that time was spent waiting on the system allocator. From a quick glance, all of these cases involved usingVec
.ReadUtils::read_nlri_prefix
. This is not that surprising given the type of file being parsed, but it looks like there are a number of ways that this could be improved.Because viewing a table dump leads to somewhat biased results, I also ran it again on one of the largest updates files I could find for rcc15 (
updates.20230124.0750.gz
, 31MB). The test code was exactly the same except for switching out the file path.In this case, the majority (59.1%) of the CPU time was spent allocating and freeing memory using the system allocator. This is a bit alarming since it means more time was spent waiting on allocations then actually performing any meaningful processing. An additional 7.8% of the CPU time was spent using
memcpy
. It is a bit harder to tell ifmemcpy
is being overused, but roughly a third of that seems to involve stuff being cloned inbgp_update_to_elems
.An easy way to get a sizable performance boost might be to use a crate like
smallvec
,tinyvec
, orarrayvec
. With some slight variations, they all provide vec-like data structures that reserve a certain amount of space on the stack before allocating space on the heap. This could have a massive impact on performance for cases where you need the flexibility of aVec
, but know than in most cases it will only hold a small number of elements. In fact, if you enable theunion
feature forsmallvec
it can use the space aVec
would normally use for the base pointer and capacity to start storing values instead. This means that if total number of items placed on the stack before moving to the heap totals to less than 2 machine words (16 bytes on x64) then it will be the exact same size as aVec
would be minus the heap.The text was updated successfully, but these errors were encountered: