Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] malloc optimizations #84

Open
wants to merge 20 commits into
base: main
Choose a base branch
from
Open

[WIP] malloc optimizations #84

wants to merge 20 commits into from

Conversation

pulsejet
Copy link
Collaborator

@pulsejet pulsejet commented Dec 25, 2024

Compiler optimizations don't seem to kick in

  1. The Component struct is copied with the value methods, and the Typ struct is also allocated on the heap.
  2. BufferReader has too many allocations because of recursion in decoding (use pool)
  3. Remove storing Interest name copy in each in-record and out-record (this is unused, don't know why NFD wants it)
  4. Reoptimize (1) in a different way. Go is very conservative in escaping to heap ...
  5. Optimize encoding allocation. We can allocate the entire wire on one array rather than in pieces.
  6. Add sync.Pool for PIT entries and nodes
  7. Reduce an allocation for the outgoing frame in NDNLP. Once the wire plan is known, we can use an existing buffer.
  8. Incoming face is not optional, remove that allocation
  9. Remove incoming interest pipeline allocations. The cost is that we have to limit the number of possible next hops. This should be fine.

Allocations observed 2GB transfers later -

/// Before ---
Showing nodes accounting for 15672805, 58.55% of 26766559 total
Dropped 26 nodes (cum <= 133832)
Showing top 10 nodes out of 55
      flat  flat%   sum%        cum   cum%
   4685894 17.51% 17.51%    4685894 17.51%  github.com/named-data/ndnd/std/encoding.Component.HashInto
   2916439 10.90% 28.40%    2916439 10.90%  github.com/named-data/ndnd/std/encoding.NewBufferReader (inline)
   1409045  5.26% 33.67%    1409045  5.26%  github.com/named-data/ndnd/std/encoding.Component.Clone (inline)

/// After (1) ---
Showing nodes accounting for 12262292, 49.65% of 24698426 total
Dropped 42 nodes (cum <= 123492)
Showing top 10 nodes out of 47
      flat  flat%   sum%        cum   cum%
   2719825 11.01% 11.01%    2719825 11.01%  github.com/named-data/ndnd/std/encoding.NewBufferReader (inline)
   1671193  6.77% 17.78%    1671193  6.77%  github.com/named-data/ndnd/std/encoding.Component.Clone (inline)
   1370863  5.55% 23.33%    4038895 16.35%  github.com/named-data/ndnd/fw/table.(*PitCsTree).InsertInterest

/// After (2) ---
Showing nodes accounting for 9799349, 43.79% of 22380348 total
Dropped 33 nodes (cum <= 111901)
Showing top 10 nodes out of 50
      flat  flat%   sum%        cum   cum%
   1703961  7.61%  7.61%    1703961  7.61%  github.com/named-data/ndnd/std/encoding.Component.Clone (inline)
   1130529  5.05% 12.67%    1130529  5.05%  github.com/named-data/ndnd/fw/table.(*pitCsTreeNode).findLongestPrefixEntryEnc
    953999  4.26% 16.93%    6472328 28.92%  github.com/named-data/ndnd/fw/face.(*NDNLPLinkService).handleIncomingFrame

/// After (3) ---
Showing nodes accounting for 8818522, 42.69% of 20657586 total
Dropped 32 nodes (cum <= 103287)
Showing top 10 nodes out of 50
      flat  flat%   sum%        cum   cum%
   1063225  5.15%  5.15%    6802693 32.93%  github.com/named-data/ndnd/fw/face.(*NDNLPLinkService).handleIncomingFrame
    999454  4.84%  9.99%     999454  4.84%  github.com/named-data/ndnd/fw/table.(*pitCsTreeNode).findLongestPrefixEntryEnc
    983054  4.76% 14.74%    1802266  8.72%  github.com/named-data/ndnd/std/ndn/spec_2022.(*PacketEncoder).Init
    891584  4.32% 19.06%    2607587 12.62%  github.com/named-data/ndnd/std/ndn/spec_2022.(*DataParsingContext).Parse

/// After (4) ---
Showing nodes accounting for 9105158, 46.91% of 19409998 total
Dropped 39 nodes (cum <= 97049)
Showing top 10 nodes out of 48
      flat  flat%   sum%        cum   cum%
   1071792  5.52%  5.52%    1661629  8.56%  github.com/named-data/ndnd/std/ndn/spec_2022.(*InterestParsingContext).Parse
   1015823  5.23% 10.76%    1015823  5.23%  github.com/named-data/ndnd/std/ndn/spec_2022.(*LpPacketEncoder).Init
    957636  4.93% 15.69%    7100475 36.58%  github.com/named-data/ndnd/fw/face.(*NDNLPLinkService).handleIncomingFrame

/// After (5) ---
Showing nodes accounting for 8862399, 46.54% of 19043804 total
Dropped 37 nodes (cum <= 95219)
Showing top 10 nodes out of 49
      flat  flat%   sum%        cum   cum%
   1298507  6.82%  6.82%    2815241 14.78%  github.com/named-data/ndnd/fw/table.(*PitCsTree).InsertInterest
   1026745  5.39% 12.21%    6777586 35.59%  github.com/named-data/ndnd/fw/fw.(*Thread).processIncomingInterest
    917524  4.82% 17.03%     917524  4.82%  github.com/named-data/ndnd/std/encoding.(*BufferReader).ReadWire
    905178  4.75% 21.78%    1429477  7.51%  github.com/named-data/ndnd/std/ndn/spec_2022.(*InterestParsingContext).Parse
    830141  4.36% 26.14%    5915832 31.06%  github.com/named-data/ndnd/std/ndn/spec_2022.(*PacketParsingContext).Parse
    825578  4.34% 30.48%    2658913 13.96%  github.com/named-data/ndnd/std/ndn/spec_2022.(*DataParsingContext).Parse

/// After (6) ---
Showing nodes accounting for 8411073, 45.00% of 18689948 total
Dropped 46 nodes (cum <= 93449)
Showing top 10 nodes out of 51
      flat  flat%   sum%        cum   cum%
   1048606  5.61%  5.61%    1671226  8.94%  github.com/named-data/ndnd/std/ndn/spec_2022.(*PacketEncoder).Encode
   1048599  5.61% 11.22%    1048599  5.61%  github.com/named-data/ndnd/std/encoding.(*BufferReader).ReadWire
    985027  5.27% 16.49%    1356405  7.26%  github.com/named-data/ndnd/std/ndn/spec_2022.(*InterestParsingContext).Parse
    964911  5.16% 21.65%    7107327 38.03%  github.com/named-data/ndnd/fw/face.(*NDNLPLinkService).handleIncomingFrame

/// After (7) ---
Showing nodes accounting for 7759487, 43.84% of 17698855 total
Dropped 37 nodes (cum <= 88494)
Showing top 10 nodes out of 50
      flat  flat%   sum%        cum   cum%
    951635  5.38%  5.38%    1519626  8.59%  github.com/named-data/ndnd/std/ndn/spec_2022.(*InterestParsingContext).Parse
    910302  5.14% 10.52%    6697571 37.84%  github.com/named-data/ndnd/fw/face.(*NDNLPLinkService).handleIncomingFrame
    895678  5.06% 15.58%    5783173 32.68%  github.com/named-data/ndnd/std/ndn/spec_2022.(*PacketParsingContext).Parse
    808291  4.57% 20.15%    6202039 35.04%  github.com/named-data/ndnd/fw/fw.(*Thread).processIncomingInterest

/// After (8) ---
Showing nodes accounting for 7892206, 45.93% of 17181875 total
Dropped 47 nodes (cum <= 85909)
Showing top 10 nodes out of 50
      flat  flat%   sum%        cum   cum%
   1135975  6.61%  6.61%    5773439 33.60%  github.com/named-data/ndnd/fw/fw.(*Thread).processIncomingInterest
   1067935  6.22% 12.83%    1832538 10.67%  github.com/named-data/ndnd/std/ndn/spec_2022.(*InterestParsingContext).Parse
   1004907  5.85% 18.68%    1004907  5.85%  github.com/named-data/ndnd/std/encoding.(*BufferReader).Range
    862911  5.02% 23.70%     862911  5.02%  github.com/named-data/ndnd/std/ndn/spec_2022.(*MetaInfoParsingContext).Parse

/// After (9) ---
Showing nodes accounting for 7559774, 45.75% of 16522296 total
Dropped 43 nodes (cum <= 82611)
Showing top 10 nodes out of 50
      flat  flat%   sum%        cum   cum%
   1092291  6.61%  6.61%    6165224 37.31%  github.com/named-data/ndnd/std/ndn/spec_2022.(*PacketParsingContext).Parse
    895678  5.42% 12.03%     895678  5.42%  github.com/named-data/ndnd/std/encoding.(*BufferReader).Range
    883180  5.35% 17.38%    1473017  8.92%  github.com/named-data/ndnd/std/ndn/spec_2022.(*InterestParsingContext).Parse
    775530  4.69% 22.07%     775530  4.69%  github.com/named-data/ndnd/std/ndn/spec_2022.(*MetaInfoParsingContext).Parse
    755538  4.57% 26.64%    2623373 15.88%  github.com/named-data/ndnd/std/ndn/spec_2022.(*DataParsingContext).Parse

Compiler optimizations don't seem to kick in - the Component struct
is copied with the value methods, and the Typ struct is also allocated
on the heap.
@pulsejet pulsejet added enhancement New feature or request fw YaNFD issues labels Dec 25, 2024
@pulsejet pulsejet self-assigned this Dec 25, 2024
@pulsejet
Copy link
Collaborator Author

At this point most allocations come from the TLV decoder. mallocgc overhead has gone down ~33% after this change (still needs more testing). If pool operations become a bottleneck it might make more sense to have a lock-free free list per-thread later ...

/// CPU time before
      flat  flat%   sum%        cum   cum%
    3480ms 17.50% 17.50%     3480ms 17.50%  internal/runtime/syscall.Syscall6
    1070ms  5.38% 22.88%     1070ms  5.38%  runtime.memmove
    1060ms  5.33% 28.21%     4320ms 21.72%  runtime.mallocgc
     680ms  3.42% 31.62%      860ms  4.32%  runtime.findObject

/// CPU time after
Showing top 10 nodes out of 183
      flat  flat%   sum%        cum   cum%
    3620ms 18.67% 18.67%     3620ms 18.67%  internal/runtime/syscall.Syscall6
    1010ms  5.21% 23.88%     1010ms  5.21%  runtime.memmove
     670ms  3.46% 27.33%      850ms  4.38%  runtime.findObject
     610ms  3.15% 30.48%     2850ms 14.70%  runtime.mallocgc
     610ms  3.15% 33.63%     2260ms 11.66%  runtime.scanobject

"value.%[1]s, err = context.%[1]s_context.Parse(reader.Delegate(int(l)), ignoreCritical)",
f.name,
), nil
code := fmt.Sprintf("drdr := reader.Delegate(int(l))\n")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you create a new variable, I think it is better to surround it with {...} to avoid potential conflict, like:

g.printlnf("{")
g.printlnf("tempVal := uint64(0)")
g.printlne(GenNaturalNumberDecode("tempVal"))
g.printlnf("value.%s = &tempVal", f.name)
g.printlnf("}")

std/encoding/name_pattern.go Show resolved Hide resolved
std/encoding/name_pattern.go Outdated Show resolved Hide resolved
fw/face/link-service.go Show resolved Hide resolved
fw/fw/multicast.go Show resolved Hide resolved
@zjkmxy
Copy link
Member

zjkmxy commented Dec 25, 2024

Based on the benchmark https://github.com/zjkmxy/ndn-encoding-test
This PR makes decoding slower for now. If you have time, please take a look.

@pulsejet
Copy link
Collaborator Author

Based on the benchmark https://github.com/zjkmxy/ndn-encoding-test This PR makes decoding slower for now. If you have time, please take a look.

I'll set up a benchmark and report 👍🏻

I do expect a slowdown in certain cases, e.g. if the benchmark is single-threaded and/or there is no GC pressure. Any benefits would likely only show up when GC itself is contending for CPU time with decoding.

@pulsejet
Copy link
Collaborator Author

pulsejet commented Dec 26, 2024

@zjkmxy So I ran the benchmark and it does behave as I expected.

The issue is this isn't easy to test. Since the "test" packets in the benchmark are on the live heap, this inflates the target heap size (see), so GC practically never happens in the benchmark.

  1. I tested with a single decoding test 1000000 x 1000B i.e. a heap size of 1GB.
  2. In gondnDecode add _ = make([]byte, 4000) at the top to simulate some unrelated allocations (equivalent to reading packet from transport in the forwarder)
  3. Use a single thread so GC collides with the benchmark, and set target footprint to ~10MB (GOGC=1%). With the extra 4000B allocation every cycle, GC should happen often enough now.
GOGC=1 GOMAXPROCS=1 ./ndn-encoding-test.exe

With this, decoding time goes from ~4.50s to ~3.95s (average of 5 runs).

This said, the gains do seem somewhat academic here - unlike my assumption, the malloc itself is extremely fast. I'm OK to abandon bufferReaderPool and Free() if the maintenance overhead seems high. What do you think?

BTW - this benchmark is great! I'll explore integrating it as an automated test here if/when time permits.

PS: yes I'm testing on Windows :p

@zjkmxy
Copy link
Member

zjkmxy commented Dec 26, 2024

This said, the gains do seem somewhat academic here - unlike my assumption, the malloc itself is extremely fast. I'm OK to abandon bufferReaderPool and Free() if the maintenance overhead seems high. What do you think?

I don't know to be honest. The benchmark is for encoding & decoding only and does not consider the effect of forwarder pipeline. If possible, I think we still need a forwarding stress test.

PS: yes I'm testing on Windows :p

All data I put into that repo are on Bruins. It can still reproduce the old data if you roll back to an old commit to use zjkmxy/go-ndn.


var pitCsTreeNodePool = sync.Pool{New: newPitCsTreeNode}

func newPitCsTreeNode() interface{} {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think any is preferred over interface{} for new code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request fw YaNFD issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants