[WIP] malloc optimizations #84

pulsejet · 2024-12-25T03:15:29Z

Compiler optimizations don't seem to kick in

The Component struct is copied with the value methods, and the Typ struct is also allocated on the heap.
BufferReader has too many allocations because of recursion in decoding (use pool)
Remove storing Interest name copy in each in-record and out-record (this is unused, don't know why NFD wants it)
Reoptimize (1) in a different way. Go is very conservative in escaping to heap ...
Optimize encoding allocation. We can allocate the entire wire on one array rather than in pieces.
Add sync.Pool for PIT entries and nodes
Reduce an allocation for the outgoing frame in NDNLP. Once the wire plan is known, we can use an existing buffer.
Incoming face is not optional, remove that allocation
Remove incoming interest pipeline allocations. The cost is that we have to limit the number of possible next hops. This should be fine.

Allocations observed 2GB transfers later -

/// Before ---
Showing nodes accounting for 15672805, 58.55% of 26766559 total
Dropped 26 nodes (cum <= 133832)
Showing top 10 nodes out of 55
      flat  flat%   sum%        cum   cum%
   4685894 17.51% 17.51%    4685894 17.51%  github.com/named-data/ndnd/std/encoding.Component.HashInto
   2916439 10.90% 28.40%    2916439 10.90%  github.com/named-data/ndnd/std/encoding.NewBufferReader (inline)
   1409045  5.26% 33.67%    1409045  5.26%  github.com/named-data/ndnd/std/encoding.Component.Clone (inline)

/// After (1) ---
Showing nodes accounting for 12262292, 49.65% of 24698426 total
Dropped 42 nodes (cum <= 123492)
Showing top 10 nodes out of 47
      flat  flat%   sum%        cum   cum%
   2719825 11.01% 11.01%    2719825 11.01%  github.com/named-data/ndnd/std/encoding.NewBufferReader (inline)
   1671193  6.77% 17.78%    1671193  6.77%  github.com/named-data/ndnd/std/encoding.Component.Clone (inline)
   1370863  5.55% 23.33%    4038895 16.35%  github.com/named-data/ndnd/fw/table.(*PitCsTree).InsertInterest

/// After (2) ---
Showing nodes accounting for 9799349, 43.79% of 22380348 total
Dropped 33 nodes (cum <= 111901)
Showing top 10 nodes out of 50
      flat  flat%   sum%        cum   cum%
   1703961  7.61%  7.61%    1703961  7.61%  github.com/named-data/ndnd/std/encoding.Component.Clone (inline)
   1130529  5.05% 12.67%    1130529  5.05%  github.com/named-data/ndnd/fw/table.(*pitCsTreeNode).findLongestPrefixEntryEnc
    953999  4.26% 16.93%    6472328 28.92%  github.com/named-data/ndnd/fw/face.(*NDNLPLinkService).handleIncomingFrame

/// After (3) ---
Showing nodes accounting for 8818522, 42.69% of 20657586 total
Dropped 32 nodes (cum <= 103287)
Showing top 10 nodes out of 50
      flat  flat%   sum%        cum   cum%
   1063225  5.15%  5.15%    6802693 32.93%  github.com/named-data/ndnd/fw/face.(*NDNLPLinkService).handleIncomingFrame
    999454  4.84%  9.99%     999454  4.84%  github.com/named-data/ndnd/fw/table.(*pitCsTreeNode).findLongestPrefixEntryEnc
    983054  4.76% 14.74%    1802266  8.72%  github.com/named-data/ndnd/std/ndn/spec_2022.(*PacketEncoder).Init
    891584  4.32% 19.06%    2607587 12.62%  github.com/named-data/ndnd/std/ndn/spec_2022.(*DataParsingContext).Parse

/// After (4) ---
Showing nodes accounting for 9105158, 46.91% of 19409998 total
Dropped 39 nodes (cum <= 97049)
Showing top 10 nodes out of 48
      flat  flat%   sum%        cum   cum%
   1071792  5.52%  5.52%    1661629  8.56%  github.com/named-data/ndnd/std/ndn/spec_2022.(*InterestParsingContext).Parse
   1015823  5.23% 10.76%    1015823  5.23%  github.com/named-data/ndnd/std/ndn/spec_2022.(*LpPacketEncoder).Init
    957636  4.93% 15.69%    7100475 36.58%  github.com/named-data/ndnd/fw/face.(*NDNLPLinkService).handleIncomingFrame

/// After (5) ---
Showing nodes accounting for 8862399, 46.54% of 19043804 total
Dropped 37 nodes (cum <= 95219)
Showing top 10 nodes out of 49
      flat  flat%   sum%        cum   cum%
   1298507  6.82%  6.82%    2815241 14.78%  github.com/named-data/ndnd/fw/table.(*PitCsTree).InsertInterest
   1026745  5.39% 12.21%    6777586 35.59%  github.com/named-data/ndnd/fw/fw.(*Thread).processIncomingInterest
    917524  4.82% 17.03%     917524  4.82%  github.com/named-data/ndnd/std/encoding.(*BufferReader).ReadWire
    905178  4.75% 21.78%    1429477  7.51%  github.com/named-data/ndnd/std/ndn/spec_2022.(*InterestParsingContext).Parse
    830141  4.36% 26.14%    5915832 31.06%  github.com/named-data/ndnd/std/ndn/spec_2022.(*PacketParsingContext).Parse
    825578  4.34% 30.48%    2658913 13.96%  github.com/named-data/ndnd/std/ndn/spec_2022.(*DataParsingContext).Parse

/// After (6) ---
Showing nodes accounting for 8411073, 45.00% of 18689948 total
Dropped 46 nodes (cum <= 93449)
Showing top 10 nodes out of 51
      flat  flat%   sum%        cum   cum%
   1048606  5.61%  5.61%    1671226  8.94%  github.com/named-data/ndnd/std/ndn/spec_2022.(*PacketEncoder).Encode
   1048599  5.61% 11.22%    1048599  5.61%  github.com/named-data/ndnd/std/encoding.(*BufferReader).ReadWire
    985027  5.27% 16.49%    1356405  7.26%  github.com/named-data/ndnd/std/ndn/spec_2022.(*InterestParsingContext).Parse
    964911  5.16% 21.65%    7107327 38.03%  github.com/named-data/ndnd/fw/face.(*NDNLPLinkService).handleIncomingFrame

/// After (7) ---
Showing nodes accounting for 7759487, 43.84% of 17698855 total
Dropped 37 nodes (cum <= 88494)
Showing top 10 nodes out of 50
      flat  flat%   sum%        cum   cum%
    951635  5.38%  5.38%    1519626  8.59%  github.com/named-data/ndnd/std/ndn/spec_2022.(*InterestParsingContext).Parse
    910302  5.14% 10.52%    6697571 37.84%  github.com/named-data/ndnd/fw/face.(*NDNLPLinkService).handleIncomingFrame
    895678  5.06% 15.58%    5783173 32.68%  github.com/named-data/ndnd/std/ndn/spec_2022.(*PacketParsingContext).Parse
    808291  4.57% 20.15%    6202039 35.04%  github.com/named-data/ndnd/fw/fw.(*Thread).processIncomingInterest

/// After (8) ---
Showing nodes accounting for 7892206, 45.93% of 17181875 total
Dropped 47 nodes (cum <= 85909)
Showing top 10 nodes out of 50
      flat  flat%   sum%        cum   cum%
   1135975  6.61%  6.61%    5773439 33.60%  github.com/named-data/ndnd/fw/fw.(*Thread).processIncomingInterest
   1067935  6.22% 12.83%    1832538 10.67%  github.com/named-data/ndnd/std/ndn/spec_2022.(*InterestParsingContext).Parse
   1004907  5.85% 18.68%    1004907  5.85%  github.com/named-data/ndnd/std/encoding.(*BufferReader).Range
    862911  5.02% 23.70%     862911  5.02%  github.com/named-data/ndnd/std/ndn/spec_2022.(*MetaInfoParsingContext).Parse

/// After (9) ---
Showing nodes accounting for 7559774, 45.75% of 16522296 total
Dropped 43 nodes (cum <= 82611)
Showing top 10 nodes out of 50
      flat  flat%   sum%        cum   cum%
   1092291  6.61%  6.61%    6165224 37.31%  github.com/named-data/ndnd/std/ndn/spec_2022.(*PacketParsingContext).Parse
    895678  5.42% 12.03%     895678  5.42%  github.com/named-data/ndnd/std/encoding.(*BufferReader).Range
    883180  5.35% 17.38%    1473017  8.92%  github.com/named-data/ndnd/std/ndn/spec_2022.(*InterestParsingContext).Parse
    775530  4.69% 22.07%     775530  4.69%  github.com/named-data/ndnd/std/ndn/spec_2022.(*MetaInfoParsingContext).Parse
    755538  4.57% 26.64%    2623373 15.88%  github.com/named-data/ndnd/std/ndn/spec_2022.(*DataParsingContext).Parse

Compiler optimizations don't seem to kick in - the Component struct is copied with the value methods, and the Typ struct is also allocated on the heap.

PIT entries once removed cannot be reused, even for removing. Since remove is only called from the priority queue expiration function, this should not be a problem.

pulsejet · 2024-12-25T14:30:35Z

At this point most allocations come from the TLV decoder. mallocgc overhead has gone down ~33% after this change (still needs more testing). If pool operations become a bottleneck it might make more sense to have a lock-free free list per-thread later ...

/// CPU time before
      flat  flat%   sum%        cum   cum%
    3480ms 17.50% 17.50%     3480ms 17.50%  internal/runtime/syscall.Syscall6
    1070ms  5.38% 22.88%     1070ms  5.38%  runtime.memmove
    1060ms  5.33% 28.21%     4320ms 21.72%  runtime.mallocgc
     680ms  3.42% 31.62%      860ms  4.32%  runtime.findObject

/// CPU time after
Showing top 10 nodes out of 183
      flat  flat%   sum%        cum   cum%
    3620ms 18.67% 18.67%     3620ms 18.67%  internal/runtime/syscall.Syscall6
    1010ms  5.21% 23.88%     1010ms  5.21%  runtime.memmove
     670ms  3.46% 27.33%      850ms  4.38%  runtime.findObject
     610ms  3.15% 30.48%     2850ms 14.70%  runtime.mallocgc
     610ms  3.15% 33.63%     2260ms 11.66%  runtime.scanobject

zjkmxy · 2024-12-25T19:55:09Z

std/encoding/codegen/fields_struct.go

-		"value.%[1]s, err = context.%[1]s_context.Parse(reader.Delegate(int(l)), ignoreCritical)",
-		f.name,
-	), nil
+	code := fmt.Sprintf("drdr := reader.Delegate(int(l))\n")


If you create a new variable, I think it is better to surround it with {...} to avoid potential conflict, like:

ndnd/std/encoding/codegen/fields_natural.go

Lines 57 to 61 in e5c2033

g.printlnf("{")

g.printlnf("tempVal := uint64(0)")

g.printlne(GenNaturalNumberDecode("tempVal"))

g.printlnf("value.%s = &tempVal", f.name)

g.printlnf("}")

std/encoding/name_pattern.go

fw/face/link-service.go

fw/fw/multicast.go

zjkmxy · 2024-12-25T21:41:17Z

Based on the benchmark https://github.com/zjkmxy/ndn-encoding-test
This PR makes decoding slower for now. If you have time, please take a look.

pulsejet · 2024-12-26T04:25:34Z

Based on the benchmark https://github.com/zjkmxy/ndn-encoding-test This PR makes decoding slower for now. If you have time, please take a look.

I'll set up a benchmark and report 👍🏻

I do expect a slowdown in certain cases, e.g. if the benchmark is single-threaded and/or there is no GC pressure. Any benefits would likely only show up when GC itself is contending for CPU time with decoding.

pulsejet · 2024-12-26T05:52:35Z

@zjkmxy So I ran the benchmark and it does behave as I expected.

The issue is this isn't easy to test. Since the "test" packets in the benchmark are on the live heap, this inflates the target heap size (see), so GC practically never happens in the benchmark.

I tested with a single decoding test 1000000 x 1000B i.e. a heap size of 1GB.
In gondnDecode add _ = make([]byte, 4000) at the top to simulate some unrelated allocations (equivalent to reading packet from transport in the forwarder)
Use a single thread so GC collides with the benchmark, and set target footprint to ~10MB (GOGC=1%). With the extra 4000B allocation every cycle, GC should happen often enough now.

GOGC=1 GOMAXPROCS=1 ./ndn-encoding-test.exe

With this, decoding time goes from ~4.50s to ~3.95s (average of 5 runs).

This said, the gains do seem somewhat academic here - unlike my assumption, the malloc itself is extremely fast. I'm OK to abandon bufferReaderPool and Free() if the maintenance overhead seems high. What do you think?

BTW - this benchmark is great! I'll explore integrating it as an automated test here if/when time permits.

PS: yes I'm testing on Windows :p

zjkmxy · 2024-12-26T08:38:10Z

This said, the gains do seem somewhat academic here - unlike my assumption, the malloc itself is extremely fast. I'm OK to abandon bufferReaderPool and Free() if the maintenance overhead seems high. What do you think?

I don't know to be honest. The benchmark is for encoding & decoding only and does not consider the effect of forwarder pipeline. If possible, I think we still need a forwarding stress test.

PS: yes I'm testing on Windows :p

All data I put into that repo are on Bruins. It can still reproduce the old data if you roll back to an old commit to use zjkmxy/go-ndn.

zjkmxy · 2024-12-26T16:03:56Z

fw/table/pit-cs-tree.go

+
+var pitCsTreeNodePool = sync.Pool{New: newPitCsTreeNode}
+
+func newPitCsTreeNode() interface{} {


nit: I think any is preferred over interface{} for new code.

std: optimize name hash allocation

6dd1b2b

Compiler optimizations don't seem to kick in - the Component struct is copied with the value methods, and the Typ struct is also allocated on the heap.

pulsejet added enhancement New feature or request fw YaNFD issues labels Dec 25, 2024

pulsejet self-assigned this Dec 25, 2024

fw: refactor everything to use packet interfaces

bb7f890

pulsejet force-pushed the alloc-opt branch from a89dff5 to bb7f890 Compare December 25, 2024 04:20

pulsejet added 14 commits December 24, 2024 20:37

mgmt: switch to packet interfaces

1d6c525

test: fix for nonce+hoplimit

8df1f0a

std: add pool for buffer reader

05f0f5c

pit: remove name from records

f28f7ea

std: prevent hash escape to heap

469a1ef

std: optimize malloc in encoding

4134db7

pit: add entry and node pools

8a90457

ndnlp: reduce one allocation

a0bcdf2

fw: fix pit test

e370916

PIT entries once removed cannot be reused, even for removing. Since remove is only called from the priority queue expiration function, this should not be a problem.

fw: make incoming face required

5d6847a

fw: remove heap allocations in incoming interest pipeline

021049e

fw: add bounds check for nexthops

b227f3f

pit: fix entry clear

460b7c6

refactor: move MaxNextHops to defn

c1fd578

zjkmxy reviewed Dec 25, 2024

View reviewed changes

std/encoding/name_pattern.go Show resolved Hide resolved

std/encoding/name_pattern.go Outdated Show resolved Hide resolved

fw/face/link-service.go Show resolved Hide resolved

fw/fw/multicast.go Show resolved Hide resolved

std: revert Name::Prefix hash pointer

3ef2202

transport: optimize stream read

2b37e4a

fw: add pool for CS entry

62c9159

zjkmxy reviewed Dec 26, 2024

View reviewed changes

Merge branch 'main' into alloc-opt

e11a3ad

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] malloc optimizations #84

[WIP] malloc optimizations #84

pulsejet commented Dec 25, 2024 •

edited

Loading

pulsejet commented Dec 25, 2024

zjkmxy Dec 25, 2024

zjkmxy commented Dec 25, 2024

pulsejet commented Dec 26, 2024

pulsejet commented Dec 26, 2024 •

edited

Loading

zjkmxy commented Dec 26, 2024

zjkmxy Dec 26, 2024

	g.printlnf("{")
	g.printlnf("tempVal := uint64(0)")
	g.printlne(GenNaturalNumberDecode("tempVal"))
	g.printlnf("value.%s = &tempVal", f.name)
	g.printlnf("}")


		var pitCsTreeNodePool = sync.Pool{New: newPitCsTreeNode}

		func newPitCsTreeNode() interface{} {

[WIP] malloc optimizations #84

Are you sure you want to change the base?

[WIP] malloc optimizations #84

Conversation

pulsejet commented Dec 25, 2024 • edited Loading

pulsejet commented Dec 25, 2024

zjkmxy Dec 25, 2024

Choose a reason for hiding this comment

zjkmxy commented Dec 25, 2024

pulsejet commented Dec 26, 2024

pulsejet commented Dec 26, 2024 • edited Loading

zjkmxy commented Dec 26, 2024

zjkmxy Dec 26, 2024

Choose a reason for hiding this comment

pulsejet commented Dec 25, 2024 •

edited

Loading

pulsejet commented Dec 26, 2024 •

edited

Loading