suggestions about its performance / additon to the benchmarks #8

sugawarayuuta · 2023-04-25T13:08:38Z

Hello! this is my first issue ever, so go easy on me.
I have published a JSON library that I was thinking about during my high school breaks. and I hope to share my knowledge I have gained from it, and if possible, I would like to ask you to add it to your benchmarks.

The following articles describes the general idea

The following is my repository

sugawarayuuta/sonnet

There are three major optimization I used, excluding unsafe things.

(Staged) sync.Pool
By setting up staged in the pool, slices etc can be reused more reliably.
The idea is to create an array of pools and specify the size of slices that each pool accept.
My implementation is here.
This eliminates waste and simultaneously omits objects that are too large or too small.
Bit-wise operations (SWAR)
This can be used for multiple things, but I use it for byte searching. You can search multiple characters at once, so this can be efficient if assembly is not available.
My implementation is here.
This uses unsafe, however you can still replace it with encoding/binary.
algorithms are described here.
Custom data structures
One thing I noticed in profiling the decoder is that lookups tend to be the bottleneck. (in this case, checking if a field exists with that name.) so first I created a hash map with robinhood hashing algorithm, that was faster than the standard map, because it's quite strong at handling keys that don't exist. I also tried cuckoo hash, which has interesting insertion and O(1) lookup. but I ended up using perfect hashing functions, as key(field name)-value(field information) pairs are known from the start, and don't change after. I also used OAAT fnv hash function to lowercase the field name and hash it at the same time.
My implementation is (although it's probably too simple) here

These are optimizations I used.

Next, about the addition of the benchmark, I created a repository for reference. the result of benchmarks taken on darwin- amd64- Intel(R) Core(TM) i5-7400 CPU @ 3.00GHz is here. the code for the benchmark is in the same repository. Check that out if you're interested.
As you may already see, the library is quite strong at stream-decoding JSON thanks to the technique I provided above.
By the way, I made the library so that it's compatible with the standard encoding/json, so the usage is the same as that.

Thank you for reading and have a good day.

The text was updated successfully, but these errors were encountered:

dsnet · 2023-04-26T09:02:27Z

Hi @sugawarayuuta, thanks for the discussion!

For the benchmarks in "github.com/go-json-experiment/jsonbench", I was able to run your module on a 1st generation Macbook Air with an M1 chipset and got:

Runtimes/Marshal/Concrete     JSONv2    SonnetJSON
CanadaGeometry                1.000000  0.865909
CITMCatalog                   1.000000  0.531719
SyntheaFHIR                   1.000000  0.848120
TwitterStatus                 1.000000  0.725057
GolangSource                  1.000000  0.617733
StringUnicode                 1.000000  1.144623

Runtimes/Marshal/Interface    JSONv2    SonnetJSON
CanadaGeometry                1.000000  1.275032
CITMCatalog                   1.000000  1.745884
SyntheaFHIR                   1.000000  1.671878
TwitterStatus                 1.000000  1.851777
GolangSource                  1.000000  1.668401
StringUnicode                 1.000000  1.373509

Runtimes/Marshal/RawValue     JSONv2    SonnetJSON
CanadaGeometry                1.000000  2.552689
CITMCatalog                   1.000000  4.232183
SyntheaFHIR                   1.000000  3.513342
TwitterStatus                 1.000000  2.517253
GolangSource                  1.000000  1.995475
StringUnicode                 1.000000  2.066478

Runtimes/Unmarshal/Concrete   JSONv2    SonnetJSON
CanadaGeometry                1.000000  0.858383
CITMCatalog                   1.000000  0.959231
SyntheaFHIR                   1.000000  0.753699
TwitterStatus                 1.000000  0.742998
GolangSource                  1.000000  0.583218
StringUnicode                 1.000000  1.694964

Runtimes/Unmarshal/Interface  JSONv2    SonnetJSON
CanadaGeometry                1.000000  0.737508
CITMCatalog                   1.000000  0.921915
SyntheaFHIR                   1.000000  0.747917
TwitterStatus                 1.000000  0.784415
GolangSource                  1.000000  0.807946
StringUnicode                 1.000000  1.519729

Runtimes/Unmarshal/RawValue   JSONv2    SonnetJSON
CanadaGeometry                1.000000  1.457672
CITMCatalog                   1.000000  1.671255
SyntheaFHIR                   1.000000  1.133976
TwitterStatus                 1.000000  1.022204
GolangSource                  1.000000  1.179505
StringUnicode                 1.000000  1.444572

All JSONv2 numbers are normalized to 1, while "sonnet" is shown as relative to JSONv2. Higher number is slower, lower number is faster.

Analysis:

The Concrete test is marshaling/unmarshaling JSON to/from a concrete Go value, in which case it has to make heavy use of Go reflection. In these tests, "sonnet" is faster, which seems to suggest that "sonnet/internal/types" is much faster than the Go "reflect" package. As an explicit design goal, we avoid using "unsafe" which prevents use of alternative "reflect"-like packages.
The RawValue test is mostly testing pure JSON decoding/encoding (without reflect), which seems to suggest that JSONv2 implementation is faster in that regard.
The Interface test marshals/unmarshals JSON to/from a Go any value. It's not clear to me why "sonnet" is faster since the JSONv2 implementation uses a specialized implementation for this situation.

(Staged) sync.Pool

Staged pooling only works if you know the expected size up-front or can accurately predict it. One of the most difficult cases is figuring what sized buffer to use as the output for Marshal. See https://go.dev/cl/471200, where we're using tiered list of sync.Pool combined with a segmented buffer to vastly improve memory utilization of Marshal. That CL is driven by real pathological heap usage situations in Kubernetes.

Bit-wise operations (SWAR)

An early prototype of this module actually used bit-wise operations for processing JSON strings. In isolation, it is faster, but has several issues:

In a decently sized corpus of JSON files, the median length of a JSON string was 6 characters (since most JSON strings are for JSON object names, which tend to be short). In which case, the loop to process 8B at a time is rarely useful in practice.
The naïve for-loop implementation is inlineable, which has the major advantage of avoiding a function call to handle each JSON object name (which tends to be a short JSON string of pure ASCII). I couldn't get the bit-wise implementation to be inlineable.
Bit-wise operations are not particularly readable compared to a naïve for-loop.

Custom data structures

As much as possible, we'd like to avoid re-inventing the wheel. If a custom hashmap is better than the builtin Go map, then it seems more prudent to optimize the Go map (e.g., golang/go#54766) instead of a one-off implementation in the "json" package. That said, we did use a custom map-like data structure for string interning. In that implementation, we keep a small cache of recently allocated Go strings and lookup the string using a variation of xxhash32. A custom data structure was justified for interning since we wanted to guarantee a fixed-size bound on the cache, which is not an easy operation to perform with a Go map.

sugawarayuuta · 2023-04-26T10:49:58Z

Thank you for the serious consideration. I am honestly happy that it did not give bad results with concrete typed encoding/decoding, which was my top priority.
I just have some questions about your answers. so let me ask them.

In a decently sized corpus of JSON files, the median length of a JSON string was 6 characters (since most JSON strings are for JSON object names, which tend to be short). In which case, the loop to process 8B at a time is rarely useful in practice.

I think what you're saying applies only if you do the parsing in 2 stages, tokenize and the actual decode which requires iterations twice. in contrast, I only do it once, (although I don't know this is a good idea, remember this is still in heavy development). It's possible by first finding one of double-quotes, control characters, or backslashes (escaped characters). In this way, you can read 8 bytes at once as long as the buffer doesn't end at the cursor, which usually doesn't.

As much as possible, we'd like to avoid re-inventing the wheel. If a custom hashmap is better than the builtin Go map, then it seems more prudent to optimize the Go map

The perfect hash function (what I use), is possible because it assumes that the key/value pairs are known from the start, (before the first Get/Put) as far as I know, and is not something that can easily merged into the Go standard map. If you accept map-like data structure, I at least want to know what's differ from it and what's stopping you from using these. (By the way, I'm personally look forward to the SwissTable one.)

dsnet · 2023-04-26T22:00:10Z

tokenize and the actual decode which requires iterations twice.

Empirically, most JSON strings have no escape characters, so they can be "decoded" by simply truncating off the surrounding double-quotes. The JSONv2 implementation has a single bit to track whether a consumed JSON string contained any escape characters. If none are present, then the decode phase does the simple and fast O(1) operation.

I at least want to know what's differ from it and what's stopping you from using these.

Performance is one of our goals, but not our main goal. Our primary goal is actually with regard to correctness and functionality, in which case maintainability becomes more important. Once the API and semantics has stabilized, we could consider custom hashmap implementations. We're more open to optimizations that buy a decent gain, but have relatively little maintainability cost.

As an aside, people care a lot about performance, but developer surveys have consistently shown that security and reliability is ranked as even more important. This is one of the reasons why we made the decision to forgo any use of "unsafe", even though it'll unfortunately leave performance behind on the table.

sugawarayuuta · 2023-04-27T09:23:33Z

Thanks again for your answer. I hope I'm not wasting your time.

The JSONv2 implementation has a single bit to track whether a consumed JSON string contained any escape characters. If none are present, then the decode phase does the simple and fast O(1) operation.

Then the situation is not that far from mine, and perhaps the SWAR can be easily implemented. when consuming strings like you said you would after consuming the first byte (always a double quote), you could instead find special characters ( c == '\\' || c == '"' || c < 0x20 || c > 0x7f ) first using the technique. If the special character is a double-quote, meaning it did not contain special characters other than that, you can simply remove that ending double quote and return it. What I was saying was that, this technique can be used even if the content of the string is shorter than 8 bytes. If we have a literal like this ["first","second","third"], and want to decode the first string after reading the starting quote, we could pack first"," these 8 bytes into uint64 and search for the character from this, which, in this case, is ".

this can be implemented like this

const (
	// mapper has 1 mapped at every byte
	mapper uint64 = 0x0101010101010101
)

var (
	notSimpleTab = [256]bool{
		'\\': true,
		'"':  true,
	}
)

func init() {
	for char := byte(utf8.RuneSelf); char != ' '; char++ {
		notSimpleTab[char] = true
	}
}

// consumeSimpleString consumes the next JSON string per RFC 7159, section 7
// but is limited to the grammar for an ASCII string without escape sequences.
// It returns 0 if it is invalid or more complicated than a simple string,
// in which case consumeString should be called.
func consumeSimpleString(src []byte) int {
	var pos int
	if len(src) != 0 && src[0] == '"' {
		// starting quote
		pos++
		for {
			if len(src)-pos < 8 {
				// look for special characters OAAT (one at a time)
				for pos < len(src) && !notSimpleTab[src[pos]] {
					pos++
				}
				break
			}
			u64 := binary.LittleEndian.Uint64(src[pos:])
			// include u64 itself to find characters that are > unicode.MaxASCII
			u64 |= (u64 - (mapper * ' ')) | (u64 ^ (mapper * '"') - mapper) | (u64 ^ (mapper * '\\') - mapper)
			// algo: https://graphics.stanford.edu/~seander/bithacks.html#ValueInWord
			u64 &= mapper * 0x80
			if u64 != 0 {
				pos += int(((u64-1)&mapper*mapper)>>(64-8) - 1)
				break
			}
			pos += 8
		}
		if pos < len(src) && src[pos] == '"' {
			// ending quote
			pos++
			// first thing we found was a quote, valid and simple enough
			return pos
		}
	}
	// invalid or complicated string
	return 0
}

so I benchmarked this: (before.txt is for the inlined approach and after.txt is for SWAR). tested JSON files are here

apple-no-iMac-2:v2 iMac$ benchstat ./before.txt ./after.txt
goos: darwin
goarch: amd64
pkg: benchmark
cpu: Intel(R) Core(TM) i5-7400 CPU @ 3.00GHz
                                                    │ ./before.txt │            ./after.txt            │
                                                    │    sec/op    │   sec/op     vs base              │
Unmarshal/canada_geometry/go-json-experiment/json-4    2.921m ± 2%   2.908m ± 2%       ~ (p=0.505 n=8)
Unmarshal/citm_catalog/go-json-experiment/json-4       5.265m ± 2%   5.284m ± 2%       ~ (p=0.574 n=8)
Unmarshal/synthea_fhir/go-json-experiment/json-4       8.511m ± 2%   8.042m ± 2%  -5.51% (p=0.000 n=8)
Unmarshal/twitter_status/go-json-experiment/json-4     2.745m ± 2%   2.649m ± 2%  -3.49% (p=0.000 n=8)
Unmarshal/golang_source/go-json-experiment/json-4      14.76m ± 3%   14.46m ± 3%  -2.00% (p=0.038 n=8)
Unmarshal/string_unicode/go-json-experiment/json-4     51.11µ ± 2%   51.04µ ± 2%       ~ (p=0.442 n=8)
Unmarshal/licenses/go-json-experiment/json-4           298.0µ ± 2%   288.2µ ± 2%  -3.30% (p=0.000 n=8)
Unmarshal/github/go-json-experiment/json-4             798.2µ ± 1%   735.2µ ± 1%  -7.89% (p=0.000 n=8)
Unmarshal/nobel/go-json-experiment/json-4              3.201m ± 1%   3.115m ± 1%  -2.67% (p=0.000 n=8)
geomean                                                1.808m        1.757m       -2.82%

Sorry for the wall. BTW, is this considered a significant maintenance cost?

dsnet · 2023-04-27T19:41:56Z

What I was saying was that, this technique can be used even if the content of the string is shorter than 8 bytes.

Ah yes, good point.

Sorry for the wall. BTW, is this considered a significant maintenance cost?

~2-8% gain is nice, but is also a non-trivial amount of cost. My judgment is that we should hold off on adding this for now.

The main focus at the present moment is figuring out the compatibility story between v1 and v2 (which is turning out to be a harder problem than everything done so far). In working on that, I've unfortunately needed to work around and/or temporarily undo some optimizations that were previously added. As mentioned before, the main focus is correctness and functionality. What are you're thoughts on shelving this discussion until this work actually lands?

I like performance gains, but I don't think now is the right time to focus on it.

sugawarayuuta · 2023-04-28T04:15:43Z

What are you're thoughts on shelving this discussion until this work actually lands?

Of course. Glad if it helped in any way.

I am going to create Marshal/Unmarshal function without unsafe imports, so let me introduce it to you when it takes shape.
also could you leave this issue open for a while for that purpose, if possible? (I am not too familiar with GitHub myself, so i don't know if it is)

dsnet · 2023-04-28T04:24:46Z

create Marshal/Unmarshal function without unsafe imports

That will be interesting for showing the performance difference of unsafe "reflect"-like operations vs using "reflect" itself.

could you leave this issue open for a while for that purpose

Sure, no problem.

sugawarayuuta · 2023-09-21T01:51:11Z

Hello. Sorry for being late. I burnt out after I made the version I posted months ago. but I finally made the version without unsafe/linkname(s), although this is not a fair comparison against the previous version, (I made some tweaks), it performed better than JSONV2 as well as the previous version of this. (generally, better here means its runtime was shorter, for concrete types.)
code is here: https://github.com/sugawarayuuta/sonnet
the article I wrote are here: Zenn (Japanese), Medium (English)
There are many optimizations I made and I couldn't list all in the article, so I hope you take a look at the repository.
jsonbench was used to benchmark JSON libraries (except for sonic, it seems like it doesn't support newer version of Go yet)
results on Intel i3-10110U is here: https://github.com/sugawarayuuta/benchmark
Thank you for patience!

dsnet · 2023-09-22T17:58:53Z

Hi @sugawarayuuta. I hope you're doing well. I'm sorry to hear about feeling burnout. I've been in that situation myself before.

I'm hope I'm not adding to your burden, but I was trying run Sonnet through the latest iteration of the benchmarks, but I'm getting test failures. If you patch go-json-experiment/jsonbench@b81ba2b with:

diff --git a/bench_test.go b/bench_test.go
index e9bf517..33dd43d 100644
--- a/bench_test.go
+++ b/bench_test.go
@@ -30,6 +30,7 @@ import (
     gojson "github.com/goccy/go-json"
     jsoniter "github.com/json-iterator/go"
     segjson "github.com/segmentio/encoding/json"
+    sonnetjson "github.com/sugawarayuuta/sonnet"
 
     "github.com/google/go-cmp/cmp"
     "github.com/google/go-cmp/cmp/cmpopts"
@@ -118,6 +119,13 @@ var arshalers = []struct {
     unmarshal:     sonicjson.Unmarshal,
     marshalWrite:  func(w io.Writer, v any) error { return sonicenc.NewStreamEncoder(w).Encode(v) },
     unmarshalRead: func(r io.Reader, v any) error { return sonicdec.NewStreamDecoder(r).Decode(v) },
+}, {
+    name:          "SonnetJSON",
+    pkgPath:       "github.com/sugawarayuuta/sonnet",
+    marshal:       sonnetjson.Marshal,
+    unmarshal:     sonnetjson.Unmarshal,
+    marshalWrite:  func(w io.Writer, v any) error { return sonnetjson.NewEncoder(w).Encode(v) },
+    unmarshalRead: func(r io.Reader, v any) error { return sonnetjson.NewDecoder(r).Decode(v) },
 }}
 
 func TestRoundtrip(t *testing.T) {

I run into a failures like:

Mismatching results
Errors in UTF-8 string decoding
Race conditions

sugawarayuuta · 2023-09-24T01:12:52Z

Thank you very much for testing!
Found all 3 issues - in addition to old 2 issues - and fixed all 5.
see: 3dfcd68
the config for wantXXX are based on JSONv1, and your tests seem to pass, except for some categories in TestParseSuite which other libraries fall into as well. Sorry for the inconvenience and thank you for your patience.

sugawarayuuta · 2023-09-25T07:41:57Z

You don't need to test/benchmark this again if you have already done that but I updated the repository to make the string validator for skipped string more strict, resulting in TestParseSuite producing exactly the same passing & failing output, with JSONv1 in my environment. (see: ba5961e) Only regret is that I replied to you here too early... I hope there are some useful bits in my library to you. Thanks again.

dsnet · 2023-10-06T16:05:58Z

Hey @sugawarayuuta, I apologize for the delay, but we've been focusing on getting discussion for "encoding/json/v2" kickstarted.

Great work with fixing your package to pass most of the test! At the moment, we're focused on getting the API and correctness right, and then we'll come back to revisit performance.

Unfortunately, performance is hard to nail down when we're making large-scale refactors to the entire package. For example, the change the to split out "jsontext" caused a 5% performance regression across the board even though there was minimal logic change; just a lot of code movement and I haven't had the chance to dig into the compiled code to understand exactly why. It is the right split from a code-maintenance perspective, even if it hurts performances. Given that the public discussion could result in yet more major changes to the API again, I'd prefer to have the performance optimizations come after possible refactors.

If you're up for it, it'd be good to revisit this after the dust has settled and see how we could incorporate some of your ideas. That said, we do prioritize readability and maintainability, so even something like SWAR would need to be carefully explained or written in a way to aid readability.

rip-create-your-account · 2024-08-31T19:10:13Z

I came up with one particularly useful technique while writing my JSON parser. It was simple to implement and importantly many optimizations followed naturally from it. I'm struggling to find any other JSON parser that does it so I'm left in a confused state.

The idea is simply that the String parser should always return a slice with cap(s) >= 16. This simple requirement then enables writing branchless fast paths for all strings with len(s) <= 16. For me this speeds up the object field name lookup and the string cache lookup as they now boil down to just a few instructions. It allowed me to implement them like below:

if len(s) <= 16 {
    extended := s[:16:16] // string parser guarantees that cap >= 16
    masks := masksForLens0to16[len(s)] // masks for zeroing the s[len:16] bytes
    var key uint64pair
    key.lo = (binary.LittleEndian.Uint64(extended[0:]) & masks.lo)
    key.hi = (binary.LittleEndian.Uint64(extended[8:]) & masks.hi)
    fieldParser, ok = fieldParsers16B.Get(key) // perfect hash table for fields with len(name) <= 16
}
if !ok { /*fallback lookup path*/ }

This requires making the String parser start off by checking if the remaining byte buffer has cap >= 16 and if not then it should reader.Read more or shift the remaining bytes inside the buffer so that the cap >= 16 guarantee can be given. Extra complications come from the optimization where the buffer is aliasing a bytes.Buffer slice and thus can't be mutated by shifting its contents. I just switch to using owned [16]uint8 slice. Some extra code is also required for strings with escape characters.

This idea of ensuring that the buffer has at least N accessible bytes also simplified handling "null", "infinity", "\uXXXX" and other multi-byte sequences. To wrap it up this enabled a modest performance improvement while also simplifying parts of the parsing code.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

suggestions about its performance / additon to the benchmarks #8

suggestions about its performance / additon to the benchmarks #8

sugawarayuuta commented Apr 25, 2023

dsnet commented Apr 26, 2023

sugawarayuuta commented Apr 26, 2023

dsnet commented Apr 26, 2023 •

edited

Loading

sugawarayuuta commented Apr 27, 2023

dsnet commented Apr 27, 2023

sugawarayuuta commented Apr 28, 2023

dsnet commented Apr 28, 2023

sugawarayuuta commented Sep 21, 2023 •

edited

Loading

dsnet commented Sep 22, 2023

sugawarayuuta commented Sep 24, 2023

sugawarayuuta commented Sep 25, 2023

dsnet commented Oct 6, 2023

rip-create-your-account commented Aug 31, 2024

suggestions about its performance / additon to the benchmarks #8

suggestions about its performance / additon to the benchmarks #8

Comments

sugawarayuuta commented Apr 25, 2023

dsnet commented Apr 26, 2023

sugawarayuuta commented Apr 26, 2023

dsnet commented Apr 26, 2023 • edited Loading

sugawarayuuta commented Apr 27, 2023

dsnet commented Apr 27, 2023

sugawarayuuta commented Apr 28, 2023

dsnet commented Apr 28, 2023

sugawarayuuta commented Sep 21, 2023 • edited Loading

dsnet commented Sep 22, 2023

sugawarayuuta commented Sep 24, 2023

sugawarayuuta commented Sep 25, 2023

dsnet commented Oct 6, 2023

rip-create-your-account commented Aug 31, 2024

dsnet commented Apr 26, 2023 •

edited

Loading

sugawarayuuta commented Sep 21, 2023 •

edited

Loading