Parsing into Map #49

piotrrzysko · 2024-05-18T04:59:11Z

Introduction

Sometimes users want to parse a JSON object into a map.

Let's assume that we have the following example object:

{
    "intKey": 123,
    "objKey": {
	"key1": "abc",
        "key2": false
    },
    "arrayKey": [1, 2, 3]
}

We expect the parser to produce a Map<String, Object> from which we should be able to extract the object's fields in the following way:

Map<String, Object> map = parser.parse(bytes, bytes.length, Map.class);

int intValue = (int) map.get("intKey");

Map<String, Object> obj = (Map<String, Object>) map.get("objKey");
String value1 = (String) obj.get("key1");
boolean value2 = (boolean) obj.get("key2");

List<Object> array = (List<Object>) map.get("arrayKey");

Question

Let’s assume that the parser exposes an API like:

Map<String, Object> map = parser.parse(bytes, bytes.length, Map.class);

The returned map is immutable.

JSON parsing benchmarks often show that, in Java, creating new strings takes a significant portion of the time. So, the question is: at which stage should this happen? I see two options:

Option 1

Map<String, Object> map = parser.parse(bytes, bytes.length, Map.class);

// at this point all Strings are created

String value1 = map.get("key"); // this doesn’t create a new one
String value2 = map.get("key"); // this doesn’t create a new one either

Option 2

Map<String, Object> map = parser.parse(bytes, bytes.length, Map.class);

// at this point, the map only holds its own copy of a byte array with all parsed strings, but no instance of String has been created so far

String value1 = map.get("key"); // this creates a new instance of String 
String value2 = map.get("key"); // this also creates a new instance of String

I suppose the second option is far more efficient in situations where someone wants to access only a small set of all fields and they want to do so only once.

@ZhaiMo15 @zekronium since you reported this topic, what are your thoughts? I’d like to understand your use cases better to be able to choose a more suitable option or come up with something else.

The text was updated successfully, but these errors were encountered:

ZhaiMo15 · 2024-05-20T03:19:51Z

In option2,

at this point, the map only holds its own copy of a byte array with all parsed strings

Does this byte array means buffer and stringBuffer in JsonValue? If so, I think #36 would occur. In my case as I mentioned in #47, UDFJson has a cache to save the parsing result, which will save parsing times if the same JSON string comes.

For example:

bytes1 = 
{
    "intKey": 123,
    "objKey": {
	"key1": "abc",
        "key2": false
    },
    "arrayKey": [1, 2, 3]
}

bytes2 =
{
    "intKey": 987,
    "objKey": {
	"key1": "zyx",
        "key2": true
    },
    "arrayKey": [9, 8, 7]
}

Map<String, Object> map1 = parser.parse(bytes1, bytes1.length, Map.class);
cache.put(bytes1, map1);
Map<String, Object> map2 = parser.parse(bytes2, bytes2.length, Map.class);
cache.put(bytes2, map2);

// the second time bytes1 comes
map1 = cache.get(bytes1);
// map1.get("xxx") and etc.

If the second "bytes1" want to get correct result, I think the byte array needs to do save for each "bytes"? However, copy byte array may cost time as much as option1?

On one hand, option1 + cache can save parsing times (from x times to 1 time if no cache miss), but the only parsing could be costly, since all Strings are created; On the other hand, option 2 can reduce parsing cost, since no instance of String has been created, but the same JSON string may be parsed more than once.

But in the end, I'm convinced by

I suppose the second option is far more efficient in situations where someone wants to access only a small set of all fields and they want to do so only once.

I vote for option2.

zekronium · 2024-05-20T14:41:21Z

Optimally would be nice to have the ability to do both, but in SIMD spirit, lazy parsing seems more appropriate, especially once full on stream parsing is eligible, that can be the underlying harness of most things if need be.

I think initially option 2 is better, but it should cache the result maybe? I think jackson does this alot and it bodes well.

arouel · 2024-11-07T17:32:37Z

@piotrrzysko I agree with @zekronium , Option 2 would be preferred, but even Option 1 a good start.

I wonder if we want to use a new TypeToken<Map<String, Object>>() {} to avoid the cast to Map<String, Object>. This is the approach Gson is following.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parsing into Map #49

Parsing into Map #49

piotrrzysko commented May 18, 2024

ZhaiMo15 commented May 20, 2024

zekronium commented May 20, 2024

arouel commented Nov 7, 2024

Parsing into Map #49

Parsing into Map #49

Comments

piotrrzysko commented May 18, 2024

Introduction

Question

Option 1

Option 2

ZhaiMo15 commented May 20, 2024

zekronium commented May 20, 2024

arouel commented Nov 7, 2024