Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add nesting depth limit to SpanFormatter #16

Open
wants to merge 7 commits into
base: airbnb-main
Choose a base branch
from

Conversation

baroquebobcat
Copy link

Summary

Currently the preprocessor will generate fields for all the nested keys in a given document even if they are highly nested. This can cause problems if there are many keys or keys that represent ids with a large cardinality. The index node's equivalent code has a limit on how deep to nest, but the preprocessor's code path just had a TODO.

This patch adds a depth limit so that fields that are too deeply nested are handled differently.

Currently it just ignores them, but it could do something else.

Rather than deciding on a place to put the depth limit in config, I hard coded it in a constant, but I'm thinking it'd make sense to add it to the schema config.

Requirements

@@ -21,6 +21,7 @@ public class SpanFormatter {

public static final String DEFAULT_LOG_MESSAGE_TYPE = "INFO";
public static final String DEFAULT_INDEX_NAME = "unknown";
public static final int DEPTH_LIMIT = 3;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can add a TODO for now to move it to schema config.

List<Trace.KeyValue> tags = new ArrayList<>();
if (value instanceof Map) {
// todo - consider adding a depth param to prevent excessively nested fields
if (depthLimit <= 0) {
return tags;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of returning empty list of tags, can we instead return a string blob that is un-indexed? Since logging systems are often used as debugging systems, it would be ideal to ingest data where possible.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

argh. Yeah. We'll need to do that now, but it'll be sort of broken. I'd assumed that the _source field would retain the original json, but this change here: https://github.com/slackhq/astra/pull/816/files#diff-3e3f1e6e9032267dd01c35b0f82f031bd633e1f08355002149316079bbe5711b broken that. It got rid of the code that reified the original document in favor of flattening everything. I'm not sure it was an intended because it isn't mentioned in the PR and there are no tests that cover the change.

Kinda like this:
Before:
{"a":{"b":1}} => {"a.b":1,"_source":{"a":{"b":1}}}
After:
{"a":{"b":1}} => {"a.b":1,"_source":{"a.b":1}}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm. Maybe not exactly that PR, but certainly from that series of changes.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we instead return a string blob that is un-indexed?
Yes. I'll mark it as BINARY, which will cause it to be unindexed but still be injected into the document correctly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants