AST of validations/matches as debug info to investigate false positives in validation #788

zokrezyl · 2024-08-16T17:32:58Z

zokrezyl
Aug 16, 2024

UPDATED

Hi everyone,

I've recently been using JSON Schema and its validation mechanisms in Python for language modeling. During this process, I noticed that there doesn't seem to be any tooling or implementation that can help build an Abstract Syntax Tree (AST)-like construct from the matched input instance.

Having such a tool would be incredibly valuable because I've observed that, especially with complex or recursive schemas, it's possible to have incorrect validations (false positives) that are silently accepted. This can lead to significant issues, particularly when the validation doesn't align with the intended logic.

I'm not sure if anyone else has experimented with this idea, but I'd like to bring it to the community's attention. I'm also interested in contributing to the implementation or standardization of such an extension or approach.

To clarify, let's consider an example where I'm implementing a templating language similar to JSON Logic I might write a schema like the following (note that this is incomplete, with irrelevant fields omitted):

{
  "$ref": "#/definitions/MyLang",
  "definitions": {
    "MyLang": {
      "oneOf": [
        { "$ref": "#/definitions/And" },
        { "$ref": "#/definitions/Any" }
      ]
    },
    "And": {
      "additionalProperties": false,
      "required": ["and"],
      "properties": {
        "and": {
          "type": "array",
          "items": {
            "oneOf": [
              { "type": "string" },
              { "type": "boolean" }
            ]
          }
        }
      }
    },
    "Any": {

        "oneOf": [
          { "type": "string" },
          {
            "type": "object",
            "not": {
              "required": ["and"]
            }
          }
        ]
      
    }
  }
}

Now, consider an input like this:

{
  " and": [true, false]  // Note the space before "and"
}

In this case, the input is silently validated as correct under the schema, even though there is an unintended space before "and". While the validator might consider this a true positive according to the schema's rules, it is a false positive in the context of my intended logic.

If I could obtain the validation result in the form of an AST, I could more easily debug and determine whether the "match" was a false positive. The AST would provide a clearer view of the validation path, making it easier to spot mismatches between the expected and actual input structure.

For an example like

[
  {
    "and": [
      true,
      { "not": [true] }
    ]
  },
  {
    "not": [true]
  }
]

The json representation of the AST would look like

{
  "type": "Root",
  "children": [
    {
      "type": "AndExpression",
      "children": [
        {
          "type": "BooleanLiteral",
          "value": true
        },
        {
          "type": "NotExpression",
          "children": [
            {
              "type": "BooleanLiteral",
              "value": true
            }
          ]
        }
      ]
    },
    {
      "type": "NotExpression",
      "children": [
        {
          "type": "BooleanLiteral",
          "value": true
        }
      ]
    }
  ]
}

or

Root
│
├── AndExpression
│   ├── BooleanLiteral: true
│   └── NotExpression
│       └── BooleanLiteral: true
│
└── NotExpression
    └── BooleanLiteral: true

The Root node represents the entire input structure.
The AndExpression node represents the "and" operation at the first level of the array.
Each BooleanLiteral node corresponds to a true value in the input.
The NotExpression nodes represent the "not" operations.
This AST clearly shows the structure and relationships within the input, making it easier to see whether the input conforms to the intended logic. For instance, if a validation issue arises, you could trace through this tree to see where the input deviates from the expected pattern.

If I could obtain the validation result in the form of an AST like this, it would be much easier to debug and determine whether a validation outcome was truly intended or a false positive.

for instance with the schema

[
  {
    "and": [
      true,
      { "not": [true] }
    ]
  },
  {
    "  not": [true] # note, 'mistake', leading space, will create false positive, will be valid, but difficult to spot out
  }
]

will produce "false positive" validation.

I hope this explanation is clear, and I'd love to hear your thoughts or see if anyone is interested in exploring this concept further.

gregsdennis · 2024-08-16T19:25:29Z

gregsdennis
Aug 16, 2024
Maintainer

build an AST like construct from the matched input instance.

I'm not sure I understand what you're suggesting. Can you edit your post to include an example of what you're looking for?

10 replies

zokrezyl Aug 18, 2024
Author

Indeed, there was a mistake in the schema, removed the "type", the type should be oneOf.

By rewriting my post, removed the example that causes "false positive", added it also to the post.

[
  {
    "and": [
      true,
      { "not": [true] }
    ]
  },
  {
    "  not": [true] # note, 'mistake', leading space, will create false positive, will be valid, but difficult to spot out
  }
]

the output you pasted is also produced by python based check-jsonschema. The problem is that it produces lot of noise, and still does not spot out the false positive. A simple AST like representation would help to see if the desired output is generated. Thank you anyway for your effort.

gregsdennis Aug 18, 2024
Maintainer

I think I found another issue with the schema. The MyLang definition is checking for either And or Any, but the data looks like it should be checking for an array of Any or Any. Here's the change:

"MyLang": {
  "type": "array", // new
  "items": {       // new
    "oneOf": [
      { "$ref": "#/definitions/And" },
      { "$ref": "#/definitions/Any" }
    ]
  }
}

The problem is that it produces lot of noise

Have a read through this blog post which explains a bit about output and why the noise is there, especially when you have branching schemas (ones that use keywords like anyOf or if/then/else).

It also looks like my implementation (which powers https://json-everything.net) is having an issue outputting the proper instance location. I'll work on that.

It could be valuable to have the output include instance locations which aren't evaluated. To do that, I suggest adding an unevaluatedProperties: true to the schema in the MyLang definition:

"MyLang": {
  "type": "array",
  "items": {
    "oneOf": [
      { "$ref": "#/definitions/And" },
      { "$ref": "#/definitions/Any" }
    ],
    "unevaluatedProperties": true
  } 
}

To illustrate this, here's a simple example:

// schema
{
  "properties": {
    "and": true
  },
  "unevaluatedProperties": true
}

// instance
{
  "and": 42,
  "not": []
}

// output
{
  "valid": true,
  "evaluationPath": "",
  "schemaLocation": "https://json-everything.net/d2c87ccdb7#",
  "instanceLocation": "",
  "annotations": {
    "properties": [
      "and"
    ],
    "unevaluatedProperties": [
      "  not"                      // here's an annotation
    ]
  },
  "details": [
    {
      "valid": true,
      "evaluationPath": "/properties/and",
      "schemaLocation": "https://json-everything.net/d2c87ccdb7#/properties/and",
      "instanceLocation": "/and"
    },
    {
      "valid": true,
      "evaluationPath": "/unevaluatedProperties",
      "schemaLocation": "https://json-everything.net/d2c87ccdb7#/unevaluatedProperties",
      "instanceLocation": "/  not"  // here's an output node
    }
  ]
}

You get an annotation and an output node that not was validated by unevaluatedProperties, which means that it wasn't picked up by /properties/not.

zokrezyl Aug 18, 2024
Author

you are right about any. But as the name says the intention was to cover also array of anything except the other types.. Did not want to make too complicated the schema examples. With unevaluated properties is a great hint.

Indeed most of the implementations for the validators are not giving hints about the position in the file. I am working on a generic wrapper for python jsonschema that will be able to load itself the file and give location hints about the errors. Maybe this should be also part of future standards as requirement or at least as recommendation for implementers.

I completely understood the role of the "noise" (please do not take it as offence), just was thinking about a less verbose output in case of successful validation. If the validation fails, the "noise" is obviously extremely usefull

gregsdennis Aug 19, 2024
Maintainer

With unevaluated properties is a great hint.

Glad that worked for you.

I completely understood the role of the "noise" (please do not take it as offence), just was thinking about a less verbose output in case of successful validation. If the validation fails, the "noise" is obviously extremely usefull

In most cases, it's impossible to resolve that verbosity. Isolating which errors are the "most pertinent" is generally an impossible task because the evaluator can only know the schema; it can't know the intent of the schema.

If you have a schema that uses an anyOf to say a value can be either a string or a number, and it's an array, which subschema produced the "best" error? The evaluator can't know that, so it gives you both.

zokrezyl Aug 19, 2024
Author

I think the best practice is to avoid anyOf where possible, use only oneOf, I learned it in a hard way

jdesrosiers · 2024-08-16T20:13:59Z

jdesrosiers
Aug 16, 2024
Maintainer

I'm not sure this is what you're looking for exactly, and it's not Python, so you probably can't use it anyway, but ...

My implementation uses a JSON AST that includes errors and annotations that apply to each node based on validation.

1 reply

zokrezyl Aug 17, 2024
Author

this sounds promising, but I was rather looking for a simple tool/protocol. Updated my initial post

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JSON Schema

AST of validations/matches as debug info to investigate false positives in validation #788

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 11 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

JSON Schema

AST of validations/matches as debug info to investigate false positives in validation #788

zokrezyl Aug 16, 2024

Replies: 2 comments · 11 replies

gregsdennis Aug 16, 2024 Maintainer

zokrezyl Aug 18, 2024 Author

gregsdennis Aug 18, 2024 Maintainer

zokrezyl Aug 18, 2024 Author

gregsdennis Aug 19, 2024 Maintainer

zokrezyl Aug 19, 2024 Author

jdesrosiers Aug 16, 2024 Maintainer

zokrezyl Aug 17, 2024 Author

zokrezyl
Aug 16, 2024

Replies: 2 comments 11 replies

gregsdennis
Aug 16, 2024
Maintainer

zokrezyl Aug 18, 2024
Author

gregsdennis Aug 18, 2024
Maintainer

zokrezyl Aug 18, 2024
Author

gregsdennis Aug 19, 2024
Maintainer

zokrezyl Aug 19, 2024
Author

jdesrosiers
Aug 16, 2024
Maintainer

zokrezyl Aug 17, 2024
Author