Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support document-oriented XML #7

Open
nichtich opened this issue Jul 14, 2019 · 5 comments
Open

Support document-oriented XML #7

nichtich opened this issue Jul 14, 2019 · 5 comments
Labels
format:xml kind:feature Request for new functionality

Comments

@nichtich
Copy link
Contributor

The XML encoding in JSON used by oq (described here, it should have a name) does not preserve element order in mixed-content XML elements -- therefore oq only supports a subset of XML. This makes sense for most applications where XML is used in a similar way like JSON and YAML. Full support of XML (minus less used features such as DTD that happed to be part of the XML specification) should be possible nevertheless. The XML encoding in JSON to do so is MicroXML. How about this:

  • add input and output format mxml (aka microxml)
  • add jq methods to convert between "xml" (the current encoding) and microxml (this can also be implemented in jq language)

An example from your blog post in MicroXML:

[[
   "people", {}, [
    [ "person", {}, [
      [ "age", { "scale" : "months" }, [ "289" ] ],
      [ "name", {}, [ "Jim" ] ],
      [ "favorite_numbers", {}, [
          [ "number", {}, [ "1" ] ],
          [ "number", {}, [ "2" ] ],
          [ "number", {}, [ "3" ] ]
         ]
      ]
     ]
    ],
    [ "person", {}, [
      [ "age", { "scale" : "years" }, [ "51" ] ],
      [ "name", {}, [ "Bob" ] ],
      [ "favorite_numbers", {}, [
          [ "number", {}, [ "4" ] ],
          [ "number", {}, [ "5" ] ],
          [ "number", {}, [ "6" ] ]
         ]
      ]
     ]
    ],
    [ "person", {}, [
      [ "age", { "scale" : "days" }, [ "31025" ] ],
      [ "name", {}, [ "Susan" ] ],
      [ "favorite_numbers", {}, [
          [ "number", {}, [ "7" ] ],
          [ "number", {}, [ "8" ] ],
          [ "number", {}, [ "9" ] ]
         ]
      ]
     ]
    ]
   ]
]]
@Blacksmoke16
Copy link
Owner

@nichtich I'm a bit confused. In your snippet there, is that actually the syntax of MicroXML, or is that just the representation of it in JSON?

The link you provided says:

This is an example of a small but complete MicroXML document exhibiting all syntactic features:

<comment lang="en" date="2012-09-11">
I <em>love</em> &#xB5;<!-- MICRO SIGN -->XML!<br/>
It's so clean &amp; simple.</comment>

The abstract data model of this document in the JSON syntax described in Section 2.1 is:

[ "comment",
  {  "date": "2012-09-11", "lang": "en" },
  [ "\nI ",
    ["em", {}, ["love"]],
    " \u03BCXML!",
    ["br", {}, []],
    "\nIt's so clean & simple."
  ]
]

This lead me to think that MicroXML is simply a structure similar to XML, but with less features; but provides a different way to serialize it into JSON?

@nichtich
Copy link
Contributor Author

MicroXML defines a simplified subset of XML and an encoding/serialization/representation in JSON. The current implementation of oq supports Goessner's "pragmatic XML", another JSON encoding of a subset of MicroXML (every XML document encodable in pragmatic XML is also encodable in MicroXML).

There is a third useful subset of XML with JSON encoding that does not differentiate between attribute names and element names, so there is no @ in front of names (I'd call this "Simple XML" from the Perl module XML::Simple). The JSON encoding of Simple XML is compatible with the current Pragmatic XML, just don't use XML attributes (or map attributes to elements) so there are no @name and #text fields.

The sets of XML document that can be expressed in each XML model are proper subsets:

  • Simple XML (is already supported by oq implicitly)
  • Pragmatic XML (currently supported by oq)
  • MicroXML (this issue, should also be supported by oq)
  • Full XML (no need to support this)

Full support is more relevant for reading XML documents because it needs to be decided which JSON structure the XML structure is mapped to.

@Blacksmoke16
Copy link
Owner

Blacksmoke16 commented Jul 14, 2019

Ah ok I think I get it now.

So this would involve adding an extra format as you said mxml, which when doing like oq -i mxml . would output the the serialization style used by MicroXML, while doing oq -i xml . would output JSON as it currently does.

Then, the other side of things, doing -o mxml or -o xml would output pretty much the same thing, minus the stuff that MicroXML does not support.

The main challenge here would be the conversion of XML style format into JSON to pass to jq. Supporting XML as an input format is next on my list, so it should be possible to support this format as well.

@nichtich
Copy link
Contributor Author

nichtich commented Jul 15, 2019

Given this input document

<root><x a="1"><a>2</a></x><y b="3">4</y></root>

I'd expect op -i xml . to emit

{
  "x": { "@a": "1", "a": "2" },
  "y": { "@b": "3", "#text": "4" }
}

and op -i sxml . (combining elements and attributes) to emit

{
  "x": { "a": [ "1", "2" ] },
  "y": { "b": "3", "#text": "4" }
}

and op -i mxml . to emit

[ "root", {}, [ 
   [ "x", { "a": "1" }, [ [ "a", {}, ["2" ] ] ] ],
   [ "y", { "b": "3" }, [ "4" ] ] 
] ]

I'd silently ignore all character data in mixed content elements for input format xml and sxml, so <root>x<y>z</y></root> is read as { "y": "z" } instead of { "y": { "#text": "x", "y": "z" } }.

@Blacksmoke16
Copy link
Owner

Blacksmoke16 commented Dec 28, 2019

@nichtich Just to confirm this before I get too far in; MicroXML is simply an alternative way to represent XML in JSON correct? This seems to be the case based on your examples in #7 (comment). However, if the user did oq -i mxml -o mxml the output should be the same as oq -i xml -o xml since the XML representation between the two formats is the same, correct?

EDIT: Currently my implementation works like (with the input from your example)

oq -i mxml .
[
  "root",
  {},
  [
    [
      "x",
      {
        "a": "1"
      },
      [
        [
          "a",
          {},
          [
            "2"
          ]
        ]
      ]
    ],
    [
      "y",
      {
        "b": "3"
      },
      [
        "4"
      ]
    ]
  ]
]
oq -i mxml -o mxml .
<?xml version="1.0" encoding="UTF-8"?>
<root>
  <x a="1">
    <a>2</a>
  </x>
  <y b="3">4</y>
</root>

I'm assuming it wouldn't be expected that you could read in an XML string as xml and output as mxml since the internal JSON representation of them are different.

@Blacksmoke16 Blacksmoke16 removed this from the 1.1.0 milestone May 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
format:xml kind:feature Request for new functionality
Projects
None yet
Development

No branches or pull requests

2 participants