option to join multiple CDATA sections into one when parsing #599

schulzer · 2023-11-30T14:15:20Z

This is similar to 'Comments split Text into multiple Nodes #546' when considering a complicated document with lots of escaping

<node> some&amp;data ]]> more&amp;data <node>

This could be rewritten (somewhat simplified) but erroneous

<node> <![CDATA[some&data ]]> more&data]]> <node>

and corrected

<node> <![CDATA[some&data ]]]><![CDATA[]> more&data]]> <node>

In any case (the original and final) the user would like to call doc.child("node").child_value() and access the stored data, without complicated logic to iterate over all children and concatenate their values, in particular as this node could be called 'url' and could be given in various different versions, which require currently various different versions of code to access the full URL, coming back I as a user would like to have an options to retrieve all those flavours nice and cleanly de-escaped with a single access point, because at the end essentially only this single value URL is encoded, from a logical/high level stand point, from low level I see the reason why they are split as they are and stored into multiple child nodes.

If this is an acceptable change, I could offer handing in a pull request.

The text was updated successfully, but these errors were encountered:

zeux · 2023-12-15T19:36:16Z

What if there's a mix of PCDATA and CDATA content?

schulzer · 2023-12-18T14:59:52Z

Just for clarification you mean something like

<node> <![CDATA[some data]]> more data <node>

and not the case

<node> <![CDATA[some data]]> <child ... /> <![CDATA[more data]]> <node>

because for the latter case it would be the same as in #546 (where I assume they are just not merged because they can't but anything before&after the child would be)

the former case is a bit tricky yes but here I would consider a CDATA section just a another kind of escaping compared to & encoding therefore actually not important to the user, and as usual it would follow the general white spacing rules (and modification through options) e.g.

<node><![CDATA[some data]]></node> 
    yields "some data"
<node> <![CDATA[some data]]> </node> 
    yields " some data " (or as before when 'parse_trim_pcdata')
<node><![CDATA[some data]]>more data</node> 
    yields "some datamore data"
<node> <![CDATA[some data]]> more data </node> 
    yields " some data more data "
<node> <![CDATA[some data ]]>more data </node> 
    yields " some data more data "
<node> <![CDATA[some data ]]> more data </node> 
    yields " some data  more data" with two spaces in between

schulzer added the enhancement label Nov 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

option to join multiple CDATA sections into one when parsing #599

option to join multiple CDATA sections into one when parsing #599

schulzer commented Nov 30, 2023 •

edited

Loading

zeux commented Dec 15, 2023

schulzer commented Dec 18, 2023

option to join multiple CDATA sections into one when parsing #599

option to join multiple CDATA sections into one when parsing #599

Comments

schulzer commented Nov 30, 2023 • edited Loading

zeux commented Dec 15, 2023

schulzer commented Dec 18, 2023

schulzer commented Nov 30, 2023 •

edited

Loading