Softer HTML table processing? #11464
Replies: 2 comments 1 reply
-
There are a lot of corner cases in that code base, and I'm not particularly keen to change it anytime soon. With that said, if you're willing to contribute a PR that doesn't break backwards compatibility and includes tests and documentation, we'd be happy to take it. |
Beta Was this translation helpful? Give feedback.
-
Alright, I'll see what I can do. Not sure about the tests though: Which cases should be covered to check backwards compatibility? |
Beta Was this translation helpful? Give feedback.
-
Description
I recently dug a bit deeper into Quarto's HTML table processing to understand what exactly is supported and what isn't. It seems that (part of it) is implemented in
parsehtml.lua
. The functionhandle_raw_html_as_table
contains a comment which isn't wrong but also not completely correct:It is true that Pandoc's table model does not distinguish between
Cell
s which aretd
and which areth
. However, Pandoc doesn't completely ignore the difference in the input either.Within a
tbody
,if one or more of the uppermost rows contain only
th
elements, they are assigned to theTableBody
propertyhead
("intermediate head") instead of remaining in the propertybody
("table body rows") with the other rows, andif one or more of the leftmost columns contain only
th
elements, this is translated into theTableBody
propertyrow_head_columns
("number of columns taken up by the row head of each row").Basically, Pandoc imposes a stricter semantic model, where
th
elements cannot be used arbitrarily within atbody
, but can only be used to indicate "intermediate head" rows as well as "row head" or stub columns. If aTable
element is written to HTML, cells inTableHead
,TableBody.head
and within theTableBody.row_head_columns
leftmost columns are written asth
and all others astd
. That means that if an HTML table conforms to Pandocs stricter semantic model, the distinction betweentd
andth
is actually preserved in HTML output.To my knowledge, this behavior is not documented anywhere; I was pointed to
row_head_columns
and the above is the result of my experiments. Maybe @jgm can confirm?Why this is important for Quarto's HTML table processing? I believe it is too radical.
It is very useful to be able to include tables in HTML format in Markdown documents and have them parsed by Pandoc, such that they are output to all formats. However, the replacement of all
th
elements bytd data-quarto-table-cell-role="th"
prevents Pandoc from detecting the described semantic structure, which at least potentially degrades the structure of tables in output formats other than HTML.It may make sense wanting to preserve the distinction between
td
andth
more generally, but wouldn't it be enough to replaceth
elements byth
data-quarto-table-cell-role="th"
?Moreover, the linked-to
gt
issue onth
elements for accessibility probably could have been solved without this special processing, because Pandoc does preserveth
elements in stub columns.Finally, the "HTML postprocessor" mentioned in the comment doesn't seem to work as intended; at least I get
td data-quarto-table-cell-role="th"
in Quarto's HTML output.My feature request is to modify Quarto's HTML table processing such that it does preserve Pandoc's semantic table structure.
Beta Was this translation helpful? Give feedback.
All reactions