Softer HTML table processing? #11464

allefeld · 2024-11-16T19:43:48Z

allefeld
Nov 16, 2024

Description

I recently dug a bit deeper into Quarto's HTML table processing to understand what exactly is supported and what isn't. It seems that (part of it) is implemented in parsehtml.lua. The function handle_raw_html_as_table contains a comment which isn't wrong but also not completely correct:

        -- Pandoc's HTML-table -> AST-table processing does not faithfully respect
        -- `th` vs `td` elements. This causes some complex tables to be parsed incorrectly,
        -- and changes which elements are `th` and which are `td`.

        -- For quarto, this change is not acceptable because `td` and `th` have
        -- accessibility impacts (see https://github.com/rstudio/gt/issues/678 for a concrete
        -- request from a screen-reader user).
        --
        -- To preserve td and th, we replace `th` elements in the input with 
        -- `td data-quarto-table-cell-role="th"`. 
        -- 
        -- Then, in our HTML postprocessor,
        -- we replace th elements with td (since pandoc chooses to set some of its table
        -- elements as th, even if the original table requested not to), and replace those 
        -- annotated td elements with th elements.

It is true that Pandoc's table model does not distinguish between Cells which are td and which are th. However, Pandoc doesn't completely ignore the difference in the input either.

Within a tbody,

if one or more of the uppermost rows contain only th elements, they are assigned to the TableBody property head ("intermediate head") instead of remaining in the property body ("table body rows") with the other rows, and
if one or more of the leftmost columns contain only th elements, this is translated into the TableBody property row_head_columns ("number of columns taken up by the row head of each row").

Basically, Pandoc imposes a stricter semantic model, where th elements cannot be used arbitrarily within a tbody, but can only be used to indicate "intermediate head" rows as well as "row head" or stub columns. If a Table element is written to HTML, cells in TableHead, TableBody.head and within the TableBody.row_head_columns leftmost columns are written as th and all others as td. That means that if an HTML table conforms to Pandocs stricter semantic model, the distinction between td and th is actually preserved in HTML output.

To my knowledge, this behavior is not documented anywhere; I was pointed to row_head_columns and the above is the result of my experiments. Maybe @jgm can confirm?

Why this is important for Quarto's HTML table processing? I believe it is too radical.

It is very useful to be able to include tables in HTML format in Markdown documents and have them parsed by Pandoc, such that they are output to all formats. However, the replacement of all th elements by td data-quarto-table-cell-role="th" prevents Pandoc from detecting the described semantic structure, which at least potentially degrades the structure of tables in output formats other than HTML.

It may make sense wanting to preserve the distinction between td and th more generally, but wouldn't it be enough to replace th elements by th data-quarto-table-cell-role="th"?

Moreover, the linked-to gt issue on th elements for accessibility probably could have been solved without this special processing, because Pandoc does preserve th elements in stub columns.

Finally, the "HTML postprocessor" mentioned in the comment doesn't seem to work as intended; at least I get td data-quarto-table-cell-role="th" in Quarto's HTML output.

My feature request is to modify Quarto's HTML table processing such that it does preserve Pandoc's semantic table structure.

cscheid · 2024-11-18T15:28:37Z

cscheid
Nov 18, 2024
Maintainer

My feature request is to modify Quarto's HTML table processing such that it does preserve Pandoc's semantic table structure.

There are a lot of corner cases in that code base, and I'm not particularly keen to change it anytime soon. With that said, if you're willing to contribute a PR that doesn't break backwards compatibility and includes tests and documentation, we'd be happy to take it.

0 replies

allefeld · 2024-11-22T17:19:23Z

allefeld
Nov 22, 2024
Author

Alright, I'll see what I can do.

Not sure about the tests though: Which cases should be covered to check backwards compatibility?
I'm guessing tables created by gt and great_tables?

1 reply

cscheid Nov 22, 2024
Maintainer

I'd start with ensuring that our test suite passes. We have regression tests for that kind of behavior, although they're likely a bit scattered throughout tests/docs/smoke-all/202{2,3,4}.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Softer HTML table processing? #11464

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

Softer HTML table processing? #11464

allefeld Nov 16, 2024

Description

Replies: 2 comments · 1 reply

cscheid Nov 18, 2024 Maintainer

allefeld Nov 22, 2024 Author

cscheid Nov 22, 2024 Maintainer

allefeld
Nov 16, 2024

Replies: 2 comments 1 reply

cscheid
Nov 18, 2024
Maintainer

allefeld
Nov 22, 2024
Author

cscheid Nov 22, 2024
Maintainer