Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce Markdown.AST #1196

Closed

Conversation

eksperimental
Copy link
Contributor

  • Introduce ExDoc.Markdown.AST and t:Markdown.AST.html/0.
  • Replaces ExDoc.Formatter.HTML.ast_to_html/1 with ExDoc.Markdown.AST.to_html/2 which is a port of Earmark.Transform
  • It uses a more recent git version of Earkmark since due to a bug in the library.
  • It adds Floki to deal with HTML code in Markdown docs.

Closes #1168, #1189
Supersedes #1190

@eksperimental eksperimental force-pushed the markdown_ast branch 2 times, most recently from 414359c to 401ac0f Compare June 26, 2020 02:13
@josevalim
Copy link
Member

Thank you @eksperimental, this looks amazing but I am afraid it is nudging us in the wrong direction. Earmark is already capable of parsing HTML, so why do we need to reimplement it here and rely on Floki?

In particular, I would expect markdown |> Earmark.to_html and markdown |> Earmark.as_ast |> simple_ast_to_html to be enough. If it isn't, I would say Earmark.as_ast is not functioning correctly and we should push those fixes upstream.

@eksperimental
Copy link
Contributor Author

eksperimental commented Jun 26, 2020

The use of Floki was a last resort. Actually this PR was done when I added more HTML tests to realize that we were still not covering them.
Earkmarks has two main limitations:

  1. Currently it does not parse HTML recursively. Parse HTML recursively pragdave/earmark#358
  2. HTML needs to be written in a certain way. Document how HTML parsing works  pragdave/earmark#357, https://github.com/pragdave/earmark/pull/360/files

Boths PRs stemmed from reports I have made upstream by working in this PR.
I have reported many others, but still I don't think all cases will be covered unless they use something like Floki.
Floki is exclusively used in exactly one line, in one case, when verbatim: true, and I'm in the talks to see if can get a copy of the original HTML code in the Earmark AST pragdave/earmark#356 (comment)

@josevalim
Copy link
Member

Right, so I would rather push those concerns to Earmark rather than in ExDoc. It is not our job to solve them.

Perhaps one solution you can adopt in Earmark is to have a copy of floki_mochi_html, properly renamed to earmark_mochi_html. It doesn't bring a whole dependency and it will import only the parts that you need (which is effectively the parser).

@eksperimental
Copy link
Contributor Author

@RobertDober What do you think can be done about this?

@RobertDober
Copy link
Contributor

RobertDober commented Jun 26, 2020

@eksperimental has had a tough time discovering how basic and also broken HTML parsing is in Earmark. Now I believe that I can come up with a pretty good HTML parser in 1.5 but I admit that

  • Floki already does this and probably better than Earmark will ever
  • personally I am totally confused how HTML should be parsed in Markdown and therefor try to orient myself on GFM but that is a high bar
  • I really liked @eksperimental's idea to provide the pure HTML string in the AST, (it would be a great safety net, I mean Earmark is not really professionally maintained, I am very much aware of my shortcomings) and that I could push into 1.4.6 I guess
  • I'll have a look at https://github.com/philss/floki/blob/master/src/floki_mochi_html.erl and come back to you on that

All that said I am working hard on the 1.4.6 release with the new AST format and most of the gross errors fixed (ETA Sunday 😨 ) do we agree that this is the first priority? If not I can rearrange my time.

Please note too that pragdave/earmark#358 is downwards incompatible and therefore scheduled for 1.5 (ETA????)

@josevalim
Copy link
Member

My understanding is that the issues described here are "old issues", so I would keep your current priority.

@eksperimental if we close this, we should still merge #1190, right?

@eksperimental
Copy link
Contributor Author

My understanding is that the issues described here are "old issues", so I would keep your current priority.

@eksperimental if we close this, we should still merge #1190, right?

no, never. That PR will render the Epub unusable whenever a < or > is not escaped

@wojtekmach
Copy link
Member

wojtekmach commented Jul 17, 2020

My understanding is in this issue we're discussing and trying to tackle:

  1. Breaking change in 0.22: HTML entity support #1168 (note: it's already fixed)
  2. Rendering regression in 0.22 #1189
  3. other improvements to html parsing (via floki/floki_mochi/etc)

does that sound about right? @eksperimental could we just tackle 2. for now in this PR and leave 3. for the future?

@eksperimental
Copy link
Contributor Author

eksperimental commented Jul 17, 2020

We could just leave out the Floki call which is one or two lines, and leave what Earmark gives as is, and merge this PR.

@josevalim
Copy link
Member

Right, it is not our job to parse html and adding markdown specific logic is going to be counter intuitive to direction we are taking ExDoc which is to support multiple input formats. :)

@eksperimental
Copy link
Contributor Author

eksperimental commented Jul 17, 2020

Right, it is not our job to parse html and adding markdown specific logic is going to be counter intuitive to direction we are taking ExDoc which is to support multiple input formats. :)

I know, but we have no other options, and we still don't know whether this will be fixed in Earkmark or not

@josevalim
Copy link
Member

If we have no other options, then we will stick with no other options. We have lived without those so far and I really don't want to be responsible for handling this complexity. :)

- Introduce ExDoc.Markdown.AST and t:Markdown.AST.html/0.
- Replaces ExDoc.Formatter.HTML.ast_to_html/1 with ExDoc.Markdown.AST.to_html/2 which is a port of Earmark.Transform
- It uses a more recent git version of Earkmark since due to a bug in the library.
- It adds Floki to deal with HTML code in Markdown docs.

Closes elixir-lang#1168, elixir-lang#1189
Supersedes elixir-lang#1190
@eksperimental
Copy link
Contributor Author

I have rebased to master, meaning now we work with EarmarkParser. I have added some temporary fixes until EarkmarkParser always works with 4-element tuples in the AST.
Floki has been removed and the comments that are not passing due to the limitation in nested HTML code have been commented out.

@eksperimental
Copy link
Contributor Author

eksperimental commented Jul 17, 2020

If we have no other options, then we will stick with no other options. We have lived without those so far and I really don't want to be responsible for handling this complexity. :)

For the record, this limitation recently came in with the introduction of AST if I'm not mistaken.

I think the easiest solution would be to store the original HTML in the metadata, as I have suggested to @RobertDober and we can inject it back in on our side.

@josevalim
Copy link
Member

But why do we need the original HTML? Doesn’t markdown say we should parse any markdown inside the HTML too? So returning the original HTML could override the nested parsing.

@josevalim
Copy link
Member

Btw, thanks for the clarification! :D

@eksperimental
Copy link
Contributor Author

eksperimental commented Jul 17, 2020

But why do we need the original HTML? Doesn’t markdown say we should parse any markdown inside the HTML too? So returning the original HTML could override the nested parsing.

Earmark is returning a crippled version of the original HTML or something like that (I can't recall exactly). As of when I wrote this PR we couldn't fix this limitation, as we were not able to reconstruct the HTML that we passed to it. I will look later if something changed in EarmarkParser

@eksperimental
Copy link
Contributor Author

Sorry, the crippled version it was a bug that has been fixed. We are getting the HTML string. The issue is that we cannot escape the text within the HTML elements. This PR and also the current version of ExDoc escape the special characters of the whole string

@RobertDober
Copy link
Contributor

RobertDober commented Jul 17, 2020

@josevalim Almost no Markdown parser parses content inside HTML https://babelmark.github.io/?text=%3Cdiv%3E+%0A++-+a%0A++-+b%0A%3C%2Fdiv%3E , nor does EaramrkParser

I try to orient Earmark as much as GFM as possible, GFM does not parse Markdown inside HTML.

@josevalim
Copy link
Member

Ah, apologies for the confusion then. So storing the original HTML should work, if that’s what we want.

@eksperimental
Copy link
Contributor Author

Sorry, since Earkmark was parsing the first level of the HTML I thought the intention was to build the whole tree.
What we need is to escape the content of those html tags, but not the tags themselves.

@josevalim
Copy link
Member

Why do we need to escape them? Maybe they shouldn’t be unescaped in the first place?

@RobertDober
Copy link
Contributor

RobertDober commented Jul 18, 2020

@eksperimental parsing the whole tree is planned for 1.5, scheduled for 2020-08-30

and to be clear

parsing the whole tree, leaving all text nodes untransformed using Floki as standard

iex(3)> Floki.parse ("<div> a & b < </div>")         
{"div", [], [" a & b < "]}

@eksperimental
Copy link
Contributor Author

Regardless the parsing of the HTML code which is just two lines and I will remove them later today., I would like to know what is your opinion on this PR

@josevalim
Copy link
Member

I think it is too early for us to formalize the AST. We also don't really care about Markdown AST per se as we will likely introduce our own so it works with both Markdown and the Erlang docs AST. So I would rather focus on fixing bugs instead of trying to formalize something that will definitely change.

@eksperimental eksperimental deleted the markdown_ast branch September 3, 2020 13:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

Breaking change in 0.22: HTML entity support
4 participants