-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Markdown parser updates #18
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution!
Let me know if I missed something regarding the regex, but thanks for tightening it up rather than hitting it with the ol' \w\W
.
It shouldn't break in pure Python, I did test it in the console to make sure it was matching as expected and had no issues.
Maybe escape the single quote too rather than remove it? It is a valid URL character after all, and not that unusual since it's often used to enclose values in parameters
|
Break may have been the wrong word. What's your expected input? >>> import re
>>> def _parse_links(input_str: str) -> str:
... input_str = re.sub(
... r"\[([\w\W]+?)\]\(([\w\W]+?)\)",
... r'<a href="\2">\1</a>',
... input_str,
... flags=re.MULTILINE,
... )
... return re.sub(
... r"(?<!href=\")(https?:\/\/[-a-zA-Z0-9._~:/?#@!$&()*+,;=%]+')",
... r'<a href="\1">\1</a>',
... input_str,
... flags=re.MULTILINE,
... )
...
>>> test_str = """
... https://example.com
... [example](https://example.com)
... https://example.com'
... """
>>> print(_parse_links(test_str))
https://example.com
<a href="https://example.com">example</a>
<a href="https://example.com'">https://example.com'</a> Did you intend to only support links that were suffixed with |
That... was an oversight between converting the expression enclosure from single to double-quotes in the browser editor 😅 |
Anything else holding this and #20 ? |
No blockers at a glance -- sorry for the slow review, I've just been incredibly busy with irl. I'll block out some time this weekend to take a closer look and/or merge. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the slow review -- I left some comments.
* Marker matches on both sides * Opener and closer is not escaped * First character inside is not a space
Should have addressed all notes 🤞 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the long back-and-forth, I appreciate the iteration. A set of tests would be useful so that we can capture all the edge cases and expected behaviour.
Going to merge this as-is, thanks so much for the contribution!
|
||
|
||
def _parse_strong_emphasis(input_str: str) -> str: | ||
return "\n".join( | ||
re.sub( | ||
r"(?:\*\*|\_\_)([\w]+?[\w\W]+?[\w]+?)(?:\*\*|\_\_)", | ||
r"(?<!\\)\*\*(\w.*?)(?<!\\)\*\*|(?<!\\)__(\w.*?)(?<!\\)__", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, I'm a fan of also handling the escaped character.
I think you're handling the first character being a space, but not the last?
>>> text = """
... **asdf**
... ** asdf**
... **asdf **
... ** asdf **
... """
>>> print(_parse_strong_emphasis(text))
<strong>asdf</strong>
** asdf**
<strong>asdf </strong>
** asdf **
It's fine though, I'm not going to block on this (especially since the original impl was flawed as it only allowed \w
which ignores symbols as valid characters).
I couldn't figure an elegant way to filter trailing spaces as well. While another \w at the end might handle it, it would also make it impossible to format single characters like **A** letter or _1_ digit.
Turns out a space with the {0} quantifier doesn't work as a negative 🙁
|
* Parse standalone URLs too * move single-quote back into the set * Tighten URL characters for CFM * Improve headers formatting * Simplify expressions * Update example * Parse URLs regardless of casing * open links in new tabs * Put atx-style back into docstring * Absorb more spaces and optional closing on headers * Refine bold and italics expressions by ensuring: * Marker matches on both sides * Opener and closer is not escaped * First character inside is not a space
Parsed after the markdown format since it has the added benefit of avoiding its closing brackets at the end of the URL