Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Markdown parser updates #18

Merged
merged 11 commits into from
Jun 5, 2023
Merged

Markdown parser updates #18

merged 11 commits into from
Jun 5, 2023

Conversation

BrutuZ
Copy link
Contributor

@BrutuZ BrutuZ commented Apr 28, 2023

Parsed after the markdown format since it has the added benefit of avoiding its closing brackets at the end of the URL

Copy link
Collaborator

@funkyhippo funkyhippo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution!

Let me know if I missed something regarding the regex, but thanks for tightening it up rather than hitting it with the ol' \w\W.

proxy/source/markdown_parser.py Outdated Show resolved Hide resolved
@BrutuZ
Copy link
Contributor Author

BrutuZ commented Apr 29, 2023 via email

@funkyhippo
Copy link
Collaborator

Break may have been the wrong word. What's your expected input?

>>> import re
>>> def _parse_links(input_str: str) -> str:
...     input_str = re.sub(
...         r"\[([\w\W]+?)\]\(([\w\W]+?)\)",
...         r'<a href="\2">\1</a>',
...         input_str,
...         flags=re.MULTILINE,
...     )
...     return re.sub(
...         r"(?<!href=\")(https?:\/\/[-a-zA-Z0-9._~:/?#@!$&()*+,;=%]+')",
...         r'<a href="\1">\1</a>',
...         input_str,
...         flags=re.MULTILINE,
...     )
...
>>> test_str = """
... https://example.com
... [example](https://example.com)
... https://example.com'
... """
>>> print(_parse_links(test_str))

https://example.com
<a href="https://example.com">example</a>
<a href="https://example.com'">https://example.com'</a>

Did you intend to only support links that were suffixed with '? It's outside the character set.

@BrutuZ
Copy link
Contributor Author

BrutuZ commented Apr 29, 2023

Did you intend to only support links that were suffixed with '? It's outside the character set.

That... was an oversight between converting the expression enclosure from single to double-quotes in the browser editor 😅

@BrutuZ
Copy link
Contributor Author

BrutuZ commented May 2, 2023

Added some other minor changes, doesn't seem like I broke anything 😅
image

@BrutuZ BrutuZ requested a review from funkyhippo May 2, 2023 02:45
@BrutuZ BrutuZ changed the title Parse standalone URLs too Markdown parser updates May 14, 2023
@BrutuZ
Copy link
Contributor Author

BrutuZ commented Jun 1, 2023

Anything else holding this and #20 ?
I have the changes from both PRs (and a few more) running on a forked instance with no issues so far

@funkyhippo
Copy link
Collaborator

No blockers at a glance -- sorry for the slow review, I've just been incredibly busy with irl.

I'll block out some time this weekend to take a closer look and/or merge.

Copy link
Collaborator

@funkyhippo funkyhippo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the slow review -- I left some comments.

proxy/source/markdown_parser.py Outdated Show resolved Hide resolved
proxy/source/markdown_parser.py Outdated Show resolved Hide resolved
proxy/source/markdown_parser.py Outdated Show resolved Hide resolved
proxy/source/markdown_parser.py Outdated Show resolved Hide resolved
BrutuZ added 3 commits June 4, 2023 22:19
 * Marker matches on both sides
 * Opener and closer is not escaped
  * First character inside is not a space
@BrutuZ
Copy link
Contributor Author

BrutuZ commented Jun 5, 2023

Should have addressed all notes 🤞

Copy link
Collaborator

@funkyhippo funkyhippo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the long back-and-forth, I appreciate the iteration. A set of tests would be useful so that we can capture all the edge cases and expected behaviour.

Going to merge this as-is, thanks so much for the contribution!



def _parse_strong_emphasis(input_str: str) -> str:
return "\n".join(
re.sub(
r"(?:\*\*|\_\_)([\w]+?[\w\W]+?[\w]+?)(?:\*\*|\_\_)",
r"(?<!\\)\*\*(\w.*?)(?<!\\)\*\*|(?<!\\)__(\w.*?)(?<!\\)__",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, I'm a fan of also handling the escaped character.

I think you're handling the first character being a space, but not the last?

>>> text = """
... **asdf**
... ** asdf**
... **asdf **
... ** asdf **
... """
>>> print(_parse_strong_emphasis(text))

<strong>asdf</strong>
** asdf**
<strong>asdf </strong>
** asdf **

It's fine though, I'm not going to block on this (especially since the original impl was flawed as it only allowed \w which ignores symbols as valid characters).

@funkyhippo funkyhippo merged commit 3dcabbe into subject-f:develop Jun 5, 2023
@BrutuZ
Copy link
Contributor Author

BrutuZ commented Jun 5, 2023 via email

@BrutuZ BrutuZ deleted the patch-1 branch June 6, 2023 00:20
Einlion pushed a commit that referenced this pull request Jan 14, 2024
* Parse standalone URLs too

* move single-quote back into the set

* Tighten URL characters for CFM

* Improve headers formatting

* Simplify expressions

* Update example

* Parse URLs regardless of casing

* open links in new tabs

* Put atx-style back into docstring

* Absorb more spaces and optional closing on headers

* Refine bold and italics expressions by ensuring:
 * Marker matches on both sides
 * Opener and closer is not escaped
  * First character inside is not a space
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants