Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

non-boundary html tags are being mixed with sibling boundary tags #35

Open
hugonteifeh opened this issue Aug 31, 2021 · 1 comment
Open

Comments

@hugonteifeh
Copy link

hugonteifeh commented Aug 31, 2021

Hey!

Context:

const options = {
    newline_boundaries: true,
    html_boundaries: true,
    html_boundaries_tags: [
        'br',
        'p',
        'h1',
        'h2',
        'h3',
        'h4',
        'h5',
        'h6',
        'ul',
        'div',
        'figcaption',
    ],
    sanitize: true,
    preserve_whitespace: true,
}

const html =`<article> <span>a span here</span><h1>This is a a very cool title.</h1></article>`

console.log(tokenizer.sentences(html, options))

Expected Result:

[ 'a span here', 'This is a a very cool title.' ]

Actual Result:

['a span hereThis is a a very cool title.' ]

I do realise that <span> is not marked as a boundary html tag but in my opinion that shouldn't let its content leak into the text of its sibling html boundary tags.

@hugonteifeh hugonteifeh changed the title non-boundary html tags that are not defined are being mixed with sibling boundary elements non-boundary html tags are being mixed with sibling boundary tags Aug 31, 2021
@Tessmore
Copy link
Owner

Tessmore commented Sep 15, 2021

Hey @hugonteifeh, I must admit that I never intended this lib. to work with html. The sanitize-html dependency was added to strip the html and then hopefully boundaries can be detected.

I'm not 100% sure on a solution for this. If we replace tags (<p> and <span>) with spaces it will still end-up in the same sentence. At least it will be more readable... maybe tags like em and span need special handling. But yeah, it feels like an edge-case with titles, as you wouldn't encounter:

<span>hello</span><em>world</em>

helloworld

(would you expect spacing here? or a sentence break?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants