You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
const options = {
newline_boundaries: true,
html_boundaries: true,
html_boundaries_tags: [
'br',
'p',
'h1',
'h2',
'h3',
'h4',
'h5',
'h6',
'ul',
'div',
'figcaption',
],
sanitize: true,
preserve_whitespace: true,
}
const html =`<article> <span>a span here</span><h1>This is a a very cool title.</h1></article>`
console.log(tokenizer.sentences(html, options))
Expected Result:
[ 'a span here', 'This is a a very cool title.' ]
Actual Result:
['a span hereThis is a a very cool title.' ]
I do realise that <span> is not marked as a boundary html tag but in my opinion that shouldn't let its content leak into the text of its sibling html boundary tags.
The text was updated successfully, but these errors were encountered:
hugonteifeh
changed the title
non-boundary html tags that are not defined are being mixed with sibling boundary elements
non-boundary html tags are being mixed with sibling boundary tags
Aug 31, 2021
Hey @hugonteifeh, I must admit that I never intended this lib. to work with html. The sanitize-html dependency was added to strip the html and then hopefully boundaries can be detected.
I'm not 100% sure on a solution for this. If we replace tags (<p> and <span>) with spaces it will still end-up in the same sentence. At least it will be more readable... maybe tags like em and span need special handling. But yeah, it feels like an edge-case with titles, as you wouldn't encounter:
<span>hello</span><em>world</em>
helloworld
(would you expect spacing here? or a sentence break?)
Hey!
Context:
Expected Result:
[ 'a span here', 'This is a a very cool title.' ]
Actual Result:
['a span hereThis is a a very cool title.' ]
I do realise that
<span>
is not marked as a boundary html tag but in my opinion that shouldn't let its content leak into the text of its sibling html boundary tags.The text was updated successfully, but these errors were encountered: