[Bug]: splitter.split() `ValueError: substring not found` for specific character combination #3403

TimBMK · 2024-02-05T12:14:34Z

Describe the bug

I have the following text string producing the ValueError when calling splitter.split() on it:

"RT @gruenethl: #GRÜNundWichtig im Juli-#PlenumTH\n\nAuf dem Programm: \n🌻Grünes #KlimaKonjunkturPrgramm\u2029\n🗄️ Aufbewahrung der Akten der #NSU-Untersuchungs-\u2028ausschüsse\u2029\n🧑‍💻#Digitalisierung in Schulen \n\nMehr Infos & Livestream \n📺 https://t.co/UU3mfRQ77t https://t.co/PYCBL7LUYh"

After some testing (removing emojis etc) I could trace the error to the very specific string "s-\u2028ausschüsse". When this specific combination gets passed to splitter.split(), it taps out for some reason.

To Reproduce

from flair.splitter import SegtokSentenceSplitter

splitter = SegtokSentenceSplitter()

# the first half of the string works:
splitter.split("RT @gruenethl: #GRÜNundWichtig im Juli-#PlenumTH\n\nAuf dem Programm: \nGrünes #KlimaKonjunkturPrgramm\u2029\n Aufbewahrung der Akten der #NSU-Untersuchung")

# this produces the error:
splitter.split("s-\u2028ausschüsse")

# other combinations of this string are fine:
splitter.split("s-\u2028")
splitter.split("\u2028ausschüsse")

Expected behavior

I'm not sure why this specific sring causes the error. It's easy enough to remove it in this one instance, but since I'm processing very large amounts of text, it is somewhat impossible to anticipate other problematic strings beforehand.

Logs and Stack traces

ValueError: substring not found

Screenshots

No response

Additional Context

No response

Environment

Versions:

Flair

0.13.1

Pytorch

2.2.0+cu121

Transformers

4.37.2

GPU

False

The text was updated successfully, but these errors were encountered:

helpmefindaname · 2024-02-09T14:23:16Z

hi @TimBMK
I take the original tweet as reference:

and assume that those characters \u2028 LINE SEPARATOR and \u2029 PARAGRAPH SEPARATOR are symbols that are there only for display reasons, but has no semantic meaning and therefore should be ignored in nlp.

You can test my fix on #3404

TimBMK · 2024-02-09T14:29:06Z

Awesome, thanks for the quick fix! Yes, I agree, they can absolutely be ignored and my workaround was to simply drop them before running the pipeline. My concern was more that the very specific combination of characters (precisely "s-\u2028ausschüsse", while "s-\u2028" and "\u2028ausschüsse" were fine) broke the splitter. I'm not sure if this may point to a larger, underlying problem, as seperators in itself do not seem to break it. One way or the other, simply dropping the (semantically meaningless) seperators should do the trick!

helpmefindaname · 2024-02-09T14:40:23Z

the algorithm works fine if such symbols are on the start or end of a token but break if it is in the middle of one.

In the example the Sentence is s-\u2028ausschüsse while the SekTokTokenizer removes that symbol and returns ['s-ausschüsse'] as tokens. The index-error then occours as s-ausschüsse is no substring of s-\u2028ausschüsse.

TimBMK · 2024-02-09T14:45:42Z

Ah that makes sense. Thanks for the explanation

TimBMK · 2024-02-29T15:26:38Z

I've found a smilar problem. The string "\r" equally seems to cause a ValueError: substring not found. Removing it beforehand fixs the problem.

TimBMK added the bug Something isn't working label Feb 5, 2024

helpmefindaname mentioned this issue Feb 9, 2024

ignore separator symbols #3404

Merged

alanakbik closed this as completed in #3404 Mar 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: splitter.split() `ValueError: substring not found` for specific character combination #3403

[Bug]: splitter.split() `ValueError: substring not found` for specific character combination #3403

TimBMK commented Feb 5, 2024

helpmefindaname commented Feb 9, 2024

TimBMK commented Feb 9, 2024

helpmefindaname commented Feb 9, 2024

TimBMK commented Feb 9, 2024

TimBMK commented Feb 29, 2024

[Bug]: splitter.split() ValueError: substring not found for specific character combination #3403

[Bug]: splitter.split() ValueError: substring not found for specific character combination #3403

Comments

TimBMK commented Feb 5, 2024

Describe the bug

To Reproduce

Expected behavior

Logs and Stack traces

Screenshots

Additional Context

Environment

Versions:

Flair

Pytorch

Transformers

GPU

helpmefindaname commented Feb 9, 2024

TimBMK commented Feb 9, 2024

helpmefindaname commented Feb 9, 2024

TimBMK commented Feb 9, 2024

TimBMK commented Feb 29, 2024

[Bug]: splitter.split() `ValueError: substring not found` for specific character combination #3403

[Bug]: splitter.split() `ValueError: substring not found` for specific character combination #3403