Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: splitter.split() ValueError: substring not found for specific character combination #3403

Closed
TimBMK opened this issue Feb 5, 2024 · 5 comments · Fixed by #3404
Closed
Labels
bug Something isn't working

Comments

@TimBMK
Copy link

TimBMK commented Feb 5, 2024

Describe the bug

I have the following text string producing the ValueError when calling splitter.split() on it:

"RT @gruenethl: #GRÜNundWichtig im Juli-#PlenumTH\n\nAuf dem Programm: \n🌻Grünes #KlimaKonjunkturPrgramm\u2029\n🗄️ Aufbewahrung der Akten der #NSU-Untersuchungs-\u2028ausschüsse\u2029\n🧑‍💻#Digitalisierung in Schulen \n\nMehr Infos & Livestream \n📺 https://t.co/UU3mfRQ77t https://t.co/PYCBL7LUYh"

After some testing (removing emojis etc) I could trace the error to the very specific string "s-\u2028ausschüsse". When this specific combination gets passed to splitter.split(), it taps out for some reason.

To Reproduce

from flair.splitter import SegtokSentenceSplitter

splitter = SegtokSentenceSplitter()

# the first half of the string works:
splitter.split("RT @gruenethl: #GRÜNundWichtig im Juli-#PlenumTH\n\nAuf dem Programm: \nGrünes #KlimaKonjunkturPrgramm\u2029\n Aufbewahrung der Akten der #NSU-Untersuchung")

# this produces the error:
splitter.split("s-\u2028ausschüsse")

# other combinations of this string are fine:
splitter.split("s-\u2028")
splitter.split("\u2028ausschüsse")

Expected behavior

I'm not sure why this specific sring causes the error. It's easy enough to remove it in this one instance, but since I'm processing very large amounts of text, it is somewhat impossible to anticipate other problematic strings beforehand.

Logs and Stack traces

ValueError: substring not found

Screenshots

No response

Additional Context

No response

Environment

Versions:

Flair

0.13.1

Pytorch

2.2.0+cu121

Transformers

4.37.2

GPU

False

@TimBMK TimBMK added the bug Something isn't working label Feb 5, 2024
@helpmefindaname
Copy link
Collaborator

hi @TimBMK
I take the original tweet as reference:
image

and assume that those characters \u2028 LINE SEPARATOR and \u2029 PARAGRAPH SEPARATOR are symbols that are there only for display reasons, but has no semantic meaning and therefore should be ignored in nlp.

You can test my fix on #3404

@TimBMK
Copy link
Author

TimBMK commented Feb 9, 2024

Awesome, thanks for the quick fix! Yes, I agree, they can absolutely be ignored and my workaround was to simply drop them before running the pipeline. My concern was more that the very specific combination of characters (precisely "s-\u2028ausschüsse", while "s-\u2028" and "\u2028ausschüsse" were fine) broke the splitter. I'm not sure if this may point to a larger, underlying problem, as seperators in itself do not seem to break it. One way or the other, simply dropping the (semantically meaningless) seperators should do the trick!

@helpmefindaname
Copy link
Collaborator

the algorithm works fine if such symbols are on the start or end of a token but break if it is in the middle of one.

In the example the Sentence is s-\u2028ausschüsse while the SekTokTokenizer removes that symbol and returns ['s-ausschüsse'] as tokens. The index-error then occours as s-ausschüsse is no substring of s-\u2028ausschüsse.

@TimBMK
Copy link
Author

TimBMK commented Feb 9, 2024

Ah that makes sense. Thanks for the explanation

@TimBMK
Copy link
Author

TimBMK commented Feb 29, 2024

I've found a smilar problem. The string "\r" equally seems to cause a ValueError: substring not found. Removing it beforehand fixs the problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants