-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: splitter.split() ValueError: substring not found
for specific character combination
#3403
Comments
Awesome, thanks for the quick fix! Yes, I agree, they can absolutely be ignored and my workaround was to simply drop them before running the pipeline. My concern was more that the very specific combination of characters (precisely "s-\u2028ausschüsse", while "s-\u2028" and "\u2028ausschüsse" were fine) broke the splitter. I'm not sure if this may point to a larger, underlying problem, as seperators in itself do not seem to break it. One way or the other, simply dropping the (semantically meaningless) seperators should do the trick! |
the algorithm works fine if such symbols are on the start or end of a token but break if it is in the middle of one. In the example the Sentence is |
Ah that makes sense. Thanks for the explanation |
I've found a smilar problem. The string "\r" equally seems to cause a |
Describe the bug
I have the following text string producing the ValueError when calling
splitter.split()
on it:After some testing (removing emojis etc) I could trace the error to the very specific string "s-\u2028ausschüsse". When this specific combination gets passed to
splitter.split()
, it taps out for some reason.To Reproduce
Expected behavior
I'm not sure why this specific sring causes the error. It's easy enough to remove it in this one instance, but since I'm processing very large amounts of text, it is somewhat impossible to anticipate other problematic strings beforehand.
Logs and Stack traces
Screenshots
No response
Additional Context
No response
Environment
Versions:
Flair
0.13.1
Pytorch
2.2.0+cu121
Transformers
4.37.2
GPU
False
The text was updated successfully, but these errors were encountered: