-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Request: Symbols that can represent multiple letters #73
Comments
Allowing one character to map to multiple characters is intractable with Obscenity's current design, and I do not think it is something we would like to support--we would need to test patterns against all possible transformed strings instead of just one, potentially degrading performance significantly. (Consider the input text The correct way to solve this issue is to either adjust the patterns (that is, add a pattern that matches directly on the text With these considerations in mind, I am inclined to close this specific request as wontfix, but I think the intent of your issue is actually already tracked in #46--so, to be clear, I do hope that eventually Obscenity's detection quality can be improved to catch the cases you mention, just not in the manner you propose. Does that sound reasonable to you? |
For some context on why I have not yet fixed #46, the code dealing with transformations and matching is some of the more nasty code in this package, in part due to its age--I would have done things differently now compared to 3 years ago--and in part due to the complexity in mapping match positions in the transformed text back to the original text in a Unicode-aware way. (Working in both Unicode code points and UTF-16 code units depending on context makes this even nastier.) Consequently, for some time, this code was on rather shaky ground (as you observed in #71), and I was very reluctant to adjust it for fear of breaking it more. Recently, however, after your previous report I took the time to rework some of the code and fuzz tested it in 25bd1db and am now much more confident that things are as they should be. Addressing #46 should, I think, be considerably more straightforward after this, and it's possible we can do it in the next release. |
I did read #46, but not necessarily tied it to the use case presented here. I'll add asterisks in my word inputs since that will work for the moment as per your suggestion. Thanks a lot for your hard work!! |
For ease of tracking, I'm going to close this in favor of #46, which I just renamed to better reflect the current state of that issue. As discussed above, the suggestion ultimately presented there is a directly actionable way of solving the same problem in your original issue. Thanks! |
Description
I've tried with a few combinations of EnglishTransformers, but I haven't been able to correctly censor words like
sh*t
orf*ck
. In both cases, words should be censored, however, in the first word*
represents ani
and*
represents au
. Is there a way to create a new transformer for multiple letters/regex?Solution
I do not know how this can be implemented. Looking at the L33tspeak transformer, I can see there's a map per character:
However, I don't know how it would work for multiple characters where for example, we could have
Code of Conduct
The text was updated successfully, but these errors were encountered: