Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request: Symbols that can represent multiple letters #73

Closed
1 task done
rion18 opened this issue Aug 5, 2024 · 5 comments
Closed
1 task done

Request: Symbols that can represent multiple letters #73

rion18 opened this issue Aug 5, 2024 · 5 comments
Labels
enhancement New feature or request

Comments

@rion18
Copy link

rion18 commented Aug 5, 2024

Description

I've tried with a few combinations of EnglishTransformers, but I haven't been able to correctly censor words like sh*t or f*ck. In both cases, words should be censored, however, in the first word * represents an i and * represents a u. Is there a way to create a new transformer for multiple letters/regex?

Solution

I do not know how this can be implemented. Looking at the L33tspeak transformer, I can see there's a map per character:

	['a', '@4'],
	['c', '('],
	['e', '3'],
	['i', '1|'],
	['o', '0'],
	['s', '$'],

However, I don't know how it would work for multiple characters where for example, we could have

	['*', 'any_letter_or_vowel_etc.'],

Code of Conduct

  • I agree to follow this project's Code of Conduct.
@rion18 rion18 added the enhancement New feature or request label Aug 5, 2024
@jo3-l
Copy link
Owner

jo3-l commented Aug 6, 2024

Allowing one character to map to multiple characters is intractable with Obscenity's current design, and I do not think it is something we would like to support--we would need to test patterns against all possible transformed strings instead of just one, potentially degrading performance significantly. (Consider the input text **** with * -> any of aeiou, for instance: there are 5^4 = 625 possible transformed strings. With adversarial inputs this could be disastrous.)

The correct way to solve this issue is to either adjust the patterns (that is, add a pattern that matches directly on the text sh*t), or to strip out the * with a transformer. Some previous versions of Obscenity actually correctly identified a match in the input f*ck using the second approach (see the skipNonAlphabetic transformer, disabled by default due to #46.) It has always been my goal to eventually add the skipNonAlphabetic transformer back after fixing that issue, but I have not gotten to it yet.

With these considerations in mind, I am inclined to close this specific request as wontfix, but I think the intent of your issue is actually already tracked in #46--so, to be clear, I do hope that eventually Obscenity's detection quality can be improved to catch the cases you mention, just not in the manner you propose. Does that sound reasonable to you?

@jo3-l
Copy link
Owner

jo3-l commented Aug 6, 2024

For some context on why I have not yet fixed #46, the code dealing with transformations and matching is some of the more nasty code in this package, in part due to its age--I would have done things differently now compared to 3 years ago--and in part due to the complexity in mapping match positions in the transformed text back to the original text in a Unicode-aware way. (Working in both Unicode code points and UTF-16 code units depending on context makes this even nastier.) Consequently, for some time, this code was on rather shaky ground (as you observed in #71), and I was very reluctant to adjust it for fear of breaking it more.

Recently, however, after your previous report I took the time to rework some of the code and fuzz tested it in 25bd1db and am now much more confident that things are as they should be. Addressing #46 should, I think, be considerably more straightforward after this, and it's possible we can do it in the next release.

@rion18
Copy link
Author

rion18 commented Aug 6, 2024

I did read #46, but not necessarily tied it to the use case presented here. I'll add asterisks in my word inputs since that will work for the moment as per your suggestion.

Thanks a lot for your hard work!!

@jo3-l
Copy link
Owner

jo3-l commented Aug 6, 2024

I did read #46, but not necessarily tied it to the use case presented here.

That's fair. The title of #46 is a little misleading at the moment; it's more of a tracking issue to get the skipNonAlphabetic transformer re-enabled by default since the original problem there was fixed.

@jo3-l
Copy link
Owner

jo3-l commented Sep 1, 2024

For ease of tracking, I'm going to close this in favor of #46, which I just renamed to better reflect the current state of that issue. As discussed above, the suggestion ultimately presented there is a directly actionable way of solving the same problem in your original issue. Thanks!

@jo3-l jo3-l closed this as completed Sep 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants