Request: Symbols that can represent multiple letters #73

rion18 · 2024-08-05T23:14:58Z

Description

I've tried with a few combinations of EnglishTransformers, but I haven't been able to correctly censor words like sh*t or f*ck. In both cases, words should be censored, however, in the first word * represents an i and * represents a u. Is there a way to create a new transformer for multiple letters/regex?

Solution

I do not know how this can be implemented. Looking at the L33tspeak transformer, I can see there's a map per character:

	['a', '@4'],
	['c', '('],
	['e', '3'],
	['i', '1|'],
	['o', '0'],
	['s', '$'],

However, I don't know how it would work for multiple characters where for example, we could have

	['*', 'any_letter_or_vowel_etc.'],

Code of Conduct

I agree to follow this project's Code of Conduct.

The text was updated successfully, but these errors were encountered:

jo3-l · 2024-08-06T01:04:24Z

Allowing one character to map to multiple characters is intractable with Obscenity's current design, and I do not think it is something we would like to support--we would need to test patterns against all possible transformed strings instead of just one, potentially degrading performance significantly. (Consider the input text **** with * -> any of aeiou, for instance: there are 5^4 = 625 possible transformed strings. With adversarial inputs this could be disastrous.)

The correct way to solve this issue is to either adjust the patterns (that is, add a pattern that matches directly on the text sh*t), or to strip out the * with a transformer. Some previous versions of Obscenity actually correctly identified a match in the input f*ck using the second approach (see the skipNonAlphabetic transformer, disabled by default due to #46.) It has always been my goal to eventually add the skipNonAlphabetic transformer back after fixing that issue, but I have not gotten to it yet.

With these considerations in mind, I am inclined to close this specific request as wontfix, but I think the intent of your issue is actually already tracked in #46--so, to be clear, I do hope that eventually Obscenity's detection quality can be improved to catch the cases you mention, just not in the manner you propose. Does that sound reasonable to you?

jo3-l · 2024-08-06T01:38:51Z

For some context on why I have not yet fixed #46, the code dealing with transformations and matching is some of the more nasty code in this package, in part due to its age--I would have done things differently now compared to 3 years ago--and in part due to the complexity in mapping match positions in the transformed text back to the original text in a Unicode-aware way. (Working in both Unicode code points and UTF-16 code units depending on context makes this even nastier.) Consequently, for some time, this code was on rather shaky ground (as you observed in #71), and I was very reluctant to adjust it for fear of breaking it more.

Recently, however, after your previous report I took the time to rework some of the code and fuzz tested it in 25bd1db and am now much more confident that things are as they should be. Addressing #46 should, I think, be considerably more straightforward after this, and it's possible we can do it in the next release.

rion18 · 2024-08-06T02:04:54Z

I did read #46, but not necessarily tied it to the use case presented here. I'll add asterisks in my word inputs since that will work for the moment as per your suggestion.

Thanks a lot for your hard work!!

jo3-l · 2024-08-06T02:08:49Z

I did read #46, but not necessarily tied it to the use case presented here.

That's fair. The title of #46 is a little misleading at the moment; it's more of a tracking issue to get the skipNonAlphabetic transformer re-enabled by default since the original problem there was fixed.

jo3-l · 2024-09-01T01:28:06Z

For ease of tracking, I'm going to close this in favor of #46, which I just renamed to better reflect the current state of that issue. As discussed above, the suggestion ultimately presented there is a directly actionable way of solving the same problem in your original issue. Thanks!

rion18 added the enhancement New feature or request label Aug 5, 2024

jo3-l closed this as completed Sep 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Request: Symbols that can represent multiple letters #73

Request: Symbols that can represent multiple letters #73

rion18 commented Aug 5, 2024

jo3-l commented Aug 6, 2024 •

edited

Loading

jo3-l commented Aug 6, 2024

rion18 commented Aug 6, 2024

jo3-l commented Aug 6, 2024

jo3-l commented Sep 1, 2024

Request: Symbols that can represent multiple letters #73

Request: Symbols that can represent multiple letters #73

Comments

rion18 commented Aug 5, 2024

Description

Solution

Code of Conduct

jo3-l commented Aug 6, 2024 • edited Loading

jo3-l commented Aug 6, 2024

rion18 commented Aug 6, 2024

jo3-l commented Aug 6, 2024

jo3-l commented Sep 1, 2024

jo3-l commented Aug 6, 2024 •

edited

Loading