reenable non-alphabetic stripping by default #46

HatScripts · 2024-01-05T05:32:43Z

Expected behavior

Inputting all the EnglishProfaneWord, I expected all of them to be censored.

Actual behavior

As you can see on this CodePen, not all of the words get censored.

Minimal reproducible example

import {
  RegExpMatcher,
  TextCensor,
  englishDataset,
  englishRecommendedTransformers,
  keepStartCensorStrategy,
  keepEndCensorStrategy,
  asteriskCensorStrategy
} from 'obscenity'

const matcher = new RegExpMatcher({
  ...englishDataset.build(),
  ...englishRecommendedTransformers
})

const strategy = keepStartCensorStrategy(keepEndCensorStrategy(asteriskCensorStrategy()))
const censor = new TextCensor().setStrategy(strategy)

const words = 'abbo abeed africoon anal anus arabush arse ass bastard bestiality bitch blowjob boob boonga buttplug chingchong chink cock cuck cum cunt deepthroat dick dildo doggystyle double penetration ejaculate fag felch fellatio finger bang fisting fuck gangbang handjob hentai hooker incest jerk off jizz lubejob masturbate nigger orgasm orgy penis porn pussy rape retard scat semen sex slut tit tranny vagina whore'

const matches = matcher.getAllMatches(words)
console.log(censor.applyTo(words, matches))

Steps to reproduce

View console
Observe that not all words are censored

Additional context

Here is a less minimal CodePen with an input and output textarea: https://codepen.io/HatScripts/pen/NWJxEKW

Node.js version

N/A

Obscenity version

v0.1.4

Priority

Low
Medium
High

Terms

I agree to follow the project's Code of Conduct.
I have searched existing issues for similar reports.

jo3-l · 2024-01-05T06:55:20Z

Hi, thanks for the report. The problem here is caused by a very unfortunate interaction of two features, as follows. (Here, I refer to the specific example of the blacklisted term anal for purposes of explanation only--the issue generalizes.)

The primary pattern for anal has a word boundary at the start, so as to not erroneously match word such as banal.
- This shouldn't technically be an issue since your sample text contains anal on a separate line... but,
By default, the skip-non-alphabetic transformer strips out non-alphabetic characters from the text before matching occurring. This allows a pattern like fuck to match f u c k. Here, however, this transformation has the unintended side effect of removing the newline, leaving ...africoonanal....
Now the pattern for anal no longer matches, because anal does not appear as a separate word in the text after the transformation is applied.

There are a couple of ways we could go about addressing this:

a. Remove the skip-non-alphabetic transformer entirely; it is the root cause of an existing issue (see #23) and probably more. The downside is that we will no longer match f u c k given the pattern fuck. This seems unfortunate.
b. Rework the matching logic so it checks word boundaries against the text before transformations are applied.

Option B is evidently the more 'correct' fix but it requires some rearchitecting of the internals of obscenity. This is long overdue--I wrote much of this library ~4 years ago and looking back, there's a lot of overengineering and things I would do differently now. Unfortunately, I'm currently very swamped with exams & university applications and do not anticipate having the requisite time until the end of the month/February at earliest.

So that leaves Option A. This fix is something you can do in your own code: provide a custom set of transformers that does not include the skip-non-alphabetic routine. There will be some undesirable degradation in match quality, as mentioned previously, so it is up to you to decide whether you still wish to use Obscenity despite this.

Thanks again for the report and reproduction, and apologies for not being able to address this more promptly.

For #23, #46. BREAKING CHANGE: Using the default English preset, Obscenity will no longer strip non-alphabetic characters from the input text before matching. This addresses a class of egregious false negatives in previous versions (see #23), but introduces a regression where cases such as 'f u c k' (with the space) will no longer be detected by default. We expect to provide a more comprehensive fix in the next minor release. If desired, it remains possible to revert to the previous behavior by providing a custom set of transformers to the matcher.

jo3-l · 2024-01-05T07:52:46Z

I thought about this issue some more and released v0.2.0 with Option A (disabling the problematic transformation by default). As outlined above, this avoids the false negatives seen in your report at the cost of regressing in cases such as f u c k. I still hope to provide a more complete fix that avoids the regression in a future version, but wanted to get a release out immediately since this seems a rather critical problem.

See #46.

HatScripts · 2024-01-05T08:08:40Z

Thanks for the detailed response. I tried running the code again and unfortunately there are still some words not being censored, and I'm not sure why.

a**o a***d a******n a**l a**s a*****h a**e a*s b*****d b*****lity b***h b*****b b**b b****a buttplug c********g c***k c**k c**k c*m c**t d********t d**k d***o d*******le d************ation e****late f*g f***h f****tio finger bang f****ng f**k g******g h*****b h****i h****r i****t jerk off jizz l*****b m********e n****r o****m o**y p***s p**n p***y r**e r****d s**t s***n s*x s**t s**t t*t tranny v****a w***e

As seen above, he following words are still failing to be censored:

buttplug
finger bang (fingerbang works)
jerk off (jerkoff doesn't work)
jizz
tranny

HatScripts · 2024-01-05T08:18:23Z

I just realised the reason those 4 aren't working is because of the collapseDuplicatesTransformer.

buttplug → butplug
jerkoff → jerkof
jizz → jiz
tranny → trany

This can be fixed by either

Updating the default customThresholds for the collapseDuplicatesTransformer
Updating the patterns for the 4 aforementioned words, e.g. as a regex but+plug or tran+y. Not sure what that would look like with your pattern syntax; perhaps but[t]plug?

HatScripts · 2024-01-05T08:32:11Z

Is it possible to make the skip-non-alphabetic transformer simply not skip newlines (\n)? Or would this bug still persist regardless?

Skipping non-alphabetic characters is a very nice feature to have, and as you say, without it we suffer "some undesirable degradation in match quality".

There's also the issue of multi-word phrases such as double penetration, finger bang, jerk off, etc. With the skip-non-alphabetic transformer disabled, finger bang, for instance, will no longer be detected, as its pattern is lacking the space. double penetration on the other hand, is detected, but doublepenetration is not. In my opinion, a space ( ) in the pattern should be inferred to mean "an optional space"; either a space or no character. If necessary, an underscore (_) could be used to signify a mandatory space.

jo3-l · 2024-01-05T08:55:49Z

Is it possible to make the skip-non-alphabetic transformer simply not skip newlines (\n)? Or would this bug still persist regardless?

It's possible to do this, but we would still run into issues with not matching on n anal (or similar.) I still believe the correct fix is to consider word boundaries with respect to the original text, but as originally stated this is non-trivial with the current implementation. So I'm inclined to leave this as is for the time being until I get around to a refactor, but am willing to reconsider if you feel strongly.

Re: the issues w/ buttplug, jerkoff, jizz, tranny, updating the thresholds is probably the correct fix for the current design, but I'd need to take a closer look. (Currently changing the thresholds is a rather cumbersome task because we need to take care not to introduce any new false negatives. Suppose, e.g., that we set the threshold of u to 2. Then fuuck would no longer match the pattern fuck, and so we need to add another pattern fuuck to compensate. I suspect this problem may arise if we naively update the thresholds for the problematic letters. The refactor I have in mind should avoid this issue, FWIW.)

See #46.

GreenArrow18 · 2024-11-10T16:16:56Z

What are the options to handle cases like f u c k or f.u.c.k?

HatScripts added the bug Something isn't working label Jan 5, 2024

jo3-l added a commit that referenced this issue Jan 5, 2024

docs(readme): fix examples reliant on non-alphabetic stripping

ad51d19

See #46.

jo3-l added a commit that referenced this issue Jan 5, 2024

docs(readme): fix examples reliant on non-alphabetic stripping

34d45da

See #46.

jo3-l mentioned this issue Jan 7, 2024

bug: Strange input results in false positive #49

Closed

5 tasks

jo3-l mentioned this issue Aug 6, 2024

Request: Symbols that can represent multiple letters #73

Closed

1 task

jo3-l changed the title ~~bug: Certain words not being censored~~ reenable non-alphabetic stripping by default Sep 1, 2024

jo3-l added enhancement New feature or request and removed bug Something isn't working labels Sep 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reenable non-alphabetic stripping by default #46

reenable non-alphabetic stripping by default #46

HatScripts commented Jan 5, 2024 •

edited

Loading

jo3-l commented Jan 5, 2024 •

edited

Loading

jo3-l commented Jan 5, 2024

HatScripts commented Jan 5, 2024 •

edited

Loading

HatScripts commented Jan 5, 2024 •

edited

Loading

HatScripts commented Jan 5, 2024 •

edited

Loading

jo3-l commented Jan 5, 2024 •

edited

Loading

GreenArrow18 commented Nov 10, 2024

reenable non-alphabetic stripping by default #46

reenable non-alphabetic stripping by default #46

Comments

HatScripts commented Jan 5, 2024 • edited Loading

Expected behavior

Actual behavior

Minimal reproducible example

Steps to reproduce

Additional context

Node.js version

Obscenity version

Priority

Terms

jo3-l commented Jan 5, 2024 • edited Loading

jo3-l commented Jan 5, 2024

HatScripts commented Jan 5, 2024 • edited Loading

HatScripts commented Jan 5, 2024 • edited Loading

HatScripts commented Jan 5, 2024 • edited Loading

jo3-l commented Jan 5, 2024 • edited Loading

GreenArrow18 commented Nov 10, 2024

HatScripts commented Jan 5, 2024 •

edited

Loading

jo3-l commented Jan 5, 2024 •

edited

Loading

HatScripts commented Jan 5, 2024 •

edited

Loading

HatScripts commented Jan 5, 2024 •

edited

Loading

HatScripts commented Jan 5, 2024 •

edited

Loading

jo3-l commented Jan 5, 2024 •

edited

Loading