[Feature Request]:Content Filter Layer for Input and Output of LLMs #15

kanak8278 · 2024-04-10T10:41:45Z

Is your feature request related to a problem? Please describe.

Summary
I am writing to propose the addition of a content filter layer for both the input and output stages of the project's language learning models (LLMs). This enhancement aims to improve the robustness and safety of our interactions by preventing inappropriate or harmful content from being processed or generated by the LLMs.

Background
As our project grows in popularity and usage, the variety of inputs the LLMs have to process will inevitably increase. While the LLMs are designed to understand and generate human-like responses, there is a risk of encountering or producing content that may be offensive, biased, or otherwise inappropriate. Implementing a content filter layer can significantly mitigate these risks, ensuring a more positive and safe experience for all users.

Describe the solution you'd like

Moderation Model from OpenAI
Before Anything Goes Through (Input Filter): Let’s use OpenAI's text-moderation-latest model(or any other variant) right off the bat to check what people are submitting. It’s pretty good at catching hate speech, anything too adult, or just plain violent content. If something sketchy pops up, we can either tell them we can’t process it or clean it up a bit if possible.

Before Anything Goes Out (Output Filter): Once our LLM comes up with a response, let’s run it through the same filter. We want to make sure everything we send back is cool and doesn’t rub anyone the wrong way.

If Something’s Not Right: If we catch any no-no words or ideas in the output, we can tweak the response to fix it up or notify the user.

https://platform.openai.com/docs/models/moderation

Goal:
Moderation Model from OpenAI
Before Anything Goes Through (Input Filter): Let’s use OpenAI's text-moderation-latest model(or any other variant) right off the bat to check what people are submitting. It’s pretty good at catching hate speech, anything too adult, or just plain violent content. If something sketchy pops up, we can either tell them we can’t process it or clean it up a bit if possible.

Before Anything Goes Out (Output Filter): Once our LLM comes up with a response, let’s run it through the same filter. We want to make sure everything we send back is cool and doesn’t rub anyone the wrong way.

If Something’s Not Right: If we catch any no-no words or ideas in the output, we can tweak the response to fix it up or notify the user.

https://platform.openai.com/docs/models/moderation

Expected Outcome:
The content filter (should process both inputs and outputs from LLMs) should eliminate any harmful content given to or produced by the LLMs.

Acceptance criteria
The function should remove any harmful content when given to it.

Implementation details:
Create a function filter inside class content_filter present in the file jb-manager-bot/jb_manager_bot/content_filter/init.py which takes in a string and gives out a string. You can use the OptionParser function for reference.
You can use any moderation models.

Mockups/Wireframes:
NOT APPLICABLE

Tech skills needed:
Python
Data science

Complexity
Medium

Category:
Backend

Additional context

No response

Lekhanrao · 2024-07-25T03:56:11Z

Could you please provide an update on this? @shreypandey

kanak8278 · 2024-07-30T19:36:17Z

We have not yet started working on it. @Lekhanrao

KaranrajM · 2024-08-01T07:22:30Z

Need to be converted to C4GT issue template and can be picked up by C4GT folks.

yashpal2104 · 2024-08-22T14:56:51Z

I can try to work on this if you don't mind can you point out the resources that will help me understand more and solve this issue. Thanks

Lekhanrao · 2024-08-23T06:53:38Z

@yashpal2104, Thankyou. You can first touchbase with @KaranrajM (Karan) and @DevvStrange (Arun). We can take it forward from there. Have assigned this ticket to you now.

KaranrajM · 2024-08-27T05:27:17Z

Hi @yashpal2104 the task here is to avoid providing unwanted user information to the LLMS and also harmful content from LLM back to the users. You can use this link https://avidml.org/blog/llm-guardrails-4/ to know more about the guardrails. Here are some of the resources to get you started with:

yashpal2104 · 2024-08-27T10:12:58Z

Thanks for the resources will get on it after checking out the resources

Lekhanrao · 2024-10-01T11:05:20Z

@yashpal2104 is there any update on this please?

yashpal2104 · 2024-10-01T13:26:33Z

Hey I did try it but was not able to provide a definitive answer I'm sorry for the delay. If you want you can assign it to anyone else, if I get a solution I will surely update this to you

Lekhanrao · 2024-10-09T04:37:24Z

Assigned ticket to @DevvStrange

Lekhanrao · 2024-10-21T05:49:28Z

@DevvStrange is looking at different models and will be able to come to some conclusion by 21st Oct

DevvStrange · 2024-10-23T05:23:09Z

We have explored and tested below models -

OpenAI/omni-moderation-latest - OpenAI's Model [Free to use]
KoalaAI/Text-Moderation - Open source Moderation Model

Based on testing, both the models perform more or less similar. Omni model is easy to plug as its available as an API and recently the dataset has been updated by which the performance has been improved from the previous text-moderation model. For more details - https://platform.openai.com/docs/guides/moderation

Lekhanrao · 2024-12-01T14:24:26Z

@DevvStrange , could you please update what is the next steps here?

DevvStrange · 2024-12-03T12:43:19Z

Implementation is already done, test cases are yet to be written and will be closed before end of this week.

kanak8278 added the enhancement New feature or request label Apr 10, 2024

sameersegal assigned shreypandey Jun 9, 2024

shreypandey removed their assignment Jul 31, 2024

Lekhanrao added C4GT Community Assignable to C4GT community developers C4GT Coding Coding related labels Aug 14, 2024

Lekhanrao assigned yashpal2104 Aug 23, 2024

Lekhanrao assigned DevvStrange Oct 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request]:Content Filter Layer for Input and Output of LLMs #15

[Feature Request]:Content Filter Layer for Input and Output of LLMs #15

kanak8278 commented Apr 10, 2024 •

edited by Lekhanrao

Loading

Lekhanrao commented Jul 25, 2024

kanak8278 commented Jul 30, 2024

KaranrajM commented Aug 1, 2024

yashpal2104 commented Aug 22, 2024

Lekhanrao commented Aug 23, 2024

KaranrajM commented Aug 27, 2024

yashpal2104 commented Aug 27, 2024

Lekhanrao commented Oct 1, 2024

yashpal2104 commented Oct 1, 2024

Lekhanrao commented Oct 9, 2024

Lekhanrao commented Oct 21, 2024

DevvStrange commented Oct 23, 2024

Lekhanrao commented Dec 1, 2024

DevvStrange commented Dec 3, 2024

[Feature Request]:Content Filter Layer for Input and Output of LLMs #15

[Feature Request]:Content Filter Layer for Input and Output of LLMs #15

Comments

kanak8278 commented Apr 10, 2024 • edited by Lekhanrao Loading

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Additional context

Lekhanrao commented Jul 25, 2024

kanak8278 commented Jul 30, 2024

KaranrajM commented Aug 1, 2024

yashpal2104 commented Aug 22, 2024

Lekhanrao commented Aug 23, 2024

KaranrajM commented Aug 27, 2024

yashpal2104 commented Aug 27, 2024

Lekhanrao commented Oct 1, 2024

yashpal2104 commented Oct 1, 2024

Lekhanrao commented Oct 9, 2024

Lekhanrao commented Oct 21, 2024

DevvStrange commented Oct 23, 2024

Lekhanrao commented Dec 1, 2024

DevvStrange commented Dec 3, 2024

kanak8278 commented Apr 10, 2024 •

edited by Lekhanrao

Loading