Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request]:Content Filter Layer for Input and Output of LLMs #15

Open
kanak8278 opened this issue Apr 10, 2024 · 14 comments
Open
Assignees
Labels
C4GT Coding Coding related C4GT Community Assignable to C4GT community developers enhancement New feature or request

Comments

@kanak8278
Copy link
Collaborator

kanak8278 commented Apr 10, 2024

Is your feature request related to a problem? Please describe.

Summary
I am writing to propose the addition of a content filter layer for both the input and output stages of the project's language learning models (LLMs). This enhancement aims to improve the robustness and safety of our interactions by preventing inappropriate or harmful content from being processed or generated by the LLMs.

Background
As our project grows in popularity and usage, the variety of inputs the LLMs have to process will inevitably increase. While the LLMs are designed to understand and generate human-like responses, there is a risk of encountering or producing content that may be offensive, biased, or otherwise inappropriate. Implementing a content filter layer can significantly mitigate these risks, ensuring a more positive and safe experience for all users.

Describe the solution you'd like

Moderation Model from OpenAI
Before Anything Goes Through (Input Filter): Let’s use OpenAI's text-moderation-latest model(or any other variant) right off the bat to check what people are submitting. It’s pretty good at catching hate speech, anything too adult, or just plain violent content. If something sketchy pops up, we can either tell them we can’t process it or clean it up a bit if possible.

Before Anything Goes Out (Output Filter): Once our LLM comes up with a response, let’s run it through the same filter. We want to make sure everything we send back is cool and doesn’t rub anyone the wrong way.

If Something’s Not Right: If we catch any no-no words or ideas in the output, we can tweak the response to fix it up or notify the user.

https://platform.openai.com/docs/models/moderation

Goal:
Moderation Model from OpenAI
Before Anything Goes Through (Input Filter): Let’s use OpenAI's text-moderation-latest model(or any other variant) right off the bat to check what people are submitting. It’s pretty good at catching hate speech, anything too adult, or just plain violent content. If something sketchy pops up, we can either tell them we can’t process it or clean it up a bit if possible.

Before Anything Goes Out (Output Filter): Once our LLM comes up with a response, let’s run it through the same filter. We want to make sure everything we send back is cool and doesn’t rub anyone the wrong way.

If Something’s Not Right: If we catch any no-no words or ideas in the output, we can tweak the response to fix it up or notify the user.

https://platform.openai.com/docs/models/moderation

Expected Outcome:
The content filter (should process both inputs and outputs from LLMs) should eliminate any harmful content given to or produced by the LLMs.

Acceptance criteria
The function should remove any harmful content when given to it.

Implementation details:
Create a function filter inside class content_filter present in the file jb-manager-bot/jb_manager_bot/content_filter/init.py which takes in a string and gives out a string. You can use the OptionParser function for reference.
You can use any moderation models.

Mockups/Wireframes:
NOT APPLICABLE

Tech skills needed:
Python
Data science

Complexity
Medium

Category:
Backend

Additional context

No response

@kanak8278 kanak8278 added the enhancement New feature or request label Apr 10, 2024
@Lekhanrao
Copy link
Collaborator

Could you please provide an update on this? @shreypandey

@kanak8278
Copy link
Collaborator Author

We have not yet started working on it. @Lekhanrao

@shreypandey shreypandey removed their assignment Jul 31, 2024
@KaranrajM
Copy link
Contributor

Need to be converted to C4GT issue template and can be picked up by C4GT folks.

@Lekhanrao Lekhanrao added C4GT Community Assignable to C4GT community developers C4GT Coding Coding related labels Aug 14, 2024
@yashpal2104
Copy link

I can try to work on this if you don't mind can you point out the resources that will help me understand more and solve this issue. Thanks

@Lekhanrao
Copy link
Collaborator

@yashpal2104, Thankyou. You can first touchbase with @KaranrajM (Karan) and @DevvStrange (Arun). We can take it forward from there. Have assigned this ticket to you now.

@KaranrajM
Copy link
Contributor

Hi @yashpal2104 the task here is to avoid providing unwanted user information to the LLMS and also harmful content from LLM back to the users. You can use this link https://avidml.org/blog/llm-guardrails-4/ to know more about the guardrails. Here are some of the resources to get you started with:

  1. https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/content-filter
  2. https://deveshsurve.medium.com/beginners-guide-to-llms-build-a-content-moderation-filter-and-learn-advanced-prompting-with-free-87f3bad7c0af

@yashpal2104
Copy link

Thanks for the resources will get on it after checking out the resources

@Lekhanrao
Copy link
Collaborator

@yashpal2104 is there any update on this please?

@yashpal2104
Copy link

Hey I did try it but was not able to provide a definitive answer I'm sorry for the delay. If you want you can assign it to anyone else, if I get a solution I will surely update this to you

@Lekhanrao
Copy link
Collaborator

Assigned ticket to @DevvStrange

@Lekhanrao
Copy link
Collaborator

@DevvStrange is looking at different models and will be able to come to some conclusion by 21st Oct

@DevvStrange
Copy link
Contributor

We have explored and tested below models -

  • OpenAI/omni-moderation-latest - OpenAI's Model [Free to use]
  • KoalaAI/Text-Moderation - Open source Moderation Model

Based on testing, both the models perform more or less similar. Omni model is easy to plug as its available as an API and recently the dataset has been updated by which the performance has been improved from the previous text-moderation model. For more details - https://platform.openai.com/docs/guides/moderation

@Lekhanrao
Copy link
Collaborator

@DevvStrange , could you please update what is the next steps here?

@DevvStrange
Copy link
Contributor

Implementation is already done, test cases are yet to be written and will be closed before end of this week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C4GT Coding Coding related C4GT Community Assignable to C4GT community developers enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

6 participants