-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature Request]:Content Filter Layer for Input and Output of LLMs #15
Comments
Could you please provide an update on this? @shreypandey |
We have not yet started working on it. @Lekhanrao |
Need to be converted to C4GT issue template and can be picked up by C4GT folks. |
I can try to work on this if you don't mind can you point out the resources that will help me understand more and solve this issue. Thanks |
@yashpal2104, Thankyou. You can first touchbase with @KaranrajM (Karan) and @DevvStrange (Arun). We can take it forward from there. Have assigned this ticket to you now. |
Hi @yashpal2104 the task here is to avoid providing unwanted user information to the LLMS and also harmful content from LLM back to the users. You can use this link https://avidml.org/blog/llm-guardrails-4/ to know more about the guardrails. Here are some of the resources to get you started with: |
Thanks for the resources will get on it after checking out the resources |
@yashpal2104 is there any update on this please? |
Hey I did try it but was not able to provide a definitive answer I'm sorry for the delay. If you want you can assign it to anyone else, if I get a solution I will surely update this to you |
Assigned ticket to @DevvStrange |
@DevvStrange is looking at different models and will be able to come to some conclusion by 21st Oct |
We have explored and tested below models -
Based on testing, both the models perform more or less similar. Omni model is easy to plug as its available as an API and recently the dataset has been updated by which the performance has been improved from the previous text-moderation model. For more details - https://platform.openai.com/docs/guides/moderation |
@DevvStrange , could you please update what is the next steps here? |
Implementation is already done, test cases are yet to be written and will be closed before end of this week. |
Is your feature request related to a problem? Please describe.
Summary
I am writing to propose the addition of a content filter layer for both the input and output stages of the project's language learning models (LLMs). This enhancement aims to improve the robustness and safety of our interactions by preventing inappropriate or harmful content from being processed or generated by the LLMs.
Background
As our project grows in popularity and usage, the variety of inputs the LLMs have to process will inevitably increase. While the LLMs are designed to understand and generate human-like responses, there is a risk of encountering or producing content that may be offensive, biased, or otherwise inappropriate. Implementing a content filter layer can significantly mitigate these risks, ensuring a more positive and safe experience for all users.
Describe the solution you'd like
Moderation Model from OpenAI
Before Anything Goes Through (Input Filter): Let’s use OpenAI's text-moderation-latest model(or any other variant) right off the bat to check what people are submitting. It’s pretty good at catching hate speech, anything too adult, or just plain violent content. If something sketchy pops up, we can either tell them we can’t process it or clean it up a bit if possible.
Before Anything Goes Out (Output Filter): Once our LLM comes up with a response, let’s run it through the same filter. We want to make sure everything we send back is cool and doesn’t rub anyone the wrong way.
If Something’s Not Right: If we catch any no-no words or ideas in the output, we can tweak the response to fix it up or notify the user.
https://platform.openai.com/docs/models/moderation
Goal:
Moderation Model from OpenAI
Before Anything Goes Through (Input Filter): Let’s use OpenAI's text-moderation-latest model(or any other variant) right off the bat to check what people are submitting. It’s pretty good at catching hate speech, anything too adult, or just plain violent content. If something sketchy pops up, we can either tell them we can’t process it or clean it up a bit if possible.
Before Anything Goes Out (Output Filter): Once our LLM comes up with a response, let’s run it through the same filter. We want to make sure everything we send back is cool and doesn’t rub anyone the wrong way.
If Something’s Not Right: If we catch any no-no words or ideas in the output, we can tweak the response to fix it up or notify the user.
https://platform.openai.com/docs/models/moderation
Expected Outcome:
The content filter (should process both inputs and outputs from LLMs) should eliminate any harmful content given to or produced by the LLMs.
Acceptance criteria
The function should remove any harmful content when given to it.
Implementation details:
Create a function filter inside class content_filter present in the file jb-manager-bot/jb_manager_bot/content_filter/init.py which takes in a string and gives out a string. You can use the OptionParser function for reference.
You can use any moderation models.
Mockups/Wireframes:
NOT APPLICABLE
Tech skills needed:
Python
Data science
Complexity
Medium
Category:
Backend
Additional context
No response
The text was updated successfully, but these errors were encountered: