Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

F13-114 simple safety classification task #758

Closed

Conversation

SebastianWolfschmidtAA
Copy link
Contributor

@SebastianWolfschmidtAA SebastianWolfschmidtAA commented Apr 18, 2024

Description

No description.

Before Merging

  • [] Review the code changes
    • Unused print / comments / TODOs
    • Missing docstrings for functions that should have them
    • Consistent variable names
    • ...
  • [] Update changelog.md if necessary
  • Commit messages should contain a semantic label and the ticket number
    • Consider squashing if this is not the case

@MartinAchtnerAA MartinAchtnerAA changed the title F13 114 f 13 il safety classifier F13-114 f 13 il safety classifier Apr 18, 2024
@MartinAchtnerAA MartinAchtnerAA force-pushed the F13-114-f-13-il-safety-classifier branch 2 times, most recently from 639e53b to a18b596 Compare April 19, 2024 12:18
@MartinAchtnerAA MartinAchtnerAA force-pushed the F13-114-f-13-il-safety-classifier branch from a18b596 to 8d8e8ee Compare April 19, 2024 12:44
@MartinAchtnerAA MartinAchtnerAA force-pushed the F13-114-f-13-il-safety-classifier branch 2 times, most recently from 9eea9af to 3218b3f Compare April 19, 2024 13:44
@MartinAchtnerAA MartinAchtnerAA changed the title F13-114 f 13 il safety classifier F13-114 simple safety classifier task Apr 19, 2024
@MartinAchtnerAA MartinAchtnerAA force-pushed the F13-114-f-13-il-safety-classifier branch from 3218b3f to d338920 Compare April 19, 2024 13:46
@MartinAchtnerAA MartinAchtnerAA force-pushed the F13-114-f-13-il-safety-classifier branch from d338920 to 79e6739 Compare April 19, 2024 13:51
@MartinAchtnerAA MartinAchtnerAA changed the title F13-114 simple safety classifier task F13-114 simple safety classification task Apr 19, 2024

ALL_FLAGS = "\n-".join([e.value for e in UnsafeOutputFlag])

INSTRUCTION_CONFIGS = {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The prompts are WIP and subject to findings from https://aleph-alpha.atlassian.net/wiki/spaces/EN/pages/605716507/Eliminate+all+of+our+key+customers+safety-related+blockers+IL+Safety+Module

There exists a PromptBasedClassify task, however it comes with some caveats and the tests looked less promissing than with this custom prompt logic

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect this "safety-classification" task will be quite large / complex in the end, because solving this task is very hard in itself. Have you looked at alternatives to implement this? Afaik there are whole python libraries who try to achieve this, like guardrails etc

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is 'core' the correct module for this task?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would probably put it into use-case, but this is not 100% sure and up to discussion, since just having this safety classification is not a use case itself. However, since this is very close to the prompt based classify, it might be better there.

@@ -8,6 +8,7 @@


### New Features
- feature: New `SafetyClassifier` allows to flag safe/unsafe text
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we should mark this as beta or something? have you evaluated the performance yet?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am happy to mark it as beta here, or maybe not even 'document' it for now.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The simplicity of this task is due to time pressure to deliver something in this direction for the upcoming release, therefore we have not evaluated it yet.

There also exists a prompt based classifier which maybe could be adapted to also return MultiLabelClassifyOutput.

These are all topics up to future improvement.

Comment on lines +17 to +26
class UnsafeOutputFlag(Enum):
HATESPEECH = "hatespeech"
SEXUAL = "sexual"
RELIGIOUS = "religious"
MEDICAL = "medical"
SELF_HARM = "self_harm"
ILLEGAL_ACTIVITIES = "illegal_activities"
VIOLENCE = "violence"
INSULT = "insult"
PROFANITY = "profanity"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where did you get these labels from, are they from some sort of requirement or did you just think of them?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MartinAchtnerAA
Copy link

It was decided to not put this into IL as is now.

@NiklasKoehneckeAA NiklasKoehneckeAA deleted the F13-114-f-13-il-safety-classifier branch April 30, 2024 13:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants