Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(weave): Add initial suite of scorers, refactor weave/flow #2662

Merged
merged 163 commits into from
Oct 28, 2024

Conversation

morganmcg1
Copy link
Member

@morganmcg1 morganmcg1 commented Oct 10, 2024

Scorers

Create opinionated scorers for users looking for off-the-shelf evaluators to common LLM issues.

TODO

  • Restructure scorers
  • Create initial set of scorers
  • Write tests for all new scorers
  • Add google.generativeai
  • Write documentation
  • Cleanup, remove any stray TODOs, comments, prints, commented-out code

New Scorers

This PR introduces several scorers for evaluating different aspects of model outputs. These scorers are designed to work with the existing Scorer base class and can be easily integrated into the evaluation pipeline. The LLM-based scorers support multiple LLM providers (OpenAI, Anthropic, Google Generative AI and Mistral) with a unified interface and use the instructor library for a consistent api and for structured outputs.

LLMScorer: A base class for LLM-based scorers.
InstructorLLMScorer: for instructor-powered LLMs

  1. HallucinationScorer: Given a model output and and input, checks for hallucinations
  2. SummarizationScorer: Grades an output summary and also returns a measure of the entity-density of the summary
  3. EmbeddingScorer: Computes cosine similarity between embeddings of model output and target.
  4. OpenAIModerationScorer: Uses OpenAI's moderation API to check if the model output is safe.
  5. JSONScorer: Validates if the model output is a valid JSON string.
  6. XMLScorer: Checks if the model output is a valid XML string.
  7. PydanticScorer: Checks if the model output is valid for a given Pydantic model.
  8. ContextEntityRecallScorer: estimates context recall , from the RAGAS library
  9. ContextRelevancyScorer: evaluates the relevancy of the provided context, from RAGAS library

User-facing api changes

  • now outputs can be used as a param in the score function as an alternative to model_outputs
  • added column_map as an optional attribute for Scorer to give more flexibility when scorer param names are different to dataset column names.

Structural repo changes

  • creation of weave/flow/scorers and moving most core scoring functionality from weave/flow/scorer.py to weave/flow/scorers/base_scorer.py

    • weave/flow/scorer.py kept around for now for backward compatibility
  • weave/scorers dir was created for high-level imports to enable more dev-friendly importing of scorers. The user shouldn't have to know about flow:

    • from: from weave.flow.scorers.json_scorer import ValidJSONScorer
    • to: from weave.scorers import ValidJSONScorer

Documentation

Screenshot 2024-10-16 at 21 19 57 Screenshot 2024-10-16 at 21 20 06 Screenshot 2024-10-16 at 21 20 17 Screenshot 2024-10-16 at 21 20 29 Screenshot 2024-10-16 at 21 20 42 Screenshot 2024-10-16 at 21 20 56 Screenshot 2024-10-16 at 21 21 09 Screenshot 2024-10-16 at 21 21 15 Screenshot 2024-10-16 at 21 21 29

@morganmcg1 morganmcg1 requested a review from a team as a code owner October 10, 2024 14:18
@circle-job-mirror
Copy link

circle-job-mirror bot commented Oct 10, 2024

@tcapelle tcapelle changed the title add inital scorere, refactor add inital scorers, refactor Oct 10, 2024
@morganmcg1 morganmcg1 marked this pull request as draft October 10, 2024 19:27
@scottire scottire merged commit f79fbcc into master Oct 28, 2024
218 checks passed
@scottire scottire deleted the add_more_scorers branch October 28, 2024 18:29
@github-actions github-actions bot locked and limited conversation to collaborators Oct 28, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants