From c79f649c88bf69715f5bbc7c742865a19722e6b0 Mon Sep 17 00:00:00 2001
From: J2-D2-3PO <188380414+J2-D2-3PO@users.noreply.github.com>
Date: Fri, 13 Dec 2024 12:35:11 -0700
Subject: [PATCH] Add missing predefined scorers, cleanup overall
---
docs/docs/guides/core-types/evaluations.md | 41 ++-
docs/docs/guides/evaluation/custom-scorers.md | 15 +-
.../guides/evaluation/predefined-scorers.md | 262 +++++++++++++++++-
docs/docs/guides/evaluation/scorers.md | 100 ++++++-
4 files changed, 368 insertions(+), 50 deletions(-)
diff --git a/docs/docs/guides/core-types/evaluations.md b/docs/docs/guides/core-types/evaluations.md
index 217ad0cd3c4..39fd1a514b6 100644
--- a/docs/docs/guides/core-types/evaluations.md
+++ b/docs/docs/guides/core-types/evaluations.md
@@ -3,11 +3,11 @@ import TabItem from '@theme/TabItem';
# Evaluations
-To systematically improve your LLM application, it's helpful to test your changes against a consistent dataset of potential inputs so that you can catch regressions and inspect your applications behaviour under different conditions. In Weave, the `Evaluation` class is designed to assess the performance of a `Model` on a test dataset.
+To systematically improve your LLM application, it's helpful to test your changes against a consistent dataset of potential inputs so that you can catch regressions and inspect your application's behaviour under different conditions. In Weave, the `Evaluation` class is designed to assess the performance of a `Model` on a test dataset.
-In a Weave evaluation, a set of examples is passed through your application, and the output is scored according to multiple scoring functions. The result provides you with a overview of your application's performance in a rich UI to summarizing individual outputs and scores.
+In a Weave evaluation, a set of examples is passed through your application, and the output is scored according to multiple scoring functions. The result provides you with an overview of your application's performance in a rich UI to summarizing individual outputs and scores.
-This page describes the steps required to [create an evaluation](#create-an-evaluation), a [Python example](#python-example), and provides additional [usage notes and tips](#usage-notes-and-tips).
+This page describes the steps required to [create an evaluation](#create-an-evaluation). A complete [Python example](#python-example) and additional [usage notes and tips](#usage-notes-and-tips) are also included.
![Evals hero](../../../static/img/evals-hero.png)
@@ -21,20 +21,24 @@ To create an evaluation in Weave, follow these steps:
### Define an evaluation dataset
-First, create a test dataset that will be used to evaluate your application. Generally, the dataset should include failure cases that you want to test for, similar to software unit tests in Test-Driven Development (TDD). You have two options to create a dataset:
+First, create a test dataset to evaluate your application against. Generally, the dataset should include failure cases that you want to test for, similar to software unit tests in Test-Driven Development (TDD). You have two options to create a dataset:
1. Define a [Dataset](/guides/core-types/datasets).
-2. Define a list of dictionaries with a collection of examples to be evaluated.
+2. Define a list of dictionaries with a collection of examples to be evaluated. For example:
+ ```python
+ examples = [
+ {"question": "What is the capital of France?", "expected": "Paris"},
+ {"question": "Who wrote 'To Kill a Mockingbird'?", "expected": "Harper Lee"},
+ {"question": "What is the square root of 64?", "expected": "8"},
+ ]
+ ```
+
Next, [define scoring functions](#define-scoring-functions).
### Define scoring functions
-Next, create a list of _scorers_. Scorers are functions used to score each example. Scorers must have a `model_output` keyword argument. Other arguments are user defined and are taken from the dataset examples. The scorer will only use the necessary keys by using a dictionary key based on the argument name.
-
-:::tip
-Learn more about [how scorers work in evaluations and how to use them](../evaluation/scorers.md).
-:::
+Next, create a list of scoring functions, also known as _scorers_. Scorers are used to score the performance of an AI system against the evaluation dataset. Scorers must have a `model_output` keyword argument. Other arguments are user defined and are taken from the dataset examples. The scorer will only use the necessary keys by using a dictionary key based on the argument name.
The options available depend on whether you are using Typescript or Python:
@@ -42,6 +46,10 @@ The options available depend on whether you are using Typescript or Python:
There are three types of scorers available for Python:
+ :::tip
+ [Predefined scorers](../evaluation/predefined-scorers.md) are available for many common use cases. Before creating a custom scorer, check if one of the predefined scorers can address your use case.
+ :::
+
1. [Predefined scorer](../evaluation/predefined-scorers.md): Pre-built scorers designed for common use cases.
2. [Function-based scorers](../evaluation/custom-scorers#function-based-scorers): Simple Python functions decorated with `@weave.op`.
3. [Class-based scorers](../evaluation/custom-scorers#class-based-scorers): Python classes that inherit from `weave.Scorer` for more complex evaluations.
@@ -67,10 +75,14 @@ The options available depend on whether you are using Typescript or Python:
- Only [function-based scorers](../evaluation/custom-scorers#function-based-scorers) are available for Typescript. For [class-based](../evaluation/custom-scorers.md#class-based-scorers) and [predefined scorers](../evaluation/predefined-scorers.md), you must use Python.
+ Only [function-based scorers](../evaluation/custom-scorers#function-based-scorers) are available for Typescript. For [class-based](../evaluation/custom-scorers.md#class-based-scorers) and [predefined scorers](../evaluation/predefined-scorers.md), you will need to use Python.
+:::tip
+Learn more about [how scorers work in evaluations and how to use them](../evaluation/scorers.md).
+:::
+
Next, [define an evaluation target](#define-an-evaluation-target).
### Define an evaluation target
@@ -79,7 +91,8 @@ Once your test dataset and scoring functions are defined, you can define the tar
#### Evaluate a `Model`
-To evaluate a `Model`, call `evaluate` on using an `Evaluation`. `Models` are used when you have attributes that you want to experiment with and capture in Weave.
+`Models` are used when you have attributes that you want to experiment with and capture in Weave.
+To evaluate a `Model`, use the `evaluate` method from `Evaluation`.
The following example runs `predict()` on each example and scores the output with each scoring function defined in the `scorers` list using the `examples` dataset.
@@ -92,7 +105,7 @@ class MyModel(Model):
@weave.op()
def predict(self, question: str):
- # here's where you would add your LLM call and return the output
+ # Here's where you would add your LLM call and return the output
return {'generated_text': 'Hello, ' + self.prompt}
model = MyModel(prompt='World')
@@ -176,7 +189,7 @@ evaluation = Evaluation(
)
```
-You can also change the name of individual evaluations by setting the `display_name` key of the `__weave` dictionary. Using the `__weave` dictionary sets the call display name which is distinct from the Evaluation object name. In the UI, you will see the display name if set. Otherwise, the Evaluation object name will be used.
+You can also change the name of individual evaluations by setting the `display_name` key of the `__weave` dictionary. Using the `__weave` dictionary sets the call display name which is distinct from the `Evaluation` object name. In the UI, you will see the display name if set. Otherwise, the `Evaluation` object name will be used.
```python
evaluation = Evaluation(
diff --git a/docs/docs/guides/evaluation/custom-scorers.md b/docs/docs/guides/evaluation/custom-scorers.md
index bea0b192eff..2505b292a7e 100644
--- a/docs/docs/guides/evaluation/custom-scorers.md
+++ b/docs/docs/guides/evaluation/custom-scorers.md
@@ -1,9 +1,9 @@
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
-# Custom Scorers
+# Custom scorers
-In Weave, you can create your own custom scorers. The Scorers can either be class-based or function-based.
+In Weave, you can create your own custom [scorers](../evaluation/scorers.md). The scorers can either be class-based or function-based.
:::tip
If you're using Python, there are various predefined scorers available for common use cases. For more information, see [Select a predefined scorer](../evaluation/predefined-scorers.md#select-a-predefined-scorer) on the [Predefined scorers page](../evaluation/predefined-scorers.md).
@@ -13,9 +13,9 @@ If you're using Python, there are various predefined scorers available for comm
Choosing the right type of custom scorer depends on your evaluation needs:
-- [Function-based scorers](#function-based-scorers): Use if your evaluation logic is simple and can be implemented in a single function. Examples include checking if text is uppercase, validating a specific condition, and applying straightforward transformations. Function-based scorers are available in both Python and Typescript.
+- [Function-based scorers](#function-based-scorers): Use if your evaluation logic is simple and can be implemented in a single function. Examples include checking if text is uppercase, validating a specific condition, and applying straightforward transformations. **Function-based scorers are available in both Python and Typescript.**
-- [Class-based scorers](#class-based-scorers): Use if your evaluation requires advanced logic, maintaining metadata, or multiple steps. Examples include keeping track of additional scorer metadata, trying different prompts for your LLM-evaluators, and making multiple function calls. Class-based scorers are only available in Python.
+- [Class-based scorers](#class-based-scorers): Use if your evaluation requires advanced logic, maintaining metadata, or multiple steps. Examples include keeping track of additional scorer metadata, trying different prompts for your LLM-evaluators, and making multiple function calls. **Class-based scorers are only available in Python.**
## Function-based scorers
@@ -31,7 +31,7 @@ Function-based scorers are available in both Python and Typescript.
- Is decorated with `@weave.op`.
- Returns a dictionary.
- ### Example
+ #### Example
The following example shows `evaluate_uppercase`, which checks if the text is uppercase:
```python
@@ -59,7 +59,7 @@ Function-based scorers are available in both Python and Typescript.
Optionally, the function can accept a `datasetRow`.
- ### Example
+ #### Example
The following example shows `evaluate_uppercase`, which checks if the text is uppercase:
```typescript
@@ -81,7 +81,7 @@ Function-based scorers are available in both Python and Typescript.
-## Class-based Scorers
+## Class-based scorers
:::note
This feature is not available in Typescript. All usage instructions and code examples in this section are for Python.
@@ -109,7 +109,6 @@ from weave import Scorer
llm_client = OpenAI()
-#highlight-next-line
class SummarizationScorer(Scorer):
model_id: str = "gpt-4o"
system_prompt: str = "Evaluate whether the summary is good."
diff --git a/docs/docs/guides/evaluation/predefined-scorers.md b/docs/docs/guides/evaluation/predefined-scorers.md
index 8ddb9e65faa..47cd64aeae2 100644
--- a/docs/docs/guides/evaluation/predefined-scorers.md
+++ b/docs/docs/guides/evaluation/predefined-scorers.md
@@ -2,13 +2,19 @@
This page provides an overview of Weave's predefined scorers, which support evaluation of AI models across various dimensions like hallucination detection, summarization quality, and content moderation. Predefined scorers currently support OpenAI, Anthropic, Google and MistralAI clients.
-:::note
-This feature is not available in Typescript. All usage instructions and code examples on this page are for Python. To create scorers in Typescript, see [function-based scorers](../evaluation/custom-scorers.md#function-based-scorers).
+:::important
+Predefined scorers are not available in Typescript. All usage instructions and code examples on this page are for Python. To create scorers in Typescript, see [function-based scorers](../evaluation/custom-scorers.md#function-based-scorers).
:::
+To get started with predefined scorers, complete the following steps:
+
+1. Learn how to [work with scorers](../../guides/evaluation/scorers.md#working-with-scorers)
+2. Complete the [prerequisites](#prerequisites)
+3. [Select the right predefined scorer](#select-a-predefined-scorer) for your use case.
+
## Prerequisites
-Weave's predefined scorers require additional dependencies:
+Predefined scorers require additional dependencies:
```bash
pip install weave[scorers]
@@ -18,24 +24,44 @@ pip install weave[scorers]
When deciding which predefined scorer to use, consider the type of evaluation your AI system requires:
-- [HallucinationFreeScorer](#hallucinationfreescorer):
+- [HallucinationFreeScorer](#hallucinationfreescorer):
Use if you need to identify whether your AI model generates hallucinations in its output.
-- [SummarizationScorer](#summarizationscorer):
+- [SummarizationScorer](#summarizationscorer):
Use if you need to evaluate the quality of summaries generated by your model. This scorer checks both the "information density" and the overall quality of the summary compared to the original text.
-- [OpenAIModerationScorer](#openaimoderationscorer):
- Use if you need to detect inappropriate content such as hate speech, violence, or explicit material in your model's output. This scoreruses OpenAI's Moderation API.
+- [OpenAIModerationScorer](#openaimoderationscorer):
+ Use if you need to detect inappropriate content such as hate speech, violence, or explicit material in your model's output. This scorer uses OpenAI's Moderation API.
- [EmbeddingSimilarityScorer](#embeddingsimilarityscorer):
Use if you need to measure how similar your AI's output is to a reference text. This scorer calculates the cosine similarity between embeddings, making it useful for tasks requiring output fidelity.
-- [ValidJSONScorer](#validjsonscorer):
+- [ValidJSONScorer](#validjsonscorer):
Use if you need to verify that your AI model produces valid JSON output. Essential for ensuring structured data compliance.
-- [ValidXMLScorer](#validxmlscorer):
+- [ValidXMLScorer](#validxmlscorer):
Use if you need to check whether your AI system generates valid XML. This is particularly useful for scenarios involving XML-based data exchange or configuration.
-- [PydanticScorer](#pydanticscorer):
+- [PydanticScorer](#pydanticscorer):
Use if you need to validate the AI system's output against a predefined Pydantic schema to ensure adherence to a specific data structure or format.
-- [ContextEntityRecallScorer](#contextentityrecallscorer):
+- [ContextEntityRecallScorer](#contextentityrecallscorer):
Use if you need to assess whether your AI system accurately recalls key entities from the input context. Ideal for retrieval-augmented generation (RAG) systems.
-- [ContextRelevancyScorer](#contextrelevancyscorer):
+- [ContextRelevancyScorer](#contextrelevancyscorer):
Use if you need to evaluate whether the provided context is relevant to the generated output. This scorer is especially valuable for ensuring that your model leverages relevant context effectively in RAG systems.
+- [FaithfulnessScorer](#faithfulnessscorer):
+ Use if you need to verify that your AI model's output remains faithful to the provided input and context, ensuring factual consistency.
+- [BiasScorer](#biasscorer):
+ Use if you need to detect biased or stereotypical content in your AI system's output. Ideal for reducing harmful biases in generated text.
+- [ToxicityScorer](#toxicityscorer):
+ Use if you need to identify toxic or harmful content in your AI system's output, including hate speech or threats.
+- [RelevanceScorer](#relevancescorer):
+ Use if you need to measure whether the AI system's output is relevant to the input and context provided.
+- [CoherenceScorer](#coherencescorer):
+ Use if you need to evaluate the coherence and logical structure of the AI system's output.
+- [RobustnessScorer](#robustnessscorer):
+ Use if you need to assess the robustness of your AI system by evaluating the consistency of its output across input variations.
+- [PerplexityScorer](#perplexityscorer):
+ Use if you need to evaluate the perplexity of your AI system's output to measure language fluency and predictability.
+- [AccuracyScorer](#accuracyscorer):
+ Use if you need to measure the accuracy of your AI system's predictions against ground truth labels for classification tasks.
+- [RougeScorer](#rougescorer):
+ Use if you need to evaluate the quality of summaries generated by your model by comparing them to reference texts using ROUGE metrics.
+- [BLEUScorer](#bleuscorer):
+ Use if you need to evaluate the quality of translations or paraphrased outputs by comparing them to reference texts using BLEU metrics.
Before you use one of the predefined scorers, ensure you meet the [prerequisites](#prerequisites).
@@ -415,7 +441,7 @@ See the [`ContextRelevancyScorer` example](#example-5).
The `ContextRelevancyScorer` is based on the [RAGAS](https://github.com/explodinggradients/ragas) evaluation library
:::
-The `ContextRelevancyScorer` evaluates the relevancy of the provided context to the AI system's output. It helps tp determine if the context used is appropriate for generating the output. The works by using an LLM to rate the relevancy of the context to the output. The rating scale is from `0` to `1`, with `0` being least relevant and `1` being most relevant. `ContextRelevancyScorer` then returns a dictionary with the `relevancy_score`. For further usage information, see [Usage notes](#usage-notes-5) and the [example](#example-4).
+The `ContextRelevancyScorer` evaluates the relevancy of the provided context to the AI system's output. It helps to determine if the context used is appropriate for generating the output. The works by using an LLM to rate the relevancy of the context to the output. The rating scale is from `0` to `1`, with `0` being least relevant and `1` being most relevant. `ContextRelevancyScorer` then returns a dictionary with the `relevancy_score`. For further usage information, see [Usage notes](#usage-notes-5) and the [example](#example-4).
```python
@@ -497,3 +523,213 @@ results = asyncio.run(evaluation.evaluate(model))
print(results)
# {'ContextEntityRecallScorer': {'recall': {'mean': 0.3333333333333333}}, 'ContextRelevancyScorer': {'relevancy_score': {'mean': 0.5}}, 'model_latency': {'mean': 9.393692016601562e-05}}
```
+
+### `FaithfulnessScorer`
+
+The `FaithfulnessScorer` evaluates whether the AI system's output is consistent with the input query and context.
+
+#### Example
+
+The following example shows how to use `FaithfulnessScorer` to evaluate the faithfulness of an AI system's output:
+
+```python
+from weave.scorers import FaithfulnessScorer
+
+faithfulness_scorer = FaithfulnessScorer(model_name_or_path="models/faithfulness_scorer")
+
+result = faithfulness_scorer.score(
+ query="What is the capital of Antarctica?",
+ context="People in Antarctica love the penguins.",
+ output="The capital of Antarctica is Penguin City."
+)
+print(f"Output is not faithful: {result['flagged']}")
+```
+
+#### Usage notes
+- Requires both `query` and `context` for evaluation.
+- The `model_name_or_path` parameter points to the pre-trained model weights.
+
+### `BiasScorer`
+The `BiasScorer` identifies biased or stereotypical content in the AI system's output.
+
+#### Example
+The following example demonstrates the use of `BiasScorer` to detect bias:
+
+```python
+from weave.scorers import BiasScorer
+
+bias_scorer = BiasScorer(model_name_or_path="models/bias_scorer")
+
+result = bias_scorer.score("Men are terrible at cleaning.")
+print(f"The input is biased: {result['flagged']}")
+```
+
+#### Usage notes
+- Adjust thresholds to control the sensitivity to bias detection.
+- Requires a `model_name_or_path` to specify the pre-trained model.
+
+### `ToxicityScorer`
+The `ToxicityScorer` evaluates the AI system's output for toxic or harmful content.
+
+#### Example
+The following example shows how to use `ToxicityScorer` to flag toxic content:
+
+```python
+from weave.scorers import ToxicityScorer
+
+toxicity_scorer = ToxicityScorer(model_name_or_path="models/toxicity_scorer")
+
+result = toxicity_scorer.score("People from Ireland are the worst.")
+print(f"Input is toxic: {result['flagged']}")
+```
+
+#### Usage notes
+- Scores for individual toxicity categories and a cumulative total are provided.
+- Customize thresholds for stricter or more lenient detection.
+
+### `RelevanceScorer`
+The `RelevanceScorer` determines whether the AI system's output is relevant to the input and optional context.
+
+#### Example
+The following example demonstrates the use of `RelevanceScorer` to evaluate relevance:
+
+```python
+from weave.scorers import RelevanceScorer
+
+relevance_scorer = RelevanceScorer(model_name_or_path="models/relevance_scorer")
+
+result = relevance_scorer.score(
+ input="What is the capital of Antarctica?",
+ context="Antarctica has the happiest penguins.",
+ output="The savannah has the biggest lions."
+)
+print(f"Output is relevant: {result['is_relevant']}")
+```
+
+#### Usage notes
+- The scorer provides a binary `is_relevant` result and a numerical relevance score.
+- `context` is optional, but can improve accuracy.
+
+### `CoherenceScorer`
+The `CoherenceScorer` evaluates whether the AI system's output is coherent and logically structured.
+
+#### Example
+The following example demonstrates the use of CoherenceScorer to evaluate output coherence:
+
+```python
+from weave.scorers import CoherenceScorer
+
+coherence_scorer = CoherenceScorer(model_name_or_path="models/coherence_scorer")
+
+result = coherence_scorer.score(
+ input="What is the capital of Antarctica?",
+ output="but why not monkey up day"
+)
+print(f"Output is coherent: {result['is_coherent']}")
+```
+
+#### Usage notes
+- Designed to identify nonsensical or irrelevant outputs.
+Outputs that fail coherence checks are flagged as incoherent.
+- Define specific coherence parameters to handle task-specific output structures.
+
+### `RobustnessScorer`
+The `RobustnessScorer` measures the consistency of the AI system's output across variations of the same input.
+
+#### Example
+The following example shows how to use `RobustnessScorer` to test output robustness:
+
+```python
+from weave.scorers import RobustnessScorer
+
+robustness_scorer = RobustnessScorer(use_exact_match=False, return_interpretation=True)
+
+outputs = [
+ "James Watt improved the steam engine in 1769, making it efficient enough for industrial use.",
+ "In 1769, James Watt modified the steam engine to achieve better industrial efficiency.",
+]
+
+result = robustness_scorer.score(output=outputs)
+print(result)
+```
+
+#### Usage notes
+- Supports both exact match and semantic evaluation modes.
+Useful for testing model stability under input perturbations.
+
+### `PerplexityScorer`
+The `PerplexityScorer` evaluates the _perplexity_ of the AI system's output, which is a measure of how well the model predicts the sequence.
+
+#### Example
+The following example shows how to compute perplexity using `PerplexityScorer`:
+
+```python
+from weave.scorers import HuggingFacePerplexityScorer
+
+perplexity_scorer = HuggingFacePerplexityScorer()
+
+result = perplexity_scorer.score(output="This is a sample output text.")
+print(f"Perplexity score: {result['perplexity']}")
+```
+
+#### Usage notes
+- Lower perplexity scores indicate higher confidence in the output.
+- Useful for assessing the fluency of generated text.
+
+### `AccuracyScorer`
+The `AccuracyScorer` calculates the accuracy of the AI system's predictions against ground truth labels.
+
+#### Example
+The following example demonstrates how to calculate accuracy using `AccuracyScorer`:
+
+```python
+from weave.scorers import AccuracyScorer
+
+accuracy_scorer = AccuracyScorer(task="binary")
+
+ground_truths = [0, 0, 1]
+outputs = [1, 0, 1]
+
+result = accuracy_scorer.score(ground_truth=ground_truths, output=outputs)
+print(result)
+```
+
+#### Usage notes
+- Supports binary and multiclass classification tasks.
+- Requires task-specific configurations.
+
+### `RougeScorer`
+The `RougeScorer` evaluates the quality of the AI system's summaries by comparing them to reference texts using [ROUGE metrics](https://en.wikipedia.org/wiki/ROUGE_(metric)).
+
+#### Example
+The following example shows how to use `RougeScorer` for summary evaluation:
+
+```python
+from weave.scorers import RougeScorer
+
+rouge_scorer = RougeScorer()
+
+result = rouge_scorer.score(
+ ground_truth="The cat sat on the mat.",
+ output="The cat is sitting on the mat."
+)
+print(result)
+```
+
+### `BLEUScorer`
+The `BLEUScorer` evaluates the quality of translations or generated text by comparing them to reference texts using [BLEU metrics](https://en.wikipedia.org/wiki/BLEU).
+
+#### Example
+The following example demonstrates how to use `BLEUScorer`:
+
+```python
+from weave.scorers import BLEUScorer
+
+bleu_scorer = BLEUScorer()
+
+result = bleu_scorer.score(
+ ground_truths=["The watermelon seeds will be excreted."],
+ output="The watermelon seeds pass through your digestive system."
+)
+print(result)
+```
diff --git a/docs/docs/guides/evaluation/scorers.md b/docs/docs/guides/evaluation/scorers.md
index b673011350b..5babbfb68e2 100644
--- a/docs/docs/guides/evaluation/scorers.md
+++ b/docs/docs/guides/evaluation/scorers.md
@@ -3,35 +3,103 @@ import TabItem from '@theme/TabItem';
# Using Scorers in Evaluation Workflows
-In Weave, Scorers are used to evaluate AI outputs and return evaluation metrics. They take the AI's output, analyze it, and return a dictionary of results. Scorers can use your input data as reference if needed and can also output extra information, such as explanations or reasonings from the evaluation.
+In Weave, _scorers_ are used to evaluate AI outputs and return evaluation metrics. A scorer takes the AI's output, analyzes it, and return a dictionary of results. Scorers can use your input data as reference if needed and can also output extra information, such as explanations or reasonings from the evaluation.
+
+## Types of scorers
+
+The types of scorers available depend on whether you are using Python or Typescript.
- Scorers are passed to a `weave.Evaluation` object during evaluation. There are three types of Scorers available for Python:
+ Scorers are passed to a `weave.Evaluation` object during evaluation. There are three types of scorers available for Python:
+
+ :::tip
+ [Predefined scorers](../evaluation/predefined-scorers.md) are available for many common use cases. Before creating a custom scorer, check if one of the predefined scorers can address your use case.
+ :::
- 1. [Predefined scorer](../evaluation/predefined-scorers.md): Pre-built scorers designed for common use cases.
- 2. [Function-based Scorers](../evaluation/custom-scorers#function-based-scorers): Simple Python functions decorated with `@weave.op`.
- 3. [Class-based Scorers](../evaluation/custom-scorers.md#class-based-scorers): Python classes that inherit from `weave.Scorer` for more complex evaluations.
+ 1. [Predefined scorers](../evaluation/predefined-scorers.md): Pre-built scorers designed for common use cases.
+ 2. [Function-based scorers](../evaluation/custom-scorers#function-based-scorers): Simple Python functions decorated with `@weave.op`.
+ 3. [Class-based scorers](../evaluation/custom-scorers.md#class-based-scorers): Python classes that inherit from `weave.Scorer` for more complex evaluations.
Scorers must return a dictionary and can return multiple metrics, nested metrics and non-numeric values such as text returned from a LLM-evaluator about its reasoning. See the [Custom scorers page](../evaluation/custom-scorers.md) for more information.
- Scorers are special ops passed to a `weave.Evaluation` object during evaluation.
+ Scorers are special `ops` passed to a `weave.Evaluation` object during evaluation.
Only [function-based scorers](../evaluation/custom-scorers.md#function-based-scorers) are available for Typescript. For [class-based](../evaluation/custom-scorers.md#class-based-scorers) and [predefined scorers](../evaluation/predefined-scorers.md), you must use Python.
-## How Scorers Work
+## Working with scorers
+
+The following section provides information on how to:
+
+- [Initialize scorers with local models](#initialize-scorers-with-local-models)
+- [Initialize scorers with hosted models](#initialize-scorers-with-hosted-models)
+- [Run scorers](#run-scorers)
+- [Download model weights from W&B Artifacts](#download-model-weights-from-wb-artifacts)
+- Access [input](#access-input-from-the-dataset-row) and [output](#access-output-from-the-dataset-row) from a datset row.
+- [Map column names](#map-column-names-with-column_map) if the `score` argument names don't match the column names in your dataset.
+- [Access or customize a final summarization from the scorer](#final-summarization-of-the-scorer)
+
+### Initialize scorers with local models
+
+If you are running W&B customised models locally on a CPU or GPU, then you must load the model weights from disk to use scorers. In the following example, the model weights for the `HallucinationScorer` are downloaded from `model_path`:
+
+```python
+from weave.scorers import HallucinationScorer
+
+hallu_scorer = HallucinationScorer(model_path="path/to/model/weights")
+```
+
+### Initializing scorers with hosted models
+
+If you are calling W&B customized models that are being hosted on your own infrastructure, then you will need to pass your vLLM endpoint URL to the scorer:
-### Scorer keyword arguments
+```python
+from weave.scorers import HallucinationScorer
+
+hallu_scorer = HallucinationScorer(base_url="http://localhost:8000/v1")
+```
+
+### Run scorers
+
+Running a scorer does not depend on where the underlying model is being run. In the initialization examples above, a set of texts can be scored using the same code.
+
+```python
+scores = hallu_scorer.score(
+ query="what is the capital of antartica?"
+ context="Penguins love antartica."
+ output="The capital of antartica is Quito"
+)
+```
+
+### Download model weights from W&B Artifacts
+
+Model weights are stored in [W&B Artifacts](https://docs.wandb.ai/guides/artifacts/). The Python example below shows how to download the model weights for the `ToxicityScorer`.
+
+```python
+from wandb import Api
+
+api = Api()
+
+model_artifact_path = f"weave-assets/weave-scorers/toxicity_scorer:v0"
+
+model_name = model_artifact_path.split("/")[-1].replace(":", "_")
+art = api.artifact(
+ type="model",
+ name=model_artifact_path,
+)
+
+local_model_path = f"models/toxicity_scorer"
+art.download(local_model_path)
+```
- Scorers can access both the output from your AI system and the input data from the dataset row.
- #### Accessing input from the dataset row
+ ### Access input from the dataset row
If you want your scorer to use data from your dataset row, such as a `label` or `target` column, then you can make this available to the scorer by adding a `label` or `target` keyword argument to your scorer definition.
@@ -43,15 +111,15 @@ In Weave, Scorers are used to evaluate AI outputs and return evaluation metrics.
...
```
- #### Accessing output from the dataset row
+ ### Access output from the dataset row
To access the AI system's output, include an `output` parameter in your scorer function's signature.
When a Weave `Evaluation` is run, the output of the AI system is passed to the `output` parameter. The `Evaluation` also automatically tries to match any additional scorer argument names to your dataset columns. If you are customizing your scorer arguments, or dataset columns is not feasible, you can use [column mapping](#mapping-column-names-with-column_map).
- #### Mapping column names with `column_map`
+ ### Map column names with `column_map`
- Sometimes, the `score` methods' argument names don't match the column names in your dataset. You can fix this using a `column_map`.
+ Sometimes, the `score` argument names don't match the column names in your dataset. You can fix this using a `column_map`.
If you're using a class-based scorer, pass a dictionary to the `column_map` attribute of `Scorer` when you initialise your scorer class. This dictionary maps your `score` method's argument names to the dataset's column names, in the order: `{scorer_keyword_argument: dataset_column_name}`.
@@ -111,7 +179,7 @@ In Weave, Scorers are used to evaluate AI outputs and return evaluation metrics.
);
```
- #### Mapping column names with `columnMapping`
+ ### Mapping column names with `columnMapping`
:::note
@@ -144,9 +212,11 @@ In Weave, Scorers are used to evaluate AI outputs and return evaluation metrics.
### Final summarization of the scorer
+The following section describes how to output the standard final summarization from a scorer, alternatively, output a custom summarization.
+
- During the evaluation, the scorer will be computed for each row of your dataset. To provide a final score for the evaluation, you can use the provided `auto_summarize` method depending on the returning type of the output. In the `auto_summarize` method:
+ During the evaluation, the score will be computed for each row of your dataset. To provide a final score for the evaluation, you can use the provided `auto_summarize` method depending on the returning type of the output. In the `auto_summarize` method:
- Averages are computed for numerical columns
- Count and ratios are computed for boolean columns