Skip to content

Commit

Permalink
Add missing predefined scorers, cleanup overall
Browse files Browse the repository at this point in the history
  • Loading branch information
J2-D2-3PO committed Dec 13, 2024
1 parent 13d57ce commit c79f649
Show file tree
Hide file tree
Showing 4 changed files with 368 additions and 50 deletions.
41 changes: 27 additions & 14 deletions docs/docs/guides/core-types/evaluations.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,11 +3,11 @@ import TabItem from '@theme/TabItem';

# Evaluations

To systematically improve your LLM application, it's helpful to test your changes against a consistent dataset of potential inputs so that you can catch regressions and inspect your applications behaviour under different conditions. In Weave, the `Evaluation` class is designed to assess the performance of a `Model` on a test dataset.
To systematically improve your LLM application, it's helpful to test your changes against a consistent dataset of potential inputs so that you can catch regressions and inspect your application's behaviour under different conditions. In Weave, the `Evaluation` class is designed to assess the performance of a `Model` on a test dataset.

In a Weave evaluation, a set of examples is passed through your application, and the output is scored according to multiple scoring functions. The result provides you with a overview of your application's performance in a rich UI to summarizing individual outputs and scores.
In a Weave evaluation, a set of examples is passed through your application, and the output is scored according to multiple scoring functions. The result provides you with an overview of your application's performance in a rich UI to summarizing individual outputs and scores.

This page describes the steps required to [create an evaluation](#create-an-evaluation), a [Python example](#python-example), and provides additional [usage notes and tips](#usage-notes-and-tips).
This page describes the steps required to [create an evaluation](#create-an-evaluation). A complete [Python example](#python-example) and additional [usage notes and tips](#usage-notes-and-tips) are also included.

![Evals hero](../../../static/img/evals-hero.png)

Expand All @@ -21,27 +21,35 @@ To create an evaluation in Weave, follow these steps:

### Define an evaluation dataset

First, create a test dataset that will be used to evaluate your application. Generally, the dataset should include failure cases that you want to test for, similar to software unit tests in Test-Driven Development (TDD). You have two options to create a dataset:
First, create a test dataset to evaluate your application against. Generally, the dataset should include failure cases that you want to test for, similar to software unit tests in Test-Driven Development (TDD). You have two options to create a dataset:

1. Define a [Dataset](/guides/core-types/datasets).
2. Define a list of dictionaries with a collection of examples to be evaluated.
2. Define a list of dictionaries with a collection of examples to be evaluated. For example:
```python
examples = [
{"question": "What is the capital of France?", "expected": "Paris"},
{"question": "Who wrote 'To Kill a Mockingbird'?", "expected": "Harper Lee"},
{"question": "What is the square root of 64?", "expected": "8"},
]
```


Next, [define scoring functions](#define-scoring-functions).

### Define scoring functions

Next, create a list of _scorers_. Scorers are functions used to score each example. Scorers must have a `model_output` keyword argument. Other arguments are user defined and are taken from the dataset examples. The scorer will only use the necessary keys by using a dictionary key based on the argument name.

:::tip
Learn more about [how scorers work in evaluations and how to use them](../evaluation/scorers.md).
:::
Next, create a list of scoring functions, also known as _scorers_. Scorers are used to score the performance of an AI system against the evaluation dataset. Scorers must have a `model_output` keyword argument. Other arguments are user defined and are taken from the dataset examples. The scorer will only use the necessary keys by using a dictionary key based on the argument name.

The options available depend on whether you are using Typescript or Python:

<Tabs groupId="programming-language">
<TabItem value="python" label="Python" default>
There are three types of scorers available for Python:

:::tip
[Predefined scorers](../evaluation/predefined-scorers.md) are available for many common use cases. Before creating a custom scorer, check if one of the predefined scorers can address your use case.
:::

1. [Predefined scorer](../evaluation/predefined-scorers.md): Pre-built scorers designed for common use cases.
2. [Function-based scorers](../evaluation/custom-scorers#function-based-scorers): Simple Python functions decorated with `@weave.op`.
3. [Class-based scorers](../evaluation/custom-scorers#class-based-scorers): Python classes that inherit from `weave.Scorer` for more complex evaluations.
Expand All @@ -67,10 +75,14 @@ The options available depend on whether you are using Typescript or Python:

</TabItem>
<TabItem value="typescript" label="TypeScript">
Only [function-based scorers](../evaluation/custom-scorers#function-based-scorers) are available for Typescript. For [class-based](../evaluation/custom-scorers.md#class-based-scorers) and [predefined scorers](../evaluation/predefined-scorers.md), you must use Python.
Only [function-based scorers](../evaluation/custom-scorers#function-based-scorers) are available for Typescript. For [class-based](../evaluation/custom-scorers.md#class-based-scorers) and [predefined scorers](../evaluation/predefined-scorers.md), you will need to use Python.
</TabItem>
</Tabs>

:::tip
Learn more about [how scorers work in evaluations and how to use them](../evaluation/scorers.md).
:::

Next, [define an evaluation target](#define-an-evaluation-target).

### Define an evaluation target
Expand All @@ -79,7 +91,8 @@ Once your test dataset and scoring functions are defined, you can define the tar

#### Evaluate a `Model`

To evaluate a `Model`, call `evaluate` on using an `Evaluation`. `Models` are used when you have attributes that you want to experiment with and capture in Weave.
`Models` are used when you have attributes that you want to experiment with and capture in Weave.
To evaluate a `Model`, use the `evaluate` method from `Evaluation`.

The following example runs `predict()` on each example and scores the output with each scoring function defined in the `scorers` list using the `examples` dataset.

Expand All @@ -92,7 +105,7 @@ class MyModel(Model):

@weave.op()
def predict(self, question: str):
# here's where you would add your LLM call and return the output
# Here's where you would add your LLM call and return the output
return {'generated_text': 'Hello, ' + self.prompt}

model = MyModel(prompt='World')
Expand Down Expand Up @@ -176,7 +189,7 @@ evaluation = Evaluation(
)
```

You can also change the name of individual evaluations by setting the `display_name` key of the `__weave` dictionary. Using the `__weave` dictionary sets the call display name which is distinct from the Evaluation object name. In the UI, you will see the display name if set. Otherwise, the Evaluation object name will be used.
You can also change the name of individual evaluations by setting the `display_name` key of the `__weave` dictionary. Using the `__weave` dictionary sets the call display name which is distinct from the `Evaluation` object name. In the UI, you will see the display name if set. Otherwise, the `Evaluation` object name will be used.

```python
evaluation = Evaluation(
Expand Down
15 changes: 7 additions & 8 deletions docs/docs/guides/evaluation/custom-scorers.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';

# Custom Scorers
# Custom scorers

In Weave, you can create your own custom scorers. The Scorers can either be class-based or function-based.
In Weave, you can create your own custom [scorers](../evaluation/scorers.md). The scorers can either be class-based or function-based.

:::tip
If you're using Python, there are various predefined scorers available for common use cases. For more information, see [Select a predefined scorer](../evaluation/predefined-scorers.md#select-a-predefined-scorer) on the [Predefined scorers page](../evaluation/predefined-scorers.md).
Expand All @@ -13,9 +13,9 @@ If you're using Python, there are various predefined scorers available for comm

Choosing the right type of custom scorer depends on your evaluation needs:

- [Function-based scorers](#function-based-scorers): Use if your evaluation logic is simple and can be implemented in a single function. Examples include checking if text is uppercase, validating a specific condition, and applying straightforward transformations. Function-based scorers are available in both Python and Typescript.
- [Function-based scorers](#function-based-scorers): Use if your evaluation logic is simple and can be implemented in a single function. Examples include checking if text is uppercase, validating a specific condition, and applying straightforward transformations. **Function-based scorers are available in both Python and Typescript.**

- [Class-based scorers](#class-based-scorers): Use if your evaluation requires advanced logic, maintaining metadata, or multiple steps. Examples include keeping track of additional scorer metadata, trying different prompts for your LLM-evaluators, and making multiple function calls. Class-based scorers are only available in Python.
- [Class-based scorers](#class-based-scorers): Use if your evaluation requires advanced logic, maintaining metadata, or multiple steps. Examples include keeping track of additional scorer metadata, trying different prompts for your LLM-evaluators, and making multiple function calls. **Class-based scorers are only available in Python.**

## Function-based scorers

Expand All @@ -31,7 +31,7 @@ Function-based scorers are available in both Python and Typescript.
- Is decorated with `@weave.op`.
- Returns a dictionary.

### Example
#### Example
The following example shows `evaluate_uppercase`, which checks if the text is uppercase:

```python
Expand Down Expand Up @@ -59,7 +59,7 @@ Function-based scorers are available in both Python and Typescript.

Optionally, the function can accept a `datasetRow`.

### Example
#### Example
The following example shows `evaluate_uppercase`, which checks if the text is uppercase:

```typescript
Expand All @@ -81,7 +81,7 @@ Function-based scorers are available in both Python and Typescript.
</Tabs>


## Class-based Scorers
## Class-based scorers

:::note
This feature is not available in Typescript. All usage instructions and code examples in this section are for Python.
Expand Down Expand Up @@ -109,7 +109,6 @@ from weave import Scorer

llm_client = OpenAI()

#highlight-next-line
class SummarizationScorer(Scorer):
model_id: str = "gpt-4o"
system_prompt: str = "Evaluate whether the summary is good."
Expand Down
Loading

0 comments on commit c79f649

Please sign in to comment.