Skip to content

Commit

Permalink
Refactor Evals and Scoring section
Browse files Browse the repository at this point in the history
  • Loading branch information
J2-D2-3PO committed Nov 23, 2024
1 parent a76414b commit 1a00368
Show file tree
Hide file tree
Showing 5 changed files with 638 additions and 622 deletions.
41 changes: 26 additions & 15 deletions docs/docs/guides/core-types/evaluations.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
# Evaluations

Evaluation-driven development helps you reliably iterate on an application. The `Evaluation` class is designed to assess the performance of a `Model` on a given `Dataset` or set of examples using scoring functions.
To systematically improve your LLM application, it's helpful to test your changes against a consistent dataset of potential inputs so that you can catch regressions and inspect your applications behaviour under different conditions. In Weave, the `Evaluation` class is designed to assess the performance of a `Model` on a test dataset.

In a Weave Evaluation, a set of examples is passed through your application, and the output is scored according to multiple scoring functions. The result provides you with a overview of your application's performance in a rich UI to summarizing individual outputs and scores.

![Evals hero](../../../static/img/evals-hero.png)

Expand Down Expand Up @@ -38,23 +40,30 @@ weave.init('intro-example')
asyncio.run(evaluation.evaluate(function_to_evaluate))
```

## Create an Evaluation
This page describes how to get started with evaluations.
## Create an evaluation

To create an evaluation in Weave, follow these steps:

1. [Define an evaluation dataset](#define-an-evaluation-dataset)
2. [Define scoring functions](#define-scoring-functions)

To systematically improve your application, it's helpful to test your changes against a consistent dataset of potential inputs so that you catch regressions and can inspect your apps behaviour under different conditions. Using the `Evaluation` class, you can be sure you're comparing apples-to-apples by keeping track of all of the details that you're experimenting and evaluating with.
### Create an evaluation dataset

Weave will take each example, pass it through your application and score the output on multiple custom scoring functions. By doing this, you'll have a view of the performance of your application, and a rich UI to drill into individual outputs and scores.
First, create a test dataset that will be used to evaluate your application. Generally, the dataset should include failure cases that you want to test for, similar to software unit tests in Test-Driven Development (TDD). You have two options to create a dataset:

### Define an evaluation dataset
1. Define a [Dataset](/guides/core-types/datasets).
2. Define a list of dictionaries with a collection of examples to be evaluated.

First, define a [Dataset](/guides/core-types/datasets) or list of dictionaries with a collection of examples to be evaluated. These examples are often failure cases that you want to test for, these are similar to unit tests in Test-Driven Development (TDD).
### Define scoring functions

### Defining scoring functions
Next, create a list of _Scorers_. Scorers are functions used to score each example. Scorers must have a `model_output` keyword argument. Other arguments are user defined and are taken from the dataset examples. The Scorer will only use the necessary keys by using a dictionary key based on the argument name.

Then, create a list of scoring functions. These are used to score each example. Each function should have a `model_output` and optionally, other inputs from your examples, and return a dictionary with the scores.
When defining Scorers, you can either use one of the many predefined scorers available in Weave, or create your own custom Scorer.

Scoring functions need to have a `model_output` keyword argument, but the other arguments are user defined and are taken from the dataset examples. It will only take the necessary keys by using a dictionary key based on the argument name.
#### Scorer example

This will take `expected` from the dictionary for scoring.
In the following example, the `match_score1()` Scorer will take `expected` from the dictionary for scoring.

```python
import weave
Expand All @@ -73,15 +82,17 @@ def match_score1(expected: str, model_output: dict) -> dict:
return {'match': expected == model_output['generated_text']}
```

### Optional: Define a custom `Scorer` class
#### Optional: Define a custom `Scorer` class

In some applications we want to create custom `Scorer` classes - where for example a standardized `LLMJudge` class should be created with specific parameters (e.g. chat model, prompt), specific scoring of each row, and specific calculation of an aggregate score.
For some applications, you may want to create custom `Scorer` classes. For example, a standardized `LLMJudge` class should be created with specific parameters (e.g. chat model, prompt), scoring of each row, and calculation of an aggregate score. For more information about creating custom Scorers, see [Create your own Scorers](../evaluation/custom-scorers.md).

See the tutorial on defining a `Scorer` class in the next chapter on [Model-Based Evaluation of RAG applications](/tutorial-rag#optional-defining-a-scorer-class) for more information.
> For an end-to-end tutorial that involves defining a custom `Scorer` class, see [Model-Based Evaluation of RAG applications](/tutorial-rag#optional-defining-a-scorer-class).
### Define a Model to evaluate

To evaluate a `Model`, call `evaluate` on it using an `Evaluation`. `Models` are used when you have attributes that you want to experiment with and capture in weave.
Once your test dataset and Scorers are defined, you can begin the evaluation. To evaluate a `Model`, call `evaluate` on using an `Evaluation`. `Models` are used when you have attributes that you want to experiment with and capture in Weave.

The following example funs `predict()` on each example and scores the output with each scoring function defined in the `scorers` list using the `examples` dataset.

```python
from weave import Model, Evaluation
Expand All @@ -104,7 +115,7 @@ weave.init('intro-example') # begin tracking results with weave
asyncio.run(evaluation.evaluate(model))
```

This will run `predict` on each example and score the output with each scoring functions.


#### Custom Naming

Expand Down
117 changes: 117 additions & 0 deletions docs/docs/guides/evaluation/custom-scorers.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,117 @@
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';

# Create your own Scorers

In Weave, you can create your own custom Scorers. The Scorers can either be class-based or function-based. For more information about Scorers, see the

### Function-based Scorers

<Tabs groupId="programming-language">
<TabItem value="python" label="Python" default>
These are functions decorated with `@weave.op` that return a dictionary. They're great for simple evaluations like:

```python
import weave

@weave.op
def evaluate_uppercase(text: str) -> dict:
return {"text_is_uppercase": text.isupper()}

my_eval = weave.Evaluation(
dataset=[{"text": "HELLO WORLD"}],
scorers=[evaluate_uppercase]
)
```

When the evaluation is run, `evaluate_uppercase` checks if the text is all uppercase.

</TabItem>
<TabItem value="typescript" label="TypeScript">
These are functions wrapped with `weave.op` that accept an object with `modelOutput` and optionally `datasetRow`. They're great for simple evaluations like:
```typescript
import * as weave from 'weave'

const evaluateUppercase = weave.op(
({modelOutput}) => modelOutput.toUpperCase() === modelOutput,
{name: 'textIsUppercase'}
);


const myEval = new weave.Evaluation({
dataset: [{text: 'HELLO WORLD'}],
scorers: [evaluateUppercase],
})
```

</TabItem>
</Tabs>

### Class-based Scorers

<Tabs groupId="programming-language">
<TabItem value="python" label="Python" default>
For more advanced evaluations, especially when you need to keep track of additional scorer metadata, try different prompts for your LLM-evaluators, or make multiple function calls, you can use the `Scorer` class.

**Requirements:**

1. Inherit from `weave.Scorer`.
2. Define a `score` method decorated with `@weave.op`.
3. The `score` method must return a dictionary.

Example:

```python
import weave
from openai import OpenAI
from weave import Scorer

llm_client = OpenAI()

#highlight-next-line
class SummarizationScorer(Scorer):
model_id: str = "gpt-4o"
system_prompt: str = "Evaluate whether the summary is good."

@weave.op
def some_complicated_preprocessing(self, text: str) -> str:
processed_text = "Original text: \n" + text + "\n"
return processed_text

@weave.op
def call_llm(self, summary: str, processed_text: str) -> dict:
res = llm_client.chat.completions.create(
messages=[
{"role": "system", "content": self.system_prompt},
{"role": "user", "content": (
f"Analyse how good the summary is compared to the original text."
f"Summary: {summary}\n{processed_text}"
)}])
return {"summary_quality": res}

@weave.op
def score(self, output: str, text: str) -> dict:
"""Score the summary quality.

Args:
output: The summary generated by an AI system
text: The original text being summarized
"""
processed_text = self.some_complicated_preprocessing(text)
eval_result = self.call_llm(summary=output, processed_text=processed_text)
return {"summary_quality": eval_result}

evaluation = weave.Evaluation(
dataset=[{"text": "The quick brown fox jumps over the lazy dog."}],
scorers=[summarization_scorer])
```

This class evaluates how good a summary is by comparing it to the original text.

</TabItem>
<TabItem value="typescript" label="TypeScript">
```plaintext
This feature is not available in TypeScript yet. Stay tuned!
```
</TabItem>
</Tabs>
Loading

0 comments on commit 1a00368

Please sign in to comment.