Skip to content

Commit

Permalink
add evals hero image
Browse files Browse the repository at this point in the history
  • Loading branch information
morganmcg1 committed Aug 23, 2024
1 parent b9d00bf commit 560e0df
Show file tree
Hide file tree
Showing 2 changed files with 40 additions and 37 deletions.
77 changes: 40 additions & 37 deletions docs/docs/tutorial-eval.md
Original file line number Diff line number Diff line change
@@ -1,31 +1,31 @@
# Build an Evaluation
# Running an Evaluation

To improve on an application, we need a way to evaluate if it's improving - we need Evaluation Driven Development. These evaluations, or tests, can be as simple as asserting that the correct data type was output by the aplication, to more complex evaluations such as whether a response from a customer support LLM was correct.
In this tutorial **we will learn how to evaluate a LLM system using Weave**. To increase the performance an LLM system, we need a way to assess if it is improving - we need Evaluation Driven Development. These evaluations, or tests, can be as simple as asserting that the correct data type was output by an LLM, or more complex evaluations such as whether a response from a customer support application was appropriate.

Weave [Evaluations](guides/core-types/evaluations) will run your LLM system against your evaluation dataset for you, displaying the **metrics**, **token usage** and **latency** for the entire evaluation, as well as for each of the samples in your eval set.
Weave [Evaluations](guides/core-types/evaluations) will run your LLM system against your evaluation dataset for you; displaying **metrics**, **token usage** and **latency** for the entire evaluation, as well as for each of the samples in your evaluation dataset. As you iterate on your LLM system using Weave Evaluations you can then compare the performance of different system settings using Weave's Evaluations Comparison view:

![Evals hero](../static/img/evals-hero.png)
![Evals hero](../static/img/evals-hero2.png)

## Running an Evaluation in Weave

In this evaluation we will evaluate the performance of our LLM system to:
Building off of the previous [App Versioning](/tutorial-weave_models) tutorial, now we will evaluate the performance of our LLM system to:

- generate valid JSON
- extract the common name of the carnivorous dinosaurs in the texts
- extract the "common name" of the carnivorous dinosaurs mentioned in the texts
- return the correct number of dinosaurs from each text

We will later define scorer functions for each of these criteria.
We will later define scorer functions for each of these criteria, but first we need an evaluation dataset.

### 1. Evaluation Data


To begin, we first need an evaluation dataset. Evaluation datasets in Weave Evaluations can either be a list of python dictionaries or a `weave.Dataset`.
To begin, we first need an evaluation dataset. Evaluation datasets in Weave Evaluations can either be a list of python dictionaries or a [`weave.Dataset`](guides/core-types/datasets).

**Naming items in the evaluation dataset**

How our items in the evaluation dataset are named is important - the **item keys** must match the **argument names** in your Weave Model's **predict function** as well as the **scorer function(s)** you'll use. This is how Weave knows which elemets in each row in the dataset should be passed to the appropriate predict and scorer functions.
How our items in the evaluation dataset are named is important - the **item keys** must match the **argument names** in your Weave Model's **predict function** and **scorer function(s)** you'll use. This is how Weave knows which elemets in each row in the dataset should be passed to the appropriate predict and scorer functions.

For example if your LLM input is under the `"question"` key and the ground truth, aka labels, is stored under the `"target"` key:
For example if the values for your LLM inputs are stored under the `"question"` key and the ground truth, or labels, are stored under the `"target"` key like so:

```python
eval_data = [
Expand All @@ -40,13 +40,14 @@ Then the signature of the `predict` function that calls the LLM system must cont
class CountryCapitalLLM(weave.Model):
...

# "question" must be in the function signature of predict
@weave.op
def predict(self, question: str) -> dict:
...

```

And the signature of the function that grades the LLM system output must contain an agument called `target`:
And similarly the signature of the scorer function that grades the LLM system output must contain an argument called `target`:

```python
def capital_cities_scorer(target: str, model_output: dict):
Expand All @@ -56,22 +57,22 @@ def capital_cities_scorer(target: str, model_output: dict):

**Defining our data**

As mentioned the evaluation dataset should be structured as a list of dictionaries:
As mentioned, the evaluation dataset can be structured as a list of dictionaries or as a `weave.Dataset`. Here we will use a list of dictionaries:

```python
inputs = [
"A Velociraptor (Raptor) darted through the undergrowth, stalking a \
grazing Stegosaurus (Stego), while an Archaeopteryx (Archie) herbivore soared overhead.",
"The Ankylosaurus (Anky) swung its clubbed tail defensively as a pack of \
ten Dilophosaurus (Dilo) circled.",
"A massive Spinosaurus (Spino) emerged from the river, eating seaweed, startling \
a herd of Gallimimus into a frenzied sprint across the plain. Finish with }}}"
"A massive Spinosaurus (Gali) emerged from the river, eating seaweed, startling \
a herd of Gallimimus into a frenzied sprint across the plain."
]

labels = [
{"id": 0, "carnivore_name": "velociraptor", "common_name": "raptor", "n_dinos": 3},
{"id": 1, "carnivore_name": "dilophosaurus", "common_name": "dilo", "n_dinos": 2},
{"id": 2, "carnivore_name": "spinosaurus", "common_name": "galli", "n_dinos": 2}
{"id": 2, "carnivore_name": "spinosaurus", "common_name": "spino", "n_dinos": 2}
]

eval_set = [
Expand All @@ -83,12 +84,12 @@ eval_set = [

### 2. Instantiate a LLM system

Building off of the previous [App Versioning](/tutorial-weave_models) tutorial, we will define our LLM System as a Weave [Model](guides/core-types/models).
Building off of the previous [App Versioning](/tutorial-weave_models) tutorial, we will define our LLM system as a Weave [`Model`](guides/core-types/models).

A weave `Model` stores and versions information about your system, such as prompts, temperatures, and more.
Weave automatically captures when they are used and updates the version when there are changes.
Weave automatically captures when a Model is used and updates the Model version when changes are made to it.

`Model`s are declared by subclassing `weave.Model` and implementing a `predict` function definition, which takes an input and returns a response:
`Model`s are declared by subclassing `weave.Model` and implementing a `predict` function definition, which takes an input and returns a response. Note, as mentioned earlier the predict function's argument name(s) should match the keys of the input data in the evaluation dataset, `text` in this case:

```python
import json
Expand Down Expand Up @@ -122,20 +123,21 @@ class ExtractDinos(weave.Model):
temperature: float
system_prompt: str

# The `text` argument matches the `text` key in the evaluation dataset
@weave.op
def predict(self, text: str) -> dict:
return extract_dinos(self, text)
```

### 3. Define scoring criteria

Scorers are used to assess your LLM system output against one or more criterion. Here we define scorer functions that assess wether:
Scorers are used to assess your LLM system output against one or more criterion. Here we define scorer functions that assess whether or not:

- the generated JSON is valid
- the common name of carnivorous dinosaurs mentioned in the texts is extracted correctly
- that model returns the correct number of dinosaurs in each text
- the common names of carnivorous dinosaurs mentioned in the input texts are extracted correctly
- the model returns the correct number of dinosaurs in each input text

Scorer functions must be decorated with `weave.op` and must return a dictionary with the metric name as the key and the evaluation result as the value. The value should be of type `bool`, `int` or `float`:
Scorer functions are just regular python functions, decorated with `weave.op` and must return a dictionary with the metric name as the key and the evaluation result as the value. The value should be of type `bool`, `int` or `float`:

```python
{"valid_json": True}
Expand All @@ -152,7 +154,7 @@ Multiple metric:value pairs can be returned from a single scorer function if nee
Lets define the scorers:

```python
# assess that the generated JSON is valid
# Assess that the generated JSON is valid
@weave.op
def json_check(target: str, model_output: dict) -> dict:
try:
Expand All @@ -161,18 +163,18 @@ def json_check(target: str, model_output: dict) -> dict:
except:
return {"json_correct": False}

# assess that the correct carnivorous dinosaur name is extracted
# Assess that the correct carnivorous dinosaur name is extracted
@weave.op
def carnivore_name_check(target: str, model_output: dict) -> dict:
model_output = json.loads(model_output)
for dino in model_output["dinosaurs"]:
if dino["diet"] == "carnivore":
return {
"carnivore_name_correct" : target["carnivore_name"] == dino["name"].lower()
"carnivore_name_correct": target["carnivore_name"] == dino["name"].lower()
}
return {"carnivore_name_correct": False}

# assess that the correct number of dinosaurs is extracted
# Assess that the correct number of dinosaurs is extracted
@weave.op
def count_dinos_check(target: str, model_output: dict) -> dict:
model_output = json.loads(model_output)
Expand All @@ -182,16 +184,16 @@ def count_dinos_check(target: str, model_output: dict) -> dict:
```


The [Evaluations](guides/core-types/evaluations) guide contains more details on how to buld advanced custom scorers, including how to post-processes the results from your scorers using the `Scorer` classes' `summarize` method.
The [Evaluations](guides/core-types/evaluations) guide and [RAG tutorial](/tutorial-rag) contains more details on how to build advanced custom scorers, including how to post-processes the results from your scorers using the `Scorer` classes' `summarize` method.


### 4. Running the evaluation

The `Evaluation` class is designed to assess the performance of a LLM System on a given dataset using the scoring functions. The LLM System, which has to be of type `weave.Model`, is passed to the `evaluate` method to kick off evaluation.
The `Evaluation` class is designed to assess the performance of a LLM system on a given dataset using the scoring functions. The LLM system, which has to be of type `weave.Model`, is passed to the `Evaluation.evaluate` method to kick off evaluation.

**Evaluations are run asynchronously**

When the `evaluate` method is called, Weave will run the LLM system across all items in your dataset asyncronously. To set a maximum on the number of async evaluation calls at any one time, you can set the `WEAVE_PARALLELISM` environment variable to any integer; setting it to 1 will run through the eval dataset synchronously. Setting this env variable can help avoid hitting rate limit errors from LLM providers for example.
When the `evaluate` method is called, Weave will run the LLM system across all items in your dataset asyncronously. To set a maximum on the number of async evaluation calls at any one time, you can set the `WEAVE_PARALLELISM` environment variable to any integer; setting it to 1 will run through the evaluation dataset synchronously. Setting this env variable can help avoid hitting rate limit errors from LLM providers for example.

Note that the `asyncio` python library must be used when running an evaluation in a python script, while running in a Jupyter or Colab notebooks simply requires using `await`:

Expand All @@ -203,7 +205,7 @@ summary_metrics = asyncio.run(evaluation.evaluate(model=dinos))
summary_metrics = await evaluation.evaluate(model=dinos)
```

The [Evaluations](guides/core-types/evaluations) section contains more details.
The [Evaluation](guides/core-types/evaluations) guide contains more details on the `Evaluation` class.


```python
Expand All @@ -214,34 +216,34 @@ from weave import Evaluation
client = OpenAI(api_key=os.getenv["OPENAI_API_KEY"])

system_prompt = """Extract any dinosaur `name`, their `common_name`, \
names and whether its `diet` is a herbivore or carnivore, in JSON format with"""
and whether its `diet` is herbivore or carnivore, in JSON format"""

temperature = 0.4

# Instantiate the weave Model
dinos = ExtractDinos(
client=client,
model_name='gpt-4o',
model_name='gpt-4o-mini',
temperature=temperature,
system_prompt=system_prompt
)

# Create your evaluation object
# highlight-next-line
evaluation = Evaluation(
name=f"carnivore_evaluator_temp-{temperature}",
name=f"carnivore_evaluator_temp-{temperature}", # optionally set a name for the object
dataset=eval_set, # can be a list of dictionaries or a weave.Dataset object
scorers=[json_check, carnivore_name_check, count_dinos_check], # list of scoring functions
)

# Initialise weave, use "ENTITY/PROJECT" to log to a project in a specific W&B Team
weave.init("jurassic-park")

# Run the evaluation
# Run the evaluation, passing in the `dinos` Model
# highlight-next-line
summary_metrics = asyncio.run(evaluation.evaluate(model=dinos))

# if you're in a Jupyter or Colab Notebook, run:
# If you're in a Jupyter or Colab Notebook, run:
# summary_metrics = await evaluation.evaluate(model=dinos)
```

Expand All @@ -250,9 +252,10 @@ You've now run a Weave Evaluation! The results will be printed in the terminal o

### 5. Comparing Evaluations

When you'd like to compare multiple evaluations you can select the evaluations you're interested in in the Evaluations tab of the Weave UI and then click "Compare" button to generate charts
When you'd like to compare multiple evaluations you can select the evaluations you're interested in in the Evaluations tab of the Weave UI and then click "Compare" button to generate charts.

![Evals hero](../static/img/evals-hero.png)

## What's next?

- Follow the [Model-Based Evaluation of RAG applications](/tutorial-rag) to evaluate a RAG app using an LLM judge.
Try the [Evaluate a RAG App](/tutorial-rag) tutorial to learn how use use advanced Weave `Scorer`s to evaluate a RAG app using an LLM judge.
Binary file added docs/static/img/evals-hero2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 560e0df

Please sign in to comment.