Add missing predefined scorers, cleanup overall

wandb · Dec 13, 2024 · c79f649 · c79f649
1 parent 13d57ce
commit c79f649
Show file tree

Hide file tree

Showing 4 changed files with 368 additions and 50 deletions.
diff --git a/docs/docs/guides/core-types/evaluations.md b/docs/docs/guides/core-types/evaluations.md
@@ -3,11 +3,11 @@ import TabItem from '@theme/TabItem';
 
 # Evaluations
 
-To systematically improve your LLM application, it's helpful to test your changes against a consistent dataset of potential inputs so that you can catch regressions and inspect your applications behaviour under different conditions. In Weave, the `Evaluation` class is designed to assess the performance of a `Model` on a test dataset.
+To systematically improve your LLM application, it's helpful to test your changes against a consistent dataset of potential inputs so that you can catch regressions and inspect your application's behaviour under different conditions. In Weave, the `Evaluation` class is designed to assess the performance of a `Model` on a test dataset.
 
-In a Weave evaluation, a set of examples is passed through your application, and the output is scored according to multiple scoring functions. The result provides you with a overview of your application's performance in a rich UI to summarizing individual outputs and scores.
+In a Weave evaluation, a set of examples is passed through your application, and the output is scored according to multiple scoring functions. The result provides you with an overview of your application's performance in a rich UI to summarizing individual outputs and scores.
 
-This page describes the steps required to [create an evaluation](#create-an-evaluation), a [Python example](#python-example), and provides additional [usage notes and tips](#usage-notes-and-tips).
+This page describes the steps required to [create an evaluation](#create-an-evaluation). A complete [Python example](#python-example) and additional [usage notes and tips](#usage-notes-and-tips) are also included.
 
 ![Evals hero](../../../static/img/evals-hero.png)
 
@@ -21,27 +21,35 @@ To create an evaluation in Weave, follow these steps:
 
 ### Define an evaluation dataset
 
-First, create a test dataset that will be used to evaluate your application. Generally, the dataset should include failure cases that you want to test for, similar to software unit tests in Test-Driven Development (TDD). You have two options to create a dataset:
+First, create a test dataset to evaluate your application against. Generally, the dataset should include failure cases that you want to test for, similar to software unit tests in Test-Driven Development (TDD). You have two options to create a dataset:
 
 1. Define a [Dataset](/guides/core-types/datasets).
-2. Define a list of dictionaries with a collection of examples to be evaluated. 
+2. Define a list of dictionaries with a collection of examples to be evaluated. For example:
+   ```python
+   examples = [
+        {"question": "What is the capital of France?", "expected": "Paris"},
+        {"question": "Who wrote 'To Kill a Mockingbird'?", "expected": "Harper Lee"},
+        {"question": "What is the square root of 64?", "expected": "8"},
+    ]
+   ```
+
 
 Next, [define scoring functions](#define-scoring-functions).
 
 ### Define scoring functions
 
-Next, create a list of _scorers_. Scorers are functions used to score each example. Scorers must have a `model_output` keyword argument. Other arguments are user defined and are taken from the dataset examples. The scorer will only use the necessary keys by using a dictionary key based on the argument name.
-
-:::tip
-Learn more about [how scorers work in evaluations and how to use them](../evaluation/scorers.md). 
-:::
+Next, create a list of scoring functions, also known as _scorers_. Scorers are used to score the performance of an AI system against the evaluation dataset. Scorers must have a `model_output` keyword argument. Other arguments are user defined and are taken from the dataset examples. The scorer will only use the necessary keys by using a dictionary key based on the argument name.
 
 The options available depend on whether you are using Typescript or Python:
 
 <Tabs groupId="programming-language">
   <TabItem value="python" label="Python" default>
   There are three types of scorers available for Python:
 
+    :::tip
+    [Predefined scorers](../evaluation/predefined-scorers.md) are available for many common use cases. Before creating a custom scorer, check if one of the predefined scorers can address your use case.
+    :::
+
     1. [Predefined scorer](../evaluation/predefined-scorers.md): Pre-built scorers designed for common use cases.
     2. [Function-based scorers](../evaluation/custom-scorers#function-based-scorers): Simple Python functions decorated with `@weave.op`.
     3. [Class-based scorers](../evaluation/custom-scorers#class-based-scorers): Python classes that inherit from `weave.Scorer` for more complex evaluations.
@@ -67,10 +75,14 @@ The options available depend on whether you are using Typescript or Python:
 
   </TabItem>
   <TabItem value="typescript" label="TypeScript">
-     Only [function-based scorers](../evaluation/custom-scorers#function-based-scorers) are available for Typescript. For [class-based](../evaluation/custom-scorers.md#class-based-scorers) and [predefined scorers](../evaluation/predefined-scorers.md), you must use Python.
+     Only [function-based scorers](../evaluation/custom-scorers#function-based-scorers) are available for Typescript. For [class-based](../evaluation/custom-scorers.md#class-based-scorers) and [predefined scorers](../evaluation/predefined-scorers.md), you will need to use Python.
   </TabItem>
 </Tabs>
 
+:::tip
+Learn more about [how scorers work in evaluations and how to use them](../evaluation/scorers.md). 
+:::
+
 Next, [define an evaluation target](#define-an-evaluation-target).
 
 ### Define an evaluation target
@@ -79,7 +91,8 @@ Once your test dataset and scoring functions are defined, you can define the tar
 
 #### Evaluate a `Model` 
 
-To evaluate a `Model`, call `evaluate` on using an `Evaluation`. `Models` are used when you have attributes that you want to experiment with and capture in Weave.
+`Models` are used when you have attributes that you want to experiment with and capture in Weave.
+To evaluate a `Model`, use the `evaluate` method from `Evaluation`. 
 
 The following example runs `predict()` on each example and scores the output with each scoring function defined in the `scorers` list using the `examples` dataset.
 
@@ -92,7 +105,7 @@ class MyModel(Model):
 
     @weave.op()
     def predict(self, question: str):
-        # here's where you would add your LLM call and return the output
+        # Here's where you would add your LLM call and return the output
         return {'generated_text': 'Hello, ' + self.prompt}
 
 model = MyModel(prompt='World')
@@ -176,7 +189,7 @@ evaluation = Evaluation(
 )
 ```
 
-You can also change the name of individual evaluations by setting the `display_name` key of the `__weave` dictionary. Using the `__weave` dictionary sets the call display name which is distinct from the Evaluation object name. In the UI, you will see the display name if set. Otherwise, the Evaluation object name will be used.
+You can also change the name of individual evaluations by setting the `display_name` key of the `__weave` dictionary. Using the `__weave` dictionary sets the call display name which is distinct from the `Evaluation` object name. In the UI, you will see the display name if set. Otherwise, the `Evaluation` object name will be used.
 
 ```python
 evaluation = Evaluation(

diff --git a/docs/docs/guides/evaluation/custom-scorers.md b/docs/docs/guides/evaluation/custom-scorers.md
@@ -1,9 +1,9 @@
 import Tabs from '@theme/Tabs';
 import TabItem from '@theme/TabItem';
 
-# Custom Scorers
+# Custom scorers
 
-In Weave, you can create your own custom scorers. The Scorers can either be class-based or function-based. 
+In Weave, you can create your own custom [scorers](../evaluation/scorers.md). The scorers can either be class-based or function-based. 
 
 :::tip
 If you're using Python, there are various  predefined scorers available for common use cases. For more information, see [Select a predefined scorer](../evaluation/predefined-scorers.md#select-a-predefined-scorer) on the [Predefined scorers page](../evaluation/predefined-scorers.md).
@@ -13,9 +13,9 @@ If you're using Python, there are various  predefined scorers available for comm
 
 Choosing the right type of custom scorer depends on your evaluation needs:
 
-- [Function-based scorers](#function-based-scorers): Use if your evaluation logic is simple and can be implemented in a single function. Examples include checking if text is uppercase, validating a specific condition, and applying straightforward transformations. Function-based scorers are available in both Python and Typescript.
+- [Function-based scorers](#function-based-scorers): Use if your evaluation logic is simple and can be implemented in a single function. Examples include checking if text is uppercase, validating a specific condition, and applying straightforward transformations. **Function-based scorers are available in both Python and Typescript.**
 
-- [Class-based scorers](#class-based-scorers): Use if your evaluation requires advanced logic, maintaining metadata, or multiple steps. Examples include keeping track of additional scorer metadata, trying different prompts for your LLM-evaluators, and making multiple function calls. Class-based scorers are only available in Python.
+- [Class-based scorers](#class-based-scorers): Use if your evaluation requires advanced logic, maintaining metadata, or multiple steps. Examples include keeping track of additional scorer metadata, trying different prompts for your LLM-evaluators, and making multiple function calls. **Class-based scorers are only available in Python.**
 
 ## Function-based scorers
 
@@ -31,7 +31,7 @@ Function-based scorers are available in both Python and Typescript.
      - Is decorated with `@weave.op`.
      - Returns a dictionary.
 
-     ### Example
+     #### Example
      The following example shows `evaluate_uppercase`, which checks if the text is uppercase:
 
     ```python
@@ -59,7 +59,7 @@ Function-based scorers are available in both Python and Typescript.
 
      Optionally, the function can accept a `datasetRow`.
 
-     ### Example
+     #### Example
      The following example shows `evaluate_uppercase`, which checks if the text is uppercase:
 
     ```typescript
@@ -81,7 +81,7 @@ Function-based scorers are available in both Python and Typescript.
 </Tabs>
 
 
-## Class-based Scorers
+## Class-based scorers
 
 :::note
 This feature is not available in Typescript. All usage instructions and code examples in this section are for Python.
@@ -109,7 +109,6 @@ from weave import Scorer
 
 llm_client = OpenAI()
 
-#highlight-next-line
 class SummarizationScorer(Scorer):
     model_id: str = "gpt-4o"
     system_prompt: str = "Evaluate whether the summary is good."