-
Notifications
You must be signed in to change notification settings - Fork 67
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
5 changed files
with
638 additions
and
622 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,117 @@ | ||
import Tabs from '@theme/Tabs'; | ||
import TabItem from '@theme/TabItem'; | ||
|
||
# Create your own Scorers | ||
|
||
In Weave, you can create your own custom Scorers. The Scorers can either be class-based or function-based. For more information about Scorers, see the | ||
|
||
### Function-based Scorers | ||
|
||
<Tabs groupId="programming-language"> | ||
<TabItem value="python" label="Python" default> | ||
These are functions decorated with `@weave.op` that return a dictionary. They're great for simple evaluations like: | ||
|
||
```python | ||
import weave | ||
|
||
@weave.op | ||
def evaluate_uppercase(text: str) -> dict: | ||
return {"text_is_uppercase": text.isupper()} | ||
|
||
my_eval = weave.Evaluation( | ||
dataset=[{"text": "HELLO WORLD"}], | ||
scorers=[evaluate_uppercase] | ||
) | ||
``` | ||
|
||
When the evaluation is run, `evaluate_uppercase` checks if the text is all uppercase. | ||
|
||
</TabItem> | ||
<TabItem value="typescript" label="TypeScript"> | ||
These are functions wrapped with `weave.op` that accept an object with `modelOutput` and optionally `datasetRow`. They're great for simple evaluations like: | ||
```typescript | ||
import * as weave from 'weave' | ||
|
||
const evaluateUppercase = weave.op( | ||
({modelOutput}) => modelOutput.toUpperCase() === modelOutput, | ||
{name: 'textIsUppercase'} | ||
); | ||
|
||
|
||
const myEval = new weave.Evaluation({ | ||
dataset: [{text: 'HELLO WORLD'}], | ||
scorers: [evaluateUppercase], | ||
}) | ||
``` | ||
|
||
</TabItem> | ||
</Tabs> | ||
|
||
### Class-based Scorers | ||
|
||
<Tabs groupId="programming-language"> | ||
<TabItem value="python" label="Python" default> | ||
For more advanced evaluations, especially when you need to keep track of additional scorer metadata, try different prompts for your LLM-evaluators, or make multiple function calls, you can use the `Scorer` class. | ||
|
||
**Requirements:** | ||
|
||
1. Inherit from `weave.Scorer`. | ||
2. Define a `score` method decorated with `@weave.op`. | ||
3. The `score` method must return a dictionary. | ||
|
||
Example: | ||
|
||
```python | ||
import weave | ||
from openai import OpenAI | ||
from weave import Scorer | ||
|
||
llm_client = OpenAI() | ||
|
||
#highlight-next-line | ||
class SummarizationScorer(Scorer): | ||
model_id: str = "gpt-4o" | ||
system_prompt: str = "Evaluate whether the summary is good." | ||
|
||
@weave.op | ||
def some_complicated_preprocessing(self, text: str) -> str: | ||
processed_text = "Original text: \n" + text + "\n" | ||
return processed_text | ||
|
||
@weave.op | ||
def call_llm(self, summary: str, processed_text: str) -> dict: | ||
res = llm_client.chat.completions.create( | ||
messages=[ | ||
{"role": "system", "content": self.system_prompt}, | ||
{"role": "user", "content": ( | ||
f"Analyse how good the summary is compared to the original text." | ||
f"Summary: {summary}\n{processed_text}" | ||
)}]) | ||
return {"summary_quality": res} | ||
|
||
@weave.op | ||
def score(self, output: str, text: str) -> dict: | ||
"""Score the summary quality. | ||
|
||
Args: | ||
output: The summary generated by an AI system | ||
text: The original text being summarized | ||
""" | ||
processed_text = self.some_complicated_preprocessing(text) | ||
eval_result = self.call_llm(summary=output, processed_text=processed_text) | ||
return {"summary_quality": eval_result} | ||
|
||
evaluation = weave.Evaluation( | ||
dataset=[{"text": "The quick brown fox jumps over the lazy dog."}], | ||
scorers=[summarization_scorer]) | ||
``` | ||
|
||
This class evaluates how good a summary is by comparing it to the original text. | ||
|
||
</TabItem> | ||
<TabItem value="typescript" label="TypeScript"> | ||
```plaintext | ||
This feature is not available in TypeScript yet. Stay tuned! | ||
``` | ||
</TabItem> | ||
</Tabs> |
Oops, something went wrong.