. The cookbook also includes an advanced example of a Real Time Audio API based assistant integrated with Weave.
diff --git a/docs/docs/guides/core-types/prompts.md b/docs/docs/guides/core-types/prompts.md
new file mode 100644
index 00000000000..9a2d50ecf2b
--- /dev/null
+++ b/docs/docs/guides/core-types/prompts.md
@@ -0,0 +1,373 @@
+# Prompts
+
+Creating, evaluating, and refining prompts is a core activity for AI engineers.
+Small changes to a prompt can have big impacts on your application's behavior.
+Weave lets you create prompts, save and retrieve them, and evolve them over time.
+Some of the benefits of Weave's prompt management system are:
+
+- Unopinionated core, with a batteries-included option for rapid development
+- Versioning that shows you how a prompt has evolved over time
+- The ability to update a prompt in production without redeploying your application
+- The ability to evaluate a prompt against many inputs to evaluate performance
+
+## Getting started
+
+If you want complete control over how a Prompt is constructed, you can subclass the base class, `weave.Prompt`, `weave.StringPrompt`, or `weave.MessagesPrompt` and implement the corresponding `format` method. When you publish one of these objects with `weave.publish`, it will appear in your Weave project on the "Prompts" page.
+
+```
+class Prompt(Object):
+ def format(self, **kwargs: Any) -> Any:
+ ...
+
+class StringPrompt(Prompt):
+ def format(self, **kwargs: Any) -> str:
+ ...
+
+class MessagesPrompt(Prompt):
+ def format(self, **kwargs: Any) -> list:
+ ...
+```
+
+Weave also includes a "batteries-included" class called `EasyPrompt` that can be simpler to start with, especially if you are working with APIs that are similar to OpenAI. This document highlights the features you get with EasyPrompt.
+
+## Constructing prompts
+
+You can think of the EasyPrompt object as a list of messages with associated roles, optional
+placeholder variables, and an optional model configuration.
+But constructing a prompt can be as simple as providing a single string:
+
+```python
+import weave
+
+prompt = weave.EasyPrompt("What's 23 * 42?")
+assert prompt[0] == {"role": "user", "content": "What's 23 * 42?"}
+```
+
+For terseness, the weave library aliases the `EasyPrompt` class to `P`.
+
+```python
+from weave import P
+p = P("What's 23 * 42?")
+```
+
+It is common for a prompt to consist of multiple messages. Each message has an associated `role`.
+If the role is omitted, it defaults to `"user"`.
+
+**Some common roles**
+
+| Role | Description |
+| --------- | -------------------------------------------------------------------------------------------------------------------- |
+| system | System prompts provide high level instructions and can be used to set the behavior, knowledge, or persona of the AI. |
+| user | Represents input from a human user. (This is the default role.) |
+| assistant | Represents the AI's generated replies. Can be used for historical completions or to show examples. |
+
+For convenience, you can prefix a message string with one of these known roles:
+
+```python
+import weave
+
+prompt = weave.EasyPrompt("system: Talk like a pirate")
+assert prompt[0] == {"role": "system", "content": "Talk like a pirate"}
+
+# An explicit role parameter takes precedence
+prompt = weave.EasyPrompt("system: Talk like a pirate", role="user")
+assert prompt[0] == {"role": "user", "content": "system: Talk like a pirate"}
+
+```
+
+Messages can be appended to a prompt one-by-one:
+
+```python
+import weave
+
+prompt = weave.EasyPrompt()
+prompt.append("You are an expert travel consultant.", role="system")
+prompt.append("Give me five ideas for top kid-friendly attractions in New Zealand.")
+```
+
+Or you can append multiple messages at once, either with the `append` method or with the `Prompt`
+constructor, which is convenient for constructing a prompt from existing messages.
+
+```python
+import weave
+
+prompt = weave.EasyPrompt()
+prompt.append([
+ {"role": "system", "content": "You are an expert travel consultant."},
+ "Give me five ideas for top kid-friendly attractions in New Zealand."
+])
+
+# Same
+prompt = weave.EasyPrompt([
+ {"role": "system", "content": "You are an expert travel consultant."},
+ "Give me five ideas for top kid-friendly attractions in New Zealand."
+])
+```
+
+The Prompt class is designed to be easily inserted into existing code.
+For example, you can quickly wrap it around all of the arguments to the
+OpenAI chat completion `create` call including its messages and model
+configuration. If you don't wrap the inputs, Weave's integration would still
+track all of the call's inputs, but it would not extract them as a separate
+versioned object. Having a separate Prompt object allows you to version
+the prompt, easily filter calls by that version, etc.
+
+```python
+from weave import init, P
+from openai import OpenAI
+client = OpenAI()
+
+# Must specify a target project, otherwise the Weave code is a no-op
+# highlight-next-line
+init("intro-example")
+
+# highlight-next-line
+response = client.chat.completions.create(P(
+ model="gpt-4o-mini",
+ messages=[
+ {"role": "user", "content": "What's 23 * 42?"}
+ ],
+ temperature=0.7,
+ max_tokens=64,
+ top_p=1
+# highlight-next-line
+))
+```
+
+:::note
+Why this works: Weave's OpenAI integration wraps the OpenAI `create` method to make it a Weave Op.
+When the Op is executed, the Prompt object in the input will get saved and associated with the Call.
+However, it will be replaced with the structure the `create` method expects for the execution of the
+underlying function.
+:::
+
+## Parameterizing prompts
+
+When specifying a prompt, you can include placeholders for values you want to fill in later. These placeholders are called "Parameters".
+Parameters are indicated with curly braces. Here's a simple example:
+
+```python
+import weave
+
+prompt = weave.EasyPrompt("What's {A} + {B}?")
+```
+
+You will specify values for all of the parameters or "bind" them, when you [use the prompt](#using-prompts).
+
+The `require` method of Prompt allows you to associate parameters with restrictions that will be checked at bind time to detect programming errors.
+
+```python
+import weave
+
+prompt = weave.EasyPrompt("What's {A} + 42?")
+prompt.require("A", type="int", min=0, max=100)
+
+prompt = weave.EasyPrompt("system: You are a {profession}")
+prompt.require("profession", oneof=('pirate', 'cartoon mouse', 'hungry dragon'), default='pirate')
+```
+
+## Using prompts
+
+You use a Prompt by converting it into a list of messages where all template placeholders have been filled in. You can bind a prompt to parameter values with the `bind` method or by simply calling it as a function. Here's an example where the prompt has zero parameters.
+
+```python
+import weave
+prompt = weave.EasyPrompt("What's 23 * 42?")
+assert prompt() == prompt.bind() == [
+ {"role": "user", "content": "What's 23 * 42?"}
+]
+```
+
+If a prompt has parameters, you would specify values for them when you use the prompt.
+Parameter values can be passed in as a dictionary or as keyword arguments.
+
+```python
+import weave
+prompt = weave.EasyPrompt("What's {A} + {B}?")
+assert prompt(A=5, B="10") == prompt({"A": 5, "B": "10"})
+```
+
+If any parameters are missing, they will be left unsubstituted in the output.
+
+Here's a complete example of using a prompt with OpenAI. This example also uses [Weave's OpenAI integration](../integrations/openai.md) to automatically log the prompt and response.
+
+```python
+import weave
+from openai import OpenAI
+client = OpenAI()
+
+weave.init("intro-example")
+prompt = weave.EasyPrompt()
+prompt.append("You will be provided with a tweet, and your task is to classify its sentiment as positive, neutral, or negative.", role="system")
+prompt.append("I love {this_thing}!")
+
+response = client.chat.completions.create(
+ model="gpt-4o-mini",
+ messages=prompt(this_thing="Weave"),
+ temperature=0.7,
+ max_tokens=64,
+ top_p=1
+)
+```
+
+## Publishing to server
+
+Prompt are a type of [Weave object](../tracking/objects.md), and use the same methods for publishing to the Weave server.
+You must specify a destination project name with `weave.init` before you can publish a prompt.
+
+```python
+import weave
+
+prompt = weave.EasyPrompt()
+prompt.append("What's 23 * 42?")
+
+weave.init("intro-example") # Use entity/project format if not targeting your default entity
+weave.publish(prompt, name="calculation-prompt")
+```
+
+Weave will automatically determine if the object has changed and only publish a new version if it has.
+You can also specify a name or description for the Prompt as part of its constructor.
+
+```python
+import weave
+
+prompt = weave.EasyPrompt(
+ "What's 23 * 42?",
+ name="calculation-prompt",
+ description="A prompt for calculating the product of two numbers.",
+)
+
+weave.init("intro-example")
+weave.publish(prompt)
+```
+
+## Retrieving from server
+
+Prompt are a type of [Weave object](../tracking/objects.md), and use the same methods for retrieval from the Weave server.
+You must specify a source project name with `weave.init` before you can retrieve a prompt.
+
+```python
+import weave
+
+weave.init("intro-example")
+prompt = weave.ref("calculation-prompt").get()
+```
+
+By default, the latest version of the prompt is returned. You can make this explicit or select a specific version by providing its version id.
+
+```python
+import weave
+
+weave.init("intro-example")
+prompt = weave.ref("calculation-prompt:latest").get()
+# ":", for example:
+prompt = weave.ref("calculation-prompt:QSLzr96CTzFwLWgFFi3EuawCI4oODz4Uax98SxIY79E").get()
+```
+
+It is also possible to retrieve a Prompt without calling `init` if you pass a fully qualified URI to `weave.ref`.
+
+## Loading and saving from files
+
+Prompts can be saved to files and loaded from files. This can be convenient if you want your Prompt to be versioned through
+a mechanism other than Weave such as git, or as a fallback if Weave is not available.
+
+To save a prompt to a file, you can use the `dump_file` method.
+
+```python
+import weave
+
+prompt = weave.EasyPrompt("What's 23 * 42?")
+prompt.dump_file("~/prompt.json")
+```
+
+and load it again later with `Prompt.load_file`.
+
+```python
+import weave
+
+prompt = weave.EasyPrompt.load_file("~/prompt.json")
+```
+
+You can also use the lower level `dump` and `Prompt.load` methods for custom (de)serialization.
+
+## Evaluating prompts
+
+The [Parameter feature of prompts](#parameterizing-prompts) can be used to execute or evaluate variations of a prompt.
+
+You can bind each row of a [Dataset](./datasets.md) to generate N variations of a prompt.
+
+```python
+import weave
+
+# Create a dataset
+dataset = weave.Dataset(name='countries', rows=[
+ {'id': '0', 'country': "Argentina"},
+ {'id': '1', 'country': "Belize"},
+ {'id': '2', 'country': "Canada"},
+ {'id': '3', 'country': "New Zealand"},
+])
+
+prompt = weave.EasyPrompt(name='travel_agent')
+prompt.append("You are an expert travel consultant.", role="system")
+prompt.append("Tell me the capital of {country} and about five kid-friendly attractions there.")
+
+
+prompts = prompt.bind_rows(dataset)
+assert prompts[2][1]["content"] == "Tell me the capital of Canada and about five kid-friendly attractions there."
+```
+
+You can extend this into an [Evaluation](./evaluations.md):
+
+```python
+import asyncio
+
+import openai
+import weave
+
+weave.init("intro-example")
+
+# Create a dataset
+dataset = weave.Dataset(name='countries', rows=[
+ {'id': '0', 'country': "Argentina", 'capital': "Buenos Aires"},
+ {'id': '1', 'country': "Belize", 'capital': "Belmopan"},
+ {'id': '2', 'country': "Canada", 'capital': "Ottawa"},
+ {'id': '3', 'country': "New Zealand", 'capital': "Wellington"},
+])
+
+# Create a prompt
+prompt = weave.EasyPrompt(name='travel_agent')
+prompt.append("You are an expert travel consultant.", role="system")
+prompt.append("Tell me the capital of {country} and about five kid-friendly attractions there.")
+
+# Create a model, combining a prompt with model configuration
+class TravelAgentModel(weave.Model):
+
+ model_name: str
+ prompt: weave.EasyPrompt
+
+ @weave.op
+ async def predict(self, country: str) -> dict:
+ client = openai.AsyncClient()
+
+ response = await client.chat.completions.create(
+ model=self.model_name,
+ messages=self.prompt(country=country),
+ )
+ result = response.choices[0].message.content
+ if result is None:
+ raise ValueError("No response from model")
+ return result
+
+# Define and run the evaluation
+@weave.op
+def mentions_capital_scorer(capital: str, model_output: str) -> dict:
+ return {'correct': capital in model_output}
+
+model = TravelAgentModel(model_name="gpt-4o-mini", prompt=prompt)
+evaluation = weave.Evaluation(
+ dataset=dataset,
+ scorers=[mentions_capital_scorer],
+)
+asyncio.run(evaluation.evaluate(model))
+
+```
diff --git a/docs/docs/guides/evaluation/scorers.md b/docs/docs/guides/evaluation/scorers.md
new file mode 100644
index 00000000000..ce7ea3b86c1
--- /dev/null
+++ b/docs/docs/guides/evaluation/scorers.md
@@ -0,0 +1,670 @@
+# Evaluation Metrics
+
+## Evaluations in Weave
+In Weave, Scorers are used to evaluate AI outputs and return evaluation metrics. They take the AI's output, analyze it, and return a dictionary of results. Scorers can use your input data as reference if needed and can also output extra information, such as explanations or reasonings from the evaluation.
+
+Scorers are passed to a `weave.Evaluation` object during evaluation. There are two types of Scorers in weave:
+
+1. **Function-based Scorers:** Simple Python functions decorated with `@weave.op`.
+2. **Class-based Scorers:** Python classes that inherit from `weave.Scorer` for more complex evaluations.
+
+Scorers must return a dictionary and can return multiple metrics, nested metrics and non-numeric values such as text returned from a LLM-evaluator about its reasoning.
+
+## Create your own Scorers
+### Function-based Scorers
+These are functions decorated with `@weave.op` that return a dictionary. They're great for simple evaluations like:
+
+```python
+import weave
+
+@weave.op
+def evaluate_uppercase(text: str) -> dict: # Added return type hint
+ return {"text_is_uppercase": text.isupper()}
+
+my_eval = weave.Evaluation(
+ dataset=[{"text": "HELLO WORLD"}],
+ scorers=[evaluate_uppercase]
+)
+```
+
+When the evaluation is run, `evaluate_uppercase` checks if the text is all uppercase.
+
+### Class-based Scorers
+For more advanced evaluations, especially when you need to keep track of additional scorer metadata, try different prompts for your LLM-evaluators, or make multiple function calls, you can use the `Scorer` class.
+
+**Requirements:**
+1. Inherit from `weave.Scorer`.
+2. Define a `score` method decorated with `@weave.op`.
+3. The `score` method must return a dictionary.
+
+Example:
+
+
+```python
+import weave
+from openai import OpenAI
+from weave import Scorer
+
+llm_client = OpenAI()
+
+#highlight-next-line
+class SummarizationScorer(Scorer):
+ model_id: str = "gpt-4o"
+ system_prompt: str = "Evaluate whether the summary is good."
+
+ @weave.op
+ def some_complicated_preprocessing(self, text: str) -> str:
+ processed_text = "Original text: \n" + text + "\n"
+ return processed_text
+
+ @weave.op
+ def call_llm(self, summary: str, processed_text: str) -> dict:
+ res = llm_client.chat.completions.create(
+ messages=[
+ {"role": "system", "content": self.system_prompt},
+ {"role": "user", "content": (
+ f"Analyse how good the summary is compared to the original text."
+ f"Summary: {summary}\n{processed_text}"
+ )}])
+ return {"summary_quality": res}
+
+ @weave.op
+ def score(self, output: str, text: str) -> dict:
+ """Score the summary quality.
+
+ Args:
+ output: The summary generated by an AI system
+ text: The original text being summarized
+ """
+ processed_text = self.some_complicated_preprocessing(text)
+ eval_result = self.call_llm(summary=output, processed_text=processed_text)
+ return {"summary_quality": eval_result}
+
+evaluation = weave.Evaluation(
+ dataset=[{"text": "The quick brown fox jumps over the lazy dog."}],
+ scorers=[summarization_scorer])
+```
+This class evaluates how good a summary is by comparing it to the original text.
+
+## How Scorers Work
+### Scorer Keyword Arguments
+Scorers can access both the output from your AI system and the input data from the dataset row.
+
+- **Input:** If you would like your scorer to use data from your dataset row, such as a "label" or "target" column then you can easily make this available to the scorer by adding a `label` or `target` keyword argument to your scorer definition.
+
+For example if you wanted to use a column called "label" from your dataset then your scorer function (or `score` class method) would have a parameter list like this:
+
+```python
+@weave.op
+def my_custom_scorer(output: str, label: int) -> dict: # Added return type hint
+ ...
+```
+
+When a weave `Evaluation` is run, the output of the AI system is passed to the `output` parameter. The `Evaluation` also automatically tries to match any additional scorer argument names to your dataset columns. If customizing your scorer arguments or dataset columns is not feasible, you can use column mapping - see below for more.
+
+- **Output:** Include an `output` parameter in your scorer function's signature to access the AI system's output.
+
+
+### Mapping Column Names with column_map
+Sometimes, the `score` methods' argument names don't match the column names in your dataset. You can fix this using a `column_map`.
+
+If you're using a class-based scorer, pass a dictionary to the `column_map` attribute of `Scorer` when you initialise your scorer class. This dictionary maps your `score` method's argument names to the dataset's column names, in the order: `{scorer_keyword_argument: dataset_column_name}`.
+
+Example:
+
+```python
+import weave
+from weave import Scorer
+
+# A dataset with news articles to be summarised
+dataset = [
+ {"news_article": "The news today was great...", "date": "2030-04-20", "source": "Bright Sky Network"},
+ ...
+]
+
+# Scorer class
+class SummarizationScorer(Scorer):
+
+ @weave.op
+ def score(output, text) -> dict:
+ """
+ output: output summary from a LLM summarization system
+ text: the text being summarised
+ """
+ ... # evaluate the quality of the summary
+
+# create a scorer with a column mapping the `text` argument to the `news_article` data column
+scorer = SummarizationScorer(column_map={"text" : "news_article"})
+```
+
+Now, the `text` argument in the `score` method will receive data from the `news_article` dataset column.
+
+**Notes:**
+- Another equivalent option to map your columns is to subclass the `Scorer` and overload the `score` method mapping the columns explicitly.
+
+```python
+import weave
+from weave import Scorer
+
+class MySummarizationScorer(SummarizationScorer):
+
+ @weave.op
+ def score(self, output: str, news_article: str) -> dict: # Added type hints
+ # overload the score method and map columns manually
+ return super().score(output=output, text=news_article)
+```
+
+### Final summarization of the scorer
+
+During evaluation, the scorer will be computed for each row of your dataset. To provide a final score for the evaluation we provide an `auto_summarize` depending on the returning type of the output.
+ - average will be computed for numerical columns
+ - count and fraction for boolean cols
+ - other col types are ignored
+
+You can override the `summarize` method on the `Scorer` class and provide your own way of computing the final scores. The `summarize` function expects:
+
+- A single parameter `score_rows`: This is a list of dictionaries, where each dictionary contains the scores returned by the `score` method for a single row of your dataset.
+- It should return a dictionary containing the summarized scores.
+
+**Why this is useful?**
+
+When you need to score all rows before deciding on the final value of the score for the dataset.
+
+```python
+class MyBinaryScorer(Scorer):
+ """
+ Returns True if the full output matches the target, False if not
+ """
+
+ @weave.op
+ def score(output, target):
+ return {"match": if output == target}
+
+ def summarize(self, score_rows: list) -> dict:
+ full_match = all(row["match"] for row in score_rows)
+ return {"full_match": full_match}
+```
+> In this example, the default `auto_summarize` would have returned the count and proportion of True.
+
+If you want to learn more, check the implementation of [CorrectnessLLMJudge](/tutorial-rag#optional-defining-a-scorer-class).
+
+## Predefined Scorers
+
+**Installation**
+
+To use Weave's predefined scorers you need to install some additional dependencies:
+
+```bash
+pip install weave[scorers]
+```
+
+**LLM-evaluators**
+
+The pre-defined scorers that use LLMs support the OpenAI, Anthropic, Google GenerativeAI and MistralAI clients. They also use `weave`'s `InstructorLLMScorer` class, so you'll need to install the [`instructor`](https://github.com/instructor-ai/instructor) Python package to be able to use them. You can get all necessary dependencies with `pip install "weave[scorers]"`
+
+### `HallucinationFreeScorer`
+
+This scorer checks if your AI system's output includes any hallucinations based on the input data.
+
+```python
+from weave.scorers import HallucinationFreeScorer
+
+llm_client = ... # initialize your LLM client here
+
+scorer = HallucinationFreeScorer(
+ client=llm_client,
+ model_id="gpt4o"
+)
+```
+
+**Customization:**
+- Customize the `system_prompt` and `user_prompt` attributes of the scorer to define what "hallucination" means for you.
+
+**Notes:**
+- The `score` method expects an input column named `context`. If your dataset uses a different name, use the `column_map` attribute to map `context` to the dataset column.
+
+Here you have an example in the context of an evaluation:
+
+```python
+import asyncio
+from openai import OpenAI
+import weave
+from weave.scorers import HallucinationFreeScorer
+
+# Initialize clients and scorers
+llm_client = OpenAI()
+hallucination_scorer = HallucinationFreeScorer(
+ client=llm_client,
+ model_id="gpt-4o",
+ column_map={"context": "input", "output": "other_col"}
+)
+
+# Create dataset
+dataset = [
+ {"input": "John likes various types of cheese."},
+ {"input": "Pepe likes various types of cheese."},
+]
+
+@weave.op
+def model(input: str) -> str:
+ return "The person's favorite cheese is cheddar."
+
+# Run evaluation
+evaluation = weave.Evaluation(
+ dataset=dataset,
+ scorers=[hallucination_scorer],
+)
+result = asyncio.run(evaluation.evaluate(model))
+print(result)
+# {'HallucinationFreeScorer': {'has_hallucination': {'true_count': 2, 'true_fraction': 1.0}}, 'model_latency': {'mean': 1.4395725727081299}}
+```
+---
+
+### `SummarizationScorer`
+
+Use an LLM to compare a summary to the original text and evaluate the quality of the summary.
+
+```python
+from weave.scorers import SummarizationScorer
+
+llm_client = ... # initialize your LLM client here
+
+scorer = SummarizationScorer(
+ client=llm_client,
+ model_id="gpt4o"
+)
+```
+
+**How It Works:**
+
+This scorer evaluates summaries in two ways:
+
+1. **Entity Density:** Checks the ratio of unique entities (like names, places, or things) mentioned in the summary to the total word count in the summary in order to estimate the "information density" of the summary. Uses an LLM to extract the entities. Similar to how entity density is used in the Chain of Density paper, https://arxiv.org/abs/2309.04269
+
+2. **Quality Grading:** Uses an LLM-evaluator to grade the summary as `poor`, `ok`, or `excellent`. These grades are converted to scores (0.0 for poor, 0.5 for ok, and 1.0 for excellent) so you can calculate averages.
+
+**Customization:**
+- Adjust `summarization_evaluation_system_prompt` and `summarization_evaluation_prompt` to define what makes a good summary.
+
+**Notes:**
+- This scorer uses the `InstructorLLMScorer` class.
+- The `score` method expects the original text that was summarized to be present in the `input` column of the dataset. Use the `column_map` class attribute to map `input` to the correct dataset column if needed.
+
+
+Here you have an example usage of the `SummarizationScorer` in the context of an evaluation:
+
+```python
+import asyncio
+from openai import OpenAI
+import weave
+from weave.scorers import SummarizationScorer
+
+class SummarizationModel(weave.Model):
+ @weave.op()
+ async def predict(self, input: str) -> str:
+ return "This is a summary of the input text."
+
+# Initialize clients and scorers
+llm_client = OpenAI()
+model = SummarizationModel()
+summarization_scorer = SummarizationScorer(
+ client=llm_client,
+ model_id="gpt-4o",
+)
+# Create dataset
+dataset = [
+ {"input": "The quick brown fox jumps over the lazy dog."},
+ {"input": "Artificial Intelligence is revolutionizing various industries."}
+]
+
+# Run evaluation
+evaluation = weave.Evaluation(dataset=dataset, scorers=[summarization_scorer])
+results = asyncio.run(evaluation.evaluate(model))
+print(results)
+# {'SummarizationScorer': {'is_entity_dense': {'true_count': 0, 'true_fraction': 0.0}, 'summarization_eval_score': {'mean': 0.0}, 'entity_density': {'mean': 0.0}}, 'model_latency': {'mean': 6.210803985595703e-05}}
+```
+
+---
+
+### `OpenAIModerationScorer`
+
+The `OpenAIModerationScorer` uses OpenAI's Moderation API to check if the AI system's output contains disallowed content, such as hate speech or explicit material.
+
+```python
+from weave.scorers import OpenAIModerationScorer
+from openai import OpenAI
+
+oai_client = OpenAI(api_key=...) # initialize your LLM client here
+
+scorer = OpenAIModerationScorer(
+ client=oai_client,
+ model_id="text-embedding-3-small"
+)
+```
+
+**How It Works:**
+
+- Sends the AI's output to the OpenAI Moderation endpoint and returns a dictionary indicating whether the content is flagged and details about the categories involved.
+
+**Notes:**
+- Requires the `openai` Python package.
+- The client must be an instance of OpenAI's `OpenAI` or `AsyncOpenAI` client.
+
+
+Here you have an example in the context of an evaluation:
+```python
+import asyncio
+from openai import OpenAI
+import weave
+from weave.scorers import OpenAIModerationScorer
+
+class MyModel(weave.Model):
+ @weave.op
+ async def predict(self, input: str) -> str:
+ return input
+
+# Initialize clients and scorers
+client = OpenAI()
+model = MyModel()
+moderation_scorer = OpenAIModerationScorer(client=client)
+
+# Create dataset
+dataset = [
+ {"input": "I love puppies and kittens!"},
+ {"input": "I hate everyone and want to hurt them."}
+]
+
+# Run evaluation
+evaluation = weave.Evaluation(dataset=dataset, scorers=[moderation_scorer])
+results = asyncio.run(evaluation.evaluate(model))
+print(results)
+# {'OpenAIModerationScorer': {'flagged': {'true_count': 1, 'true_fraction': 0.5}, 'categories': {'violence': {'true_count': 1, 'true_fraction': 1.0}}}, 'model_latency': {'mean': 9.500980377197266e-05}}
+```
+
+---
+
+### `EmbeddingSimilarityScorer`
+
+The `EmbeddingSimilarityScorer` computes the cosine similarity between the embeddings of the AI system's output and a target text from your dataset. It's useful for measuring how similar the AI's output is to a reference text.
+
+```python
+from weave.scorers import EmbeddingSimilarityScorer
+
+llm_client = ... # initialise your LlM client
+
+similarity_scorer = EmbeddingSimilarityScorer(
+ client=llm_client
+ target_column="reference_text", # the dataset column to compare the output against
+ threshold=0.4 # the cosine similarity threshold to use
+)
+```
+
+**Parameters:**
+
+- `target`: This scorer expects a `target` column in your dataset, it will calculate the cosine similarity of the embeddings of the `target` column to the AI system output. If your dataset doesn't contain a column called `target` you can use the scorers `column_map` attribute to map `target` to the appropriate column name in your dataset. See the Column Mapping section for more.
+- `threshold` (float): The minimum cosine similarity score between the embedding of the AI system output and the embdedding of the `target`, above which the 2 samples are considered "similar", (defaults to `0.5`). `threshold` can be in a range from -1 to 1:
+ - 1 indicates identical direction.
+ - 0 indicates orthogonal vectors.
+ - -1 indicates opposite direction.
+
+The correct cosine similarity threshold to set can fluctuate quite a lot depending on your use case, we advise exploring different thresholds.
+
+
+Here you have an example usage of the `EmbeddingSimilarityScorer` in the context of an evaluation:
+
+```python
+import asyncio
+from openai import OpenAI
+import weave
+from weave.scorers import EmbeddingSimilarityScorer
+
+# Initialize clients and scorers
+client = OpenAI()
+similarity_scorer = EmbeddingSimilarityScorer(
+ client=client,
+ threshold=0.7,
+ column_map={"target": "reference"}
+)
+
+# Create dataset
+dataset = [
+ {
+ "input": "He's name is John",
+ "reference": "John likes various types of cheese.",
+ },
+ {
+ "input": "He's name is Pepe.",
+ "reference": "Pepe likes various types of cheese.",
+ },
+]
+
+# Define model
+@weave.op
+def model(input: str) -> str:
+ return "John likes various types of cheese."
+
+# Run evaluation
+evaluation = weave.Evaluation(
+ dataset=dataset,
+ scorers=[similarity_scorer],
+)
+result = asyncio.run(evaluation.evaluate(model))
+print(result)
+# {'EmbeddingSimilarityScorer': {'is_similar': {'true_count': 1, 'true_fraction': 0.5}, 'similarity_score': {'mean': 0.8448514031462045}}, 'model_latency': {'mean': 0.45862746238708496}}
+```
+
+---
+
+### `ValidJSONScorer`
+
+The ValidJSONScorer checks whether the AI system's output is valid JSON. This scorer is useful when you expect the output to be in JSON format and need to verify its validity.
+
+```python
+from weave.scorers import ValidJSONScorer
+
+json_scorer = ValidJSONScorer()
+```
+
+Here you have an example usage of the `ValidJSONScorer` in the context of an evaluation:
+
+```python
+import asyncio
+import weave
+from weave.scorers import ValidJSONScorer
+
+class JSONModel(weave.Model):
+ @weave.op()
+ async def predict(self, input: str) -> str:
+ # This is a placeholder.
+ # In a real scenario, this would generate JSON.
+ return '{"key": "value"}'
+
+model = JSONModel()
+json_scorer = ValidJSONScorer()
+
+dataset = [
+ {"input": "Generate a JSON object with a key and value"},
+ {"input": "Create an invalid JSON"}
+]
+
+evaluation = weave.Evaluation(dataset=dataset, scorers=[json_scorer])
+results = asyncio.run(evaluation.evaluate(model))
+print(results)
+# {'ValidJSONScorer': {'json_valid': {'true_count': 2, 'true_fraction': 1.0}}, 'model_latency': {'mean': 8.58306884765625e-05}}
+```
+
+
+---
+
+### `ValidXMLScorer`
+
+The `ValidXMLScorer` checks whether the AI system's output is valid XML. This is useful when expecting XML-formatted outputs.
+
+```python
+from weave.scorers import ValidXMLScorer
+
+xml_scorer = ValidXMLScorer()
+```
+
+
+Here you have an example usage of the `ValidXMLScorer` in the context of an evaluation:
+
+```python
+import asyncio
+import weave
+from weave.scorers import ValidXMLScorer
+
+class XMLModel(weave.Model):
+ @weave.op()
+ async def predict(self, input: str) -> str:
+ # This is a placeholder. In a real scenario, this would generate XML.
+ return 'value'
+
+model = XMLModel()
+xml_scorer = ValidXMLScorer()
+
+dataset = [
+ {"input": "Generate a valid XML with a root element"},
+ {"input": "Create an invalid XML"}
+]
+
+evaluation = weave.Evaluation(dataset=dataset, scorers=[xml_scorer])
+results = asyncio.run(evaluation.evaluate(model))
+print(results)
+# {'ValidXMLScorer': {'xml_valid': {'true_count': 2, 'true_fraction': 1.0}}, 'model_latency': {'mean': 8.20159912109375e-05}}
+```
+
+---
+
+### `PydanticScorer`
+
+The `PydanticScorer` validates the AI system's output against a Pydantic model to ensure it adheres to a specified schema or data structure.
+
+```python
+from weave.scorers import PydanticScorer
+from pydantic import BaseModel
+
+class FinancialReport(BaseModel):
+ revenue: int
+ year: str
+
+pydantic_scorer = PydanticScorer(model=FinancialReport)
+```
+
+---
+
+### RAGAS - `ContextEntityRecallScorer`
+
+The `ContextEntityRecallScorer` estimates context recall by extracting entities from both the AI system's output and the provided context, then computing the recall score. Based on the [RAGAS](https://github.com/explodinggradients/ragas) evaluation library
+
+```python
+from weave.scorers import ContextEntityRecallScorer
+
+llm_client = ... # initialise your LlM client
+
+entity_recall_scorer = ContextEntityRecallScorer(
+ client=llm_client
+ model_id="your-model-id"
+)
+```
+
+**How It Works:**
+
+- Uses an LLM to extract unique entities from the output and context and calculates recall.
+- **Recall** indicates the proportion of important entities from the context that are captured in the output, helping to assess the model's effectiveness in retrieving relevant information.
+- Returns a dictionary with the recall score.
+
+**Notes:**
+
+- Expects a `context` column in your dataset, use `column_map` to map `context` to another dataset column if needed.
+
+---
+
+### RAGAS - `ContextRelevancyScorer`
+
+The `ContextRelevancyScorer` evaluates the relevancy of the provided context to the AI system's output. It helps determine if the context used is appropriate for generating the output. Based on the [RAGAS](https://github.com/explodinggradients/ragas) evaluation library.
+
+```python
+from weave.scorers import ContextRelevancyScorer
+
+llm_client = ... # initialise your LlM client
+
+relevancy_scorer = ContextRelevancyScorer(
+ llm_client = ... # initialise your LlM client
+ model_id="your-model-id"
+ )
+```
+
+**How It Works:**
+
+- Uses an LLM to rate the relevancy of the context to the output on a scale from 0 to 1.
+- Returns a dictionary with the `relevancy_score`.
+
+**Notes:**
+
+- Expects a `context` column in your dataset, use `column_map` to map `context` to another dataset column if needed.
+- Customize the `relevancy_prompt` to define how relevancy is assessed.
+
+
+Here you have an example usage of `ContextEntityRecallScorer` and `ContextRelevancyScorer` in the context of an evaluation:
+
+```python
+import asyncio
+from textwrap import dedent
+from openai import OpenAI
+import weave
+from weave.scorers import ContextEntityRecallScorer, ContextRelevancyScorer
+
+class RAGModel(weave.Model):
+ @weave.op()
+ async def predict(self, question: str) -> str:
+ "Retrieve relevant context"
+ return "Paris is the capital of France."
+
+
+model = RAGModel()
+
+# Define prompts
+relevancy_prompt: str = dedent("""
+ Given the following question and context, rate the relevancy of the context to the question on a scale from 0 to 1.
+
+ Question: {question}
+ Context: {context}
+ Relevancy Score (0-1):
+ """)
+
+# Initialize clients and scorers
+llm_client = OpenAI()
+entity_recall_scorer = ContextEntityRecallScorer(
+ client=client,
+ model_id="gpt-4o",
+)
+
+relevancy_scorer = ContextRelevancyScorer(
+ client=llm_client,
+ model_id="gpt-4o",
+ relevancy_prompt=relevancy_prompt
+)
+
+# Create dataset
+dataset = [
+ {
+ "question": "What is the capital of France?",
+ "context": "Paris is the capital city of France."
+ },
+ {
+ "question": "Who wrote Romeo and Juliet?",
+ "context": "William Shakespeare wrote many famous plays."
+ }
+]
+
+# Run evaluation
+evaluation = weave.Evaluation(
+ dataset=dataset,
+ scorers=[entity_recall_scorer, relevancy_scorer]
+)
+results = asyncio.run(evaluation.evaluate(model))
+print(results)
+# {'ContextEntityRecallScorer': {'recall': {'mean': 0.3333333333333333}}, 'ContextRelevancyScorer': {'relevancy_score': {'mean': 0.5}}, 'model_latency': {'mean': 9.393692016601562e-05}}
+```
+
diff --git a/docs/docs/guides/integrations/langchain.md b/docs/docs/guides/integrations/langchain.md
index b382e793e70..4487a85dfd4 100644
--- a/docs/docs/guides/integrations/langchain.md
+++ b/docs/docs/guides/integrations/langchain.md
@@ -196,7 +196,7 @@ Evaluations help you measure the performance of your models. By using the [`weav
```python
-from weave.flow.scorer import MultiTaskBinaryClassificationF1
+from weave.scorers import MultiTaskBinaryClassificationF1
sentences = [
"There are many fruits that were found on the recently discovered planet Goocrux. There are neoskizzles that grow there, which are purple and taste like candy.",
diff --git a/docs/docs/guides/tracking/costs.md b/docs/docs/guides/tracking/costs.md
index 8bcddeb2e0c..bedca15aa17 100644
--- a/docs/docs/guides/tracking/costs.md
+++ b/docs/docs/guides/tracking/costs.md
@@ -1,9 +1,5 @@
# Costs
-:::info
-Custom costs are accessible via Python and REST queries. UI uptake is under development and expected to be complete by middle of October 2024
-:::
-
## Adding a custom cost
You can add a custom cost by using the [`add_cost`](/reference/python-sdk/weave/trace/weave.trace.weave_client#method-add_cost) method.
diff --git a/docs/docs/media/multi-agent-structured-output/0.png b/docs/docs/media/multi-agent-structured-output/0.png
new file mode 100644
index 00000000000..a49c7d93219
Binary files /dev/null and b/docs/docs/media/multi-agent-structured-output/0.png differ
diff --git a/docs/docs/media/multi-agent-structured-output/1.png b/docs/docs/media/multi-agent-structured-output/1.png
new file mode 100644
index 00000000000..4aaea4187f7
Binary files /dev/null and b/docs/docs/media/multi-agent-structured-output/1.png differ
diff --git a/docs/docs/media/multi-agent-structured-output/2.png b/docs/docs/media/multi-agent-structured-output/2.png
new file mode 100644
index 00000000000..96007934546
Binary files /dev/null and b/docs/docs/media/multi-agent-structured-output/2.png differ
diff --git a/docs/docs/media/multi-agent-structured-output/3.png b/docs/docs/media/multi-agent-structured-output/3.png
new file mode 100644
index 00000000000..269f70399e1
Binary files /dev/null and b/docs/docs/media/multi-agent-structured-output/3.png differ
diff --git a/docs/docs/reference/gen_notebooks/audio_with_weave.md b/docs/docs/reference/gen_notebooks/audio_with_weave.md
new file mode 100644
index 00000000000..a8c6b45efc1
--- /dev/null
+++ b/docs/docs/reference/gen_notebooks/audio_with_weave.md
@@ -0,0 +1,1198 @@
+---
+title: Log Audio With Weave
+---
+
+
+:::tip[This is a notebook]
+
+
+ );
+
+ return (
+
+ {/* setting the width to the width of the screen minus the sidebar width because of overflow: 'hidden' properties in SimplePageLayout causing issues */}
+