diff --git a/docs/docs/cookbooks/dspy_prompt_optimization.md b/docs/docs/reference/gen_notebooks/dspy_prompt_optimization.md similarity index 77% rename from docs/docs/cookbooks/dspy_prompt_optimization.md rename to docs/docs/reference/gen_notebooks/dspy_prompt_optimization.md index 363389f5fb35..af8e46a13fca 100644 --- a/docs/docs/cookbooks/dspy_prompt_optimization.md +++ b/docs/docs/reference/gen_notebooks/dspy_prompt_optimization.md @@ -1,14 +1,24 @@ --- +title: Excelling at BIG-Bench Hard tasks Using DSPy and Weave hide_table_of_contents: true --- -# Optimizing LLM Workflows Using DSPy and Weave -[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/wandb/weave/blob/master/docs/docs/cookbooks/notebooks/dspy_prompt_optimization.ipynb) +:::tip[This is a notebook] + +
Open In Colab
Open in Colab
+ +
View in Github
View in Github
+ +::: + + + +# Optimizing LLM Workflows Using DSPy and Weave The [BIG-bench (Beyond the Imitation Game Benchmark)](https://github.com/google/BIG-bench) is a collaborative benchmark intended to probe large language models and extrapolate their future capabilities consisting of more than 200 tasks. The [BIG-Bench Hard (BBH)](https://github.com/suzgunmirac/BIG-Bench-Hard) is a suite of 23 most challenging BIG-Bench tasks that can be quite difficult to be solved using the current generation of language models. -This tutorial demonstrates how we can improve the performance of our LLM workflow implemented on the **causal judgement task** from the BIG-bench Hard benchmark and evaluate our prompting strategies. We will use [DSPy](https://dspy-docs.vercel.app/) for implementing our LLM workflow and optimizing our prompting strategy. We will also use [Weave](../introduction.md) to track our LLM workflow and evaluate our prompting strategies. +This tutorial demonstrates how we can improve the performance of our LLM workflow implemented on the **causal judgement task** from the BIG-bench Hard benchmark and evaluate our prompting strategies. We will use [DSPy](https://dspy-docs.vercel.app/) for implementing our LLM workflow and optimizing our prompting strategy. We will also use [Weave](../docs/introduction.md) to track our LLM workflow and evaluate our prompting strategies. ## Installing the Dependencies @@ -18,12 +28,14 @@ We need the following libraries for this tutorial: - [Weave](../introduction.md) to track our LLM workflow and evaluate our prompting strategies. - [datasets](https://huggingface.co/docs/datasets/index) to access the Big-Bench Hard dataset from HuggingFace Hub. + ```python !pip install -qU dspy-ai weave datasets ``` Since we'll be using [OpenAI API](https://openai.com/index/openai-api/) as the LLM Vendor, we will also need an OpenAI API key. You can [sign up](https://platform.openai.com/signup) on the OpenAI platform to get your own API key. + ```python import os from getpass import getpass @@ -34,7 +46,8 @@ os.environ["OPENAI_API_KEY"] = api_key ## Enable Tracking using Weave -Weave is currently integrated with DSPy, and including [`weave.init`](../api-reference/python/weave.md#function-init) at the start of our code lets us automatically trace our DSPy functions which can be explored in the Weave UI. Check out the [Weave integration docs for DSPy](../guides/integrations/dspy.md) to learn more. +Weave is currently integrated with DSPy, and including [`weave.init`](../docs/reference/python-sdk/weave/index.md) at the start of our code lets us automatically trace our DSPy functions which can be explored in the Weave UI. Check out the [Weave integration docs for DSPy](../docs/guides/integrations/dspy.md) to learn more. + ```python import weave @@ -42,7 +55,8 @@ import weave weave.init(project_name="dspy-bigbench-hard") ``` -In this tutorial, we use a metadata class inherited from [`weave.Model`](../guides/core-types/models.md) to manage our metadata. +In this tutorial, we use a metadata class inherited from [`weave.Model`](../docs/guides/core-types/models.md) to manage our metadata. + ```python class Metadata(weave.Model): @@ -58,13 +72,14 @@ class Metadata(weave.Model): metadata = Metadata() ``` -| ![](./assets/dspy_prompt_optimization/metadata.gif) | +| ![](../static/img/dspy_prompt_optimiztion/metadata.gif) | |---| | The `Metadata` objects are automatically versioned and traced when functions consuming them are traced | ## Load the BIG-Bench Hard Dataset -We will load this dataset from HuggingFace Hub, split into training and validation sets, and [publish](./../guides/core-types/datasets.md) them on Weave, this will let us version the datasets, and also use [`weave.Evaluation`](./../guides/core-types/evaluations.md) to evaluate our prompting strategy. +We will load this dataset from HuggingFace Hub, split into training and validation sets, and [publish](../docs/guides/core-types/datasets.md) them on Weave, this will let us version the datasets, and also use [`weave.Evaluation`](../docs/guides/core-types/evaluations.md) to evaluate our prompting strategy. + ```python import dspy @@ -85,7 +100,7 @@ def get_dataset(metadata: Metadata): dspy_train_examples = [dspy.Example(row).with_inputs("question") for row in train_rows] dspy_val_examples = [dspy.Example(row).with_inputs("question") for row in val_rows] - # publish the datasets to the Weave, this will let us version the data and use for evaluation + # publish the datasets to the Weave, this would let us version the data and use for evaluation weave.publish(weave.Dataset(name=f"bigbenchhard_{metadata.big_bench_hard_task}_train", rows=train_rows)) weave.publish(weave.Dataset(name=f"bigbenchhard_{metadata.big_bench_hard_task}_val", rows=val_rows)) @@ -95,7 +110,7 @@ def get_dataset(metadata: Metadata): dspy_train_examples, dspy_val_examples = get_dataset(metadata) ``` -| ![](./assets/dspy_prompt_optimization/datasets.gif) | +| ![](../static/img/dspy_prompt_optimiztion/datasets.gif) | |---| | The datasets, once published, can be explored in the Weave UI | @@ -105,10 +120,10 @@ dspy_train_examples, dspy_val_examples = get_dataset(metadata) We will use the [`dspy.OpenAI`](https://dspy-docs.vercel.app/api/language_model_clients/OpenAI) abstraction to make LLM calls to [GPT3.5 Turbo](https://platform.openai.com/docs/models/gpt-3-5-turbo). + ```python system_prompt = """ -You are an expert in the field of causal reasoning. -You are to analyze the a given question carefully and answer in `Yes` or `No`. +You are an expert in the field of causal reasoning. You are to analyze the a given question carefully and answer in `Yes` or `No`. You should also provide a detailed explanation justifying your answer. """ @@ -120,6 +135,7 @@ dspy.settings.configure(lm=llm) A [signature](https://dspy-docs.vercel.app/docs/building-blocks/signatures) is a declarative specification of input/output behavior of a [DSPy module](https://dspy-docs.vercel.app/docs/building-blocks/modules) which are task-adaptive components—akin to neural network layers—that abstract any particular text transformation. + ```python from pydantic import BaseModel, Field @@ -151,6 +167,7 @@ class CausalReasoningModule(dspy.Module): Let's test our LLM workflow, i.e., the `CausalReasoningModule` on an example from the causal reasoning subset of Big-Bench Hard. + ```python import rich @@ -160,16 +177,17 @@ prediction = baseline_module(dspy_train_examples[0]["question"]) rich.print(prediction) ``` -| ![](./assets/dspy_prompt_optimization/dspy_module_trace.gif) | +| ![](../static/img/dspy_prompt_optimiztion/dspy_module_trace.gif) | |---| | Here's how you can explore the traces of the `CausalReasoningModule` in the Weave UI | ## Evaluating our DSPy Program -Now that we have a baseline prompting strategy, let's evaluate it on our validation set using [`weave.Evaluation`](./../guides/core-types/evaluations.md) on a simple metric that matches the predicted answer with the ground truth. Weave will take each example, pass it through your application and score the output on multiple custom scoring functions. By doing this, you'll have a view of the performance of your application, and a rich UI to drill into individual outputs and scores. +Now that we have a baseline prompting strategy, let's evaluate it on our validation set using [`weave.Evaluation`](../docs/guides/core-types/evaluations.md) on a simple metric that matches the predicted answer with the ground truth. Weave will take each example, pass it through your application and score the output on multiple custom scoring functions. By doing this, you'll have a view of the performance of your application, and a rich UI to drill into individual outputs and scores. First, we need to create a simple weave evaluation scoring function that tells whether the answer from the baseline module's output is the same as the ground truth answer or not. Scoring functions need to have a `model_output` keyword argument, but the other arguments are user defined and are taken from the dataset examples. It will only take the necessary keys by using a dictionary key based on the argument name. + ```python @weave.op() def weave_evaluation_scorer(answer: str, model_output: Output) -> dict: @@ -180,6 +198,7 @@ def weave_evaluation_scorer(answer: str, model_output: Output) -> dict: Next, we can simply define the evaluation and run it. + ```python validation_dataset = weave.ref( f"bigbenchhard_{metadata.big_bench_hard_task}_val:v0" @@ -211,6 +230,7 @@ Running the evaluation causal reasoning dataset will cost approximately $0.24 in Now, that we have a baseline DSPy program, let us try to improve its performance for causal reasoning using a [DSPy teleprompter](https://dspy-docs.vercel.app/docs/building-blocks/optimizers) that can tune the parameters of a DSPy program to maximize the specified metrics. In this tutorial, we use the [BootstrapFewShot](https://dspy-docs.vercel.app/api/category/optimizers) teleprompter. + ```python from dspy.teleprompt import BootstrapFewShot @@ -238,12 +258,13 @@ optimized_module = get_optimized_program(baseline_module, metadata) Running the evaluation causal reasoning dataset will cost approximately $0.04 in OpenAI credits. ::: -| ![](./assets/dspy_prompt_optimization/dspy_compile.png) | +| ![](../static/img/dspy_prompt_optimiztion/dspy_compile.png) | |---| | You can explore the traces of the optimization process in the Weave UI. | Now that we have our optimized program (the optimized prompting strategy), let's evaluate it once again on our validation set and compare it with our baseline DSPy program. + ```python evaluation = weave.Evaluation( name="optimized_causal_reasoning_module", @@ -254,6 +275,8 @@ evaluation = weave.Evaluation( await evaluation.evaluate(optimized_module.forward) ``` -| ![](./assets/dspy_prompt_optimization/eval_comparison.gif) | +| ![](../static/img/dspy_prompt_optimiztion/eval_comparison.gif) | |---| | Comparing the evalution of the baseline program with the optimized one shows that the optimized program answers the causal reasoning questions with siginificantly more accuracy. | + + diff --git a/docs/docs/cookbooks/notebooks/dspy_prompt_optimization.ipynb b/docs/notebooks/dspy_prompt_optimization.ipynb similarity index 83% rename from docs/docs/cookbooks/notebooks/dspy_prompt_optimization.ipynb rename to docs/notebooks/dspy_prompt_optimization.ipynb index 8df15cf87977..976c60094a45 100644 --- a/docs/docs/cookbooks/notebooks/dspy_prompt_optimization.ipynb +++ b/docs/notebooks/dspy_prompt_optimization.ipynb @@ -4,13 +4,18 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Excelling at BIG-Bench Hard tasks Using DSPy and Weave\n", + "\n", "\n", - "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/wandb/weave/blob/master/docs/docs/cookbooks/notebooks/dspy_prompt_optimization.ipynb)\n", + "# Optimizing LLM Workflows Using DSPy and Weave\n", "\n", "The [BIG-bench (Beyond the Imitation Game Benchmark)](https://github.com/google/BIG-bench) is a collaborative benchmark intended to probe large language models and extrapolate their future capabilities consisting of more than 200 tasks. The [BIG-Bench Hard (BBH)](https://github.com/suzgunmirac/BIG-Bench-Hard) is a suite of 23 most challenging BIG-Bench tasks that can be quite difficult to be solved using the current generation of language models.\n", "\n", - "This tutorial demonstrates how we can improve the performance of our LLM workflow implemented on the **causal judgement task** from the BIG-bench Hard benchmark and evaluate our prompting strategies. We will use [DSPy](https://dspy-docs.vercel.app/) for implementing our LLM workflow and optimizing our prompting strategy. We would also use [Weave](../../introduction.md) to track our LLM workflow and evaluate our prompting strategies." + "This tutorial demonstrates how we can improve the performance of our LLM workflow implemented on the **causal judgement task** from the BIG-bench Hard benchmark and evaluate our prompting strategies. We will use [DSPy](https://dspy-docs.vercel.app/) for implementing our LLM workflow and optimizing our prompting strategy. We will also use [Weave](../docs/introduction.md) to track our LLM workflow and evaluate our prompting strategies." ] }, { @@ -22,7 +27,7 @@ "We need the following libraries for this tutorial:\n", "\n", "- [DSPy](https://dspy-docs.vercel.app/) for building the LLM workflow and optimizing it.\n", - "- [Weave](../../introduction.md) to track our LLM workflow and evaluate our prompting strategies.\n", + "- [Weave](../introduction.md) to track our LLM workflow and evaluate our prompting strategies.\n", "- [datasets](https://huggingface.co/docs/datasets/index) to access the Big-Bench Hard dataset from HuggingFace Hub." ] }, @@ -61,7 +66,7 @@ "source": [ "## Enable Tracking using Weave\n", "\n", - "Weave is currently integrated with DSPy, and including [`weave.init`](../../api-reference/python/weave.md#function-init) at the start of our code lets us automatically trace our DSPy functions which can be explored in the Weave UI. Check out the [Weave integration docs for DSPy](../../guides/integrations/dspy.md) to learn more." + "Weave is currently integrated with DSPy, and including [`weave.init`](../docs/reference/python-sdk/weave/index.md) at the start of our code lets us automatically trace our DSPy functions which can be explored in the Weave UI. Check out the [Weave integration docs for DSPy](../docs/guides/integrations/dspy.md) to learn more." ] }, { @@ -79,7 +84,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "In this tutorial, we use a metadata class inherited from [`weave.Model`](../../guides/core-types/models.md) to manage our metadata." + "In this tutorial, we use a metadata class inherited from [`weave.Model`](../docs/guides/core-types/models.md) to manage our metadata." ] }, { @@ -105,7 +110,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "| ![](../assets/dspy_prompt_optimization/metadata.gif) |\n", + "| ![](../static/img/dspy_prompt_optimiztion/metadata.gif) |\n", "|---|\n", "| The `Metadata` objects are automatically versioned and traced when functions consuming them are traced |" ] @@ -116,7 +121,7 @@ "source": [ "## Load the BIG-Bench Hard Dataset\n", "\n", - "We're gonna load this dataset from HuggingFace Hub, split into training and validation sets, and [publish](../../guides/core-types/datasets.md) them on Weave, this would let us version the datasets, and also use [`weave.Evaluation`](../../guides/core-types/evaluations.md) to evaluate our prompting strategy." + "We will load this dataset from HuggingFace Hub, split into training and validation sets, and [publish](../docs/guides/core-types/datasets.md) them on Weave, this will let us version the datasets, and also use [`weave.Evaluation`](../docs/guides/core-types/evaluations.md) to evaluate our prompting strategy." ] }, { @@ -157,7 +162,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "| ![](../assets/dspy_prompt_optimization/datasets.gif) |\n", + "| ![](../static/img/dspy_prompt_optimiztion/datasets.gif) |\n", "|---|\n", "| The datasets, once published, can be explored in the Weave UI |" ] @@ -170,7 +175,7 @@ "\n", "[DSPy](https://dspy-docs.vercel.app) is a framework that pushes building new LM pipelines away from manipulating free-form strings and closer to programming (composing modular operators to build text transformation graphs) where a compiler automatically generates optimized LM invocation strategies and prompts from a program.\n", "\n", - "We're gonna use the [`dspy.OpenAI`](https://dspy-docs.vercel.app/api/language_model_clients/OpenAI) abstraction to make LLM calls to [GPT3.5 Turbo](https://platform.openai.com/docs/models/gpt-3-5-turbo)." + "We will use the [`dspy.OpenAI`](https://dspy-docs.vercel.app/api/language_model_clients/OpenAI) abstraction to make LLM calls to [GPT3.5 Turbo](https://platform.openai.com/docs/models/gpt-3-5-turbo)." ] }, { @@ -256,7 +261,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "| ![](../assets/dspy_prompt_optimization/dspy_module_trace.gif) |\n", + "| ![](../static/img/dspy_prompt_optimiztion/dspy_module_trace.gif) |\n", "|---|\n", "| Here's how you can explore the traces of the `CausalReasoningModule` in the Weave UI |" ] @@ -267,7 +272,7 @@ "source": [ "## Evaluating our DSPy Program\n", "\n", - "Now that we have a baseline prompting strategy, let's evaluate it on our validation set using [`weave.Evaluation`](./../guides/core-types/evaluations.md) on a simple metric that matches the predicted answer with the ground truth. Weave will take each example, pass it through your application and score the output on multiple custom scoring functions. By doing this, you'll have a view of the performance of your application, and a rich UI to drill into individual outputs and scores.\n", + "Now that we have a baseline prompting strategy, let's evaluate it on our validation set using [`weave.Evaluation`](../docs/guides/core-types/evaluations.md) on a simple metric that matches the predicted answer with the ground truth. Weave will take each example, pass it through your application and score the output on multiple custom scoring functions. By doing this, you'll have a view of the performance of your application, and a rich UI to drill into individual outputs and scores.\n", "\n", "First, we need to create a simple weave evaluation scoring function that tells whether the answer from the baseline module's output is the same as the ground truth answer or not. Scoring functions need to have a `model_output` keyword argument, but the other arguments are user defined and are taken from the dataset examples. It will only take the necessary keys by using a dictionary key based on the argument name." ] @@ -315,12 +320,18 @@ "cell_type": "markdown", "metadata": {}, "source": [ + ":::note\n", "If you're running from a python script, you can use the following code to run the evaluation:\n", "\n", "```python\n", "import asyncio\n", "asyncio.run(evaluation.evaluate(baseline_module.forward))\n", - "```" + "```\n", + ":::\n", + "\n", + ":::warning\n", + "Running the evaluation causal reasoning dataset will cost approximately $0.24 in OpenAI credits.\n", + ":::" ] }, { @@ -364,15 +375,14 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "| ![](../assets/dspy_prompt_optimization/dspy_compile.png) |\n", + ":::warning\n", + "Running the evaluation causal reasoning dataset will cost approximately $0.04 in OpenAI credits.\n", + ":::\n", + "\n", + "| ![](../static/img/dspy_prompt_optimiztion/dspy_compile.png) |\n", "|---|\n", - "| You can explore the traces of the optimization process in the Weave UI. |" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ + "| You can explore the traces of the optimization process in the Weave UI. |\n", + "\n", "Now that we have our optimized program (the optimized prompting strategy), let's evaluate it once again on our validation set and compare it with our baseline DSPy program." ] }, @@ -395,10 +405,15 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "| ![](../assets/dspy_prompt_optimization/eval_comparison.gif) |\n", + "| ![](../static/img/dspy_prompt_optimiztion/eval_comparison.gif) |\n", "|---|\n", "| Comparing the evalution of the baseline program with the optimized one shows that the optimized program answers the causal reasoning questions with siginificantly more accuracy. |" ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [] } ], "metadata": { diff --git a/docs/sidebars.ts b/docs/sidebars.ts index 4a89af42108a..787974ae378a 100644 --- a/docs/sidebars.ts +++ b/docs/sidebars.ts @@ -38,14 +38,6 @@ const sidebars: SidebarsConfig = { }, ], }, - { - type: "category", - label: "Weave Cookbooks", - collapsed: false, - items: [ - "cookbooks/dspy_prompt_optimization", - ], - }, { label: "Product Walkthrough", ...CATEGORY_SECTION_HEADER_MIXIN, diff --git a/docs/docs/cookbooks/assets/dspy_prompt_optimization/datasets.gif b/docs/static/img/dspy_prompt_optimiztion/datasets.gif similarity index 100% rename from docs/docs/cookbooks/assets/dspy_prompt_optimization/datasets.gif rename to docs/static/img/dspy_prompt_optimiztion/datasets.gif diff --git a/docs/docs/cookbooks/assets/dspy_prompt_optimization/dspy_compile.png b/docs/static/img/dspy_prompt_optimiztion/dspy_compile.png similarity index 100% rename from docs/docs/cookbooks/assets/dspy_prompt_optimization/dspy_compile.png rename to docs/static/img/dspy_prompt_optimiztion/dspy_compile.png diff --git a/docs/docs/cookbooks/assets/dspy_prompt_optimization/dspy_module_trace.gif b/docs/static/img/dspy_prompt_optimiztion/dspy_module_trace.gif similarity index 100% rename from docs/docs/cookbooks/assets/dspy_prompt_optimization/dspy_module_trace.gif rename to docs/static/img/dspy_prompt_optimiztion/dspy_module_trace.gif diff --git a/docs/docs/cookbooks/assets/dspy_prompt_optimization/eval_comparison.gif b/docs/static/img/dspy_prompt_optimiztion/eval_comparison.gif similarity index 100% rename from docs/docs/cookbooks/assets/dspy_prompt_optimization/eval_comparison.gif rename to docs/static/img/dspy_prompt_optimiztion/eval_comparison.gif diff --git a/docs/docs/cookbooks/assets/dspy_prompt_optimization/metadata.gif b/docs/static/img/dspy_prompt_optimiztion/metadata.gif similarity index 100% rename from docs/docs/cookbooks/assets/dspy_prompt_optimization/metadata.gif rename to docs/static/img/dspy_prompt_optimiztion/metadata.gif