diff --git a/docs/docs/reference/gen_notebooks/dspy_prompt_optimization.md b/docs/docs/reference/gen_notebooks/dspy_prompt_optimization.md new file mode 100644 index 00000000000..40e1c4e384d --- /dev/null +++ b/docs/docs/reference/gen_notebooks/dspy_prompt_optimization.md @@ -0,0 +1,290 @@ +--- +title: Optimizing LLM Workflows Using DSPy and Weave +--- + + +:::tip[This is a notebook] + +
Open In Colab
Open in Colab
+ +
View in Github
View in Github
+ +::: + + + +# Optimizing LLM Workflows Using DSPy and Weave + +The [BIG-bench (Beyond the Imitation Game Benchmark)](https://github.com/google/BIG-bench) is a collaborative benchmark intended to probe large language models and extrapolate their future capabilities consisting of more than 200 tasks. The [BIG-Bench Hard (BBH)](https://github.com/suzgunmirac/BIG-Bench-Hard) is a suite of 23 most challenging BIG-Bench tasks that can be quite difficult to be solved using the current generation of language models. + +This tutorial demonstrates how we can improve the performance of our LLM workflow implemented on the **causal judgement task** from the BIG-bench Hard benchmark and evaluate our prompting strategies. We will use [DSPy](https://dspy-docs.vercel.app/) for implementing our LLM workflow and optimizing our prompting strategy. We will also use [Weave](../docs/introduction.md) to track our LLM workflow and evaluate our prompting strategies. + +## Installing the Dependencies + +We need the following libraries for this tutorial: + +- [DSPy](https://dspy-docs.vercel.app/) for building the LLM workflow and optimizing it. +- [Weave](../introduction.md) to track our LLM workflow and evaluate our prompting strategies. +- [datasets](https://huggingface.co/docs/datasets/index) to access the Big-Bench Hard dataset from HuggingFace Hub. + + +```python +!pip install -qU dspy-ai weave datasets +``` + +Since we'll be using [OpenAI API](https://openai.com/index/openai-api/) as the LLM Vendor, we will also need an OpenAI API key. You can [sign up](https://platform.openai.com/signup) on the OpenAI platform to get your own API key. + + +```python +import os +from getpass import getpass + +api_key = getpass("Enter you OpenAI API key: ") +os.environ["OPENAI_API_KEY"] = api_key +``` + +## Enable Tracking using Weave + +Weave is currently integrated with DSPy, and including [`weave.init`](../docs/reference/python-sdk/weave/index.md) at the start of our code lets us automatically trace our DSPy functions which can be explored in the Weave UI. Check out the [Weave integration docs for DSPy](../docs/guides/integrations/dspy.md) to learn more. + + +```python +import weave + +weave.init(project_name="dspy-bigbench-hard") +``` + +In this tutorial, we use a metadata class inherited from [`weave.Model`](../docs/guides/core-types/models.md) to manage our metadata. + + +```python +class Metadata(weave.Model): + dataset_address: str = "maveriq/bigbenchhard" + big_bench_hard_task: str = "causal_judgement" + num_train_examples: int = 50 + openai_model: str = "gpt-3.5-turbo" + openai_max_tokens: int = 2048 + max_bootstrapped_demos: int = 8 + max_labeled_demos: int = 8 + + +metadata = Metadata() +``` + +| ![](../static/img/dspy_prompt_optimiztion/metadata.gif) | +|---| +| The `Metadata` objects are automatically versioned and traced when functions consuming them are traced | + +## Load the BIG-Bench Hard Dataset + +We will load this dataset from HuggingFace Hub, split into training and validation sets, and [publish](../docs/guides/core-types/datasets.md) them on Weave, this will let us version the datasets, and also use [`weave.Evaluation`](../docs/guides/core-types/evaluations.md) to evaluate our prompting strategy. + + +```python +import dspy +from datasets import load_dataset + + +@weave.op() +def get_dataset(metadata: Metadata): + # load the BIG-Bench Hard dataset corresponding to the task from Huggingface Hug + dataset = load_dataset(metadata.dataset_address, metadata.big_bench_hard_task)[ + "train" + ] + + # create the training and validation datasets + rows = [{"question": data["input"], "answer": data["target"]} for data in dataset] + train_rows = rows[0 : metadata.num_train_examples] + val_rows = rows[metadata.num_train_examples :] + + # create the training and validation examples consisting of `dspy.Example` objects + dspy_train_examples = [ + dspy.Example(row).with_inputs("question") for row in train_rows + ] + dspy_val_examples = [dspy.Example(row).with_inputs("question") for row in val_rows] + + # publish the datasets to the Weave, this would let us version the data and use for evaluation + weave.publish( + weave.Dataset( + name=f"bigbenchhard_{metadata.big_bench_hard_task}_train", rows=train_rows + ) + ) + weave.publish( + weave.Dataset( + name=f"bigbenchhard_{metadata.big_bench_hard_task}_val", rows=val_rows + ) + ) + + return dspy_train_examples, dspy_val_examples + + +dspy_train_examples, dspy_val_examples = get_dataset(metadata) +``` + +| ![](../static/img/dspy_prompt_optimiztion/datasets.gif) | +|---| +| The datasets, once published, can be explored in the Weave UI | + +## The DSPy Program + +[DSPy](https://dspy-docs.vercel.app) is a framework that pushes building new LM pipelines away from manipulating free-form strings and closer to programming (composing modular operators to build text transformation graphs) where a compiler automatically generates optimized LM invocation strategies and prompts from a program. + +We will use the [`dspy.OpenAI`](https://dspy-docs.vercel.app/api/language_model_clients/OpenAI) abstraction to make LLM calls to [GPT3.5 Turbo](https://platform.openai.com/docs/models/gpt-3-5-turbo). + + +```python +system_prompt = """ +You are an expert in the field of causal reasoning. You are to analyze the a given question carefully and answer in `Yes` or `No`. +You should also provide a detailed explanation justifying your answer. +""" + +llm = dspy.OpenAI(model="gpt-3.5-turbo", system_prompt=system_prompt) +dspy.settings.configure(lm=llm) +``` + +### Writing the Causal Reasoning Signature + +A [signature](https://dspy-docs.vercel.app/docs/building-blocks/signatures) is a declarative specification of input/output behavior of a [DSPy module](https://dspy-docs.vercel.app/docs/building-blocks/modules) which are task-adaptive components—akin to neural network layers—that abstract any particular text transformation. + + +```python +from pydantic import BaseModel, Field + + +class Input(BaseModel): + query: str = Field(description="The question to be answered") + + +class Output(BaseModel): + answer: str = Field(description="The answer for the question") + confidence: float = Field( + ge=0, le=1, description="The confidence score for the answer" + ) + explanation: str = Field(description="The explanation for the answer") + + +class QuestionAnswerSignature(dspy.Signature): + input: Input = dspy.InputField() + output: Output = dspy.OutputField() + + +class CausalReasoningModule(dspy.Module): + def __init__(self): + self.prog = dspy.TypedPredictor(QuestionAnswerSignature) + + @weave.op() + def forward(self, question) -> dict: + return self.prog(input=Input(query=question)).output.dict() +``` + +Let's test our LLM workflow, i.e., the `CausalReasoningModule` on an example from the causal reasoning subset of Big-Bench Hard. + + +```python +import rich + +baseline_module = CausalReasoningModule() + +prediction = baseline_module(dspy_train_examples[0]["question"]) +rich.print(prediction) +``` + +| ![](../static/img/dspy_prompt_optimiztion/dspy_module_trace.gif) | +|---| +| Here's how you can explore the traces of the `CausalReasoningModule` in the Weave UI | + +## Evaluating our DSPy Program + +Now that we have a baseline prompting strategy, let's evaluate it on our validation set using [`weave.Evaluation`](../docs/guides/core-types/evaluations.md) on a simple metric that matches the predicted answer with the ground truth. Weave will take each example, pass it through your application and score the output on multiple custom scoring functions. By doing this, you'll have a view of the performance of your application, and a rich UI to drill into individual outputs and scores. + +First, we need to create a simple weave evaluation scoring function that tells whether the answer from the baseline module's output is the same as the ground truth answer or not. Scoring functions need to have a `model_output` keyword argument, but the other arguments are user defined and are taken from the dataset examples. It will only take the necessary keys by using a dictionary key based on the argument name. + + +```python +@weave.op() +def weave_evaluation_scorer(answer: str, model_output: Output) -> dict: + return {"match": int(answer.lower() == model_output["answer"].lower())} +``` + +Next, we can simply define the evaluation and run it. + + +```python +validation_dataset = weave.ref( + f"bigbenchhard_{metadata.big_bench_hard_task}_val:v0" +).get() + +evaluation = weave.Evaluation( + name="baseline_causal_reasoning_module", + dataset=validation_dataset, + scorers=[weave_evaluation_scorer], +) + +await evaluation.evaluate(baseline_module.forward) +``` + +:::note +If you're running from a python script, you can use the following code to run the evaluation: + +```python +import asyncio +asyncio.run(evaluation.evaluate(baseline_module.forward)) +``` +::: + +:::warning +Running the evaluation causal reasoning dataset will cost approximately $0.24 in OpenAI credits. +::: + +## Optimizing our DSPy Program + +Now, that we have a baseline DSPy program, let us try to improve its performance for causal reasoning using a [DSPy teleprompter](https://dspy-docs.vercel.app/docs/building-blocks/optimizers) that can tune the parameters of a DSPy program to maximize the specified metrics. In this tutorial, we use the [BootstrapFewShot](https://dspy-docs.vercel.app/api/category/optimizers) teleprompter. + + +```python +from dspy.teleprompt import BootstrapFewShot + + +@weave.op() +def get_optimized_program(model: dspy.Module, metadata: Metadata) -> dspy.Module: + @weave.op() + def dspy_evaluation_metric(true, prediction, trace=None): + return prediction["answer"].lower() == true.answer.lower() + + teleprompter = BootstrapFewShot( + metric=dspy_evaluation_metric, + max_bootstrapped_demos=metadata.max_bootstrapped_demos, + max_labeled_demos=metadata.max_labeled_demos, + ) + return teleprompter.compile(model, trainset=dspy_train_examples) + + +optimized_module = get_optimized_program(baseline_module, metadata) +``` + +:::warning +Running the evaluation causal reasoning dataset will cost approximately $0.04 in OpenAI credits. +::: + +| ![](../static/img/dspy_prompt_optimiztion/dspy_compile.png) | +|---| +| You can explore the traces of the optimization process in the Weave UI. | + +Now that we have our optimized program (the optimized prompting strategy), let's evaluate it once again on our validation set and compare it with our baseline DSPy program. + + +```python +evaluation = weave.Evaluation( + name="optimized_causal_reasoning_module", + dataset=validation_dataset, + scorers=[weave_evaluation_scorer], +) + +await evaluation.evaluate(optimized_module.forward) +``` + +| ![](../static/img/dspy_prompt_optimiztion/eval_comparison.gif) | +|---| +| Comparing the evalution of the baseline program with the optimized one shows that the optimized program answers the causal reasoning questions with siginificantly more accuracy. | + + diff --git a/docs/docs/reference/gen_notebooks/parse_arxiv_papers.md b/docs/docs/reference/gen_notebooks/parse_arxiv_papers.md new file mode 100644 index 00000000000..ac31843e80f --- /dev/null +++ b/docs/docs/reference/gen_notebooks/parse_arxiv_papers.md @@ -0,0 +1,297 @@ + + +:::tip[This is a notebook] + +
Open In Colab
Open in Colab
+ +
View in Github
View in Github
+ +::: + + +# Extracting Structured Data from Documents using Instructor and Weave + +LLMs are widely used in downstream applications which necessitates outputs to be structured in a consistent manner. This often requires the LLM-powered applications to parse unstructured documents (such as PDF files) and extract specific information structured according to a specififc schema. + +In this tutorial, you will learn how to extract specific information from machine learning papers (such as key findings, novel methodologies, research directions, etc.). We will use [Instructor](https://python.useinstructor.com/) to get structured output from an OpenAI [GPT-4o](https://platform.openai.com/docs/models/gpt-4o) model in the form of [Pydantic objects](https://docs.pydantic.dev/latest/concepts/models/). We will also use [Weave](../../introduction.md) to track and evaluate our LLM workflow. + +## Installing the Dependencies + +We need the following libraries for this tutorial: + +- [Instructor](https://python.useinstructor.com/) to easily get structured output from LLMs. +- [OpenAI](https://openai.com/index/openai-api/) as our LLM vendor. +- [Weave](../../introduction.md) to track our LLM workflow and evaluate our prompting strategies. + + +```python +!pip install -qU pymupdf4llm instructor openai weave wget +``` + +Since we'll be using [OpenAI API](https://openai.com/index/openai-api/) as the LLM Vendor, we will also need an OpenAI API key. You can [sign up](https://platform.openai.com/signup) on the OpenAI platform to get your own API key. + + +```python +import os +from getpass import getpass + +api_key = getpass("Enter you OpenAI API key: ") +os.environ["OPENAI_API_KEY"] = api_key +os.environ["WEAVE_PARALLELISM"] = "1" +``` + +## Enable Tracking using Weave + +Weave is currently integrated with OpenAI, and including [`weave.init`](../../reference/python-sdk/weave/index.md) at the start of our code lets us automatically trace our OpenAI chat completions which can be explored in the Weave UI. Check out the [Weave integration docs for OpenAI](../../guides/integrations/openai.md) to learn more. + + +```python +import weave + +weave.init(project_name="arxiv-data-extraction") +``` + +## Structured Data Extraction Workflow + +In order to extract the required structured data from a machine learning paper using GPT-4o and instructor, let's first define our schema as [Pydantic Model](https://docs.pydantic.dev/latest/concepts/models/) outlining the exact information that we need from a paper. + + +```python +from typing import List, Optional + +from pydantic import BaseModel + + +class Finding(BaseModel): + finding_name: str + explanation: str + + +class Method(BaseModel): + method_name: str + explanation: str + citation: Optional[str] + + +class Evaluation(BaseModel): + metric: str + benchmark: str + value: float + observation: str + + +class PaperInfo(BaseModel): + main_findings: List[Finding] # The main findings of the paper + novel_methods: List[Method] # The novel methods proposed in the paper + existing_methods: List[Method] # The existing methods used in the paper + machine_learning_techniques: List[ + Method + ] # The machine learning techniques used in the paper + metrics: List[Evaluation] # The evaluation metrics used in the paper + github_repository: ( + str # The link to the GitHub repository of the paper (if there is any) + ) + hardware: str # The hardware or accelerator setup used in the paper + further_research: List[ + str + ] # The further research directions suggested in the paper +``` + +Next, we write a detailed system prompt that serve as a set of instructions providing context and guidelines to help the model perform the required task. + +First of all, we ask the model to play the role of "helpful assistant to a machine learning researcher who is reading a paper from arXiv", thus establishing the basic context of the task. Next, we provide the information regarding all the information in it needs to extract from the paper, in accordance with the schema `PaperInfo`. + + +```python +system_prompt = """ +You are a helpful assistant to a machine learning researcher who is reading a paper from arXiv. +You are to extract the following information from the paper: + +- a list of main findings in from the paper and their corresponding detailed explanations +- the list of names of the different novel methods proposed in the paper and their corresponding detailed explanations +- the list of names of the different existing methods used in the paper, their corresponding detailed explanations, and + their citations +- the list of machine learning techniques used in the paper, such as architectures, optimizers, schedulers, etc., their + corresponding detailed explanations, and their citations +- the list of evaluation metrics used in the paper, the benchmark datasets used, the values of the metrics, and their + corresponding detailed observation in the paper +- the link to the GitHub repository of the paper if there is any +- the hardware or accelerators used to perform the experiments in the paper if any +- a list of possible further research directions that the paper suggests +""" +``` + +:::note +You can also checkout OpenAI's [Prompt engineering guide](https://platform.openai.com/docs/guides/prompt-engineering) for more details on writing good prompts for models like GPT-4o. +::: + +Next, we patch the OpenAI client to return structured outputs. + + +```python +import instructor +from openai import OpenAI + +openai_client = OpenAI() +structured_client = instructor.from_openai(openai_client) +``` + +Finally, we write our LLM execution workflow as a [Weave Model](../../guides/core-types/models.md) thus combining the configurations associated with the workflow along with the code that defines how the model operates into a single object that will now be tracked and versioned using Weave. + + +```python +from io import BytesIO + +import pymupdf +import pymupdf4llm +import requests + + +class ArxivModel(weave.Model): + model: str + system_prompt: str + max_retries: int = 5 + seed: int = 42 + + @weave.op() + def get_markdown_from_arxiv(self, url): + response = requests.get(url) + with pymupdf.open(stream=BytesIO(response.content), filetype="pdf") as doc: + return pymupdf4llm.to_markdown(doc) + + @weave.op() + def predict(self, url_pdf: str) -> dict: + md_text = self.get_markdown_from_arxiv(url_pdf) + return structured_client.chat.completions.create( + model=self.model, + response_model=PaperInfo, + max_retries=self.max_retries, + seed=self.seed, + messages=[ + {"role": "system", "content": self.system_prompt}, + {"role": "user", "content": md_text}, + ], + ).model_dump() +``` + + +```python +import rich + +arxiv_parser_model = ArxivModel(model="gpt-4o", system_prompt=system_prompt) + +result = arxiv_parser_model.predict(url_pdf="http://arxiv.org/pdf/1711.06288v2.pdf") +rich.print(result) +``` + +:::warning +Executing this LLM workflow will cost approximately $0.05-$0.25 in OpenAI credits, depending on the number of attempts instructor needs makes to get the output in the desired format (which is set to 5). +::: + +| ![](../../../static/img/instructor_arxiv_paper_parser/arxiv_trace.gif) | +|---| +| Here's how you can explore the traces of the `ArxivModel` in the Weave UI | + +## Evaluating the Prompting Workflow + +Let us now evaluate how accurately our LLM workflow is able to extract the methods from the paper using [Weave Evaluation](../../guides/core-types/evaluations.md). For this we will write a simple scoring function that compares the list of novel methods, existing methods, and ML techniques predicted by the promting worflow against a ground-truth list of methods associated with the paper to compute an accuracy score. + + +```python +@weave.op() +def arxiv_method_score( + method: List[dict], model_output: Optional[dict] +) -> dict[str, float]: + if model_output is None: + return {"method_prediction_accuracy": 0.0} + predicted_methods = ( + model_output["novel_methods"] + + model_output["existing_methods"] + + model_output["machine_learning_techniques"] + ) + num_correct_methods = 0 + for gt_method in method: + for predicted_method in predicted_methods: + predicted_method = ( + f"{predicted_method['method_name']}\n{predicted_method['explanation']}" + ) + if ( + gt_method["name"].lower() in predicted_method.lower() + or gt_method["full_name"].lower() in predicted_method.lower() + ): + num_correct_methods += 1 + return {"method_prediction_accuracy": num_correct_methods / len(predicted_methods)} +``` + +For this tutorial, we will use a dataset of more than 6000 machine learning research papers and their corresponding metadata created using the [paperswithcode client](https://paperswithcode-client.readthedocs.io/en/latest/) (check [this gist](https://gist.github.com/soumik12345/996c2ea538f6ff5b3747078ba557ece4) for reference). The dataset is stored as a [Weave Dataset](../docs/guides/core-types/datasets.md) which you can explore [here](https://wandb.ai/geekyrakshit/arxiv-data-extraction/weave/objects/cv-papers/versions/7wICKJjt3YyqL3ssICHi08v3swAGSUtD7TF4PVRJ0yc). + + +```python +WEAVE_DATASET_REFERENCE = "weave:///geekyrakshit/arxiv-data-extraction/object/cv-papers:7wICKJjt3YyqL3ssICHi08v3swAGSUtD7TF4PVRJ0yc" +eval_dataset = weave.ref(WEAVE_DATASET_REFERENCE).get() + +rich.print(f"{len(eval_dataset.rows)=}") +``` + +Now, we can evaluate our LLM workflow using [Weave Evalations](../docs/guides/core-types/evaluations.md), that will take each example, pass it through your application and score the output on multiple custom scoring functions. By doing this, you'll have a view of the performance of your application, and a rich UI to drill into individual outputs and scores. + + +```python +evaluation = weave.Evaluation( + name="baseline_workflow_evaluation", + dataset=eval_dataset.rows[:5], + scorers=[arxiv_method_score], +) +await evaluation.evaluate(arxiv_parser_model) +``` + +:::warning +Running the evaluation on 5 examples from evaluation dataset will cost approximately $0.25-$1.25 in OpenAI credits, depending on the number of attempts instructor needs makes to get the output in the desired format (which is set to 5) in evaluating each example. +::: + +## Improving the LLM Workflow + +Let us try to improve the LLM workflow by adding some more instructions to our system prompt. We will provide the model with a set of rules, which act as a set of clues to guide the model to look for specific type of information in the document. + + +```python +system_prompt += """ +Here are some rules to follow: +1. When looking for the main findings in the paper, you should look for the abstract. +2. When looking for the explanations for the main findings, you should look for the introduction and methods section of + the paper. +3. When looking for the list of existing methods used in the paper, first look at the citations, and then try explaining + how they were used in the paper. +4. When looking for the list of machine learning methods used in the paper, first look at the citations, and then try + explaining how they were used in the paper. +5. When looking for the evaluation metrics used in the paper, first look at the results section of the paper, and then + try explaining the observations made from the results. Pay special attention to the tables to find the metrics, + their values, the corresponding benchmark and the observation association with the result. +6. If there are no github repositories associated with the paper, simply return "None". +7. When looking for hardware and accelerators, pay special attentions to the quantity of each type of hardware and + accelerator. If there are no hardware or accelerators used in the paper, simply return "None". +8. When looking for further research directions, look for the conclusion section of the paper. +""" + +improved_arxiv_parser_model = ArxivModel(model="gpt-4o", system_prompt=system_prompt) +``` + +We will not evaluate this improved workflow again and try to check if the accuracy has increased or not. + + +```python +evaluation = weave.Evaluation( + name="improved_workflow_evaluation", + dataset=eval_dataset.rows[:5], + scorers=[arxiv_method_score], +) +await evaluation.evaluate(arxiv_parser_model) +``` + +:::warning +Running the evaluation on 5 examples from evaluation dataset will cost approximately $0.25-$1.25 in OpenAI credits, depending on the number of attempts instructor needs makes to get the output in the desired format (which is set to 5) in evaluating each example. +::: + +| ![](../../../static/img/instructor_arxiv_paper_parser/arxiv_eval.gif) | +|---| +| Here's how you can explore and compare the evaluations traces in the Weave UI | diff --git a/docs/notebooks/parse_arxiv_papers.ipynb b/docs/notebooks/parse_arxiv_papers.ipynb index e7192abcf44..cd726166c51 100644 --- a/docs/notebooks/parse_arxiv_papers.ipynb +++ b/docs/notebooks/parse_arxiv_papers.ipynb @@ -70,7 +70,7 @@ "source": [ "## Enable Tracking using Weave\n", "\n", - "Weave is currently integrated with DSPy, and including [`weave.init`](../docs/reference/python-sdk/weave/index.md) at the start of our code lets us automatically trace our DSPy functions which can be explored in the Weave UI. Check out the [Weave integration docs for DSPy](../docs/guides/integrations/dspy.md) to learn more." + "Weave is currently integrated with OpenAI, and including [`weave.init`](../docs/reference/python-sdk/weave/index.md) at the start of our code lets us automatically trace our OpenAI chat completions which can be explored in the Weave UI. Check out the [Weave integration docs for OpenAI](../docs/guides/integrations/openai.md) to learn more." ] }, {