feat(PHS-880): enable benchmark executions

Aleph-Alpha · Nov 27, 2024 · 98e2876 · 98e2876
1 parent 467a79e
commit 98e2876
Show file tree

Hide file tree

Showing 18 changed files with 955 additions and 114 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -2,7 +2,12 @@
 ## Unreleased
 
 ### Features
-...
+- Introduce `Benchmark` and `StudioBenchmark`
+  - Add `how_to_execute_a_benchmark.ipynb` to how-tos
+  - Add `studio.ipynb` to notebooks to show how one can debug a `Task` with Studio
+- Introduce `BenchmarkRepository`and `StudioBenchmarkRepository`
+- Add `create_project` bool to `StudioClient.__init__()` to enable users to automatically create their Studio projects
+- Add progressbar to the `Runner` to be able to track the `Run`
 
 ### Fixes
 ...

diff --git a/README.md b/README.md
@@ -139,6 +139,7 @@ To use an **on-premises setup**, set the `CLIENT_URL` variable to your host URL.
 | 11    | Performance tips       | Contains some small tips for performance              | [performance_tips.ipynb](./src/documentation/performance_tips.ipynb)                                                   |
 | 12    | Deployment             | Shows how to deploy a Task in a minimal FastAPI app.  | [fastapi_tutorial.ipynb](./src/documentation/fastapi_tutorial.ipynb)                                                   |
 | 13    | Issue Classification   | Deploy a Task in Kubernetes to classify Jira issues   | [Found in adjacent repository](https://github.com/Aleph-Alpha/IL-Classification-Journey)                               |
+| 14    | Evaluate with Studio                 | Shows how to evaluate your `Task` using Studio           | [evaluate_with_studio.ipynb](./src/documentation/evaluate_with_studio.ipynb)                                                                       |
 
 ## How-Tos
 
@@ -150,18 +151,23 @@ The how-tos are quick lookups about how to do things. Compared to the tutorials,
 | [...define a task](./src/documentation/how_tos/how_to_define_a_task.ipynb)                                                                             | How to come up with a new task and formulate it                            |
 | [...implement a task](./src/documentation/how_tos/how_to_implement_a_task.ipynb)                                                                       | Implement a formulated task and make it run with the Intelligence Layer    |
 | [...debug and log a task](./src/documentation/how_tos/how_to_log_and_debug_a_task.ipynb)                                                               | Tools for logging and debugging in tasks                                   |
-| [...use Studio with traces](./src/documentation/how_tos/studio/how_to_use_studio_with_traces.ipynb)                                                    | Submitting Traces to Studio for debugging                                  |
 | **Analysis Pipeline**                                                                                                                                  |                                                                            |
 | [...implement a simple evaluation and aggregation logic](./src/documentation/how_tos/how_to_implement_a_simple_evaluation_and_aggregation_logic.ipynb) | Basic examples of evaluation and aggregation logic                         |
 | [...create a dataset](./src/documentation/how_tos/how_to_create_a_dataset.ipynb)                                                                       | Create a dataset used for running a task                                   |
 | [...run a task on a dataset](./src/documentation/how_tos/how_to_run_a_task_on_a_dataset.ipynb)                                                         | Run a task on a whole dataset instead of single examples                   |
-| [...resume a run after a crash](./src/documentation/how_tos/how_to_resume_a_run_after_a_crash.ipynb) | Resume a run after a crash or exception occurred |
+| [...resume a run after a crash](./src/documentation/how_tos/how_to_resume_a_run_after_a_crash.ipynb)                                                   | Resume a run after a crash or exception occurred                           |
 | [...evaluate multiple runs](./src/documentation/how_tos/how_to_evaluate_runs.ipynb)                                                                    | Evaluate (multiple) runs in a single evaluation                            |
 | [...aggregate multiple evaluations](./src/documentation/how_tos/how_to_aggregate_evaluations.ipynb)                                                    | Aggregate (multiple) evaluations in a single aggregation                   |
 | [...retrieve data for analysis](./src/documentation/how_tos/how_to_retrieve_data_for_analysis.ipynb)                                                   | Retrieve experiment data in multiple different ways                        |
 | [...implement a custom human evaluation](./src/documentation/how_tos/how_to_human_evaluation_via_argilla.ipynb)                                        | Necessary steps to create an evaluation with humans as a judge via Argilla |
 | [...implement elo evaluations](./src/documentation/how_tos/how_to_implement_elo_evaluations.ipynb)                                                     | Evaluate runs and create ELO ranking for them                              |
 | [...implement incremental evaluation](./src/documentation/how_tos/how_to_implement_incremental_evaluation.ipynb)                                       | Implement and run an incremental evaluation                                |
+| **Studio**                                                                                                                                             |                                                                            |
+| [...use Studio with traces](./src/documentation/how_tos/studio/how_to_use_studio_with_traces.ipynb)                                                    | Submitting Traces to Studio for debugging                                  |
+| [...upload existing datasets](./src/documentation/how_tos/studio/how_to_upload_existing_datasets_to_studio.ipynb)                                      | Upload Datasets to Studio                                                  |
+| [...execute a benchmark](./src/documentation/how_tos/studio/how_to_execute_a_benchmark.ipynb)                                                          | Execute a benchmark                                                        |
+
+
 
 # Models
 

diff --git a/src/documentation/evaluate_with_studio.ipynb b/src/documentation/evaluate_with_studio.ipynb
@@ -0,0 +1,315 @@
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import json\n",
+    "from pathlib import Path\n",
+    "\n",
+    "from dotenv import load_dotenv\n",
+    "\n",
+    "from intelligence_layer.connectors.studio.studio import StudioClient\n",
+    "from intelligence_layer.core import TextChunk\n",
+    "from intelligence_layer.core.model import Llama3InstructModel\n",
+    "from intelligence_layer.evaluation.benchmark.studio_benchmark import (\n",
+    "    StudioBenchmarkRepository,\n",
+    ")\n",
+    "from intelligence_layer.evaluation.dataset.domain import Example\n",
+    "from intelligence_layer.evaluation.dataset.studio_dataset_repository import (\n",
+    "    StudioDatasetRepository,\n",
+    ")\n",
+    "from intelligence_layer.examples import (\n",
+    "    ClassifyInput,\n",
+    "    PromptBasedClassify,\n",
+    ")\n",
+    "from intelligence_layer.examples.classify.classify import (\n",
+    "    SingleLabelClassifyAggregationLogic,\n",
+    "    SingleLabelClassifyEvaluationLogic,\n",
+    ")\n",
+    "from intelligence_layer.examples.classify.prompt_based_classify_with_definitions import (\n",
+    "    LabelWithDefinition,\n",
+    "    PromptBasedClassifyWithDefinitions,\n",
+    ")\n",
+    "\n",
+    "load_dotenv()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Evaluate with Studio\n",
+    "\n",
+    "This notebook shows how you can debug a `Task` using Studio. This notebook focuses on the `PromptBasedClassify` for demonstration purposes.\n",
+    "\n",
+    "First, we need to instantiate the `StudioClient`. We can either pass an existing project or let the `StudioClient` create it by setting the `create_project` flag to `True.`"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "studio_client = StudioClient(project=\"Classify with Studio\", create_project=True)\n",
+    "studio_dataset_repository = StudioDatasetRepository(studio_client)\n",
+    "studio_benchmark_repository = StudioBenchmarkRepository(studio_client)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Next, we will create our evaluation dataset from some pre-defined dataset."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "with Path(\"data/classify_examples.json\").open() as json_data:\n",
+    "    data = json.load(json_data)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We need to transform our dataset into the required format. \n",
+    "Therefore, let's check out what it looks like."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "data[0]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "This isn't quite yet the format we need, therefore we translate it into the interface of our `Example`.\n",
+    "\n",
+    "This is the target structure:\n",
+    "\n",
+    "``` python\n",
+    "class Example(BaseModel, Generic[Input, ExpectedOutput]):\n",
+    "    input: Input\n",
+    "    expected_output: ExpectedOutput\n",
+    "    id: Optional[str] = Field(default_factory=lambda: str(uuid4()))\n",
+    "    metadata: Optional[SerializableDict]\n",
+    "```\n",
+    "\n",
+    "We want the `input` in each `Example` to contain the input of an actual task.\n",
+    "The `expected_output` shall correspond to anything we wish to compare our generated output to (i.e., the expected label in our case)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "all_labels = list(set(item[\"label\"] for item in data))\n",
+    "dataset = studio_dataset_repository.create_dataset(\n",
+    "    examples=[\n",
+    "        Example(\n",
+    "            input=ClassifyInput(chunk=TextChunk(item[\"message\"]), labels=all_labels),\n",
+    "            expected_output=item[\"label\"],\n",
+    "        )\n",
+    "        for item in data\n",
+    "    ],\n",
+    "    dataset_name=\"Single Label Classify Dataset\",\n",
+    ")\n",
+    "print(f\"Dataset ID: {dataset.id}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "This also automatically uploads the created dataset to you **Studio** instance.\n",
+    "We can inspect the dataset and the individual examples in **Studio** under **Evaluate/Datasets**. Do not forget to select the correct project!\n",
+    "\n",
+    "After we have checked our `Dataset`, we can create our first `Benchmark`. To this end, we need the `EvaluationLogic` and the `AggregationLogic` of our Classify use-case. After creating the `Benchmark`, make sure to copy the ID of the `Benchmark` into the `get_benchmark` method, so you don't have to create the `Benchmark` again every time you run the evaluation."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import random\n",
+    "import string\n",
+    "\n",
+    "evaluation_logic = SingleLabelClassifyEvaluationLogic()\n",
+    "aggregation_logic = SingleLabelClassifyAggregationLogic()\n",
+    "\n",
+    "rand_str = ''.join(random.choice(string.ascii_uppercase + string.ascii_lowercase + string.digits) for _ in range(16))\n",
+    "\n",
+    "benchmark = studio_benchmark_repository.create_benchmark(\n",
+    "    dataset.id,\n",
+    "    evaluation_logic,\n",
+    "    aggregation_logic,\n",
+    "    f\"Single Label Classify Benchmark {rand_str}\", # Benchmark names need to be unique, therefore we add a random string to the name\n",
+    ")\n",
+    "print(f\"Benchmark ID: {benchmark.id}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "## If you want to run the notebook multiple times use this block instead of the above one and past your benchmark ID\n",
+    "# evaluation_logic = SingleLabelClassifyEvaluationLogic()\n",
+    "# aggregation_logic = SingleLabelClassifyAggregationLogic()\n",
+    "#\n",
+    "# benchmark = studio_benchmark_repository.get_benchmark(\n",
+    "#     *Your Benchmark ID here* ,\n",
+    "#     evaluation_logic,\n",
+    "#     aggregation_logic,\n",
+    "# )"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "With this, we are ready to `execute` our first `Benchmark`. We pass it a meaningful name and execute it. After about two minutes we can take a look at the results in **Studio** in the **Evaluate/Benchmarks** section."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "benchmark.execute(PromptBasedClassify(), \"Classify v0.0 with Luminous\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Let's try to improve our results and run this again using a `Llama` model."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "benchmark.execute(\n",
+    "    PromptBasedClassify(model=Llama3InstructModel(\"llama-3.1-8b-instruct\")),\n",
+    "    \"Classify v0.1 with Llama\",\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "For further comparisons we also `execute` the `PromptBasedClassifyWithDefinitions` task on the same `Benchmark`. This is possible because both `Task` have the exact same input and output format and can thus be compared to each other."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "labels_with_definitions = [\n",
+    "    LabelWithDefinition(\n",
+    "        name=\"Finance\",\n",
+    "        definition=\"Handles reimbursements, salary payments, and financial planning.\",\n",
+    "    ),\n",
+    "    LabelWithDefinition(\n",
+    "        name=\"Sales\",\n",
+    "        definition=\"Manages client inquiries, builds relationships, and drives revenue.\",\n",
+    "    ),\n",
+    "    LabelWithDefinition(\n",
+    "        name=\"Communications\",\n",
+    "        definition=\"Oversees media inquiries, partnerships, and public documentation.\",\n",
+    "    ),\n",
+    "    LabelWithDefinition(\n",
+    "        name=\"Research\",\n",
+    "        definition=\"Collaborates on innovative projects and explores market applications.\",\n",
+    "    ),\n",
+    "    LabelWithDefinition(\n",
+    "        name=\"IT Support\",\n",
+    "        definition=\"Provides technical assistance for devices and platform access issues.\",\n",
+    "    ),\n",
+    "    LabelWithDefinition(\n",
+    "        name=\"Human Resources\",\n",
+    "        definition=\"Manages onboarding, leave requests, and career development.\",\n",
+    "    ),\n",
+    "    LabelWithDefinition(\n",
+    "        name=\"Product\",\n",
+    "        definition=\"Addresses customer issues, ensures compliance, and demonstrates product use.\",\n",
+    "    ),\n",
+    "    LabelWithDefinition(\n",
+    "        name=\"Customer\",\n",
+    "        definition=\"Schedules meetings and ensures customer needs are effectively met.\",\n",
+    "    ),\n",
+    "    LabelWithDefinition(\n",
+    "        name=\"Security\",\n",
+    "        definition=\"Maintains physical and digital safety, including badge and certificate issues.\",\n",
+    "    ),\n",
+    "    LabelWithDefinition(\n",
+    "        name=\"Marketing\",\n",
+    "        definition=\"Manages strategic initiatives and promotes the company's offerings.\",\n",
+    "    ),\n",
+    "    LabelWithDefinition(\n",
+    "        name=\"CEO Office\",\n",
+    "        definition=\"Handles executive engagements and key stakeholder follow-ups.\",\n",
+    "    ),\n",
+    "]\n",
+    "\n",
+    "classify_with_definitions = PromptBasedClassifyWithDefinitions(\n",
+    "    labels_with_definitions=labels_with_definitions,\n",
+    "    model=Llama3InstructModel(\"llama-3.1-8b-instruct\"),\n",
+    ")\n",
+    "benchmark.execute(classify_with_definitions, \"Classify v1.0 with definitions and Llama\")"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "intelligence-layer-ZqHLMTHE-py3.12",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.12.3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}