Skip to content

Commit

Permalink
feat(PHS-880): enable benchmark executions
Browse files Browse the repository at this point in the history
  • Loading branch information
maxhammeralephalpha committed Nov 27, 2024
1 parent 467a79e commit 98e2876
Show file tree
Hide file tree
Showing 18 changed files with 955 additions and 114 deletions.
7 changes: 6 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,12 @@
## Unreleased

### Features
...
- Introduce `Benchmark` and `StudioBenchmark`
- Add `how_to_execute_a_benchmark.ipynb` to how-tos
- Add `studio.ipynb` to notebooks to show how one can debug a `Task` with Studio
- Introduce `BenchmarkRepository`and `StudioBenchmarkRepository`
- Add `create_project` bool to `StudioClient.__init__()` to enable users to automatically create their Studio projects
- Add progressbar to the `Runner` to be able to track the `Run`

### Fixes
...
Expand Down
10 changes: 8 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -139,6 +139,7 @@ To use an **on-premises setup**, set the `CLIENT_URL` variable to your host URL.
| 11 | Performance tips | Contains some small tips for performance | [performance_tips.ipynb](./src/documentation/performance_tips.ipynb) |
| 12 | Deployment | Shows how to deploy a Task in a minimal FastAPI app. | [fastapi_tutorial.ipynb](./src/documentation/fastapi_tutorial.ipynb) |
| 13 | Issue Classification | Deploy a Task in Kubernetes to classify Jira issues | [Found in adjacent repository](https://github.com/Aleph-Alpha/IL-Classification-Journey) |
| 14 | Evaluate with Studio | Shows how to evaluate your `Task` using Studio | [evaluate_with_studio.ipynb](./src/documentation/evaluate_with_studio.ipynb) |

## How-Tos

Expand All @@ -150,18 +151,23 @@ The how-tos are quick lookups about how to do things. Compared to the tutorials,
| [...define a task](./src/documentation/how_tos/how_to_define_a_task.ipynb) | How to come up with a new task and formulate it |
| [...implement a task](./src/documentation/how_tos/how_to_implement_a_task.ipynb) | Implement a formulated task and make it run with the Intelligence Layer |
| [...debug and log a task](./src/documentation/how_tos/how_to_log_and_debug_a_task.ipynb) | Tools for logging and debugging in tasks |
| [...use Studio with traces](./src/documentation/how_tos/studio/how_to_use_studio_with_traces.ipynb) | Submitting Traces to Studio for debugging |
| **Analysis Pipeline** | |
| [...implement a simple evaluation and aggregation logic](./src/documentation/how_tos/how_to_implement_a_simple_evaluation_and_aggregation_logic.ipynb) | Basic examples of evaluation and aggregation logic |
| [...create a dataset](./src/documentation/how_tos/how_to_create_a_dataset.ipynb) | Create a dataset used for running a task |
| [...run a task on a dataset](./src/documentation/how_tos/how_to_run_a_task_on_a_dataset.ipynb) | Run a task on a whole dataset instead of single examples |
| [...resume a run after a crash](./src/documentation/how_tos/how_to_resume_a_run_after_a_crash.ipynb) | Resume a run after a crash or exception occurred |
| [...resume a run after a crash](./src/documentation/how_tos/how_to_resume_a_run_after_a_crash.ipynb) | Resume a run after a crash or exception occurred |
| [...evaluate multiple runs](./src/documentation/how_tos/how_to_evaluate_runs.ipynb) | Evaluate (multiple) runs in a single evaluation |
| [...aggregate multiple evaluations](./src/documentation/how_tos/how_to_aggregate_evaluations.ipynb) | Aggregate (multiple) evaluations in a single aggregation |
| [...retrieve data for analysis](./src/documentation/how_tos/how_to_retrieve_data_for_analysis.ipynb) | Retrieve experiment data in multiple different ways |
| [...implement a custom human evaluation](./src/documentation/how_tos/how_to_human_evaluation_via_argilla.ipynb) | Necessary steps to create an evaluation with humans as a judge via Argilla |
| [...implement elo evaluations](./src/documentation/how_tos/how_to_implement_elo_evaluations.ipynb) | Evaluate runs and create ELO ranking for them |
| [...implement incremental evaluation](./src/documentation/how_tos/how_to_implement_incremental_evaluation.ipynb) | Implement and run an incremental evaluation |
| **Studio** | |
| [...use Studio with traces](./src/documentation/how_tos/studio/how_to_use_studio_with_traces.ipynb) | Submitting Traces to Studio for debugging |
| [...upload existing datasets](./src/documentation/how_tos/studio/how_to_upload_existing_datasets_to_studio.ipynb) | Upload Datasets to Studio |
| [...execute a benchmark](./src/documentation/how_tos/studio/how_to_execute_a_benchmark.ipynb) | Execute a benchmark |



# Models

Expand Down
315 changes: 315 additions & 0 deletions src/documentation/evaluate_with_studio.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,315 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import json\n",
"from pathlib import Path\n",
"\n",
"from dotenv import load_dotenv\n",
"\n",
"from intelligence_layer.connectors.studio.studio import StudioClient\n",
"from intelligence_layer.core import TextChunk\n",
"from intelligence_layer.core.model import Llama3InstructModel\n",
"from intelligence_layer.evaluation.benchmark.studio_benchmark import (\n",
" StudioBenchmarkRepository,\n",
")\n",
"from intelligence_layer.evaluation.dataset.domain import Example\n",
"from intelligence_layer.evaluation.dataset.studio_dataset_repository import (\n",
" StudioDatasetRepository,\n",
")\n",
"from intelligence_layer.examples import (\n",
" ClassifyInput,\n",
" PromptBasedClassify,\n",
")\n",
"from intelligence_layer.examples.classify.classify import (\n",
" SingleLabelClassifyAggregationLogic,\n",
" SingleLabelClassifyEvaluationLogic,\n",
")\n",
"from intelligence_layer.examples.classify.prompt_based_classify_with_definitions import (\n",
" LabelWithDefinition,\n",
" PromptBasedClassifyWithDefinitions,\n",
")\n",
"\n",
"load_dotenv()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Evaluate with Studio\n",
"\n",
"This notebook shows how you can debug a `Task` using Studio. This notebook focuses on the `PromptBasedClassify` for demonstration purposes.\n",
"\n",
"First, we need to instantiate the `StudioClient`. We can either pass an existing project or let the `StudioClient` create it by setting the `create_project` flag to `True.`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"studio_client = StudioClient(project=\"Classify with Studio\", create_project=True)\n",
"studio_dataset_repository = StudioDatasetRepository(studio_client)\n",
"studio_benchmark_repository = StudioBenchmarkRepository(studio_client)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, we will create our evaluation dataset from some pre-defined dataset."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"with Path(\"data/classify_examples.json\").open() as json_data:\n",
" data = json.load(json_data)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We need to transform our dataset into the required format. \n",
"Therefore, let's check out what it looks like."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"data[0]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This isn't quite yet the format we need, therefore we translate it into the interface of our `Example`.\n",
"\n",
"This is the target structure:\n",
"\n",
"``` python\n",
"class Example(BaseModel, Generic[Input, ExpectedOutput]):\n",
" input: Input\n",
" expected_output: ExpectedOutput\n",
" id: Optional[str] = Field(default_factory=lambda: str(uuid4()))\n",
" metadata: Optional[SerializableDict]\n",
"```\n",
"\n",
"We want the `input` in each `Example` to contain the input of an actual task.\n",
"The `expected_output` shall correspond to anything we wish to compare our generated output to (i.e., the expected label in our case)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"all_labels = list(set(item[\"label\"] for item in data))\n",
"dataset = studio_dataset_repository.create_dataset(\n",
" examples=[\n",
" Example(\n",
" input=ClassifyInput(chunk=TextChunk(item[\"message\"]), labels=all_labels),\n",
" expected_output=item[\"label\"],\n",
" )\n",
" for item in data\n",
" ],\n",
" dataset_name=\"Single Label Classify Dataset\",\n",
")\n",
"print(f\"Dataset ID: {dataset.id}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This also automatically uploads the created dataset to you **Studio** instance.\n",
"We can inspect the dataset and the individual examples in **Studio** under **Evaluate/Datasets**. Do not forget to select the correct project!\n",
"\n",
"After we have checked our `Dataset`, we can create our first `Benchmark`. To this end, we need the `EvaluationLogic` and the `AggregationLogic` of our Classify use-case. After creating the `Benchmark`, make sure to copy the ID of the `Benchmark` into the `get_benchmark` method, so you don't have to create the `Benchmark` again every time you run the evaluation."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import random\n",
"import string\n",
"\n",
"evaluation_logic = SingleLabelClassifyEvaluationLogic()\n",
"aggregation_logic = SingleLabelClassifyAggregationLogic()\n",
"\n",
"rand_str = ''.join(random.choice(string.ascii_uppercase + string.ascii_lowercase + string.digits) for _ in range(16))\n",
"\n",
"benchmark = studio_benchmark_repository.create_benchmark(\n",
" dataset.id,\n",
" evaluation_logic,\n",
" aggregation_logic,\n",
" f\"Single Label Classify Benchmark {rand_str}\", # Benchmark names need to be unique, therefore we add a random string to the name\n",
")\n",
"print(f\"Benchmark ID: {benchmark.id}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"## If you want to run the notebook multiple times use this block instead of the above one and past your benchmark ID\n",
"# evaluation_logic = SingleLabelClassifyEvaluationLogic()\n",
"# aggregation_logic = SingleLabelClassifyAggregationLogic()\n",
"#\n",
"# benchmark = studio_benchmark_repository.get_benchmark(\n",
"# *Your Benchmark ID here* ,\n",
"# evaluation_logic,\n",
"# aggregation_logic,\n",
"# )"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"With this, we are ready to `execute` our first `Benchmark`. We pass it a meaningful name and execute it. After about two minutes we can take a look at the results in **Studio** in the **Evaluate/Benchmarks** section."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"benchmark.execute(PromptBasedClassify(), \"Classify v0.0 with Luminous\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's try to improve our results and run this again using a `Llama` model."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"benchmark.execute(\n",
" PromptBasedClassify(model=Llama3InstructModel(\"llama-3.1-8b-instruct\")),\n",
" \"Classify v0.1 with Llama\",\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For further comparisons we also `execute` the `PromptBasedClassifyWithDefinitions` task on the same `Benchmark`. This is possible because both `Task` have the exact same input and output format and can thus be compared to each other."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"labels_with_definitions = [\n",
" LabelWithDefinition(\n",
" name=\"Finance\",\n",
" definition=\"Handles reimbursements, salary payments, and financial planning.\",\n",
" ),\n",
" LabelWithDefinition(\n",
" name=\"Sales\",\n",
" definition=\"Manages client inquiries, builds relationships, and drives revenue.\",\n",
" ),\n",
" LabelWithDefinition(\n",
" name=\"Communications\",\n",
" definition=\"Oversees media inquiries, partnerships, and public documentation.\",\n",
" ),\n",
" LabelWithDefinition(\n",
" name=\"Research\",\n",
" definition=\"Collaborates on innovative projects and explores market applications.\",\n",
" ),\n",
" LabelWithDefinition(\n",
" name=\"IT Support\",\n",
" definition=\"Provides technical assistance for devices and platform access issues.\",\n",
" ),\n",
" LabelWithDefinition(\n",
" name=\"Human Resources\",\n",
" definition=\"Manages onboarding, leave requests, and career development.\",\n",
" ),\n",
" LabelWithDefinition(\n",
" name=\"Product\",\n",
" definition=\"Addresses customer issues, ensures compliance, and demonstrates product use.\",\n",
" ),\n",
" LabelWithDefinition(\n",
" name=\"Customer\",\n",
" definition=\"Schedules meetings and ensures customer needs are effectively met.\",\n",
" ),\n",
" LabelWithDefinition(\n",
" name=\"Security\",\n",
" definition=\"Maintains physical and digital safety, including badge and certificate issues.\",\n",
" ),\n",
" LabelWithDefinition(\n",
" name=\"Marketing\",\n",
" definition=\"Manages strategic initiatives and promotes the company's offerings.\",\n",
" ),\n",
" LabelWithDefinition(\n",
" name=\"CEO Office\",\n",
" definition=\"Handles executive engagements and key stakeholder follow-ups.\",\n",
" ),\n",
"]\n",
"\n",
"classify_with_definitions = PromptBasedClassifyWithDefinitions(\n",
" labels_with_definitions=labels_with_definitions,\n",
" model=Llama3InstructModel(\"llama-3.1-8b-instruct\"),\n",
")\n",
"benchmark.execute(classify_with_definitions, \"Classify v1.0 with definitions and Llama\")"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "intelligence-layer-ZqHLMTHE-py3.12",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.3"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Loading

0 comments on commit 98e2876

Please sign in to comment.