-
Notifications
You must be signed in to change notification settings - Fork 6
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
feat(PHS-880): enable benchmark executions
- Loading branch information
1 parent
467a79e
commit 98e2876
Showing
18 changed files
with
955 additions
and
114 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,315 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"import json\n", | ||
"from pathlib import Path\n", | ||
"\n", | ||
"from dotenv import load_dotenv\n", | ||
"\n", | ||
"from intelligence_layer.connectors.studio.studio import StudioClient\n", | ||
"from intelligence_layer.core import TextChunk\n", | ||
"from intelligence_layer.core.model import Llama3InstructModel\n", | ||
"from intelligence_layer.evaluation.benchmark.studio_benchmark import (\n", | ||
" StudioBenchmarkRepository,\n", | ||
")\n", | ||
"from intelligence_layer.evaluation.dataset.domain import Example\n", | ||
"from intelligence_layer.evaluation.dataset.studio_dataset_repository import (\n", | ||
" StudioDatasetRepository,\n", | ||
")\n", | ||
"from intelligence_layer.examples import (\n", | ||
" ClassifyInput,\n", | ||
" PromptBasedClassify,\n", | ||
")\n", | ||
"from intelligence_layer.examples.classify.classify import (\n", | ||
" SingleLabelClassifyAggregationLogic,\n", | ||
" SingleLabelClassifyEvaluationLogic,\n", | ||
")\n", | ||
"from intelligence_layer.examples.classify.prompt_based_classify_with_definitions import (\n", | ||
" LabelWithDefinition,\n", | ||
" PromptBasedClassifyWithDefinitions,\n", | ||
")\n", | ||
"\n", | ||
"load_dotenv()" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Evaluate with Studio\n", | ||
"\n", | ||
"This notebook shows how you can debug a `Task` using Studio. This notebook focuses on the `PromptBasedClassify` for demonstration purposes.\n", | ||
"\n", | ||
"First, we need to instantiate the `StudioClient`. We can either pass an existing project or let the `StudioClient` create it by setting the `create_project` flag to `True.`" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"studio_client = StudioClient(project=\"Classify with Studio\", create_project=True)\n", | ||
"studio_dataset_repository = StudioDatasetRepository(studio_client)\n", | ||
"studio_benchmark_repository = StudioBenchmarkRepository(studio_client)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Next, we will create our evaluation dataset from some pre-defined dataset." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"with Path(\"data/classify_examples.json\").open() as json_data:\n", | ||
" data = json.load(json_data)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"We need to transform our dataset into the required format. \n", | ||
"Therefore, let's check out what it looks like." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"data[0]" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"This isn't quite yet the format we need, therefore we translate it into the interface of our `Example`.\n", | ||
"\n", | ||
"This is the target structure:\n", | ||
"\n", | ||
"``` python\n", | ||
"class Example(BaseModel, Generic[Input, ExpectedOutput]):\n", | ||
" input: Input\n", | ||
" expected_output: ExpectedOutput\n", | ||
" id: Optional[str] = Field(default_factory=lambda: str(uuid4()))\n", | ||
" metadata: Optional[SerializableDict]\n", | ||
"```\n", | ||
"\n", | ||
"We want the `input` in each `Example` to contain the input of an actual task.\n", | ||
"The `expected_output` shall correspond to anything we wish to compare our generated output to (i.e., the expected label in our case)." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"all_labels = list(set(item[\"label\"] for item in data))\n", | ||
"dataset = studio_dataset_repository.create_dataset(\n", | ||
" examples=[\n", | ||
" Example(\n", | ||
" input=ClassifyInput(chunk=TextChunk(item[\"message\"]), labels=all_labels),\n", | ||
" expected_output=item[\"label\"],\n", | ||
" )\n", | ||
" for item in data\n", | ||
" ],\n", | ||
" dataset_name=\"Single Label Classify Dataset\",\n", | ||
")\n", | ||
"print(f\"Dataset ID: {dataset.id}\")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"This also automatically uploads the created dataset to you **Studio** instance.\n", | ||
"We can inspect the dataset and the individual examples in **Studio** under **Evaluate/Datasets**. Do not forget to select the correct project!\n", | ||
"\n", | ||
"After we have checked our `Dataset`, we can create our first `Benchmark`. To this end, we need the `EvaluationLogic` and the `AggregationLogic` of our Classify use-case. After creating the `Benchmark`, make sure to copy the ID of the `Benchmark` into the `get_benchmark` method, so you don't have to create the `Benchmark` again every time you run the evaluation." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"import random\n", | ||
"import string\n", | ||
"\n", | ||
"evaluation_logic = SingleLabelClassifyEvaluationLogic()\n", | ||
"aggregation_logic = SingleLabelClassifyAggregationLogic()\n", | ||
"\n", | ||
"rand_str = ''.join(random.choice(string.ascii_uppercase + string.ascii_lowercase + string.digits) for _ in range(16))\n", | ||
"\n", | ||
"benchmark = studio_benchmark_repository.create_benchmark(\n", | ||
" dataset.id,\n", | ||
" evaluation_logic,\n", | ||
" aggregation_logic,\n", | ||
" f\"Single Label Classify Benchmark {rand_str}\", # Benchmark names need to be unique, therefore we add a random string to the name\n", | ||
")\n", | ||
"print(f\"Benchmark ID: {benchmark.id}\")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"## If you want to run the notebook multiple times use this block instead of the above one and past your benchmark ID\n", | ||
"# evaluation_logic = SingleLabelClassifyEvaluationLogic()\n", | ||
"# aggregation_logic = SingleLabelClassifyAggregationLogic()\n", | ||
"#\n", | ||
"# benchmark = studio_benchmark_repository.get_benchmark(\n", | ||
"# *Your Benchmark ID here* ,\n", | ||
"# evaluation_logic,\n", | ||
"# aggregation_logic,\n", | ||
"# )" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"With this, we are ready to `execute` our first `Benchmark`. We pass it a meaningful name and execute it. After about two minutes we can take a look at the results in **Studio** in the **Evaluate/Benchmarks** section." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"benchmark.execute(PromptBasedClassify(), \"Classify v0.0 with Luminous\")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Let's try to improve our results and run this again using a `Llama` model." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"benchmark.execute(\n", | ||
" PromptBasedClassify(model=Llama3InstructModel(\"llama-3.1-8b-instruct\")),\n", | ||
" \"Classify v0.1 with Llama\",\n", | ||
")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"For further comparisons we also `execute` the `PromptBasedClassifyWithDefinitions` task on the same `Benchmark`. This is possible because both `Task` have the exact same input and output format and can thus be compared to each other." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"labels_with_definitions = [\n", | ||
" LabelWithDefinition(\n", | ||
" name=\"Finance\",\n", | ||
" definition=\"Handles reimbursements, salary payments, and financial planning.\",\n", | ||
" ),\n", | ||
" LabelWithDefinition(\n", | ||
" name=\"Sales\",\n", | ||
" definition=\"Manages client inquiries, builds relationships, and drives revenue.\",\n", | ||
" ),\n", | ||
" LabelWithDefinition(\n", | ||
" name=\"Communications\",\n", | ||
" definition=\"Oversees media inquiries, partnerships, and public documentation.\",\n", | ||
" ),\n", | ||
" LabelWithDefinition(\n", | ||
" name=\"Research\",\n", | ||
" definition=\"Collaborates on innovative projects and explores market applications.\",\n", | ||
" ),\n", | ||
" LabelWithDefinition(\n", | ||
" name=\"IT Support\",\n", | ||
" definition=\"Provides technical assistance for devices and platform access issues.\",\n", | ||
" ),\n", | ||
" LabelWithDefinition(\n", | ||
" name=\"Human Resources\",\n", | ||
" definition=\"Manages onboarding, leave requests, and career development.\",\n", | ||
" ),\n", | ||
" LabelWithDefinition(\n", | ||
" name=\"Product\",\n", | ||
" definition=\"Addresses customer issues, ensures compliance, and demonstrates product use.\",\n", | ||
" ),\n", | ||
" LabelWithDefinition(\n", | ||
" name=\"Customer\",\n", | ||
" definition=\"Schedules meetings and ensures customer needs are effectively met.\",\n", | ||
" ),\n", | ||
" LabelWithDefinition(\n", | ||
" name=\"Security\",\n", | ||
" definition=\"Maintains physical and digital safety, including badge and certificate issues.\",\n", | ||
" ),\n", | ||
" LabelWithDefinition(\n", | ||
" name=\"Marketing\",\n", | ||
" definition=\"Manages strategic initiatives and promotes the company's offerings.\",\n", | ||
" ),\n", | ||
" LabelWithDefinition(\n", | ||
" name=\"CEO Office\",\n", | ||
" definition=\"Handles executive engagements and key stakeholder follow-ups.\",\n", | ||
" ),\n", | ||
"]\n", | ||
"\n", | ||
"classify_with_definitions = PromptBasedClassifyWithDefinitions(\n", | ||
" labels_with_definitions=labels_with_definitions,\n", | ||
" model=Llama3InstructModel(\"llama-3.1-8b-instruct\"),\n", | ||
")\n", | ||
"benchmark.execute(classify_with_definitions, \"Classify v1.0 with definitions and Llama\")" | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "intelligence-layer-ZqHLMTHE-py3.12", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.12.3" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 2 | ||
} |
Oops, something went wrong.