Skip to content

Commit

Permalink
feat: Upload dataset from il sdk (#1116)
Browse files Browse the repository at this point in the history
* feat: Add submit_dataset to studio client

* feat: Add how-to for submitting existing datasets to studio

* feat: Introduce mapper functions to avoid circular dependencies

* refactor: Remove DataClient Dependency

* feat: add warning to StudioDatasetRepository

---------

Co-authored-by: Johannes Wesch <[email protected]>
  • Loading branch information
MerlinKallenbornAA and JohannesWesch authored Nov 4, 2024
1 parent 6642e53 commit 1b851ed
Show file tree
Hide file tree
Showing 10 changed files with 370 additions and 262 deletions.
2 changes: 2 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,8 @@
- `PostgresInstructionFinetuningDataRepository` to work with data stored in a Postgres database.
- `FileInstructionFinetuningDataRepository` to work with data stored in the local file-system.
- Compute precision, recall and f1-score by class in `SingleLabelClassifyAggregationLogic`
- Add submit_dataset function to StudioClient
- Add `how_to_upload_existing_datasets_to_studio.ipynb` to how-tos

### Fixes
...
Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -150,7 +150,7 @@ The how-tos are quick lookups about how to do things. Compared to the tutorials,
| [...define a task](./src/documentation/how_tos/how_to_define_a_task.ipynb) | How to come up with a new task and formulate it |
| [...implement a task](./src/documentation/how_tos/how_to_implement_a_task.ipynb) | Implement a formulated task and make it run with the Intelligence Layer |
| [...debug and log a task](./src/documentation/how_tos/how_to_log_and_debug_a_task.ipynb) | Tools for logging and debugging in tasks |
| [...use Studio with traces](./src/documentation/how_tos/how_to_use_studio_with_traces.ipynb) | Submitting Traces to Studio for debugging |
| [...use Studio with traces](./src/documentation/how_tos/studio/how_to_use_studio_with_traces.ipynb) | Submitting Traces to Studio for debugging |
| **Analysis Pipeline** | |
| [...implement a simple evaluation and aggregation logic](./src/documentation/how_tos/how_to_implement_a_simple_evaluation_and_aggregation_logic.ipynb) | Basic examples of evaluation and aggregation logic |
| [...create a dataset](./src/documentation/how_tos/how_to_create_a_dataset.ipynb) | Create a dataset used for running a task |
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -70,7 +70,7 @@
],
"metadata": {
"kernelspec": {
"display_name": "intelligence-layer-d3iSWYpm-py3.10",
"display_name": "intelligence-layer-aL2cXmJM-py3.11",
"language": "python",
"name": "python3"
},
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -90,7 +90,7 @@
],
"metadata": {
"kernelspec": {
"display_name": "intelligence-layer-d3iSWYpm-py3.10",
"display_name": ".venv",
"language": "python",
"name": "python3"
},
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,102 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from uuid import uuid4\n",
"\n",
"from documentation.how_tos.example_data import example_data\n",
"from intelligence_layer.connectors import StudioClient\n",
"from intelligence_layer.evaluation.dataset.studio_dataset_repository import (\n",
" StudioDatasetRepository,\n",
")\n",
"\n",
"my_example_data = example_data()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# How to upload (existing) datasets to Studio\n",
"<div class=\"alert alert-info\"> \n",
"\n",
"Make sure your account has permissions to use the Studio application.\n",
"\n",
"For an on-prem or local installation, please contact the corresponding team.\n",
"</div>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"0. Extract `Dataset` and `Examples` from your `DatasetRepository`.\n",
"\n",
"1. Initialize a `StudioClient` with a project.\n",
" - Use an existing project or create a new one with the `StudioClient.create_project` function.\n",
" \n",
"2. Create a `StudioDatasetRepository` and create a new `Dataset` via `StudioDatasetRepository.create_dataset`, which will automatically upload this new `Dataset` to Studio.\n",
"\n",
"### Example"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Step 0\n",
"existing_dataset_repo = my_example_data.dataset_repository\n",
"\n",
"existing_dataset = existing_dataset_repo.dataset(dataset_id=my_example_data.dataset.id)\n",
"assert existing_dataset, \"Make sure your dataset still exists.\"\n",
"\n",
"existing_examples = existing_dataset_repo.examples(\n",
" existing_dataset.id, input_type=str, expected_output_type=str\n",
")\n",
"\n",
"# Step 1\n",
"project_name = str(uuid4())\n",
"studio_client = StudioClient(project=project_name)\n",
"my_project = studio_client.create_project(project=project_name)\n",
"\n",
"# Step 2\n",
"studio_dataset_repo = StudioDatasetRepository(studio_client=studio_client)\n",
"\n",
"studio_dataset = studio_dataset_repo.create_dataset(\n",
" examples=existing_examples,\n",
" dataset_name=existing_dataset.name,\n",
" labels=existing_dataset.labels,\n",
" metadata=existing_dataset.metadata,\n",
")"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "intelligence-layer-aL2cXmJM-py3.11",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.8"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Original file line number Diff line number Diff line change
Expand Up @@ -88,7 +88,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.2"
"version": "3.11.8"
}
},
"nbformat": 4,
Expand Down
107 changes: 101 additions & 6 deletions src/intelligence_layer/connectors/studio/studio.py
Original file line number Diff line number Diff line change
@@ -1,25 +1,71 @@
import json
import os
from collections import defaultdict
from collections.abc import Sequence
from typing import Optional
from collections.abc import Iterable, Sequence
from typing import Generic, Optional, TypeVar
from urllib.parse import urljoin
from uuid import uuid4

import requests
from pydantic import BaseModel
from pydantic import BaseModel, Field
from requests.exceptions import ConnectionError, MissingSchema

from intelligence_layer.connectors.base.json_serializable import (
SerializableDict,
)
from intelligence_layer.core.tracer.tracer import ( # Import to be fixed with PHS-731
ExportedSpan,
ExportedSpanList,
PydanticSerializable,
Tracer,
)

Input = TypeVar("Input", bound=PydanticSerializable)
ExpectedOutput = TypeVar("ExpectedOutput", bound=PydanticSerializable)


class StudioProject(BaseModel):
name: str
description: Optional[str]


class StudioExample(BaseModel, Generic[Input, ExpectedOutput]):
"""Represents an instance of :class:`Example`as sent to Studio.
Attributes:
input: Input for the :class:`Task`. Has to be same type as the input for the task used.
expected_output: The expected output from a given example run.
This will be used by the evaluator to compare the received output with.
id: Identifier for the example, defaults to uuid.
metadata: Optional dictionary of custom key-value pairs.
Generics:
Input: Interface to be passed to the :class:`Task` that shall be evaluated.
ExpectedOutput: Output that is expected from the run with the supplied input.
"""

input: Input
expected_output: ExpectedOutput
id: str = Field(default_factory=lambda: str(uuid4()))
metadata: Optional[SerializableDict] = None


class StudioDataset(BaseModel):
"""Represents a :class:`Dataset` linked to multiple examples as sent to Studio.
Attributes:
id: Dataset ID.
name: A short name of the dataset.
label: Labels for filtering datasets. Defaults to empty list.
metadata: Additional information about the dataset. Defaults to empty dict.
"""

id: str = Field(default_factory=lambda: str(uuid4()))
name: str
labels: set[str] = set()
metadata: SerializableDict = dict()


class StudioClient:
"""Client for communicating with Studio.
Expand Down Expand Up @@ -50,7 +96,6 @@ def __init__(
"'AA_TOKEN' is not set and auth_token is not given as a parameter. Please provide one or the other."
)
self._headers = {
"Content-Type": "application/json",
"Accept": "application/json",
"Authorization": f"Bearer {self._token}",
}
Expand Down Expand Up @@ -148,7 +193,7 @@ def submit_trace(self, data: Sequence[ExportedSpan]) -> str:
spans belong to multiple traces.
Args:
data: Spans to create the trace from. Created by exporting from a `Tracer`.
data: :class:`Spans` to create the trace from. Created by exporting from a :class:`Tracer`.
Returns:
The ID of the created trace.
Expand All @@ -161,7 +206,7 @@ def submit_from_tracer(self, tracer: Tracer) -> list[str]:
"""Sends all trace data from the Tracer to Studio.
Args:
tracer: Tracer to extract data from.
tracer: :class:`Tracer` to extract data from.
Returns:
List of created trace IDs.
Expand Down Expand Up @@ -191,3 +236,53 @@ def _upload_trace(self, trace: ExportedSpanList) -> str:
case _:
response.raise_for_status()
return str(response.json())

def submit_dataset(
self,
dataset: StudioDataset,
examples: Iterable[StudioExample[Input, ExpectedOutput]],
) -> str:
"""Submits a dataset to Studio.
Args:
dataset: :class:`Dataset` to be uploaded
examples: :class:`Examples` of the :class:`Dataset`
Returns:
ID of the created dataset
"""
url = urljoin(self.url, f"/api/projects/{self.project_id}/evaluation/datasets")
source_data_list = [
example.model_dump_json()
for example in sorted(examples, key=lambda x: x.id)
]

source_data_file = "\n".join(source_data_list).encode()

data = {
"name": dataset.name,
"labels": list(dataset.labels) if dataset.labels is not None else [],
"total_datapoints": len(source_data_list),
}

if dataset.metadata:
data["metadata"] = json.dumps(dataset.metadata)

response = requests.post(
url,
files={"source_data": source_data_file},
data=data,
headers=self._headers,
)

self._raise_for_status(response)
return str(response.text)

def _raise_for_status(self, response: requests.Response) -> None:
try:
response.raise_for_status()
except requests.HTTPError as e:
print(
f"The following error has been raised via execution {e.response.text}"
)
raise e
Loading

0 comments on commit 1b851ed

Please sign in to comment.