feat: Upload dataset from il sdk (#1116)

* feat: Add submit_dataset to studio client * feat: Add how-to for submitting existing datasets to studio * feat: Introduce mapper functions to avoid circular dependencies * refactor: Remove DataClient Dependency * feat: add warning to StudioDatasetRepository --------- Co-authored-by: Johannes Wesch <[email protected]>
Aleph-Alpha · Nov 4, 2024 · 1b851ed · 1b851ed
1 parent 6642e53
commit 1b851ed
Show file tree

Hide file tree

Showing 10 changed files with 370 additions and 262 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -10,6 +10,8 @@
   - `PostgresInstructionFinetuningDataRepository` to work with data stored in a Postgres database.
   - `FileInstructionFinetuningDataRepository` to work with data stored in the local file-system.
 - Compute precision, recall and f1-score by class in `SingleLabelClassifyAggregationLogic`
+- Add submit_dataset function to StudioClient
+  - Add `how_to_upload_existing_datasets_to_studio.ipynb` to how-tos
 
 ### Fixes
 ...

diff --git a/README.md b/README.md
@@ -150,7 +150,7 @@ The how-tos are quick lookups about how to do things. Compared to the tutorials,
 | [...define a task](./src/documentation/how_tos/how_to_define_a_task.ipynb)                                                                             | How to come up with a new task and formulate it                            |
 | [...implement a task](./src/documentation/how_tos/how_to_implement_a_task.ipynb)                                                                       | Implement a formulated task and make it run with the Intelligence Layer    |
 | [...debug and log a task](./src/documentation/how_tos/how_to_log_and_debug_a_task.ipynb)                                                               | Tools for logging and debugging in tasks                                   |
-| [...use Studio with traces](./src/documentation/how_tos/how_to_use_studio_with_traces.ipynb)                                                    | Submitting Traces to Studio for debugging                                  |
+| [...use Studio with traces](./src/documentation/how_tos/studio/how_to_use_studio_with_traces.ipynb)                                                    | Submitting Traces to Studio for debugging                                  |
 | **Analysis Pipeline**                                                                                                                                  |                                                                            |
 | [...implement a simple evaluation and aggregation logic](./src/documentation/how_tos/how_to_implement_a_simple_evaluation_and_aggregation_logic.ipynb) | Basic examples of evaluation and aggregation logic                         |
 | [...create a dataset](./src/documentation/how_tos/how_to_create_a_dataset.ipynb)                                                                       | Create a dataset used for running a task                                   |

diff --git a/src/documentation/how_tos/how_to_aggregate_evaluations.ipynb b/src/documentation/how_tos/how_to_aggregate_evaluations.ipynb
@@ -70,7 +70,7 @@
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": "intelligence-layer-d3iSWYpm-py3.10",
+   "display_name": "intelligence-layer-aL2cXmJM-py3.11",
    "language": "python",
    "name": "python3"
   },

diff --git a/src/documentation/how_tos/how_to_log_and_debug_a_task.ipynb b/src/documentation/how_tos/how_to_log_and_debug_a_task.ipynb
@@ -90,7 +90,7 @@
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": "intelligence-layer-d3iSWYpm-py3.10",
+   "display_name": ".venv",
    "language": "python",
    "name": "python3"
   },

diff --git a/src/documentation/how_tos/studio/how_to_upload_existing_datasets_to_studio.ipynb b/src/documentation/how_tos/studio/how_to_upload_existing_datasets_to_studio.ipynb
@@ -0,0 +1,102 @@
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from uuid import uuid4\n",
+    "\n",
+    "from documentation.how_tos.example_data import example_data\n",
+    "from intelligence_layer.connectors import StudioClient\n",
+    "from intelligence_layer.evaluation.dataset.studio_dataset_repository import (\n",
+    "    StudioDatasetRepository,\n",
+    ")\n",
+    "\n",
+    "my_example_data = example_data()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# How to upload (existing) datasets to Studio\n",
+    "<div class=\"alert alert-info\">  \n",
+    "\n",
+    "Make sure your account has permissions to use the Studio application.\n",
+    "\n",
+    "For an on-prem or local installation, please contact the corresponding team.\n",
+    "</div>"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "\n",
+    "0. Extract `Dataset` and `Examples` from your `DatasetRepository`.\n",
+    "\n",
+    "1. Initialize a `StudioClient` with a project.\n",
+    "    - Use an existing project or create a new one with the `StudioClient.create_project` function.\n",
+    "    \n",
+    "2. Create a `StudioDatasetRepository` and create a new `Dataset` via `StudioDatasetRepository.create_dataset`, which will automatically upload this new `Dataset` to Studio.\n",
+    "\n",
+    "### Example"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Step 0\n",
+    "existing_dataset_repo = my_example_data.dataset_repository\n",
+    "\n",
+    "existing_dataset = existing_dataset_repo.dataset(dataset_id=my_example_data.dataset.id)\n",
+    "assert existing_dataset, \"Make sure your dataset still exists.\"\n",
+    "\n",
+    "existing_examples = existing_dataset_repo.examples(\n",
+    "    existing_dataset.id, input_type=str, expected_output_type=str\n",
+    ")\n",
+    "\n",
+    "# Step 1\n",
+    "project_name = str(uuid4())\n",
+    "studio_client = StudioClient(project=project_name)\n",
+    "my_project = studio_client.create_project(project=project_name)\n",
+    "\n",
+    "# Step 2\n",
+    "studio_dataset_repo = StudioDatasetRepository(studio_client=studio_client)\n",
+    "\n",
+    "studio_dataset = studio_dataset_repo.create_dataset(\n",
+    "    examples=existing_examples,\n",
+    "    dataset_name=existing_dataset.name,\n",
+    "    labels=existing_dataset.labels,\n",
+    "    metadata=existing_dataset.metadata,\n",
+    ")"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "intelligence-layer-aL2cXmJM-py3.11",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.8"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/...w_tos/how_to_use_studio_with_traces.ipynb → ...tudio/how_to_use_studio_with_traces.ipynb b/...w_tos/how_to_use_studio_with_traces.ipynb → ...tudio/how_to_use_studio_with_traces.ipynb
@@ -88,7 +88,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.12.2"
+   "version": "3.11.8"
   }
  },
  "nbformat": 4,

diff --git a/src/intelligence_layer/connectors/studio/studio.py b/src/intelligence_layer/connectors/studio/studio.py
@@ -1,25 +1,71 @@
+import json
 import os
 from collections import defaultdict
-from collections.abc import Sequence
-from typing import Optional
+from collections.abc import Iterable, Sequence
+from typing import Generic, Optional, TypeVar
 from urllib.parse import urljoin
+from uuid import uuid4
 
 import requests
-from pydantic import BaseModel
+from pydantic import BaseModel, Field
 from requests.exceptions import ConnectionError, MissingSchema
 
+from intelligence_layer.connectors.base.json_serializable import (
+    SerializableDict,
+)
 from intelligence_layer.core.tracer.tracer import (  # Import to be fixed with PHS-731
     ExportedSpan,
     ExportedSpanList,
+    PydanticSerializable,
     Tracer,
 )
 
+Input = TypeVar("Input", bound=PydanticSerializable)
+ExpectedOutput = TypeVar("ExpectedOutput", bound=PydanticSerializable)
+
 
 class StudioProject(BaseModel):
     name: str
     description: Optional[str]
 
 
+class StudioExample(BaseModel, Generic[Input, ExpectedOutput]):
+    """Represents an instance of :class:`Example`as sent to Studio.
+
+    Attributes:
+        input: Input for the :class:`Task`. Has to be same type as the input for the task used.
+        expected_output: The expected output from a given example run.
+            This will be used by the evaluator to compare the received output with.
+        id: Identifier for the example, defaults to uuid.
+        metadata: Optional dictionary of custom key-value pairs.
+
+    Generics:
+        Input: Interface to be passed to the :class:`Task` that shall be evaluated.
+        ExpectedOutput: Output that is expected from the run with the supplied input.
+    """
+
+    input: Input
+    expected_output: ExpectedOutput
+    id: str = Field(default_factory=lambda: str(uuid4()))
+    metadata: Optional[SerializableDict] = None
+
+
+class StudioDataset(BaseModel):
+    """Represents a :class:`Dataset` linked to multiple examples as sent to Studio.
+
+    Attributes:
+        id: Dataset ID.
+        name: A short name of the dataset.
+        label: Labels for filtering datasets. Defaults to empty list.
+        metadata: Additional information about the dataset. Defaults to empty dict.
+    """
+
+    id: str = Field(default_factory=lambda: str(uuid4()))
+    name: str
+    labels: set[str] = set()
+    metadata: SerializableDict = dict()
+
+
 class StudioClient:
     """Client for communicating with Studio.
 
@@ -50,7 +96,6 @@ def __init__(
                 "'AA_TOKEN' is not set and auth_token is not given as a parameter. Please provide one or the other."
             )
         self._headers = {
-            "Content-Type": "application/json",
             "Accept": "application/json",
             "Authorization": f"Bearer {self._token}",
         }
@@ -148,7 +193,7 @@ def submit_trace(self, data: Sequence[ExportedSpan]) -> str:
         spans belong to multiple traces.
 
         Args:
-            data: Spans to create the trace from. Created by exporting from a `Tracer`.
+            data: :class:`Spans` to create the trace from. Created by exporting from a :class:`Tracer`.
 
         Returns:
             The ID of the created trace.
@@ -161,7 +206,7 @@ def submit_from_tracer(self, tracer: Tracer) -> list[str]:
         """Sends all trace data from the Tracer to Studio.
 
         Args:
-            tracer: Tracer to extract data from.
+            tracer: :class:`Tracer` to extract data from.
 
         Returns:
             List of created trace IDs.
@@ -191,3 +236,53 @@ def _upload_trace(self, trace: ExportedSpanList) -> str:
             case _:
                 response.raise_for_status()
         return str(response.json())
+
+    def submit_dataset(
+        self,
+        dataset: StudioDataset,
+        examples: Iterable[StudioExample[Input, ExpectedOutput]],
+    ) -> str:
+        """Submits a dataset to Studio.
+
+        Args:
+            dataset: :class:`Dataset` to be uploaded
+            examples: :class:`Examples` of the :class:`Dataset`
+
+        Returns:
+            ID of the created dataset
+        """
+        url = urljoin(self.url, f"/api/projects/{self.project_id}/evaluation/datasets")
+        source_data_list = [
+            example.model_dump_json()
+            for example in sorted(examples, key=lambda x: x.id)
+        ]
+
+        source_data_file = "\n".join(source_data_list).encode()
+
+        data = {
+            "name": dataset.name,
+            "labels": list(dataset.labels) if dataset.labels is not None else [],
+            "total_datapoints": len(source_data_list),
+        }
+
+        if dataset.metadata:
+            data["metadata"] = json.dumps(dataset.metadata)
+
+        response = requests.post(
+            url,
+            files={"source_data": source_data_file},
+            data=data,
+            headers=self._headers,
+        )
+
+        self._raise_for_status(response)
+        return str(response.text)
+
+    def _raise_for_status(self, response: requests.Response) -> None:
+        try:
+            response.raise_for_status()
+        except requests.HTTPError as e:
+            print(
+                f"The following error has been raised via execution {e.response.text}"
+            )
+            raise e