feat: improved EvaluationOverviews and Argilla Integration (#829)

* create EvaluationLogicBase to fix type magic for async case * create EvaluatorBase to share common evaluation behavior * remove ArgillaEvaluationRepository as it is no longer needed * Implement an AsyncEvaluator that serves as the base for the ArgillaEvaluator * Refactor Argilla* dependent classes * Adjust tutorials accordingly Task: IL-298 --------- Co-authored-by: Merlin Kallenborn <[email protected]> Co-authored-by: Johannes Wesch <[email protected]>
Aleph-Alpha · May 14, 2024 · cb8843c · cb8843c
1 parent e548ed8
commit cb8843c
Show file tree

Hide file tree

Showing 32 changed files with 2,096 additions and 1,847 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -2,11 +2,33 @@
 
 ## Unreleased
 
+We did a major revamp of the `ArgillaEvaluator` to separate an `AsyncEvaluator` from the normal evaluation scenario.
+This comes with easier to understand interfaces, more information in the `EvaluationOverview` and a simplified aggregation step for Argilla that is no longer dependent on specific Argilla types.
+Check the how-to for detailed information [here](./src/documentation/how_tos/how_to_human_evaluation_via_argilla.ipynb)
+
 ### Breaking Changes
-...
+
+- rename: `AggregatedInstructComparison` to `AggregatedComparison`
+- rename `InstructComparisonArgillaAggregationLogic` to `ComparisonAggregationLogic`
+- remove: `ArgillaAggregator` - the regular aggregator now does the job
+- remove: `ArgillaEvaluationRepository` - `ArgillaEvaluator` now uses `AsyncRepository` which extend existing `EvaluationRepository` for the human-feedback use-case
+- `ArgillaEvaluationLogic` now uses `to_record` and `from_record` instead of `do_evaluate`. The signature of the `to_record` stays the same. The `Field` and `Question` are now defined in the logic instead of passed to the `ArgillaRepository`
+- `ArgillaEvaluator` now takes the `ArgillaClient` as well as the `workspace_id`. It inherits from the abstract `AsyncEvaluator` and no longer has `evalaute_runs` and `evaluate`. Instead it has `submit` and `retrieve`.
+- `EvaluationOverview` gets attributes `end_date`, `successful_evaluation_count` and `failed_evaluation_count`
+  - rename: `start` is now called `start_date` and no longer optional
+- we refactored the internals of `Evaluator`. This is only relevant if you subclass from it. Most of the typing and data handling is moved to `EvaluatorBase`
+
 
 ### New Features
-...
+- Add `ComparisonEvaluation` for the elo evaluation to abstract from the Argilla record
+- Add `AsyncEvaluator` for human-feedback evaluation. `ArgillaEvaluator` inherits from this
+  - `.submit` pushes all evaluations to Argilla to label them
+  - Add `PartialEvaluationOverview` to store the submission details.
+  - `.retrieve` then collects all labelled records from Argilla and stores them in an `AsyncRepository`.
+  - Add `AsyncEvaluationRepository` to store and retrieve `PartialEvaluationOverview`. Also added `AsyncFileEvaluationRepository` and `AsyncInMemoryEvaluationRepository`
+- Add `EvaluatorBase` and `EvaluationLogicBase` for base classes for both async and synchronous evaluation.
+
+
 
 ### Fixes
  - Improve description of using artifactory tokens for installation of IL

diff --git a/src/documentation/how_tos/how_to_evaluate_runs.ipynb b/src/documentation/how_tos/how_to_evaluate_runs.ipynb
@@ -8,10 +8,7 @@
    "source": [
     "from example_data import DummyEvaluationLogic, example_data\n",
     "\n",
-    "from intelligence_layer.evaluation.evaluation.evaluator import Evaluator\n",
-    "from intelligence_layer.evaluation.evaluation.in_memory_evaluation_repository import (\n",
-    "    InMemoryEvaluationRepository,\n",
-    ")"
+    "from intelligence_layer.evaluation import Evaluator, InMemoryEvaluationRepository"
    ]
   },
   {
@@ -40,6 +37,7 @@
    "outputs": [],
    "source": [
     "# Step 0\n",
+    "\n",
     "my_example_data = example_data()\n",
     "print()\n",
     "run_ids = [my_example_data.run_overview_1.id, my_example_data.run_overview_2.id]\n",
@@ -82,7 +80,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.10.12"
+   "version": "3.11.8"
   }
  },
  "nbformat": 4,

diff --git a/src/documentation/how_tos/how_to_human_evaluation_via_argilla.ipynb b/src/documentation/how_tos/how_to_human_evaluation_via_argilla.ipynb
@@ -6,8 +6,6 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "from typing import Iterable\n",
-    "\n",
     "from dotenv import load_dotenv\n",
     "from pydantic import BaseModel\n",
     "\n",
@@ -19,13 +17,9 @@
     "    RecordData,\n",
     ")\n",
     "from intelligence_layer.evaluation import (\n",
-    "    AggregationLogic,\n",
-    "    ArgillaAggregator,\n",
     "    ArgillaEvaluationLogic,\n",
-    "    ArgillaEvaluationRepository,\n",
+    "    AsyncInMemoryEvaluationRepository,\n",
     "    Example,\n",
-    "    InMemoryAggregationRepository,\n",
-    "    InMemoryEvaluationRepository,\n",
     "    RecordDataSequence,\n",
     "    SuccessfulExampleOutput,\n",
     ")\n",
@@ -40,13 +34,17 @@
     "# How to evaluate with human evaluation via Argilla\n",
     "1. Initialize an Argilla client with the correct settings for your setup\n",
     "   - By default, the url and api key are read from the environment variables `ARGILLA_API_URL` and `ARGILLA_API_KEY`\n",
-    "2. Create `Question`s and `Field`s to structure the data that will be displayed in Argilla\n",
-    "3. Choose an Argilla workspace and get its ID\n",
-    "4. Create an `ArgillaEvaluationRepository`\n",
+    "2. Choose an Argilla workspace and get its ID\n",
+    "3. Create an `AsyncEvaluationRepository`\n",
+    "4. Define new output type for the evaluation\n",
     "5. Implement an `ArgillaEvaluationLogic`\n",
+    "   1. Create `Question`s and `Field`s to structure the data that will be displayed in Argilla\n",
+    "   2. Implement `to_record` to convert the task input into an Argilla record\n",
+    "   3. Implement `from_record` to convert the record back to an evaluation result\n",
     "6. Submit tasks to the Argilla instance by running the `ArgillaEvaluator`\n",
-    "   - Make sure to save the `EvaluationOverview.id`, as it is needed to retrieve the results later\n",
-    "7. **Use the Argilla web platform to evaluate** "
+    "7. **Use the Argilla web platform to evaluate** \n",
+    "8. Collect all labelled evaluations from Argilla\n",
+    "   - Make sure to save the `EvaluationOverview.id`, as it is needed to retrieve the results later"
    ]
   },
   {
@@ -62,54 +60,66 @@
    "metadata": {},
    "outputs": [],
    "source": [
+    "# Step 0\n",
+    "\n",
+    "\n",
+    "class StoryTaskInput(BaseModel):  # Should already be implemented in your task\n",
+    "    topic: str\n",
+    "    targeted_word_count: int\n",
+    "\n",
+    "\n",
+    "class StoryTaskOutput(BaseModel):  # Should already be implemented in your task\n",
+    "    story: str\n",
+    "\n",
+    "\n",
     "# Step 1\n",
+    "\n",
+    "\n",
     "client = DefaultArgillaClient(\n",
     "    # api_url=\"your url here\",     # not necessary if ARGILLA_API_URL is set in environment\n",
     "    # api_key=\"your api key here\", # not necessary if ARGILLA_API_KEY is set in environment\n",
     ")\n",
     "\n",
-    "# Step 2\n",
-    "questions = [\n",
-    "    Question(\n",
-    "        name=\"rating\",\n",
-    "        title=\"Funniness\",\n",
-    "        description=\"How funny do you think is the joke? Rate it from 1-5.\",\n",
-    "        options=range(1, 6),\n",
-    "    )\n",
-    "]\n",
-    "fields = [\n",
-    "    Field(name=\"input\", title=\"Topic\"),\n",
-    "    Field(name=\"output\", title=\"Joke\"),\n",
-    "]\n",
     "\n",
-    "# Step 3\n",
+    "# Step 2\n",
     "workspace_id = client.ensure_workspace_exists(\"my-workspace-name\")\n",
     "\n",
-    "# Step 4\n",
-    "data_storage = (\n",
-    "    InMemoryEvaluationRepository()\n",
+    "# Step 3\n",
+    "evaluation_repository = (\n",
+    "    AsyncInMemoryEvaluationRepository()\n",
     ")  # Use FileEvaluationRepository for persistent results\n",
-    "evaluation_repository = ArgillaEvaluationRepository(\n",
-    "    data_storage, client, workspace_id, fields, questions\n",
-    ")\n",
     "\n",
     "\n",
-    "# Step 5\n",
-    "class StoryTaskInput(BaseModel):  # Should already be implemented in your task\n",
-    "    topic: str\n",
-    "    targeted_word_count: int\n",
-    "\n",
-    "\n",
-    "class StoryTaskOutput(BaseModel):  # Should already be implemented in your task\n",
-    "    story: str\n",
+    "# Step 4\n",
+    "class FunnyOutputRating(BaseModel):\n",
+    "    rating: int\n",
     "\n",
     "\n",
+    "# Step 5\n",
     "class CustomArgillaEvaluationLogic(\n",
     "    ArgillaEvaluationLogic[\n",
-    "        StoryTaskInput, StoryTaskOutput, None\n",
+    "        StoryTaskInput, StoryTaskOutput, None, FunnyOutputRating\n",
     "    ]  # No expected output, therefore \"None\"\n",
     "):\n",
-    "    def _to_record(\n",
+    "    # Step 5.1\n",
+    "    def __init__(self):\n",
+    "        super().__init__(\n",
+    "            questions=[\n",
+    "                Question(\n",
+    "                    name=\"rating\",\n",
+    "                    title=\"Funniness\",\n",
+    "                    description=\"How funny do you think is the joke? Rate it from 1-5.\",\n",
+    "                    options=range(1, 6),\n",
+    "                )\n",
+    "            ],\n",
+    "            fields=[\n",
+    "                Field(name=\"input\", title=\"Topic\"),\n",
+    "                Field(name=\"output\", title=\"Joke\"),\n",
+    "            ],\n",
+    "        )\n",
+    "\n",
+    "    # Step 5.2\n",
+    "    def to_record(\n",
     "        self,\n",
     "        example: Example[StoryTaskInput, None],\n",
     "        *output: SuccessfulExampleOutput[StoryTaskOutput],\n",
@@ -128,6 +138,10 @@
     "            ]\n",
     "        )\n",
     "\n",
+    "    # Step 5.3\n",
+    "    def from_record(self, argilla_evaluation: ArgillaEvaluation) -> FunnyOutputRating:\n",
+    "        return FunnyOutputRating(rating=argilla_evaluation.metadata[\"rating\"])\n",
+    "\n",
     "\n",
     "evaluation_logic = CustomArgillaEvaluationLogic()"
    ]
@@ -145,16 +159,25 @@
     "runs_to_evaluate = [\"your_run_id_of_interest\", \"other_run_id_of_interest\"]\n",
     "\n",
     "evaluator = ArgillaEvaluator(\n",
-    "    ..., evaluation_repository, description=\"My evaluation description\", evaluation_logic=evaluation_logic\n",
+    "    ...,\n",
+    "    evaluation_repository,\n",
+    "    description=\"My evaluation description\",\n",
+    "    evaluation_logic=evaluation_logic,\n",
+    "    argilla_client=client,\n",
+    "    workspace_id=workspace_id,\n",
     ")\n",
-    "evaluation_overview = evaluator.evaluate_runs(*runs_to_evaluate)\n",
-    "print(\"ID to retrieve results later: \", evaluation_overview.id)\n",
+    "partial_evaluation_overview = evaluator.submit(*runs_to_evaluate)\n",
+    "print(\"ID to retrieve results later: \", partial_evaluation_overview.id)\n",
     "\n",
     "# Step 7\n",
     "\n",
     "####################################\n",
     "# Evaluate via the Argilla UI here #\n",
-    "####################################"
+    "####################################\n",
+    "\n",
+    "# Step 8\n",
+    "\n",
+    "evaluation_overview = evaluator.retrieve(partial_evaluation_overview.id)"
    ]
   },
   {
@@ -165,83 +188,6 @@
     "```python\n",
     "```"
    ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "# How to aggregate an Argilla evaluation\n",
-    "0. Submit tasks to Argilla and perform an evaluation (see [here](#how-to-evaluate-with-human-evaluation-via-argilla)).\n",
-    "1. Implement an `AggregationLogic` that takes `ArgillaEvaluation`s as input.\n",
-    "2. Remember the ID of the evaluation and the name of the Argilla workspace that you want to aggregate.\n",
-    "3. Initialize the `ArgillaEvaluationRepository` and an aggregation repository.\n",
-    "4. Aggregate the results with an `ArgillaAggregator`."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# Step 1\n",
-    "\n",
-    "\n",
-    "class CustomArgillaAggregation(BaseModel):\n",
-    "    avg_funniness: float\n",
-    "\n",
-    "\n",
-    "class CustomArgillaAggregationLogic(\n",
-    "    AggregationLogic[ArgillaEvaluation, CustomArgillaAggregation]\n",
-    "):\n",
-    "    def aggregate(\n",
-    "        self, evaluations: Iterable[ArgillaEvaluation]\n",
-    "    ) -> CustomArgillaAggregation:\n",
-    "        evaluation_list = list(evaluations)\n",
-    "        total_score = sum(\n",
-    "            evaluation.metadata[\n",
-    "                \"rating\"\n",
-    "            ]  # This name is defined by the `Question`s given to the Argilla repository during submission\n",
-    "            for evaluation in evaluation_list\n",
-    "        )\n",
-    "        return CustomArgillaAggregation(\n",
-    "            avg_funniness=total_score / len(evaluation_list)\n",
-    "        )\n",
-    "\n",
-    "\n",
-    "aggregation_logic = CustomArgillaAggregationLogic()\n",
-    "\n",
-    "# Step 2 - See the first example for more info\n",
-    "eval_id = \"my-previous-eval-id\"\n",
-    "client = DefaultArgillaClient()\n",
-    "workspace_id = client.ensure_workspace_exists(\"my-workspace-name\")\n",
-    "\n",
-    "# Step 3\n",
-    "evaluation_repository = ArgillaEvaluationRepository(\n",
-    "    InMemoryEvaluationRepository(), client, workspace_id\n",
-    ")\n",
-    "aggregation_repository = InMemoryAggregationRepository()\n",
-    "\n",
-    "# Step 4\n",
-    "aggregator = ArgillaAggregator(\n",
-    "    evaluation_repository,\n",
-    "    aggregation_repository,\n",
-    "    \"My aggregation description\",\n",
-    "    aggregation_logic,\n",
-    ")"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "%%script false --no-raise-error\n",
-    "# we skip this as we do not have a dataset or run in this example\n",
-    "\n",
-    "aggregation = aggregator.aggregate_evaluation(eval_id)"
-   ]
   }
  ],
  "metadata": {
@@ -260,7 +206,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.10.12"
+   "version": "3.11.8"
   }
  },
  "nbformat": 4,

diff --git a/src/documentation/how_tos/how_to_implement_a_simple_evaluation_and_aggregation_logic.ipynb b/src/documentation/how_tos/how_to_implement_a_simple_evaluation_and_aggregation_logic.ipynb
@@ -12,9 +12,9 @@
     "from dotenv import load_dotenv\n",
     "from pydantic import BaseModel\n",
     "\n",
-    "from intelligence_layer.evaluation.aggregation.aggregator import AggregationLogic\n",
-    "from intelligence_layer.evaluation.dataset.domain import Example\n",
-    "from intelligence_layer.evaluation.evaluation.evaluator import (\n",
+    "from intelligence_layer.evaluation import (\n",
+    "    AggregationLogic,\n",
+    "    Example,\n",
     "    SingleOutputEvaluationLogic,\n",
     ")\n",
     "\n",