Skip to content

Commit

Permalink
feat: improved EvaluationOverviews and Argilla Integration (#829)
Browse files Browse the repository at this point in the history
* create EvaluationLogicBase to fix type magic for async case
* create EvaluatorBase to share common evaluation behavior
* remove ArgillaEvaluationRepository as it is no longer needed
* Implement an AsyncEvaluator that serves as the base for the ArgillaEvaluator
* Refactor Argilla* dependent classes 
* Adjust tutorials accordingly
Task: IL-298

---------

Co-authored-by: Merlin Kallenborn <[email protected]>
Co-authored-by: Johannes Wesch <[email protected]>
  • Loading branch information
3 people authored May 14, 2024
1 parent e548ed8 commit cb8843c
Show file tree
Hide file tree
Showing 32 changed files with 2,096 additions and 1,847 deletions.
26 changes: 24 additions & 2 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,33 @@

## Unreleased

We did a major revamp of the `ArgillaEvaluator` to separate an `AsyncEvaluator` from the normal evaluation scenario.
This comes with easier to understand interfaces, more information in the `EvaluationOverview` and a simplified aggregation step for Argilla that is no longer dependent on specific Argilla types.
Check the how-to for detailed information [here](./src/documentation/how_tos/how_to_human_evaluation_via_argilla.ipynb)

### Breaking Changes
...

- rename: `AggregatedInstructComparison` to `AggregatedComparison`
- rename `InstructComparisonArgillaAggregationLogic` to `ComparisonAggregationLogic`
- remove: `ArgillaAggregator` - the regular aggregator now does the job
- remove: `ArgillaEvaluationRepository` - `ArgillaEvaluator` now uses `AsyncRepository` which extend existing `EvaluationRepository` for the human-feedback use-case
- `ArgillaEvaluationLogic` now uses `to_record` and `from_record` instead of `do_evaluate`. The signature of the `to_record` stays the same. The `Field` and `Question` are now defined in the logic instead of passed to the `ArgillaRepository`
- `ArgillaEvaluator` now takes the `ArgillaClient` as well as the `workspace_id`. It inherits from the abstract `AsyncEvaluator` and no longer has `evalaute_runs` and `evaluate`. Instead it has `submit` and `retrieve`.
- `EvaluationOverview` gets attributes `end_date`, `successful_evaluation_count` and `failed_evaluation_count`
- rename: `start` is now called `start_date` and no longer optional
- we refactored the internals of `Evaluator`. This is only relevant if you subclass from it. Most of the typing and data handling is moved to `EvaluatorBase`


### New Features
...
- Add `ComparisonEvaluation` for the elo evaluation to abstract from the Argilla record
- Add `AsyncEvaluator` for human-feedback evaluation. `ArgillaEvaluator` inherits from this
- `.submit` pushes all evaluations to Argilla to label them
- Add `PartialEvaluationOverview` to store the submission details.
- `.retrieve` then collects all labelled records from Argilla and stores them in an `AsyncRepository`.
- Add `AsyncEvaluationRepository` to store and retrieve `PartialEvaluationOverview`. Also added `AsyncFileEvaluationRepository` and `AsyncInMemoryEvaluationRepository`
- Add `EvaluatorBase` and `EvaluationLogicBase` for base classes for both async and synchronous evaluation.



### Fixes
- Improve description of using artifactory tokens for installation of IL
Expand Down
8 changes: 3 additions & 5 deletions src/documentation/how_tos/how_to_evaluate_runs.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -8,10 +8,7 @@
"source": [
"from example_data import DummyEvaluationLogic, example_data\n",
"\n",
"from intelligence_layer.evaluation.evaluation.evaluator import Evaluator\n",
"from intelligence_layer.evaluation.evaluation.in_memory_evaluation_repository import (\n",
" InMemoryEvaluationRepository,\n",
")"
"from intelligence_layer.evaluation import Evaluator, InMemoryEvaluationRepository"
]
},
{
Expand Down Expand Up @@ -40,6 +37,7 @@
"outputs": [],
"source": [
"# Step 0\n",
"\n",
"my_example_data = example_data()\n",
"print()\n",
"run_ids = [my_example_data.run_overview_1.id, my_example_data.run_overview_2.id]\n",
Expand Down Expand Up @@ -82,7 +80,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.12"
"version": "3.11.8"
}
},
"nbformat": 4,
Expand Down
194 changes: 70 additions & 124 deletions src/documentation/how_tos/how_to_human_evaluation_via_argilla.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,6 @@
"metadata": {},
"outputs": [],
"source": [
"from typing import Iterable\n",
"\n",
"from dotenv import load_dotenv\n",
"from pydantic import BaseModel\n",
"\n",
Expand All @@ -19,13 +17,9 @@
" RecordData,\n",
")\n",
"from intelligence_layer.evaluation import (\n",
" AggregationLogic,\n",
" ArgillaAggregator,\n",
" ArgillaEvaluationLogic,\n",
" ArgillaEvaluationRepository,\n",
" AsyncInMemoryEvaluationRepository,\n",
" Example,\n",
" InMemoryAggregationRepository,\n",
" InMemoryEvaluationRepository,\n",
" RecordDataSequence,\n",
" SuccessfulExampleOutput,\n",
")\n",
Expand All @@ -40,13 +34,17 @@
"# How to evaluate with human evaluation via Argilla\n",
"1. Initialize an Argilla client with the correct settings for your setup\n",
" - By default, the url and api key are read from the environment variables `ARGILLA_API_URL` and `ARGILLA_API_KEY`\n",
"2. Create `Question`s and `Field`s to structure the data that will be displayed in Argilla\n",
"3. Choose an Argilla workspace and get its ID\n",
"4. Create an `ArgillaEvaluationRepository`\n",
"2. Choose an Argilla workspace and get its ID\n",
"3. Create an `AsyncEvaluationRepository`\n",
"4. Define new output type for the evaluation\n",
"5. Implement an `ArgillaEvaluationLogic`\n",
" 1. Create `Question`s and `Field`s to structure the data that will be displayed in Argilla\n",
" 2. Implement `to_record` to convert the task input into an Argilla record\n",
" 3. Implement `from_record` to convert the record back to an evaluation result\n",
"6. Submit tasks to the Argilla instance by running the `ArgillaEvaluator`\n",
" - Make sure to save the `EvaluationOverview.id`, as it is needed to retrieve the results later\n",
"7. **Use the Argilla web platform to evaluate** "
"7. **Use the Argilla web platform to evaluate** \n",
"8. Collect all labelled evaluations from Argilla\n",
" - Make sure to save the `EvaluationOverview.id`, as it is needed to retrieve the results later"
]
},
{
Expand All @@ -62,54 +60,66 @@
"metadata": {},
"outputs": [],
"source": [
"# Step 0\n",
"\n",
"\n",
"class StoryTaskInput(BaseModel): # Should already be implemented in your task\n",
" topic: str\n",
" targeted_word_count: int\n",
"\n",
"\n",
"class StoryTaskOutput(BaseModel): # Should already be implemented in your task\n",
" story: str\n",
"\n",
"\n",
"# Step 1\n",
"\n",
"\n",
"client = DefaultArgillaClient(\n",
" # api_url=\"your url here\", # not necessary if ARGILLA_API_URL is set in environment\n",
" # api_key=\"your api key here\", # not necessary if ARGILLA_API_KEY is set in environment\n",
")\n",
"\n",
"# Step 2\n",
"questions = [\n",
" Question(\n",
" name=\"rating\",\n",
" title=\"Funniness\",\n",
" description=\"How funny do you think is the joke? Rate it from 1-5.\",\n",
" options=range(1, 6),\n",
" )\n",
"]\n",
"fields = [\n",
" Field(name=\"input\", title=\"Topic\"),\n",
" Field(name=\"output\", title=\"Joke\"),\n",
"]\n",
"\n",
"# Step 3\n",
"# Step 2\n",
"workspace_id = client.ensure_workspace_exists(\"my-workspace-name\")\n",
"\n",
"# Step 4\n",
"data_storage = (\n",
" InMemoryEvaluationRepository()\n",
"# Step 3\n",
"evaluation_repository = (\n",
" AsyncInMemoryEvaluationRepository()\n",
") # Use FileEvaluationRepository for persistent results\n",
"evaluation_repository = ArgillaEvaluationRepository(\n",
" data_storage, client, workspace_id, fields, questions\n",
")\n",
"\n",
"\n",
"# Step 5\n",
"class StoryTaskInput(BaseModel): # Should already be implemented in your task\n",
" topic: str\n",
" targeted_word_count: int\n",
"\n",
"\n",
"class StoryTaskOutput(BaseModel): # Should already be implemented in your task\n",
" story: str\n",
"# Step 4\n",
"class FunnyOutputRating(BaseModel):\n",
" rating: int\n",
"\n",
"\n",
"# Step 5\n",
"class CustomArgillaEvaluationLogic(\n",
" ArgillaEvaluationLogic[\n",
" StoryTaskInput, StoryTaskOutput, None\n",
" StoryTaskInput, StoryTaskOutput, None, FunnyOutputRating\n",
" ] # No expected output, therefore \"None\"\n",
"):\n",
" def _to_record(\n",
" # Step 5.1\n",
" def __init__(self):\n",
" super().__init__(\n",
" questions=[\n",
" Question(\n",
" name=\"rating\",\n",
" title=\"Funniness\",\n",
" description=\"How funny do you think is the joke? Rate it from 1-5.\",\n",
" options=range(1, 6),\n",
" )\n",
" ],\n",
" fields=[\n",
" Field(name=\"input\", title=\"Topic\"),\n",
" Field(name=\"output\", title=\"Joke\"),\n",
" ],\n",
" )\n",
"\n",
" # Step 5.2\n",
" def to_record(\n",
" self,\n",
" example: Example[StoryTaskInput, None],\n",
" *output: SuccessfulExampleOutput[StoryTaskOutput],\n",
Expand All @@ -128,6 +138,10 @@
" ]\n",
" )\n",
"\n",
" # Step 5.3\n",
" def from_record(self, argilla_evaluation: ArgillaEvaluation) -> FunnyOutputRating:\n",
" return FunnyOutputRating(rating=argilla_evaluation.metadata[\"rating\"])\n",
"\n",
"\n",
"evaluation_logic = CustomArgillaEvaluationLogic()"
]
Expand All @@ -145,16 +159,25 @@
"runs_to_evaluate = [\"your_run_id_of_interest\", \"other_run_id_of_interest\"]\n",
"\n",
"evaluator = ArgillaEvaluator(\n",
" ..., evaluation_repository, description=\"My evaluation description\", evaluation_logic=evaluation_logic\n",
" ...,\n",
" evaluation_repository,\n",
" description=\"My evaluation description\",\n",
" evaluation_logic=evaluation_logic,\n",
" argilla_client=client,\n",
" workspace_id=workspace_id,\n",
")\n",
"evaluation_overview = evaluator.evaluate_runs(*runs_to_evaluate)\n",
"print(\"ID to retrieve results later: \", evaluation_overview.id)\n",
"partial_evaluation_overview = evaluator.submit(*runs_to_evaluate)\n",
"print(\"ID to retrieve results later: \", partial_evaluation_overview.id)\n",
"\n",
"# Step 7\n",
"\n",
"####################################\n",
"# Evaluate via the Argilla UI here #\n",
"####################################"
"####################################\n",
"\n",
"# Step 8\n",
"\n",
"evaluation_overview = evaluator.retrieve(partial_evaluation_overview.id)"
]
},
{
Expand All @@ -165,83 +188,6 @@
"```python\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# How to aggregate an Argilla evaluation\n",
"0. Submit tasks to Argilla and perform an evaluation (see [here](#how-to-evaluate-with-human-evaluation-via-argilla)).\n",
"1. Implement an `AggregationLogic` that takes `ArgillaEvaluation`s as input.\n",
"2. Remember the ID of the evaluation and the name of the Argilla workspace that you want to aggregate.\n",
"3. Initialize the `ArgillaEvaluationRepository` and an aggregation repository.\n",
"4. Aggregate the results with an `ArgillaAggregator`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Step 1\n",
"\n",
"\n",
"class CustomArgillaAggregation(BaseModel):\n",
" avg_funniness: float\n",
"\n",
"\n",
"class CustomArgillaAggregationLogic(\n",
" AggregationLogic[ArgillaEvaluation, CustomArgillaAggregation]\n",
"):\n",
" def aggregate(\n",
" self, evaluations: Iterable[ArgillaEvaluation]\n",
" ) -> CustomArgillaAggregation:\n",
" evaluation_list = list(evaluations)\n",
" total_score = sum(\n",
" evaluation.metadata[\n",
" \"rating\"\n",
" ] # This name is defined by the `Question`s given to the Argilla repository during submission\n",
" for evaluation in evaluation_list\n",
" )\n",
" return CustomArgillaAggregation(\n",
" avg_funniness=total_score / len(evaluation_list)\n",
" )\n",
"\n",
"\n",
"aggregation_logic = CustomArgillaAggregationLogic()\n",
"\n",
"# Step 2 - See the first example for more info\n",
"eval_id = \"my-previous-eval-id\"\n",
"client = DefaultArgillaClient()\n",
"workspace_id = client.ensure_workspace_exists(\"my-workspace-name\")\n",
"\n",
"# Step 3\n",
"evaluation_repository = ArgillaEvaluationRepository(\n",
" InMemoryEvaluationRepository(), client, workspace_id\n",
")\n",
"aggregation_repository = InMemoryAggregationRepository()\n",
"\n",
"# Step 4\n",
"aggregator = ArgillaAggregator(\n",
" evaluation_repository,\n",
" aggregation_repository,\n",
" \"My aggregation description\",\n",
" aggregation_logic,\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%script false --no-raise-error\n",
"# we skip this as we do not have a dataset or run in this example\n",
"\n",
"aggregation = aggregator.aggregate_evaluation(eval_id)"
]
}
],
"metadata": {
Expand All @@ -260,7 +206,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.12"
"version": "3.11.8"
}
},
"nbformat": 4,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -12,9 +12,9 @@
"from dotenv import load_dotenv\n",
"from pydantic import BaseModel\n",
"\n",
"from intelligence_layer.evaluation.aggregation.aggregator import AggregationLogic\n",
"from intelligence_layer.evaluation.dataset.domain import Example\n",
"from intelligence_layer.evaluation.evaluation.evaluator import (\n",
"from intelligence_layer.evaluation import (\n",
" AggregationLogic,\n",
" Example,\n",
" SingleOutputEvaluationLogic,\n",
")\n",
"\n",
Expand Down
Loading

0 comments on commit cb8843c

Please sign in to comment.