fix: Adapt Changelog to reflect removal of non-incremental Elo method…

…s and classes TASK: IL-394
Aleph-Alpha · May 17, 2024 · df92fa2 · df92fa2
1 parent 57837e5
commit df92fa2
Show file tree

Hide file tree

Showing 2 changed files with 14 additions and 7 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -6,9 +6,8 @@
  - Changed the behavior of `IncrementalEvaluator::do_evaluate` such that it now sends all `SuccessfulExampleOutput`s to `do_incremental_evaluate` instead of only the new `SuccessfulExampleOutput`s.
  -
 ### New Features
- - Add generic `EloEvaluator` class and `EloEvaluationLogic` for implementation of Elo evaluation use cases.
- - Add `EloQaEvaluator` and `EloQaEvaluationLogic` for Elo evaluation of QA runs.
- - Add `IncrementalEloQaEvaluator` and `IncrementalEloQaEvaluationLogic` for Elo evaluation of QA runs with later addition of more runs to an existing evaluation.
+ - Add generic `EloEvaluationLogic` class for implementation of Elo evaluation use cases.
+ - Add `EloQaEvaluationLogic` for Elo evaluation of QA runs, with optional later addition of more runs to an existing evaluation.
  - Add `EloAggregationAdapter` class to simplify using the `ComparisonEvaluationAggregationLogic` for different Elo use cases.
  - Add `elo_qa_eval` tutorial notebook describing the use of an (incremental) Elo evaluation use case for QA models.
 ### Fixes

diff --git a/src/documentation/elo_qa_eval.ipynb b/src/documentation/elo_qa_eval.ipynb
@@ -266,14 +266,22 @@
     "Now that we have generated the answers of all models for all examples in the `dataset_repository`, the next step is to evaluate those answers.\n",
     "The evaluation is done by an `Evaluator`. In this notebook we choose an `IncrementalEvaluator` which has the capability to later add additional runs or models without repeating old comparisons, which will come in handy later.\n",
     "\n",
-    "Since we are interested in the ELO score, we use an ELO evaluation logic, which, in general, compares two outputs against each other and chooses fitting better option. In order to deem which of the two options is \"better\", we need to provide a use case specific evaluation logic, in our QA case an `EloQaEvaluationLogic`, and a \"referee mo del\" which compares and grades the individual outputs. Here we choose Llama3."
+    "Since we are interested in the ELO score, we use an `EloEvaluationLogic` with our `Evaluator`. This logic compares two outputs against each other and chooses a winner. In order to deem which of the two options is \"better\" we need to provide a use case specific evaluation logic. In our QA case, this is the `EloQaEvaluationLogic`. We just need to tell the `EloQaEvaluationLogic` which \"referee model\" it should use to perform the comparison."
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
+   "execution_count": 55,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "IDs of stored evaluations: ['0fddc8bf-7785-4a6b-9541-4407f47a5e1b', '447071ed-12b5-4e51-a7df-f1495bb60475']\n"
+     ]
+    }
+   ],
    "source": [
     "# this should demonstrate that there are no stored evaluations yet in our repository\n",
     "print(f\"IDs of stored evaluations: {evaluation_repository.evaluation_overview_ids()}\")"