Skip to content

Commit

Permalink
fix: Adapt Changelog to reflect removal of non-incremental Elo method…
Browse files Browse the repository at this point in the history
…s and classes

TASK: IL-394
  • Loading branch information
MerlinKallenbornAA committed May 17, 2024
1 parent 57837e5 commit df92fa2
Show file tree
Hide file tree
Showing 2 changed files with 14 additions and 7 deletions.
5 changes: 2 additions & 3 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,8 @@
- Changed the behavior of `IncrementalEvaluator::do_evaluate` such that it now sends all `SuccessfulExampleOutput`s to `do_incremental_evaluate` instead of only the new `SuccessfulExampleOutput`s.
-
### New Features
- Add generic `EloEvaluator` class and `EloEvaluationLogic` for implementation of Elo evaluation use cases.
- Add `EloQaEvaluator` and `EloQaEvaluationLogic` for Elo evaluation of QA runs.
- Add `IncrementalEloQaEvaluator` and `IncrementalEloQaEvaluationLogic` for Elo evaluation of QA runs with later addition of more runs to an existing evaluation.
- Add generic `EloEvaluationLogic` class for implementation of Elo evaluation use cases.
- Add `EloQaEvaluationLogic` for Elo evaluation of QA runs, with optional later addition of more runs to an existing evaluation.
- Add `EloAggregationAdapter` class to simplify using the `ComparisonEvaluationAggregationLogic` for different Elo use cases.
- Add `elo_qa_eval` tutorial notebook describing the use of an (incremental) Elo evaluation use case for QA models.
### Fixes
Expand Down
16 changes: 12 additions & 4 deletions src/documentation/elo_qa_eval.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -266,14 +266,22 @@
"Now that we have generated the answers of all models for all examples in the `dataset_repository`, the next step is to evaluate those answers.\n",
"The evaluation is done by an `Evaluator`. In this notebook we choose an `IncrementalEvaluator` which has the capability to later add additional runs or models without repeating old comparisons, which will come in handy later.\n",
"\n",
"Since we are interested in the ELO score, we use an ELO evaluation logic, which, in general, compares two outputs against each other and chooses fitting better option. In order to deem which of the two options is \"better\", we need to provide a use case specific evaluation logic, in our QA case an `EloQaEvaluationLogic`, and a \"referee mo del\" which compares and grades the individual outputs. Here we choose Llama3."
"Since we are interested in the ELO score, we use an `EloEvaluationLogic` with our `Evaluator`. This logic compares two outputs against each other and chooses a winner. In order to deem which of the two options is \"better\" we need to provide a use case specific evaluation logic. In our QA case, this is the `EloQaEvaluationLogic`. We just need to tell the `EloQaEvaluationLogic` which \"referee model\" it should use to perform the comparison."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"execution_count": 55,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"IDs of stored evaluations: ['0fddc8bf-7785-4a6b-9541-4407f47a5e1b', '447071ed-12b5-4e51-a7df-f1495bb60475']\n"
]
}
],
"source": [
"# this should demonstrate that there are no stored evaluations yet in our repository\n",
"print(f\"IDs of stored evaluations: {evaluation_repository.evaluation_overview_ids()}\")"
Expand Down

0 comments on commit df92fa2

Please sign in to comment.