diff --git a/CHANGELOG.md b/CHANGELOG.md index 09086610a..aedddb08f 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -3,13 +3,18 @@ ## Unreleased ### Breaking Changes -... + - Changed the behavior of `IncrementalEvaluator::do_evaluate` such that it now promotes all output to `do_incremental_evaluate`instead of only the new outputs. + - ### New Features -... + - Add generic `EloEvaluator` class and `EloEvaluationLogic`for implementation of Elo evaluation use cases. + - Add `EloQaEvaluator` and `EloQaEvaluationLogic` for Elo evaluation of QA runs. + - Add `IncrementalEloQaEvaluator` and `IncrementalEloQaEvaluationLogic` for Elo evaluation of QA runs with later addition of more runs to an existing evaluation. + - Add `EloAggregationAdapter` class to simplify using the `ComparisonEvaluationAggregationLogic` for different Elo use cases. + - Add `elo_qa_eval` tutorial notebook describing the use of an (incremental) Elo evaluation use case for QA models. ### Fixes ... ### Deprecations -... +...lint ## 1.2.0 diff --git a/README.md b/README.md index 3c2577557..a7dbe7f87 100644 --- a/README.md +++ b/README.md @@ -147,18 +147,19 @@ If you prefer you can also read about the [concepts](Concepts.md) first. ## Tutorials The tutorials aim to guide you through implementing several common use-cases with the Intelligence Layer SDK. They introduce you to key concepts and enable you to create your own use-cases. In general the tutorials are build in a way that you can simply hop into the topic you are most interested in. However, for starters we recommend to read through the `Summarization` tutorial first. It explains the core concepts of the intelligence layer in more depth while for the other tutorials we assume that these concepts are known. -| Order | Topic | Description | Notebook 📓 | -| ----- | ------------------ |------------------------------------------------------|-----------------------------------------------------------------| -| 1 | Summarization | Summarize a document | [summarization.ipynb](./src/documentation/summarization.ipynb) | -| 2 | Question Answering | Various approaches for QA | [qa.ipynb](./src/documentation/qa.ipynb) | -| 3 | Classification | Learn about two methods of classification | [classification.ipynb](./src/documentation/classification.ipynb) | -| 4 | Evaluation | Evaluate LLM-based methodologies | [evaluation.ipynb](./src/documentation/evaluation.ipynb) | -| 5 | Quickstart Task | Build a custom `Task` for your use case | [quickstart_task.ipynb](./src/documentation/quickstart_task.ipynb) | -| 6 | Document Index | Connect your proprietary knowledge base | [document_index.ipynb](./src/documentation/document_index.ipynb) | -| 7 | Human Evaluation | Connect to Argilla for manual evaluation | [human_evaluation.ipynb](./src/documentation/human_evaluation.ipynb) | -| 8 | Performance tips | Contains some small tips for performance | [performance_tips.ipynb](./src/documentation/performance_tips.ipynb) | -| 9 | Deployment | Shows how to deploy a Task in a minimal FastAPI app. | [fastapi_tutorial.ipynb](./src/documentation/fastapi_tutorial.ipynb) | -| 10 | Issue Classification | Deploy a Task in Kubernetes to classify Jira issues | [Found in adjacent repository](https://github.com/Aleph-Alpha/IL-Classification-Journey) | +| Order | Topic | Description | Notebook 📓 | +|-------|----------------------|------------------------------------------------------|------------------------------------------------------------------------------------------| +| 1 | Summarization | Summarize a document | [summarization.ipynb](./src/documentation/summarization.ipynb) | +| 2 | Question Answering | Various approaches for QA | [qa.ipynb](./src/documentation/qa.ipynb) | +| 3 | Classification | Learn about two methods of classification | [classification.ipynb](./src/documentation/classification.ipynb) | +| 4 | Evaluation | Evaluate LLM-based methodologies | [evaluation.ipynb](./src/documentation/evaluation.ipynb) | +| 5 | Elo QA Evaluation | Evaluate QA tasks in an Elo ranking | [elo_qa_eval.ipynb](./src/documentation/elo_qa_eval.ipynb) | +| 6 | Quickstart Task | Build a custom `Task` for your use case | [quickstart_task.ipynb](./src/documentation/quickstart_task.ipynb) | +| 7 | Document Index | Connect your proprietary knowledge base | [document_index.ipynb](./src/documentation/document_index.ipynb) | +| 8 | Human Evaluation | Connect to Argilla for manual evaluation | [human_evaluation.ipynb](./src/documentation/human_evaluation.ipynb) | +| 9 | Performance tips | Contains some small tips for performance | [performance_tips.ipynb](./src/documentation/performance_tips.ipynb) | +| 10 | Deployment | Shows how to deploy a Task in a minimal FastAPI app. | [fastapi_tutorial.ipynb](./src/documentation/fastapi_tutorial.ipynb) | +| 11 | Issue Classification | Deploy a Task in Kubernetes to classify Jira issues | [Found in adjacent repository](https://github.com/Aleph-Alpha/IL-Classification-Journey) | ## How-Tos The how-tos are quick lookups about how to do things. Compared to the tutorials, they are shorter and do not explain the concepts they are using in-depth. diff --git a/src/documentation/elo_qa_eval.ipynb b/src/documentation/elo_qa_eval.ipynb new file mode 100644 index 000000000..acefa67cf --- /dev/null +++ b/src/documentation/elo_qa_eval.ipynb @@ -0,0 +1,622 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# User Story for Calculating ELO Scores of QA Configurations for Ranking \n", + "\n", + "As a user of the Intelligence Layer (IL), I want to evaluate how well different configurations perform on a QA task with the given input data.\n", + "A configuration is a combination of a model with a fixed set of parameters.\n", + "In the following, we focus on comparing setups which differ only in the chosen model.\n", + "\n", + "We provide multiple inputs consisting of a longer texts and a questions related to each of those texts, as well as the expected answers.\n", + "A Llama-model is used as a grader to decide which answer of two different models is better.\n", + "The aggregation of all comparisons results in [ELO](https://en.wikipedia.org/wiki/Elo_rating_system) scores and win rates of the models.\n", + "\n", + "In this notebook, we go through the following steps: First, we create a set of examples of texts with a relevant question for each (Step 0), after which we use the models to generate answers (Step 1). The given answers are then compared against each other and judged by the Llama model (Step 2), which will result in a final ELO ranking and win rate (Step 3). Lastly, we include a new model in the evaluation without having to re-evaluate the previous models against each other, as is typically done in ELO rankings (Step 4).\n", + "\n", + "## Evaluating classification use-cases\n", + "\n", + "Before we can begin, we need to load the Aleph-Alpha access token from the environment and create the client." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from os import getenv\n", + "\n", + "from aleph_alpha_client import Client\n", + "from dotenv import load_dotenv\n", + "\n", + "from intelligence_layer.connectors import LimitedConcurrencyClient\n", + "from intelligence_layer.evaluation.evaluation.evaluator.elo_evaluator import Matches\n", + "\n", + "load_dotenv()\n", + "\n", + "aa_client = Client(getenv(\"AA_TOKEN\"))\n", + "limited_concurrency_client = LimitedConcurrencyClient(aa_client)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Step 0 – Data set\n", + "\n", + "During the four steps of determining the ELO scores, we make use of the following four repositories for managing the intermediate data.\n", + "\n", + "First, we create and store an input dataset into a so-called `dataset_repository`.\n", + "\n", + "The IL will read the input dataset and produce outputs for each model, which will be stored in a `run_repository`.\n", + "\n", + "The result from the previous step can now be evaluated, in this case with an ELO evaluator (`EloQaEvaluator`). The evaluation is stored in the `eval_repository`.\n", + "\n", + "Finally, the evaluations are aggregated and stored in the `aggregation_repository`. The aggregation contains the ELO score and winning rate of each model along with additional metadata." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from intelligence_layer.evaluation import (\n", + " InMemoryAggregationRepository,\n", + " InMemoryDatasetRepository,\n", + " InMemoryEvaluationRepository,\n", + " InMemoryRunRepository,\n", + ")\n", + "\n", + "dataset_repository = InMemoryDatasetRepository()\n", + "run_repository = InMemoryRunRepository()\n", + "evaluation_repository = InMemoryEvaluationRepository()\n", + "aggregation_repository = InMemoryAggregationRepository()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Here, we fill the `dataset_repository` with two `Example`s. Each `Example` contains a text, a question regarding said text, as well as an expected answer.\n", + "The newly created dataset in the repository has a unique id, which is stored in the `dataset_id` variable." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from intelligence_layer.core import Language\n", + "from intelligence_layer.evaluation import Example\n", + "from intelligence_layer.examples.qa.single_chunk_qa import SingleChunkQaInput\n", + "\n", + "qa_input_text_1 = \"\"\"Surface micromachining\n", + "\n", + "Surface micromachining builds microstructures by deposition and etching structural layers over a substrate.[1] This is different from Bulk micromachining, in which a silicon substrate wafer is selectively etched to produce structures.\n", + "\n", + "Layers\n", + "\n", + "Generally, polysilicon is used as one of the substrate layers while silicon dioxide is used as a sacrificial layer. The sacrificial layer is removed or etched out to create any necessary void in the thickness direction. Added layers tend to vary in size from 2-5 micrometres. The main advantage of this machining process is the ability to build electronic and mechanical components (functions) on the same substrate. Surface micro-machined components are smaller compared to their bulk micro-machined counterparts.\n", + "\n", + "As the structures are built on top of the substrate and not inside it, the substrate's properties are not as important as in bulk micro-machining. Expensive silicon wafers can be replaced by cheaper substrates, such as glass or plastic. The size of the substrates may be larger than a silicon wafer, and surface micro-machining is used to produce thin-film transistors on large area glass substrates for flat panel displays. This technology can also be used for the manufacture of thin film solar cells, which can be deposited on glass, polyethylene terepthalate substrates or other non-rigid materials.\n", + "\n", + "Fabrication process\n", + "\n", + "Micro-machining starts with a silicon wafer or other substrate upon which new layers are grown. These layers are selectively etched by photo-lithography; either a wet etch involving an acid, or a dry etch involving an ionized gas (or plasma). Dry etching can combine chemical etching with physical etching or ion bombardment. Surface micro-machining involves as many layers as are needed with a different mask (producing a different pattern) on each layer. Modern integrated circuit fabrication uses this technique and can use as many as 100 layers. Micro-machining is a younger technology and usually uses no more than 5 or 6 layers. Surface micro-machining uses developed technology (although sometimes not enough for demanding applications) which is easily repeatable for volume production.\"\"\"\n", + "\n", + "example_1 = Example(\n", + " input=SingleChunkQaInput(\n", + " chunk=qa_input_text_1,\n", + " question=\"What is micromachining?\",\n", + " language=Language(\"en\"),\n", + " ),\n", + " expected_output=\"Surface micromachining builds microstructures by deposition and etching structural layers over a substrate. This is different from Bulk micromachining, in which a silicon substrate wafer is selectively etched to produce structures.\",\n", + ")\n", + "\n", + "qa_input_text_2 = \"\"\"\n", + "Silicon is a chemical element; it has symbol Si and atomic number 14. It is a hard, brittle crystalline solid with a blue-grey metallic luster, and is a non metal and semiconductor. It is a member of group 14 in the periodic table: carbon is above it; and germanium, tin, lead, and flerovium are below it. It is relatively unreactive.\n", + "\n", + "Because of its high chemical affinity for oxygen, it was not until 1823 that Jöns Jakob Berzelius was first able to prepare it and characterize it in pure form. Its oxides form a family of anions known as silicates. Its melting and boiling points of 1414 °C and 3265 °C, respectively, are the second highest among all the metalloids and nonmetals, being surpassed only by boron.[a]\n", + "\n", + "Silicon is the eighth most common element in the universe by mass, but very rarely occurs as the pure element in the Earth's crust. It is widely distributed in space in cosmic dusts, planetoids, and planets as various forms of silicon dioxide (silica) or silicates. More than 90% of the Earth's crust is composed of silicate minerals, making silicon the second most abundant element in the Earth's crust (about 28% by mass), after oxygen. \n", + "\"\"\"\n", + "\n", + "example_2 = Example(\n", + " input=SingleChunkQaInput(\n", + " chunk=qa_input_text_2, question=\"What is silicon?\", language=Language(\"en\")\n", + " ),\n", + " expected_output=\"Silicon is a chemical element.\",\n", + ")\n", + "\n", + "examples = [example_1, example_2]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "dataset_id = dataset_repository.create_dataset(\n", + " examples=examples, dataset_name=\"My-test-dataset\"\n", + ").id" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# ensure that we got a valid dataset ID\n", + "assert len(dataset_id) > 0, f\"The dataset with ID {dataset_id} is empty\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now that we stored the examples into the `dataset_repository`, we can retrieve them by the `dataset_id`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "for example in dataset_repository.examples(dataset_id, SingleChunkQaInput, str):\n", + " print(example)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Step 1 - Run Models\n", + "\n", + "Given a `dataset_repository` with examples, we can now generate the output of the models for all examples.\n", + "First, we have to define which models we want to use. In this example, we use the _\"luminous-base-control\"_ model and the _\"luminous-supreme-control\"_ model.\n", + " \n", + "The previously created client is used to create a `Task` for each model. We use a `SingleChunkQa` task, meaning that in each task a model will give an answer to a question regarding a single chunk of text.\n", + "These tasks are executed by a `Runner`, using the input dataset via the previously stored `dataset_id`.\n", + "\n", + "Tasks require a `run_repository` to store the output. The generated output is automatically stored when calling `run_dataset` on the `runners`. The output for each model will have a unique _\"run id\"_.\n", + "In general, the output for each model consists of two parts. One part is a collection of example outputs. Each example outputs contains the `run_id`, `example_id`, and a field `output`. In this specific use-case, the `output` field contains the `answer` to the question. The other part is a _\"run overview\"_ with the run id stored as `id`, the `dataset_id`, and a description of the task, plus other metadata. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from intelligence_layer.core import LuminousControlModel\n", + "from intelligence_layer.evaluation.run.runner import Runner\n", + "from intelligence_layer.examples.qa.single_chunk_qa import (\n", + " SingleChunkQa,\n", + " SingleChunkQaOutput,\n", + ")\n", + "\n", + "models = [\n", + " LuminousControlModel(name=\"luminous-base-control-20240215\", client=aa_client),\n", + " LuminousControlModel(name=\"luminous-supreme-control-20240215\", client=aa_client),\n", + "]\n", + "\n", + "for model in models:\n", + " runner = Runner[SingleChunkQaInput, SingleChunkQaOutput](\n", + " task=SingleChunkQa(model=model),\n", + " dataset_repository=dataset_repository,\n", + " run_repository=run_repository,\n", + " description=f\"QA with model {model.name}\",\n", + " )\n", + " runner.run_dataset(dataset_id)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# ensure that all examples succeeded\n", + "for run_overview in run_repository.run_overviews():\n", + " assert (\n", + " run_overview.failed_example_count == 0\n", + " ), f\"There are failed runs for run overview ID {run_overview.id}\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The overviews and outputs can be retrieved via the unique run ids:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print(\n", + " f\"Run overview IDs saved in the run repository: {run_repository.run_overview_ids()}\\n\"\n", + ")\n", + "\n", + "for run_overview in run_repository.run_overviews():\n", + " print(run_overview)\n", + " for example_output in run_repository.example_outputs(\n", + " run_overview.id, SingleChunkQaOutput\n", + " ):\n", + " print(example_output)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Step 2 – Run Evaluation\n", + "\n", + "Now that we have generated the answers of all models for all examples in the `dataset_repository`, the next step is to evaluate those answers.\n", + "The evaluation is done by an `Evaluator`. Here we are interested in the ELO score, which can be calculated using the `IncrementalEloQaEvaluator`. For each example, the `IncrementalEloQaEvaluator` takes the two answers of two different models and uses Llama to decide which answer is better. It further has the capability to later add additional runs or models without repeating old comparisons, which will come in handy later. You can also implement your own `Evaluator` to exactly match your use case." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# this should demonstrate that there are no stored evaluations yet in our repository\n", + "print(f\"IDs of stored evaluations: {evaluation_repository.evaluation_overview_ids()}\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from intelligence_layer.core.model import Llama3InstructModel\n", + "from intelligence_layer.evaluation import IncrementalEvaluator\n", + "from intelligence_layer.examples import IncrementalEloQaEvaluationLogic\n", + "\n", + "elo_qa_evaluation_logic = IncrementalEloQaEvaluationLogic(\n", + " model=Llama3InstructModel(name=\"llama-3-8b-instruct\")\n", + ")\n", + "\n", + "evaluator = IncrementalEvaluator(\n", + " dataset_repository=dataset_repository,\n", + " run_repository=run_repository,\n", + " evaluation_repository=evaluation_repository,\n", + " description=\"ELO QA evaluation\", # this description will be used later to query for specific evaluations\n", + " incremental_evaluation_logic=elo_qa_evaluation_logic,\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "evaluation_overview = evaluator.evaluate_runs(*run_repository.run_overview_ids())" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# ensure that for each example there are evaluated comparisons\n", + "for example_evaluation in evaluation_repository.example_evaluations(\n", + " evaluation_overview.id, Matches\n", + "):\n", + " assert isinstance(example_evaluation.result, Matches)\n", + " assert (\n", + " len(example_evaluation.result.comparison_evaluations) > 0\n", + " ), f\"There are no matches (comparisons) for example ID {example_evaluation.example_id}\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The evaluation results can be retrieved via their unique ids:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "for evaluation_overview in evaluation_repository.evaluation_overviews():\n", + " print(evaluation_overview)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Step 3 – Run Aggregation\n", + "\n", + "Finally, all individual evaluations are aggregated into metrics for each model - here, an ELO score and a win rate.\n", + "The `MatchesAggregationLogic` defines how the evaluations should be aggregated for the ELO use case and can be customized." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# this should demonstrate that there are no stored aggregated evaluations yet in our repository\n", + "print(\n", + " f\"IDs of stored aggregated evaluations: {aggregation_repository.aggregation_overview_ids()}\"\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from intelligence_layer.evaluation import Aggregator\n", + "from intelligence_layer.evaluation.aggregation.elo_aggregation import (\n", + " MatchesAggregationLogic,\n", + ")\n", + "\n", + "aggregator = Aggregator(\n", + " evaluation_repository=evaluation_repository,\n", + " aggregation_repository=aggregation_repository,\n", + " description=\"ELO QA aggregation\",\n", + " aggregation_logic=MatchesAggregationLogic(),\n", + ")\n", + "\n", + "aggregated_evaluation = aggregator.aggregate_evaluation(evaluation_overview.id)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# ensure that there are no failed (aggregated) evaluations\n", + "assert (\n", + " aggregated_evaluation.crashed_during_evaluation_count == 0\n", + "), f\"There are crashed evaluations for aggregated evaluation ID {aggregated_evaluation.id}\"\n", + "assert (\n", + " aggregated_evaluation.failed_evaluation_count == 0\n", + "), f\"There are failed evaluations for aggregated evaluation ID {aggregated_evaluation.id}\"\n", + "# ensure that the result contains ELO scores\n", + "assert hasattr(\n", + " aggregated_evaluation.statistics, \"scores\"\n", + "), f\"There are no scores for aggregated evaluation ID {aggregated_evaluation.id}\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can get an overview of each aggregation from the aggregation repository as follows. Note that we need to provide the type of the aggregation to enable the deserialization. The given `statistics` field of the evaluation result contains only the aggregated metrics for each model. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from intelligence_layer.evaluation import AggregatedEvaluation\n", + "\n", + "for aggregation_overview in aggregation_repository.aggregation_overviews(\n", + " AggregatedEvaluation\n", + "):\n", + " print(aggregation_overview)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Step 4 Addition of New Models\n", + "\n", + "Now let us consider the case where we want to add new models to our existing evaluation.\n", + "Since the comparison of answers is rather time-consuming, we want to avoid recalculating the evaluation for the previous models, and just compare the new models to the old ones. This is why we used the `IncrementalEloQaEvaluator` to begin with.\n", + "\n", + "For this example, we first define the new models _\"luminous-base-control-v10\"_ and _\"luminous-supreme-control-v15\"_, and generate their answers." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "newly_added_models = [\n", + " LuminousControlModel(name=\"luminous-base-control-20230501\", client=aa_client),\n", + " LuminousControlModel(name=\"luminous-supreme-control-20230501\", client=aa_client),\n", + "]\n", + "\n", + "for model in newly_added_models:\n", + " runner = Runner[\n", + " SingleChunkQaInput, SingleChunkQaOutput\n", + " ](\n", + " task=SingleChunkQa(model),\n", + " dataset_repository=dataset_repository,\n", + " run_repository=run_repository,\n", + " description=f\"New QA with model {model.name}\", # used to query for new runs only later in the code\n", + " )\n", + " runner.run_dataset(dataset_id)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# ensure that all examples succeeded\n", + "for run_overview in run_repository.run_overviews():\n", + " assert (\n", + " run_overview.failed_example_count == 0\n", + " ), f\"There are failed runs for run overview ID {run_overview.id}\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "for run_overview in run_repository.run_overviews():\n", + " # skip runs done for previous models\n", + " if not run_overview.description.startswith(\"New\"):\n", + " continue\n", + " # print runs for the added models\n", + " print(run_overview)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Thanks to the `IncrementalEloQaEvaluator`, we can now easily extend our existing evaluation with the comparisons of new model runs against the previous runs, without re-running the previous comparisons. To this end, we use the same evaluator instance as for our first evaluation, but use the `evaluate_additional_runs` method, which takes a list of previous evaluation_overview IDs and uses them to filter the resulting comparisons. In this case, only comparisons of new pairings will be performed." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "new_evaluation_overview = evaluator.evaluate_additional_runs(\n", + " *run_repository.run_overview_ids(), previous_evaluation_ids=[evaluation_overview.id]\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# ensure that for each example there are evaluated comparisons\n", + "for example_evaluation in evaluation_repository.example_evaluations(\n", + " new_evaluation_overview.id, Matches\n", + "):\n", + " assert isinstance(example_evaluation.result, Matches)\n", + " assert (\n", + " len(example_evaluation.result.comparison_evaluations) > 0\n", + " ), f\"There are no matches (comparisons) for example ID {example_evaluation.example_id}\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In addition to the previous `evaluation_overview`, we now also have the newly generated `new_evaluation_overview` which includes our new model.\n", + "Finally, the aggregated evaluation of all models is calculated by passing in the evaluation ids of both evaluations into `aggregate_evaluation`. By doing so, the previously calculated ELO scores will be updated with the comparisons to the new models' answers." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# get the IDs of all the evaluation overviews which we created for the QA task\n", + "evaluation_overview_ids = [\n", + " evaluation_overview.id\n", + " for evaluation_overview in evaluation_repository.evaluation_overviews()\n", + " if evaluation_overview.description.find(\"QA\")\n", + "]\n", + "print(f\"Evaluation overviews to aggregate: {evaluation_overview_ids}\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# run the aggregation\n", + "aggregated_evaluation_with_new_model = aggregator.aggregate_evaluation(\n", + " *evaluation_overview_ids\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# ensure that there are no failed (aggregated) evaluations\n", + "assert (\n", + " aggregated_evaluation_with_new_model.crashed_during_evaluation_count == 0\n", + "), f\"There are crashed evaluations for aggregated evaluation ID {aggregated_evaluation.id}\"\n", + "assert (\n", + " aggregated_evaluation_with_new_model.failed_evaluation_count == 0\n", + "), f\"There are failed evaluations for aggregated evaluation ID {aggregated_evaluation.id}\"\n", + "# ensure that we result contains ELO scores\n", + "assert hasattr(\n", + " aggregated_evaluation_with_new_model.statistics, \"scores\"\n", + "), f\"There are no scores for aggregated evaluation ID {aggregated_evaluation.id}\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "A look at the new aggregated evaluation shows that the runs of the new models have been added to the evaluation:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print(aggregated_evaluation_with_new_model)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "intelligence-layer-tfT-HG2V-py3.11", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.8" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/src/intelligence_layer/evaluation/__init__.py b/src/intelligence_layer/evaluation/__init__.py index a16f4d8df..afbd226a0 100644 --- a/src/intelligence_layer/evaluation/__init__.py +++ b/src/intelligence_layer/evaluation/__init__.py @@ -6,11 +6,11 @@ from .aggregation.aggregator import Aggregator as Aggregator from .aggregation.domain import AggregatedEvaluation as AggregatedEvaluation from .aggregation.domain import AggregationOverview as AggregationOverview -from .aggregation.elo import ComparisonAggregationLogic as ComparisonAggregationLogic -from .aggregation.elo import ComparisonEvaluation as ComparisonEvaluation -from .aggregation.elo import EloCalculator as EloCalculator -from .aggregation.elo import MatchOutcome as MatchOutcome -from .aggregation.elo import WinRateCalculator as WinRateCalculator +from .aggregation.elo_aggregation import ( + ComparisonEvaluationAggregationLogic as ComparisonEvaluationAggregationLogic, +) +from .aggregation.elo_aggregation import EloCalculator as EloCalculator +from .aggregation.elo_aggregation import WinRateCalculator as WinRateCalculator from .aggregation.file_aggregation_repository import ( FileAggregationRepository as FileAggregationRepository, ) @@ -60,6 +60,12 @@ from .evaluation.evaluator.async_evaluator import ( AsyncEvaluationRepository as AsyncEvaluationRepository, ) +from .evaluation.evaluator.elo_evaluator import ( + ComparisonEvaluation as ComparisonEvaluation, +) +from .evaluation.evaluator.elo_evaluator import EloEvaluationLogic as EloEvaluationLogic +from .evaluation.evaluator.elo_evaluator import Matches as Matches +from .evaluation.evaluator.elo_evaluator import MatchOutcome as MatchOutcome from .evaluation.evaluator.evaluator import EvaluationLogic as EvaluationLogic from .evaluation.evaluator.evaluator import Evaluator as Evaluator from .evaluation.evaluator.evaluator import ( diff --git a/src/intelligence_layer/evaluation/aggregation/elo.py b/src/intelligence_layer/evaluation/aggregation/elo_aggregation.py similarity index 63% rename from src/intelligence_layer/evaluation/aggregation/elo.py rename to src/intelligence_layer/evaluation/aggregation/elo_aggregation.py index 76324630d..006beeeb3 100644 --- a/src/intelligence_layer/evaluation/aggregation/elo.py +++ b/src/intelligence_layer/evaluation/aggregation/elo_aggregation.py @@ -1,6 +1,5 @@ import random from collections import Counter, defaultdict -from enum import Enum from typing import Iterable, Mapping, Sequence import numpy as np @@ -8,32 +7,62 @@ from intelligence_layer.evaluation.aggregation.accumulator import MeanAccumulator from intelligence_layer.evaluation.aggregation.aggregator import AggregationLogic +from intelligence_layer.evaluation.evaluation.evaluator.elo_evaluator import ( + ComparisonEvaluation, + Matches, + MatchOutcome, +) -class MatchOutcome(str, Enum): - A_WINS = "a_wins" - DRAW = "draw" - B_WINS = "b_wins" +class PlayerScore(BaseModel): + elo: float + elo_standard_error: float + win_rate: float + num_matches: int - @property - def payoff(self) -> tuple[float, float]: - if self == self.A_WINS: - return (1, 0) - if self == self.DRAW: - return (0.5, 0.5) - return (0, 1) +class AggregatedComparison(BaseModel): + scores: Mapping[str, PlayerScore] + + +class EloAggregationAdapter: @staticmethod - def from_rank_literal(rank: int) -> "MatchOutcome": - match rank: - case 1: - return MatchOutcome.A_WINS - case 2: - return MatchOutcome.B_WINS - case 3: - return MatchOutcome.DRAW - case _: - raise ValueError(f"Got unexpected rank {rank}") + def aggregate(evaluations: Iterable[ComparisonEvaluation]) -> AggregatedComparison: + evaluations = list(evaluations) + player_counter = Counter( + player + for comparison_evaluation in evaluations + for player in [ + comparison_evaluation.first_player, + comparison_evaluation.second_player, + ] + ) + + player_counts = dict(player_counter) + players = player_counts.keys() + + accumulators = {p: MeanAccumulator() for p in players} + for _ in range(100): + elo_calc = EloCalculator(players) + random.shuffle(evaluations) + elo_calc.calculate(evaluations) + for p in players: + accumulators[p].add(elo_calc.ratings[p]) + + win_rate_calc = WinRateCalculator(players) + win_rate = win_rate_calc.calculate(evaluations) + + return AggregatedComparison( + scores={ + p: PlayerScore( + elo=acc.extract(), + elo_standard_error=acc.standard_error(), + win_rate=win_rate[p], + num_matches=player_counts[p], + ) + for p, acc in accumulators.items() + }, + ) class EloCalculator: @@ -74,13 +103,15 @@ def _calc_difs( actual_b - expected_win_rate_b ) - def calculate(self, matches: Sequence[tuple[str, str, MatchOutcome]]) -> None: - for a, b, o in matches: - dif_a, dif_b = self._calc_difs(o, a, b) - self.ratings[a] += dif_a - self.ratings[b] += dif_b - self._match_counts[a] += 1 - self._match_counts[b] += 1 + def calculate(self, matches: Sequence[ComparisonEvaluation]) -> None: + for match in matches: + dif_a, dif_b = self._calc_difs( + match.outcome, match.first_player, match.second_player + ) + self.ratings[match.first_player] += dif_a + self.ratings[match.second_player] += dif_b + self._match_counts[match.first_player] += 1 + self._match_counts[match.second_player] += 1 class WinRateCalculator: @@ -88,14 +119,12 @@ def __init__(self, players: Iterable[str]) -> None: self.match_count: dict[str, int] = {p: 0 for p in players} self.win_count: dict[str, float] = {p: 0 for p in players} - def calculate( - self, matches: Sequence[tuple[str, str, MatchOutcome]] - ) -> Mapping[str, float]: - for a, b, o in matches: - self.match_count[a] += 1 - self.match_count[b] += 1 - self.win_count[a] += o.payoff[0] - self.win_count[b] += o.payoff[1] + def calculate(self, matches: Sequence[ComparisonEvaluation]) -> Mapping[str, float]: + for match in matches: + self.match_count[match.first_player] += 1 + self.match_count[match.second_player] += 1 + self.win_count[match.first_player] += match.outcome.payoff[0] + self.win_count[match.second_player] += match.outcome.payoff[1] return { player: self.win_count[player] / match_count @@ -103,62 +132,20 @@ def calculate( } -class PlayerScore(BaseModel): - elo: float - elo_standard_error: float - win_rate: float - num_matches: int - - -class AggregatedComparison(BaseModel): - scores: Mapping[str, PlayerScore] - - -class ComparisonEvaluation(BaseModel): - first: str - second: str - winner: MatchOutcome - - -class ComparisonAggregationLogic( +class ComparisonEvaluationAggregationLogic( AggregationLogic[ComparisonEvaluation, AggregatedComparison] ): def aggregate( self, evaluations: Iterable[ComparisonEvaluation] ) -> AggregatedComparison: - flattened_evaluations = [ - ( - evaluation.first, - evaluation.second, - evaluation.winner, - ) - for evaluation in evaluations - ] - player_counter = Counter( - player for match in flattened_evaluations for player in [match[0], match[1]] - ) - player_counts = dict(player_counter) - players = player_counts.keys() - - accumulators = {p: MeanAccumulator() for p in players} - for _ in range(100): - elo_calc = EloCalculator(players) - random.shuffle(flattened_evaluations) - elo_calc.calculate(flattened_evaluations) - for p in players: - accumulators[p].add(elo_calc.ratings[p]) + return EloAggregationAdapter.aggregate(evaluations) - win_rate_calc = WinRateCalculator(players) - win_rate = win_rate_calc.calculate(flattened_evaluations) - return AggregatedComparison( - scores={ - p: PlayerScore( - elo=acc.extract(), - elo_standard_error=acc.standard_error(), - win_rate=win_rate[p], - num_matches=player_counts[p], - ) - for p, acc in accumulators.items() - }, - ) +class MatchesAggregationLogic(AggregationLogic[Matches, AggregatedComparison]): + def aggregate(self, evaluations: Iterable[Matches]) -> AggregatedComparison: + flattened_matches = [ + comparison_evaluation + for match in evaluations + for comparison_evaluation in match.comparison_evaluations + ] + return EloAggregationAdapter.aggregate(flattened_matches) diff --git a/src/intelligence_layer/evaluation/evaluation/__init__.py b/src/intelligence_layer/evaluation/evaluation/__init__.py new file mode 100644 index 000000000..e69de29bb diff --git a/src/intelligence_layer/evaluation/evaluation/evaluator/__init__.py b/src/intelligence_layer/evaluation/evaluation/evaluator/__init__.py new file mode 100644 index 000000000..e69de29bb diff --git a/src/intelligence_layer/evaluation/evaluation/evaluator/argilla_evaluator.py b/src/intelligence_layer/evaluation/evaluation/evaluator/argilla_evaluator.py index 991a67654..efa2e02ab 100644 --- a/src/intelligence_layer/evaluation/evaluation/evaluator/argilla_evaluator.py +++ b/src/intelligence_layer/evaluation/evaluation/evaluator/argilla_evaluator.py @@ -14,10 +14,6 @@ RecordData, ) from intelligence_layer.core import CompleteOutput, Input, InstructInput, Output -from intelligence_layer.evaluation.aggregation.elo import ( - ComparisonEvaluation, - MatchOutcome, -) from intelligence_layer.evaluation.dataset.dataset_repository import DatasetRepository from intelligence_layer.evaluation.dataset.domain import Example, ExpectedOutput from intelligence_layer.evaluation.evaluation.domain import ( @@ -34,6 +30,10 @@ from intelligence_layer.evaluation.evaluation.evaluator.base_evaluator import ( EvaluationLogicBase, ) +from intelligence_layer.evaluation.evaluation.evaluator.elo_evaluator import ( + ComparisonEvaluation, + MatchOutcome, +) from intelligence_layer.evaluation.run.domain import SuccessfulExampleOutput from intelligence_layer.evaluation.run.run_repository import RunRepository @@ -303,9 +303,9 @@ def from_record( self, argilla_evaluation: ArgillaEvaluation ) -> ComparisonEvaluation: return ComparisonEvaluation( - first=argilla_evaluation.metadata["first"], - second=argilla_evaluation.metadata["second"], - winner=MatchOutcome.from_rank_literal( + first_player=argilla_evaluation.metadata["first"], + second_player=argilla_evaluation.metadata["second"], + outcome=MatchOutcome.from_rank_literal( int(argilla_evaluation.responses["winner"]) ), ) diff --git a/src/intelligence_layer/evaluation/evaluation/evaluator/elo_evaluator.py b/src/intelligence_layer/evaluation/evaluation/evaluator/elo_evaluator.py new file mode 100644 index 000000000..e79ae14ef --- /dev/null +++ b/src/intelligence_layer/evaluation/evaluation/evaluator/elo_evaluator.py @@ -0,0 +1,82 @@ +from abc import abstractmethod +from enum import Enum +from itertools import combinations +from typing import Sequence, final + +from pydantic import BaseModel + +from intelligence_layer.core import Input, Output +from intelligence_layer.evaluation.dataset.domain import Example, ExpectedOutput +from intelligence_layer.evaluation.evaluation.evaluator.evaluator import EvaluationLogic +from intelligence_layer.evaluation.run.domain import SuccessfulExampleOutput + + +class MatchOutcome(str, Enum): + A_WINS = "a_wins" + DRAW = "draw" + B_WINS = "b_wins" + + @property + def payoff(self) -> tuple[float, float]: + if self == self.A_WINS: + return (1, 0) + if self == self.DRAW: + return (0.5, 0.5) + return (0, 1) + + @staticmethod + def from_rank_literal(rank: int) -> "MatchOutcome": + match rank: + case 1: + return MatchOutcome.A_WINS + case 2: + return MatchOutcome.B_WINS + case 3: + return MatchOutcome.DRAW + case _: + raise ValueError(f"Got unexpected rank {rank}") + + +class ComparisonEvaluation(BaseModel): + first_player: str + second_player: str + outcome: MatchOutcome + + +class Matches(BaseModel): + comparison_evaluations: Sequence[ComparisonEvaluation] + + +class EloGradingInput(BaseModel): + instruction: str + first_completion: str + second_completion: str + + +class EloEvaluationLogic(EvaluationLogic[Input, Output, ExpectedOutput, Matches]): + @final + def do_evaluate( + self, + example: Example[Input, ExpectedOutput], + *output: SuccessfulExampleOutput[Output], + ) -> Matches: + pairs = combinations(output, 2) + return Matches( + comparison_evaluations=[ + ComparisonEvaluation( + first_player=player_a.run_id, + second_player=player_b.run_id, + outcome=self.grade(player_a, player_b, example), + ) + for [player_a, player_b] in pairs + ] + ) + + @abstractmethod + def grade( + self, + output_a: SuccessfulExampleOutput[Output], + output_b: SuccessfulExampleOutput[Output], + example: Example[Input, ExpectedOutput], + ) -> MatchOutcome: + pass diff --git a/src/intelligence_layer/evaluation/evaluation/evaluator/incremental_evaluator.py b/src/intelligence_layer/evaluation/evaluation/evaluator/incremental_evaluator.py index bcab4124a..284e6ee5c 100644 --- a/src/intelligence_layer/evaluation/evaluation/evaluator/incremental_evaluator.py +++ b/src/intelligence_layer/evaluation/evaluation/evaluator/incremental_evaluator.py @@ -50,20 +50,16 @@ def do_evaluate( Returns: :class:`Evaluation`: The metrics that come from the evaluated :class:`Task`. """ - flattened_run_output_ids: set[str] = set() - evaluated_outputs = [] + + already_evaluated_outputs = [] for run_output_ids in self._previous_run_output_ids: - flattened_run_output_ids = flattened_run_output_ids.union(run_output_ids) - evaluated_outputs.append( + already_evaluated_outputs.append( [output for output in outputs if output.run_id in run_output_ids] ) - new_outputs = [ - output - for output in outputs - if output.run_id not in flattened_run_output_ids - ] - return self.do_incremental_evaluate(example, new_outputs, evaluated_outputs) + return self.do_incremental_evaluate( + example, list(outputs), already_evaluated_outputs + ) @abstractmethod def do_incremental_evaluate( diff --git a/src/intelligence_layer/examples/__init__.py b/src/intelligence_layer/examples/__init__.py index d3132b138..c0a75ba40 100644 --- a/src/intelligence_layer/examples/__init__.py +++ b/src/intelligence_layer/examples/__init__.py @@ -44,6 +44,10 @@ from .classify.prompt_based_classify_with_definitions import ( PromptBasedClassifyWithDefinitions as PromptBasedClassifyWithDefinitions, ) +from .qa.elo_qa_evaluation_logic import EloQaEvaluationLogic as EloQaEvaluationLogic +from .qa.incremental_elo_qa_evaluation_logic import ( + IncrementalEloQaEvaluationLogic as IncrementalEloQaEvaluationLogic, +) from .qa.long_context_qa import LongContextQa as LongContextQa from .qa.long_context_qa import LongContextQaInput as LongContextQaInput from .qa.multiple_chunk_qa import MultipleChunkQa as MultipleChunkQa diff --git a/src/intelligence_layer/examples/qa/elo_qa_evaluation_logic.py b/src/intelligence_layer/examples/qa/elo_qa_evaluation_logic.py new file mode 100644 index 000000000..002c07e58 --- /dev/null +++ b/src/intelligence_layer/examples/qa/elo_qa_evaluation_logic.py @@ -0,0 +1,144 @@ +import math +from typing import Mapping, Sequence + +from aleph_alpha_client import Prompt +from liquid import Template + +from intelligence_layer.core.detect_language import Language +from intelligence_layer.core.model import CompleteInput, CompleteOutput, ControlModel +from intelligence_layer.core.tracer.tracer import NoOpTracer, TaskSpan, Tracer +from intelligence_layer.evaluation import MatchOutcome +from intelligence_layer.evaluation.dataset.domain import Example +from intelligence_layer.evaluation.evaluation.evaluator.elo_evaluator import ( + EloEvaluationLogic, + EloGradingInput, +) +from intelligence_layer.evaluation.run.domain import SuccessfulExampleOutput +from intelligence_layer.examples.qa.single_chunk_qa import ( + QA_INSTRUCTIONS, + SingleChunkQaInput, + SingleChunkQaOutput, +) + + +class EloQaEvaluationLogic( + EloEvaluationLogic[SingleChunkQaInput, SingleChunkQaOutput, SingleChunkQaOutput] +): + INPUT_TEMPLATE = """ +Your task is to compare two answers to an instruction on one metric. + +Please make sure you read and understand these instruction carefully. Please keep this document open while reviewing, and refer to it as needed. + +The Instruction for the answers was:{instruction} + +Evaluation Procedure: +1. Read both answers carefully and identify the main facts and details they present. +2. Check if the answers contain any factual errors that are not supported by the instruction. +3. Evaluate which answer is more correct. + +Answer A:{first_completion} + +Answer B:{second_completion} + +Which answer is more correct given the Instruction and Evaluation Procedure, Answer A or Answer B? + +Response: Answer """ + VALUES = [ + " A", + " B", + ] # The space before the A and B is important due to tokenization + + def __init__( + self, + model: ControlModel, + tracer: Tracer = NoOpTracer(), + ): + super().__init__() + self._model = model + self.tracer = tracer + + @staticmethod + def _create_grading_input( + first: SuccessfulExampleOutput[SingleChunkQaOutput], + second: SuccessfulExampleOutput[SingleChunkQaOutput], + example: Example[SingleChunkQaInput, SingleChunkQaOutput], + ) -> EloGradingInput: + qa_instruction = Template( + QA_INSTRUCTIONS[Language("en")].unformatted_instruction + ).render(question=example.input.question) + + no_answer = "There is no answer." + return EloGradingInput( + instruction=f"{example.input.chunk} {qa_instruction}", + first_completion=( + first.output.answer if first.output.answer is not None else no_answer + ), + second_completion=( + second.output.answer if second.output.answer is not None else no_answer + ), + ) + + def do_run(self, input: EloGradingInput, task_span: TaskSpan) -> MatchOutcome: + text = self.INPUT_TEMPLATE.format( + instruction=input.instruction, + first_completion=input.first_completion, + second_completion=input.second_completion, + ) + + complete_input = CompleteInput( + prompt=Prompt.from_text(text), + maximum_tokens=1, + log_probs=3, + disable_optimizations=True, + ) + complete_output = self._model.complete_task().run(complete_input, task_span) + + return self.calculate_winners(complete_output) + + def grade( + self, + first: SuccessfulExampleOutput[SingleChunkQaOutput], + second: SuccessfulExampleOutput[SingleChunkQaOutput], + example: Example[SingleChunkQaInput, SingleChunkQaOutput], + ) -> MatchOutcome: + grading_input = self._create_grading_input(first, second, example) + + return MatchOutcome( + self.do_run( + grading_input, + self.tracer.task_span( + task_name="elo_qa_run_grader", input=grading_input + ), + ) + ) + + def calculate_winners(self, complete_output: CompleteOutput) -> MatchOutcome: + default_log_prob = float("-inf") + + def get_normalized_prob( + log_prob_list: Sequence[Mapping[str, float | None]] | None, + ) -> float: + assert log_prob_list is not None + log_probs = log_prob_list[0] + values = [ + math.exp(log_probs.get(str(key), default_log_prob) or default_log_prob) + for key in self.VALUES + ] + if all(v == 0 for v in values): + raise ValueError( + f"LLM evaluation response does not contain logprobs for the required tokens for the values: {self.VALUES}" + ) + return values[0] / sum(values) + + def categorize_value(value: float) -> MatchOutcome: + if value > 0.7: + return MatchOutcome.A_WINS + elif 0.3 > value: + return MatchOutcome.B_WINS + else: + return MatchOutcome.DRAW + + normalized_probability = get_normalized_prob( + complete_output.completions[0].log_probs + ) + return categorize_value(normalized_probability) diff --git a/src/intelligence_layer/examples/qa/incremental_elo_qa_evaluation_logic.py b/src/intelligence_layer/examples/qa/incremental_elo_qa_evaluation_logic.py new file mode 100644 index 000000000..3cc52ec5d --- /dev/null +++ b/src/intelligence_layer/examples/qa/incremental_elo_qa_evaluation_logic.py @@ -0,0 +1,183 @@ +import math +from itertools import combinations +from typing import Mapping, Sequence + +from aleph_alpha_client import Prompt +from liquid import Template + +from intelligence_layer.core.detect_language import Language +from intelligence_layer.core.model import CompleteInput, CompleteOutput, ControlModel +from intelligence_layer.core.tracer.tracer import NoOpTracer, TaskSpan, Tracer +from intelligence_layer.evaluation.dataset.domain import Example +from intelligence_layer.evaluation.evaluation.evaluator.elo_evaluator import ( + ComparisonEvaluation, + EloGradingInput, + Matches, + MatchOutcome, +) +from intelligence_layer.evaluation.evaluation.evaluator.incremental_evaluator import ( + IncrementalEvaluationLogic, +) +from intelligence_layer.evaluation.run.domain import SuccessfulExampleOutput +from intelligence_layer.examples.qa.single_chunk_qa import ( + QA_INSTRUCTIONS, + SingleChunkQaInput, + SingleChunkQaOutput, +) + + +class IncrementalEloQaEvaluationLogic( + IncrementalEvaluationLogic[ + SingleChunkQaInput, SingleChunkQaOutput, SingleChunkQaOutput, Matches + ] +): + INPUT_TEMPLATE = """ +Your task is to compare two answers to an instruction on one metric. + +Please make sure you read and understand these instruction carefully. Please keep this document open while reviewing, and refer to it as needed. + +The Instruction for the answers was:{instruction} + +Evaluation Procedure: +1. Read both answers carefully and identify the main facts and details they present. +2. Check if the answers contain any factual errors that are not supported by the instruction. +3. Evaluate which answer is more correct. + +Answer A:{first_completion} + +Answer B:{second_completion} + +Which answer is more correct given the Instruction and Evaluation Procedure, Answer A or Answer B? + +Response: Answer """ + VALUES = [ + " A", + " B", + ] # The space before the A and B is important due to tokenization + + def __init__( + self, + model: ControlModel, + tracer: Tracer = NoOpTracer(), + ): + super().__init__() + self._model = model + self.tracer = tracer + + def do_incremental_evaluate( + self, + example: Example[SingleChunkQaInput, SingleChunkQaOutput], + outputs: list[SuccessfulExampleOutput[SingleChunkQaOutput]], + already_evaluated_outputs: list[ + list[SuccessfulExampleOutput[SingleChunkQaOutput]] + ], + ) -> Matches: + pairs = combinations(outputs, 2) + unique_pre_evaluated_runs: set[str] = set() + + for pre_run_output in already_evaluated_outputs: + for current_output in pre_run_output: + unique_pre_evaluated_runs.add(current_output.run_id) + + return Matches( + comparison_evaluations=[ + ComparisonEvaluation( + first_player=player_a.run_id, + second_player=player_b.run_id, + outcome=self.grade(player_a, player_b, example), + ) + for [player_a, player_b] in pairs + if unique_pre_evaluated_runs is None + or len(unique_pre_evaluated_runs) == 0 + or not ( + player_a.run_id in unique_pre_evaluated_runs + and player_b.run_id in unique_pre_evaluated_runs + ) + ] + ) + + def grade( + self, + first: SuccessfulExampleOutput[SingleChunkQaOutput], + second: SuccessfulExampleOutput[SingleChunkQaOutput], + example: Example[SingleChunkQaInput, SingleChunkQaOutput], + ) -> MatchOutcome: + grading_input = self._create_grading_input(first, second, example) + + return MatchOutcome( + self.do_run( + grading_input, + self.tracer.task_span( + task_name="elo_qa_run_grader", input=grading_input + ), + ) + ) + + @staticmethod + def _create_grading_input( + first: SuccessfulExampleOutput[SingleChunkQaOutput], + second: SuccessfulExampleOutput[SingleChunkQaOutput], + example: Example[SingleChunkQaInput, SingleChunkQaOutput], + ) -> EloGradingInput: + qa_instruction = Template( + QA_INSTRUCTIONS[Language("en")].unformatted_instruction + ).render(question=example.input.question) + + no_answer = "There is no answer." + return EloGradingInput( + instruction=f"{example.input.chunk} {qa_instruction}", + first_completion=( + first.output.answer if first.output.answer is not None else no_answer + ), + second_completion=( + second.output.answer if second.output.answer is not None else no_answer + ), + ) + + def do_run(self, input: EloGradingInput, task_span: TaskSpan) -> MatchOutcome: + text = self.INPUT_TEMPLATE.format( + instruction=input.instruction, + first_completion=input.first_completion, + second_completion=input.second_completion, + ) + + complete_input = CompleteInput( + prompt=Prompt.from_text(text), + maximum_tokens=1, + log_probs=3, + disable_optimizations=True, + ) + complete_output = self._model.complete_task().run(complete_input, task_span) + + return self.calculate_winners(complete_output) + + def calculate_winners(self, complete_output: CompleteOutput) -> MatchOutcome: + default_log_prob = float("-inf") + + def get_normalized_prob( + log_prob_list: Sequence[Mapping[str, float | None]] | None, + ) -> float: + assert log_prob_list is not None + log_probs = log_prob_list[0] + values = [ + math.exp(log_probs.get(str(key), default_log_prob) or default_log_prob) + for key in self.VALUES + ] + if all(v == 0 for v in values): + raise ValueError( + f"LLM evaluation response does not contain logprobs for the required tokens for the values: {self.VALUES}" + ) + return values[0] / sum(values) + + def categorize_value(value: float) -> MatchOutcome: + if value > 0.7: + return MatchOutcome.A_WINS + elif 0.3 > value: + return MatchOutcome.B_WINS + else: + return MatchOutcome.DRAW + + normalized_probability = get_normalized_prob( + complete_output.completions[0].log_probs + ) + return categorize_value(normalized_probability) diff --git a/tests/evaluation/test_argilla_evaluator.py b/tests/evaluation/test_argilla_evaluator.py index ee4ea9d2b..c0167d641 100644 --- a/tests/evaluation/test_argilla_evaluator.py +++ b/tests/evaluation/test_argilla_evaluator.py @@ -16,8 +16,8 @@ ArgillaEvaluationLogic, ArgillaEvaluator, AsyncInMemoryEvaluationRepository, - ComparisonAggregationLogic, ComparisonEvaluation, + ComparisonEvaluationAggregationLogic, DatasetRepository, Example, InMemoryDatasetRepository, @@ -327,12 +327,12 @@ def test_argilla_evaluator_abort_on_error_works( def test_argilla_aggregation_logic_works() -> None: - argilla_aggregation_logic = ComparisonAggregationLogic() + argilla_aggregation_logic = ComparisonEvaluationAggregationLogic() evaluations = ( ComparisonEvaluation( - first="player_1", - second="player_2" if i < 9000 else "player_3", - winner=MatchOutcome.from_rank_literal( + first_player="player_1", + second_player="player_2" if i < 9000 else "player_3", + outcome=MatchOutcome.from_rank_literal( random.choices([1, 2, 3], [0.5, 0.25, 0.25], k=1)[0] ), ) diff --git a/tests/evaluation/test_elo.py b/tests/evaluation/test_elo_calculator.py similarity index 79% rename from tests/evaluation/test_elo.py rename to tests/evaluation/test_elo_calculator.py index b32525308..a045208ab 100644 --- a/tests/evaluation/test_elo.py +++ b/tests/evaluation/test_elo_calculator.py @@ -5,6 +5,9 @@ from pytest import fixture from intelligence_layer.evaluation import EloCalculator, MatchOutcome, WinRateCalculator +from intelligence_layer.evaluation.evaluation.evaluator.elo_evaluator import ( + ComparisonEvaluation, +) @fixture @@ -13,9 +16,11 @@ def players() -> Sequence[str]: @fixture -def matches(players: Sequence[str]) -> Sequence[tuple[str, str, MatchOutcome]]: +def matches(players: Sequence[str]) -> Sequence[ComparisonEvaluation]: return [ - (player_a, player_b, MatchOutcome.A_WINS) + ComparisonEvaluation( + first_player=player_a, second_player=player_b, outcome=MatchOutcome.A_WINS + ) for player_a, player_b in combinations(players, 2) ] @@ -33,7 +38,7 @@ def test_match_outcome_serializes() -> None: def test_elo_calculator_works( - players: Sequence[str], matches: Sequence[tuple[str, str, MatchOutcome]] + players: Sequence[str], matches: Sequence[ComparisonEvaluation] ) -> None: elo_calculator = EloCalculator(players) elo_calculator.calculate(matches) @@ -52,7 +57,7 @@ def test_elo_calculator_works( def test_win_rate_calculator_works( - players: Sequence[str], matches: Sequence[tuple[str, str, MatchOutcome]] + players: Sequence[str], matches: Sequence[ComparisonEvaluation] ) -> None: win_rate_calculator = WinRateCalculator(players) scores = win_rate_calculator.calculate(matches) diff --git a/tests/evaluation/test_elo_evaluator.py b/tests/evaluation/test_elo_evaluator.py new file mode 100644 index 000000000..51cec543a --- /dev/null +++ b/tests/evaluation/test_elo_evaluator.py @@ -0,0 +1,193 @@ +from typing import Sequence, Tuple + +from dotenv import load_dotenv +from pytest import fixture + +from intelligence_layer.connectors import AlephAlphaClientProtocol +from intelligence_layer.core import ( + ControlModel, + Language, + LuminousControlModel, + TextChunk, + utc_now, +) +from intelligence_layer.core.tracer.tracer import NoOpTracer, Tracer +from intelligence_layer.evaluation import ( + ComparisonEvaluation, + EloEvaluationLogic, + EvaluationLogic, + Evaluator, + Example, + ExampleOutput, + InMemoryDatasetRepository, + InMemoryEvaluationRepository, + InMemoryRunRepository, + Matches, + MatchOutcome, + RunOverview, + SuccessfulExampleOutput, +) +from intelligence_layer.examples import SingleChunkQaInput, SingleChunkQaOutput + +load_dotenv() + + +class DummyEloQaEvalLogic( + EloEvaluationLogic[SingleChunkQaInput, SingleChunkQaOutput, SingleChunkQaOutput] +): + def __init__( + self, + model: ControlModel, + tracer: Tracer = NoOpTracer(), + ): + super().__init__() + self._model = model + self.tracer = tracer + + def grade( + self, + first: SuccessfulExampleOutput[SingleChunkQaOutput], + second: SuccessfulExampleOutput[SingleChunkQaOutput], + example: Example[SingleChunkQaInput, SingleChunkQaOutput], + ) -> MatchOutcome: + _ = example + if first.run_id < second.run_id: + return MatchOutcome.A_WINS + elif first.run_id > second.run_id: + return MatchOutcome.B_WINS + else: + return MatchOutcome.DRAW + + +@fixture +def model(client: AlephAlphaClientProtocol) -> ControlModel: + return LuminousControlModel(client=client, name="luminous-base-control") + + +@fixture +def in_memory_dataset_repository() -> InMemoryDatasetRepository: + return InMemoryDatasetRepository() + + +@fixture +def in_memory_run_repository() -> InMemoryRunRepository: + return InMemoryRunRepository() + + +@fixture +def in_memory_evaluation_repository() -> InMemoryEvaluationRepository: + return InMemoryEvaluationRepository() + + +@fixture +def dummy_eval_logic(model: ControlModel) -> DummyEloQaEvalLogic: + return DummyEloQaEvalLogic(model=model) + + +@fixture +def elo_evaluator( + in_memory_dataset_repository: InMemoryDatasetRepository, + in_memory_run_repository: InMemoryRunRepository, + in_memory_evaluation_repository: InMemoryEvaluationRepository, + dummy_eval_logic: EvaluationLogic[ + SingleChunkQaInput, SingleChunkQaOutput, SingleChunkQaOutput, Matches + ], +) -> Evaluator[SingleChunkQaInput, SingleChunkQaOutput, SingleChunkQaOutput, Matches]: + return Evaluator( + in_memory_dataset_repository, + in_memory_run_repository, + in_memory_evaluation_repository, + "Testing", + dummy_eval_logic, + ) + + +@fixture +def dummy_qa_input() -> SingleChunkQaInput: + return SingleChunkQaInput(chunk=TextChunk(""), question="", language=Language("en")) + + +@fixture +def dummy_qa_output() -> SingleChunkQaOutput: + return SingleChunkQaOutput(answer=None, highlights=[]) + + +@fixture +def qa_outputs() -> Sequence[SingleChunkQaOutput]: + return [ + SingleChunkQaOutput(answer=answer, highlights=[]) + for answer in [ + "Surface micromachining builds microstructures.", + "Surface micromachining builds microstructures. This is done by deposition and etching structural layers over a substrate.", + "Surface micromachining builds microstructures by deposition and etching structural layers over a substrate. This is different from Bulk micromachining, in which a silicon substrate wafer is selectively etched to produce structures.", + ] + ] + + +@fixture +def qa_setup( + in_memory_dataset_repository: InMemoryDatasetRepository, + in_memory_run_repository: InMemoryRunRepository, + qa_outputs: Sequence[SingleChunkQaOutput], +) -> Tuple[Sequence[str], str]: + qa_input_text = TextChunk( + """Surface micromachining builds microstructures by deposition and etching structural layers over a substrate.[1] This is different from Bulk micromachining, in which a silicon substrate wafer is selectively etched to produce structures.""" + ) + # + qa_input = SingleChunkQaInput( + chunk=qa_input_text, question="What is micromachining?", language=Language("en") + ) + expected_output = "Surface micromachining builds microstructures by deposition and etching structural layers over a substrate." + # + example_id = "some-example-id" + dataset_id = in_memory_dataset_repository.create_dataset( + examples=[ + Example(input=qa_input, expected_output=expected_output, id=example_id) + ], + dataset_name="some-example-dataset-name", + ).id + # + run_ids = [f"some-run-id-{i}" for i in range(len(qa_outputs))] + for i, output in enumerate(qa_outputs): + in_memory_run_repository.store_example_output( + example_output=ExampleOutput( + run_id=run_ids[i], + example_id=example_id, + output=output, + ) + ) + in_memory_run_repository.store_run_overview( + RunOverview( + dataset_id=dataset_id, + id=run_ids[i], + start=utc_now(), + end=utc_now(), + failed_example_count=0, + successful_example_count=len(qa_outputs), + description="runner", + ) + ) + return run_ids, dataset_id + + +def test_evaluate_runs_creates_correct_matches_for_elo_qa_eval( + qa_setup: Tuple[Sequence[str], str], + elo_evaluator: Evaluator[ + SingleChunkQaInput, SingleChunkQaOutput, SingleChunkQaOutput, Matches + ], +) -> None: + run_ids, _ = qa_setup + evaluation_overview = elo_evaluator.evaluate_runs(*run_ids) + + eval_result = list(elo_evaluator.evaluation_lineages(evaluation_overview.id))[ + 0 + ].evaluation.result + assert isinstance(eval_result, Matches) + matches = eval_result.comparison_evaluations + + for match in matches: + assert isinstance(match, ComparisonEvaluation) + if match.first_player < match.second_player: + assert match.outcome == MatchOutcome.A_WINS + elif match.first_player > match.second_player: + assert match.outcome == MatchOutcome.B_WINS diff --git a/tests/evaluation/test_diff_evaluator.py b/tests/evaluation/test_incremental_evaluator.py similarity index 88% rename from tests/evaluation/test_diff_evaluator.py rename to tests/evaluation/test_incremental_evaluator.py index 6a96ad004..a6cdb4201 100644 --- a/tests/evaluation/test_diff_evaluator.py +++ b/tests/evaluation/test_incremental_evaluator.py @@ -14,7 +14,7 @@ class DummyEvaluation(BaseModel): - new_run_ids: list[str] + all_run_ids: list[str] old_run_ids: list[list[str]] @@ -29,7 +29,7 @@ def do_incremental_evaluate( already_evaluated_outputs: list[list[SuccessfulExampleOutput[str]]], ) -> DummyEvaluation: return DummyEvaluation( - new_run_ids=[output.run_id for output in outputs], + all_run_ids=[output.run_id for output in outputs], old_run_ids=[ [output.run_id for output in evaluated_output] for evaluated_output in already_evaluated_outputs @@ -46,7 +46,7 @@ def do_run(self, input: str, tracer: Tracer) -> str: return f"{input} {self._info}" -def test_incremental_evaluator_should_filter_previous_run_ids() -> None: +def test_incremental_evaluator_separates_all_runs_and_previous_runs() -> None: # Given examples = [Example(input="a", expected_output="0", id="id_0")] dataset_repository = InMemoryDatasetRepository() @@ -89,8 +89,8 @@ def create_run(name: str) -> str: iter(evaluator.evaluation_lineages(second_evaluation_overview.id)) ).evaluation.result assert isinstance(second_result, DummyEvaluation) - assert second_result.new_run_ids == [second_run_id] - assert second_result.old_run_ids == [[first_run_id]] + assert sorted(second_result.all_run_ids) == sorted([first_run_id, second_run_id]) + assert sorted(second_result.old_run_ids) == sorted([[first_run_id]]) independent_run_id = create_run("independent") @@ -115,6 +115,8 @@ def create_run(name: str) -> str: iter(evaluator.evaluation_lineages(third_evaluation_overview.id)) ).evaluation.result assert isinstance(third_result, DummyEvaluation) - assert third_result.new_run_ids == [third_run_id] + assert sorted(third_result.all_run_ids) == sorted( + [first_run_id, second_run_id, independent_run_id, third_run_id] + ) assert sorted(third_result.old_run_ids[0]) == sorted([first_run_id, second_run_id]) assert sorted(third_result.old_run_ids[1]) == sorted([independent_run_id]) diff --git a/tests/evaluation/test_instruct_comparison_argilla_evaluator.py b/tests/evaluation/test_instruct_comparison_argilla_evaluator.py index 3b0db8e65..d6e85e2be 100644 --- a/tests/evaluation/test_instruct_comparison_argilla_evaluator.py +++ b/tests/evaluation/test_instruct_comparison_argilla_evaluator.py @@ -19,8 +19,8 @@ Aggregator, ArgillaEvaluator, AsyncInMemoryEvaluationRepository, - ComparisonAggregationLogic, ComparisonEvaluation, + ComparisonEvaluationAggregationLogic, EloCalculator, Example, ExampleOutput, @@ -109,8 +109,8 @@ def any_instruct_output() -> CompleteOutput: @fixture -def argilla_aggregation_logic() -> ComparisonAggregationLogic: - return ComparisonAggregationLogic() +def argilla_aggregation_logic() -> ComparisonEvaluationAggregationLogic: + return ComparisonEvaluationAggregationLogic() def create_dummy_dataset( @@ -165,7 +165,7 @@ def test_evaluate_run_submits_pairwise_comparison_records( in_memory_run_repository: InMemoryRunRepository, async_in_memory_evaluation_repository: AsyncInMemoryEvaluationRepository, in_memory_aggregation_repository: InMemoryAggregationRepository, - argilla_aggregation_logic: ComparisonAggregationLogic, + argilla_aggregation_logic: ComparisonEvaluationAggregationLogic, any_instruct_output: CompleteOutput, argilla_fake: ArgillaFake, ) -> None: @@ -244,10 +244,10 @@ def test_elo_calculating_works_as_expected() -> None: player1 = "player1" player2 = "player2" matches = [ - ( - player1, - player2, - MatchOutcome.A_WINS, + ComparisonEvaluation( + first_player=player1, + second_player=player2, + outcome=MatchOutcome.A_WINS, ) for _ in range(10) ] @@ -258,10 +258,10 @@ def test_elo_calculating_works_as_expected() -> None: assert elo.ratings[player2] < 1500 comeback_matches = [ - ( - player1, - player2, - MatchOutcome.B_WINS, + ComparisonEvaluation( + first_player=player1, + second_player=player2, + outcome=MatchOutcome.B_WINS, ) for i in range(10) ]