IL-374 Better rendering (#698)

Co-authored-by: FelixFehse <[email protected]>
Aleph-Alpha · Apr 4, 2024 · 053725a · 053725a
1 parent 50317ec
commit 053725a
Show file tree

Hide file tree

Showing 2 changed files with 89 additions and 88 deletions.
diff --git a/src/examples/user_journey.ipynb b/src/examples/user_journey.ipynb
@@ -17,6 +17,7 @@
     "    InMemoryEvaluationRepository,\n",
     "    InMemoryRunRepository,\n",
     "    Runner,\n",
+    "    evaluation_lineages_to_pandas,\n",
     ")\n",
     "from intelligence_layer.use_cases import (\n",
     "    ClassifyInput,\n",
@@ -38,13 +39,15 @@
     "\n",
     "In the fast-paced world of business, effectively managing incoming support emails is crucial. The ability to quickly and accurately classify these emails into the appropriate department and determine their urgency is not just a matter of operational efficiency; it directly impacts customer satisfaction and overall business success. Given the high stakes, it's essential to rigorously evaluate any solution designed to automate this process. This tutorial focuses on the evaluation of a LLM-based program developed to automate the classification of support emails.\n",
     "\n",
-    "In an environment brimming with various methodologies and tools, understanding the comparative effectiveness of different approaches is vital. Systematic evaluation allows us to identify which techniques are best suited for specific tasks, understand their strengths and weaknesses, and optimize their performance.\n"
+    "In an environment brimming with various methodologies and tools, understanding the comparative effectiveness of different approaches is vital. Systematic evaluation allows us to identify which techniques are best suited for specific tasks, understand their strengths and weaknesses, and optimize their performance."
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
+    "## Setup and Evaluation\n",
+    "\n",
     "To start off, we are only given a few anecdotal examples.\n",
     "Firstly, there are two e-mails, and secondly a number of potential departments to which they should be sent.\n",
     "\n",
@@ -130,18 +133,6 @@
     "For this, we need to do some eval. Luckily, we have by now got access to a few more examples...\n"
    ]
   },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "with open(\"data/classify_examples.json\", \"r\") as file:\n",
-    "    labeled_examples: list[dict[str, str]] = json.load(file)\n",
-    "\n",
-    "labeled_examples"
-   ]
-  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -182,14 +173,16 @@
    "source": [
     "dataset_repository = InMemoryDatasetRepository()\n",
     "\n",
+    "examples = [\n",
+    "    Example(\n",
+    "        input=ClassifyInput(chunk=TextChunk(example[\"message\"]), labels=labels),\n",
+    "        expected_output=example[\"label\"],\n",
+    "    )\n",
+    "    for example in labeled_examples\n",
+    "]\n",
+    "\n",
     "dataset_id = dataset_repository.create_dataset(\n",
-    "    examples=[\n",
-    "        Example(\n",
-    "            input=ClassifyInput(chunk=TextChunk(example[\"message\"]), labels=labels),\n",
-    "            expected_output=example[\"label\"],\n",
-    "        )\n",
-    "        for example in labeled_examples\n",
-    "    ],\n",
+    "    examples=examples,\n",
     "    dataset_name=\"MyDataset\",\n",
     ").id"
    ]
@@ -286,6 +279,8 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
+    "The evaluation throws many warnings and we will take care of them below.\n",
+    "\n",
     "Finally, let's aggregate all individual evaluations to get some eval statistics."
    ]
   },
@@ -305,10 +300,11 @@
    "source": [
     "It looks like we only predicted around 25% of classes correctly.\n",
     "\n",
-    "However, a closer look at the overview suggests that we have a bunch of incorrect labels in our test dataset.\n",
-    "We will fix this later.\n",
+    "Again, we get warnings that there are examples for which the expected labels are not part of the labels that the model can predict.\n",
+    "\n",
+    "## Fixing the Data\n",
     "\n",
-    "First, let's have a look at a few failed examples in detail."
+    "Let's have a look at a few failed examples in detail:"
    ]
   },
   {
@@ -317,23 +313,61 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "from intelligence_layer.use_cases.classify.classify import (\n",
-    "    SingleLabelClassifyFailedExampleIterator,\n",
-    ")\n",
+    "# from intelligence_layer.evaluation import evaluation_lineages_to_pandas\n",
     "\n",
-    "failed_example_iterator = SingleLabelClassifyFailedExampleIterator(\n",
-    "    dataset_repository, run_repository, evaluation_repository\n",
-    ")\n",
-    "list(failed_example_iterator.get_examples(eval_overview.id))"
+    "\n",
+    "from intelligence_layer.evaluation import FailedExampleEvaluation\n",
+    "\n",
+    "passed_lineages = [\n",
+    "    lineage\n",
+    "    for lineage in evaluator.evaluation_lineages(eval_overview.id)\n",
+    "    if not isinstance(lineage.evaluation.result, FailedExampleEvaluation)\n",
+    "]\n",
+    "\n",
+    "\n",
+    "lineages = [\n",
+    "    lineage for lineage in passed_lineages if not lineage.evaluation.result.correct\n",
+    "][:2]\n",
+    "\n",
+    "\n",
+    "for lineage in lineages:\n",
+    "    display(lineage)"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "This confirms it: some expected labels are missing. Let's try fixing this.\n",
+    "This confirms it: The first example has an expected label \"IT Support\". However, this label is not listed in the set of labels our model can predict for that example.\n",
     "\n",
-    "We can do this two ways: Adjust our set of labels or adjust the eval set. In this case, we'll do the latter.\n"
+    "Let's see how often this is the case and which are the invalid expected labels:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "lineages = [\n",
+    "    lineage\n",
+    "    for lineage in passed_lineages\n",
+    "    if lineage.evaluation.result.expected_label_missing\n",
+    "]\n",
+    "\n",
+    "print(\n",
+    "    f\"Number of examples with invalid expected label: {len(lineages)} out of {len(passed_lineages)}\"\n",
+    ")\n",
+    "print(\n",
+    "    f\"Invalid expected labels: {set([lineage.example.expected_output for lineage in lineages])}\"\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We can fix this in two ways: Add the missing labels to the set of allowed labels, or change the expected label to the closes matching available label. In this case, we'll do the latter."
    ]
   },
   {
@@ -351,20 +385,14 @@
     "    \"Finance\": \"Finance and Accounting\",\n",
     "}\n",
     "\n",
-    "for example in labeled_examples:\n",
-    "    label = example[\"label\"]\n",
-    "    if label in label_map.keys():\n",
-    "        example[\"label\"] = label_map[label]\n",
+    "# we update the existing examples inplace with the correct labels\n",
+    "for example in examples:\n",
+    "    if example.expected_output in label_map.keys():\n",
+    "        example.expected_output = label_map[example.expected_output]\n",
     "\n",
     "# datasets in the IL are immutable, so we must create a new one\n",
     "cleaned_dataset_id = dataset_repository.create_dataset(\n",
-    "    examples=[\n",
-    "        Example(\n",
-    "            input=ClassifyInput(chunk=TextChunk(example[\"message\"]), labels=labels),\n",
-    "            expected_output=example[\"label\"],\n",
-    "        )\n",
-    "        for example in labeled_examples\n",
-    "    ],\n",
+    "    examples=examples,\n",
     "    dataset_name=\"CleanedDataset\",\n",
     ").id"
    ]
@@ -373,6 +401,8 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
+    "## Improving the Prompt\n",
+    "\n",
     "The prompt used for the `PromptBasedClassify`-task looks as follows:"
    ]
   },
@@ -442,7 +472,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Cool, this already got us up to 58%!\n",
+    "Our adjustments improved the accuracy to 58%!\n",
     "\n",
     "So far, we only used the `luminous-base-control` model. Let's see if we can improve our classifications by upgrading to a bigger model!"
    ]
@@ -497,9 +527,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "So using a bigger model further improved our results to 66.66%.\n",
-    "\n",
-    "As you can see there are plenty of option on how to further enhance the accuracy of our classify task. Notice, for instance, that so far we did not tell our classification task what each class means."
+    "So using a bigger model further improved our results to 67%. But there are still wrongly predicted labels:"
    ]
   },
   {
@@ -508,13 +536,25 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "list(failed_example_iterator.get_examples(eval_overview_prompt_adjusted.id))"
+    "lineages = [\n",
+    "    lineage\n",
+    "    for lineage in evaluator.evaluation_lineages(eval_overview_prompt_adjusted.id)\n",
+    "    if not isinstance(lineage.evaluation.result, FailedExampleEvaluation)\n",
+    "    and not lineage.evaluation.result.correct\n",
+    "]\n",
+    "\n",
+    "df = evaluation_lineages_to_pandas(lineages)\n",
+    "df[\"input\"] = [i.chunk for i in df[\"input\"]]\n",
+    "df[\"predicted\"] = [r.predicted for r in df[\"result\"]]\n",
+    "df.reset_index()[[\"example_id\", \"input\", \"expected_output\", \"predicted\"]]"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
+    "As you can see there are plenty of option on how to further enhance the accuracy of our classify task. Notice, for instance, that so far we did not tell our classification task what each class means.\n",
+    "\n",
     "The model had to 'guess' what we mean by each class purely from the given labels. In order to tackle this issue you could use the `PromptBasedClassifyWithDefinitions` task. This task allows you to also provide a short description for each class.\n",
     "\n",
     "Feel free to further play around and improve our classification example. "
@@ -537,7 +577,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.11.4"
+   "version": "3.11.8"
   }
  },
  "nbformat": 4,

diff --git a/src/intelligence_layer/use_cases/classify/classify.py b/src/intelligence_layer/use_cases/classify/classify.py
@@ -7,15 +7,10 @@
 from intelligence_layer.core import TextChunk
 from intelligence_layer.evaluation import (
     AggregationLogic,
-    DatasetRepository,
-    EvaluationRepository,
     Example,
     MeanAccumulator,
-    RepositoryNavigator,
-    RunRepository,
     SingleOutputEvaluationLogic,
 )
-from intelligence_layer.evaluation.evaluation.domain import FailedExampleEvaluation
 
 Probability = NewType("Probability", float)
 
@@ -168,40 +163,6 @@ def do_evaluate_single_output(
         )
 
 
-class SingleLabelClassifyFailedExampleIterator:
-    def __init__(
-        self,
-        dataset_repository: DatasetRepository,
-        run_repository: RunRepository,
-        evaluation_repository: EvaluationRepository,
-    ):
-        self.repository_navigator = RepositoryNavigator(
-            dataset_repository, run_repository, evaluation_repository
-        )
-
-    # TODO: Add test
-    def get_examples(
-        self, evaluation_overview_id: str, first_n: int = 0
-    ) -> Iterable[Example[ClassifyInput, str]]:
-        evaluation_lineages = self.repository_navigator.evaluation_lineages(
-            evaluation_id=evaluation_overview_id,
-            input_type=ClassifyInput,
-            expected_output_type=str,
-            output_type=SingleLabelClassifyOutput,
-            evaluation_type=SingleLabelClassifyEvaluation,
-        )
-        count_yielded = 0
-        for lineage in evaluation_lineages:
-            if first_n != 0 and count_yielded >= first_n:
-                break
-            if (
-                isinstance(lineage.evaluation.result, FailedExampleEvaluation)
-                or not lineage.evaluation.result.correct
-            ):
-                count_yielded += 1
-                yield lineage.example
-
-
 class MultiLabelClassifyEvaluation(BaseModel):
     """The evaluation of a single multi-label classification example.