Skip to content

Commit

Permalink
IL-374 Better rendering (#698)
Browse files Browse the repository at this point in the history
Co-authored-by: FelixFehse <[email protected]>
  • Loading branch information
FelixFehse and FelixFehseTNG authored Apr 4, 2024
1 parent 50317ec commit 053725a
Show file tree
Hide file tree
Showing 2 changed files with 89 additions and 88 deletions.
138 changes: 89 additions & 49 deletions src/examples/user_journey.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@
" InMemoryEvaluationRepository,\n",
" InMemoryRunRepository,\n",
" Runner,\n",
" evaluation_lineages_to_pandas,\n",
")\n",
"from intelligence_layer.use_cases import (\n",
" ClassifyInput,\n",
Expand All @@ -38,13 +39,15 @@
"\n",
"In the fast-paced world of business, effectively managing incoming support emails is crucial. The ability to quickly and accurately classify these emails into the appropriate department and determine their urgency is not just a matter of operational efficiency; it directly impacts customer satisfaction and overall business success. Given the high stakes, it's essential to rigorously evaluate any solution designed to automate this process. This tutorial focuses on the evaluation of a LLM-based program developed to automate the classification of support emails.\n",
"\n",
"In an environment brimming with various methodologies and tools, understanding the comparative effectiveness of different approaches is vital. Systematic evaluation allows us to identify which techniques are best suited for specific tasks, understand their strengths and weaknesses, and optimize their performance.\n"
"In an environment brimming with various methodologies and tools, understanding the comparative effectiveness of different approaches is vital. Systematic evaluation allows us to identify which techniques are best suited for specific tasks, understand their strengths and weaknesses, and optimize their performance."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Setup and Evaluation\n",
"\n",
"To start off, we are only given a few anecdotal examples.\n",
"Firstly, there are two e-mails, and secondly a number of potential departments to which they should be sent.\n",
"\n",
Expand Down Expand Up @@ -130,18 +133,6 @@
"For this, we need to do some eval. Luckily, we have by now got access to a few more examples...\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"with open(\"data/classify_examples.json\", \"r\") as file:\n",
" labeled_examples: list[dict[str, str]] = json.load(file)\n",
"\n",
"labeled_examples"
]
},
{
"cell_type": "markdown",
"metadata": {},
Expand Down Expand Up @@ -182,14 +173,16 @@
"source": [
"dataset_repository = InMemoryDatasetRepository()\n",
"\n",
"examples = [\n",
" Example(\n",
" input=ClassifyInput(chunk=TextChunk(example[\"message\"]), labels=labels),\n",
" expected_output=example[\"label\"],\n",
" )\n",
" for example in labeled_examples\n",
"]\n",
"\n",
"dataset_id = dataset_repository.create_dataset(\n",
" examples=[\n",
" Example(\n",
" input=ClassifyInput(chunk=TextChunk(example[\"message\"]), labels=labels),\n",
" expected_output=example[\"label\"],\n",
" )\n",
" for example in labeled_examples\n",
" ],\n",
" examples=examples,\n",
" dataset_name=\"MyDataset\",\n",
").id"
]
Expand Down Expand Up @@ -286,6 +279,8 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"The evaluation throws many warnings and we will take care of them below.\n",
"\n",
"Finally, let's aggregate all individual evaluations to get some eval statistics."
]
},
Expand All @@ -305,10 +300,11 @@
"source": [
"It looks like we only predicted around 25% of classes correctly.\n",
"\n",
"However, a closer look at the overview suggests that we have a bunch of incorrect labels in our test dataset.\n",
"We will fix this later.\n",
"Again, we get warnings that there are examples for which the expected labels are not part of the labels that the model can predict.\n",
"\n",
"## Fixing the Data\n",
"\n",
"First, let's have a look at a few failed examples in detail."
"Let's have a look at a few failed examples in detail:"
]
},
{
Expand All @@ -317,23 +313,61 @@
"metadata": {},
"outputs": [],
"source": [
"from intelligence_layer.use_cases.classify.classify import (\n",
" SingleLabelClassifyFailedExampleIterator,\n",
")\n",
"# from intelligence_layer.evaluation import evaluation_lineages_to_pandas\n",
"\n",
"failed_example_iterator = SingleLabelClassifyFailedExampleIterator(\n",
" dataset_repository, run_repository, evaluation_repository\n",
")\n",
"list(failed_example_iterator.get_examples(eval_overview.id))"
"\n",
"from intelligence_layer.evaluation import FailedExampleEvaluation\n",
"\n",
"passed_lineages = [\n",
" lineage\n",
" for lineage in evaluator.evaluation_lineages(eval_overview.id)\n",
" if not isinstance(lineage.evaluation.result, FailedExampleEvaluation)\n",
"]\n",
"\n",
"\n",
"lineages = [\n",
" lineage for lineage in passed_lineages if not lineage.evaluation.result.correct\n",
"][:2]\n",
"\n",
"\n",
"for lineage in lineages:\n",
" display(lineage)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This confirms it: some expected labels are missing. Let's try fixing this.\n",
"This confirms it: The first example has an expected label \"IT Support\". However, this label is not listed in the set of labels our model can predict for that example.\n",
"\n",
"We can do this two ways: Adjust our set of labels or adjust the eval set. In this case, we'll do the latter.\n"
"Let's see how often this is the case and which are the invalid expected labels:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"lineages = [\n",
" lineage\n",
" for lineage in passed_lineages\n",
" if lineage.evaluation.result.expected_label_missing\n",
"]\n",
"\n",
"print(\n",
" f\"Number of examples with invalid expected label: {len(lineages)} out of {len(passed_lineages)}\"\n",
")\n",
"print(\n",
" f\"Invalid expected labels: {set([lineage.example.expected_output for lineage in lineages])}\"\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can fix this in two ways: Add the missing labels to the set of allowed labels, or change the expected label to the closes matching available label. In this case, we'll do the latter."
]
},
{
Expand All @@ -351,20 +385,14 @@
" \"Finance\": \"Finance and Accounting\",\n",
"}\n",
"\n",
"for example in labeled_examples:\n",
" label = example[\"label\"]\n",
" if label in label_map.keys():\n",
" example[\"label\"] = label_map[label]\n",
"# we update the existing examples inplace with the correct labels\n",
"for example in examples:\n",
" if example.expected_output in label_map.keys():\n",
" example.expected_output = label_map[example.expected_output]\n",
"\n",
"# datasets in the IL are immutable, so we must create a new one\n",
"cleaned_dataset_id = dataset_repository.create_dataset(\n",
" examples=[\n",
" Example(\n",
" input=ClassifyInput(chunk=TextChunk(example[\"message\"]), labels=labels),\n",
" expected_output=example[\"label\"],\n",
" )\n",
" for example in labeled_examples\n",
" ],\n",
" examples=examples,\n",
" dataset_name=\"CleanedDataset\",\n",
").id"
]
Expand All @@ -373,6 +401,8 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Improving the Prompt\n",
"\n",
"The prompt used for the `PromptBasedClassify`-task looks as follows:"
]
},
Expand Down Expand Up @@ -442,7 +472,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Cool, this already got us up to 58%!\n",
"Our adjustments improved the accuracy to 58%!\n",
"\n",
"So far, we only used the `luminous-base-control` model. Let's see if we can improve our classifications by upgrading to a bigger model!"
]
Expand Down Expand Up @@ -497,9 +527,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"So using a bigger model further improved our results to 66.66%.\n",
"\n",
"As you can see there are plenty of option on how to further enhance the accuracy of our classify task. Notice, for instance, that so far we did not tell our classification task what each class means."
"So using a bigger model further improved our results to 67%. But there are still wrongly predicted labels:"
]
},
{
Expand All @@ -508,13 +536,25 @@
"metadata": {},
"outputs": [],
"source": [
"list(failed_example_iterator.get_examples(eval_overview_prompt_adjusted.id))"
"lineages = [\n",
" lineage\n",
" for lineage in evaluator.evaluation_lineages(eval_overview_prompt_adjusted.id)\n",
" if not isinstance(lineage.evaluation.result, FailedExampleEvaluation)\n",
" and not lineage.evaluation.result.correct\n",
"]\n",
"\n",
"df = evaluation_lineages_to_pandas(lineages)\n",
"df[\"input\"] = [i.chunk for i in df[\"input\"]]\n",
"df[\"predicted\"] = [r.predicted for r in df[\"result\"]]\n",
"df.reset_index()[[\"example_id\", \"input\", \"expected_output\", \"predicted\"]]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As you can see there are plenty of option on how to further enhance the accuracy of our classify task. Notice, for instance, that so far we did not tell our classification task what each class means.\n",
"\n",
"The model had to 'guess' what we mean by each class purely from the given labels. In order to tackle this issue you could use the `PromptBasedClassifyWithDefinitions` task. This task allows you to also provide a short description for each class.\n",
"\n",
"Feel free to further play around and improve our classification example. "
Expand All @@ -537,7 +577,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.4"
"version": "3.11.8"
}
},
"nbformat": 4,
Expand Down
39 changes: 0 additions & 39 deletions src/intelligence_layer/use_cases/classify/classify.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,15 +7,10 @@
from intelligence_layer.core import TextChunk
from intelligence_layer.evaluation import (
AggregationLogic,
DatasetRepository,
EvaluationRepository,
Example,
MeanAccumulator,
RepositoryNavigator,
RunRepository,
SingleOutputEvaluationLogic,
)
from intelligence_layer.evaluation.evaluation.domain import FailedExampleEvaluation

Probability = NewType("Probability", float)

Expand Down Expand Up @@ -168,40 +163,6 @@ def do_evaluate_single_output(
)


class SingleLabelClassifyFailedExampleIterator:
def __init__(
self,
dataset_repository: DatasetRepository,
run_repository: RunRepository,
evaluation_repository: EvaluationRepository,
):
self.repository_navigator = RepositoryNavigator(
dataset_repository, run_repository, evaluation_repository
)

# TODO: Add test
def get_examples(
self, evaluation_overview_id: str, first_n: int = 0
) -> Iterable[Example[ClassifyInput, str]]:
evaluation_lineages = self.repository_navigator.evaluation_lineages(
evaluation_id=evaluation_overview_id,
input_type=ClassifyInput,
expected_output_type=str,
output_type=SingleLabelClassifyOutput,
evaluation_type=SingleLabelClassifyEvaluation,
)
count_yielded = 0
for lineage in evaluation_lineages:
if first_n != 0 and count_yielded >= first_n:
break
if (
isinstance(lineage.evaluation.result, FailedExampleEvaluation)
or not lineage.evaluation.result.correct
):
count_yielded += 1
yield lineage.example


class MultiLabelClassifyEvaluation(BaseModel):
"""The evaluation of a single multi-label classification example.
Expand Down

0 comments on commit 053725a

Please sign in to comment.