Skip to content

Commit

Permalink
feat: Add FailedExampleIterator for retrieval of failed examples and …
Browse files Browse the repository at this point in the history
…adapt user_journey.ipynb

Task IL-367
  • Loading branch information
SebastianNiehusTNG committed Apr 3, 2024
1 parent ab51357 commit d8f0970
Show file tree
Hide file tree
Showing 2 changed files with 77 additions and 57 deletions.
90 changes: 33 additions & 57 deletions src/examples/user_journey.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -259,7 +259,8 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Finally, let's aggregate all individual evaluations to get seom eval statistics."
"As you can see, there are quite some warnings telling us that some expected label is missing from the input labels for the evaluation.\n",
"We will have to take care of those cases at some point, but first let's look at some of the failed examples:"
]
},
{
Expand All @@ -268,20 +269,20 @@
"metadata": {},
"outputs": [],
"source": [
"aggregation_overview = aggregator.aggregate_evaluation(eval_overview.id)\n",
"aggregation_overview"
"from intelligence_layer.use_cases.classify.classify import FailedExampleIterator\n",
"\n",
"failed_example_iterator = FailedExampleIterator(\n",
" dataset_repository, run_repository, evaluation_repository\n",
")\n",
"list(failed_example_iterator.get_examples(eval_overview.id))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It looks like we only predicted around 25% of classes correctly.\n",
"\n",
"However, a closer look at the overview suggests that we have a bunch of incorrect labels in our test dataset.\n",
"We will fix this later.\n",
"\n",
"First, let's have a look at a few failed examples in detail."
"As you can see, for some of the examples the expected output is missing from the set of labels.\n",
"Let's aggregate our results to get some more statistics:"
]
},
{
Expand All @@ -290,53 +291,24 @@
"metadata": {},
"outputs": [],
"source": [
"from intelligence_layer.use_cases import (\n",
" SingleLabelClassifyOutput,\n",
" SingleLabelClassifyEvaluation,\n",
")\n",
"\n",
"\n",
"def get_failed_examples(run_id: str, eval_id: str, dataset_id: str, first_n: int):\n",
" overview = [\n",
" {\n",
" \"input\": example.input,\n",
" \"expected_output\": example.expected_output,\n",
" \"result\": sorted(\n",
" list(\n",
" next(\n",
" e\n",
" for e in run_repository.example_outputs(\n",
" run_id, SingleLabelClassifyOutput\n",
" )\n",
" if e.example_id == example.id\n",
" ).output.scores.items()\n",
" ),\n",
" key=lambda i: i[1],\n",
" reverse=True,\n",
" )[0],\n",
" \"eval\": evaluation_repository.example_evaluation(\n",
" evaluation_id=eval_id,\n",
" example_id=example.id,\n",
" evaluation_type=SingleLabelClassifyEvaluation,\n",
" ).result,\n",
" }\n",
" for example in dataset_repository.examples(\n",
" dataset_id=dataset_id, input_type=ClassifyInput, expected_output_type=str\n",
" )\n",
" ]\n",
" return [e for e in overview if not e[\"eval\"].correct][:first_n]\n",
"\n",
"\n",
"get_failed_examples(run_overview.id, eval_overview.id, dataset_id, 3)"
"aggregation_overview = aggregator.aggregate_evaluation(eval_overview.id)\n",
"aggregation_overview"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This confirms it: some expected labels are missing. Let's try fixing this.\n",
"\n",
"We can do this two ways: Adjust our set of labels or adjust the eval set. In this case, we'll do the latter.\n"
"It looks like we only predicted around 30% of classes correctly.\n",
"And again, we get a warning about missing labels. We can even see those labels in the 'missing_labels' field of the 'statistics' section of the aggregation_overview.\n",
"So let's fix this. We can do this two ways: Adjust our set of labels or adjust the eval set. In this case, we'll do the latter."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n"
]
},
{
Expand Down Expand Up @@ -447,7 +419,9 @@
"source": [
"Cool, this already got us up to 62%!\n",
"\n",
"Notice, how we don't actually tell our classification task, what each class means; we only supply it with all the labels.\n"
"Notice, how we don't actually tell our classification task, what each class means; we only supply it with all the labels.\n",
"\n",
"Another look at the remaining failed examples confirms that the failures are no longer because of the missing labels:"
]
},
{
Expand All @@ -456,12 +430,7 @@
"metadata": {},
"outputs": [],
"source": [
"get_failed_examples(\n",
" run_overview_prompt_adjusted.id,\n",
" eval_overview_prompt_adjusted.id,\n",
" cleaned_dataset_id,\n",
" 3,\n",
")"
"list(failed_example_iterator.get_examples(eval_overview_prompt_adjusted.id))"
]
},
{
Expand Down Expand Up @@ -518,6 +487,13 @@
"source": [
"aggregation_overview_with_extended"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
Expand Down
44 changes: 44 additions & 0 deletions src/intelligence_layer/use_cases/classify/classify.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,10 +7,15 @@
from intelligence_layer.core import TextChunk
from intelligence_layer.evaluation import (
AggregationLogic,
DatasetRepository,
EvaluationRepository,
Example,
MeanAccumulator,
RepositoryNavigator,
RunRepository,
SingleOutputEvaluationLogic,
)
from intelligence_layer.evaluation.evaluation.domain import FailedExampleEvaluation

Probability = NewType("Probability", float)

Expand Down Expand Up @@ -111,6 +116,11 @@ def aggregate(
confusion_matrix[(evaluation.predicted, evaluation.expected)] += 1
by_label[evaluation.predicted]["predicted"] += 1
by_label[evaluation.expected]["expected"] += 1

if len(missing_labels) > 0:
warn_message = "[WARNING] There were examples with expected labels missing in the evaluation inputs. For a detailed list, see the 'statistics.missing_labels' field of the returned `AggregationOverview`."
warnings.warn(warn_message, RuntimeWarning)

return AggregatedSingleLabelClassifyEvaluation(
percentage_correct=acc.extract(),
confusion_matrix=confusion_matrix,
Expand Down Expand Up @@ -158,6 +168,40 @@ def do_evaluate_single_output(
)


class FailedExampleIterator:
def __init__(
self,
dataset_repository: DatasetRepository,
run_repository: RunRepository,
evaluation_repository: EvaluationRepository,
):
self.repository_navigator = RepositoryNavigator(
dataset_repository, run_repository, evaluation_repository
)

# TODO: Add test
def get_examples(
self, evaluation_overview_id: str, first_n: int = 0
) -> Iterable[Example[ClassifyInput, str]]:
evaluation_lineages = self.repository_navigator.evaluation_lineages(
evaluation_id=evaluation_overview_id,
input_type=ClassifyInput,
expected_output_type=str,
output_type=SingleLabelClassifyOutput,
evaluation_type=SingleLabelClassifyEvaluation,
)
count_yielded = 0
for lineage in evaluation_lineages:
if first_n != 0 and count_yielded >= first_n:
break
if (
isinstance(lineage.evaluation.result, FailedExampleEvaluation)
or not lineage.evaluation.result.correct
):
count_yielded += 1
yield lineage.example


class MultiLabelClassifyEvaluation(BaseModel):
"""The evaluation of a single multi-label classification example.
Expand Down

0 comments on commit d8f0970

Please sign in to comment.