diff --git a/docs/docs/examples/evaluation/retrieval/retriever_eval.ipynb b/docs/docs/examples/evaluation/retrieval/retriever_eval.ipynb
index 20125832002c7..04fdac36ddf99 100644
--- a/docs/docs/examples/evaluation/retrieval/retriever_eval.ipynb
+++ b/docs/docs/examples/evaluation/retrieval/retriever_eval.ipynb
@@ -18,7 +18,7 @@
"\n",
"This notebook uses our `RetrieverEvaluator` to evaluate the quality of any Retriever module defined in LlamaIndex.\n",
"\n",
- "We specify a set of different evaluation metrics: this includes hit-rate and MRR. For any given question, these will compare the quality of retrieved results from the ground-truth context.\n",
+ "We specify a set of different evaluation metrics: this includes hit-rate, MRR, and NDCG. For any given question, these will compare the quality of retrieved results from the ground-truth context.\n",
"\n",
"To ease the burden of creating the eval dataset in the first place, we can rely on synthetic data generation."
]
@@ -40,13 +40,14 @@
"metadata": {},
"outputs": [],
"source": [
- "%pip install llama-index-llms-openai"
+ "%pip install llama-index-llms-openai\n",
+ "%pip install llama-index-readers-file"
]
},
{
"cell_type": "code",
"execution_count": null,
- "id": "bb6fecf4-7215-4ae9-b02b-3cb7c6000f2c",
+ "id": "285cfab2",
"metadata": {},
"outputs": [],
"source": [
@@ -62,7 +63,6 @@
"metadata": {},
"outputs": [],
"source": [
- "from llama_index.core.evaluation import generate_question_context_pairs\n",
"from llama_index.core import VectorStoreIndex, SimpleDirectoryReader\n",
"from llama_index.core.node_parser import SentenceSplitter\n",
"from llama_index.llms.openai import OpenAI"
@@ -82,7 +82,25 @@
"execution_count": null,
"id": "589c112d",
"metadata": {},
- "outputs": [],
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "--2024-06-12 23:57:02-- https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt\n",
+ "Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.109.133, ...\n",
+ "Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.\n",
+ "HTTP request sent, awaiting response... 200 OK\n",
+ "Length: 75042 (73K) [text/plain]\n",
+ "Saving to: ‘data/paul_graham/paul_graham_essay.txt’\n",
+ "\n",
+ "data/paul_graham/pa 100%[===================>] 73.28K --.-KB/s in 0.08s \n",
+ "\n",
+ "2024-06-12 23:57:03 (864 KB/s) - ‘data/paul_graham/paul_graham_essay.txt’ saved [75042/75042]\n",
+ "\n"
+ ]
+ }
+ ],
"source": [
"!mkdir -p 'data/paul_graham/'\n",
"!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'"
@@ -171,15 +189,11 @@
{
"data": {
"text/markdown": [
- "**Node ID:** node_0
**Similarity:** 0.8181379514114543
**Text:** What I Worked On\n",
- "\n",
- "February 2021\n",
+ "**Node ID:** node_38
**Similarity:** 0.814377909267451
**Text:** I also worked on spam filters, and did some more painting. I used to have dinners for a group of friends every thursday night, which taught me how to cook for groups. And I bought another building in Cambridge, a former candy factory (and later, twas said, porn studio), to use as an office.\n",
"\n",
- "Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep.\n",
- "\n",
- "The first programs I tried writing were on the IBM 1401 that our school district used for what was then called \"data processing.\" This was in 9th grade, so I was 13 or 14. The school district's 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain's lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights.\n",
+ "One night in October 2003 there was a big party at my house. It was a clever idea of my friend Maria Daniels, who was one of the thursday diners. Three separate hosts would all invite their friends to one party. So for every guest, two thirds of the other guests would be people they didn't know but would probably like. One of the guests was someone I didn't know but would turn out to like a lot: a woman called Jessica Livingston. A couple days later I asked her out.\n",
"\n",
- "The language we used was an early version of Fortran. You had to type programs on punch cards, then stack them in ...
"
+ "Jessica was in charge of marketing at a Boston investment bank. This bank thought it understood startups, but over the next year, as she met friends of mine from the startup world, she was surprised how different reality was. And ho...
"
],
"text/plain": [
""
@@ -191,13 +205,15 @@
{
"data": {
"text/markdown": [
- "**Node ID:** node_52
**Similarity:** 0.8143530600618721
**Text:** It felt like I was doing life right. I remember that because I was slightly dismayed at how novel it felt. The good news is that I had more moments like this over the next few years.\n",
+ "**Node ID:** node_0
**Similarity:** 0.8122448657654567
**Text:** What I Worked On\n",
"\n",
- "In the summer of 2016 we moved to England. We wanted our kids to see what it was like living in another country, and since I was a British citizen by birth, that seemed the obvious choice. We only meant to stay for a year, but we liked it so much that we still live there. So most of Bel was written in England.\n",
+ "February 2021\n",
+ "\n",
+ "Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep.\n",
"\n",
- "In the fall of 2019, Bel was finally finished. Like McCarthy's original Lisp, it's a spec rather than an implementation, although like McCarthy's Lisp it's a spec expressed as code.\n",
+ "The first programs I tried writing were on the IBM 1401 that our school district used for what was then called \"data processing.\" This was in 9th grade, so I was 13 or 14. The school district's 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain's lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights.\n",
"\n",
- "Now that I could write essays again, I wrote a bunch about topics I'd had stacked up. I kept writing essays through 2020, but I also started to think about other things I could work on. How should I choose what to do? Well, how had I chosen what to work on in the past? I wrote an essay for myself to answer that ques...
"
+ "The language we used was an early version of Fortran. You had to type programs on punch cards, then stack them in ...
"
],
"text/plain": [
""
@@ -246,7 +262,15 @@
"execution_count": null,
"id": "2d29a159-9a4f-4d44-9c0d-1cd683f8bb9b",
"metadata": {},
- "outputs": [],
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "100%|██████████| 61/61 [04:59<00:00, 4.91s/it]\n"
+ ]
+ }
+ ],
"source": [
"qa_dataset = generate_question_context_pairs(\n",
" nodes, llm=llm, num_questions_per_chunk=2\n",
@@ -263,7 +287,7 @@
"name": "stdout",
"output_type": "stream",
"text": [
- "\"Describe the transition from using the IBM 1401 to microcomputers, as mentioned in the text. What were the key differences in terms of user interaction and programming capabilities?\"\n"
+ "\"Describe the transition from using the IBM 1401 to microcomputers, as mentioned in the text. How did this change impact the way programs were written and executed?\"\n"
]
}
],
@@ -319,7 +343,7 @@
"metadata": {},
"outputs": [],
"source": [
- "include_cohere_rerank = True\n",
+ "include_cohere_rerank = False\n",
"\n",
"if include_cohere_rerank:\n",
" !pip install cohere -q"
@@ -334,7 +358,7 @@
"source": [
"from llama_index.core.evaluation import RetrieverEvaluator\n",
"\n",
- "metrics = [\"mrr\", \"hit_rate\"]\n",
+ "metrics = [\"mrr\", \"hit_rate\", \"ndcg\"]\n",
"\n",
"if include_cohere_rerank:\n",
" metrics.append(\n",
@@ -356,8 +380,8 @@
"name": "stdout",
"output_type": "stream",
"text": [
- "Query: In the context provided, the author describes his early experiences with programming on an IBM 1401. Based on his description, what were some of the limitations and challenges he faced while trying to write programs on this machine?\n",
- "Metrics: {'mrr': 1.0, 'hit_rate': 1.0, 'cohere_rerank_relevancy': 0.99620515}\n",
+ "Query: In the context, the author mentions his early experiences with programming on an IBM 1401. Describe the process he used to write and run a program on this machine, and explain why he found it challenging to create meaningful programs on this system.\n",
+ "Metrics: {'mrr': 1.0, 'hit_rate': 1.0, 'ndcg': 0.6131471927654584}\n",
"\n"
]
}
@@ -402,9 +426,10 @@
"\n",
" full_df = pd.DataFrame(metric_dicts)\n",
"\n",
- " hit_rate = full_df[\"hit_rate\"].mean()\n",
- " mrr = full_df[\"mrr\"].mean()\n",
- " columns = {\"retrievers\": [name], \"hit_rate\": [hit_rate], \"mrr\": [mrr]}\n",
+ " columns = {\n",
+ " \"retrievers\": [name],\n",
+ " **{k: [full_df[k].mean()] for k in metrics},\n",
+ " }\n",
"\n",
" if include_cohere_rerank:\n",
" crr_relevancy = full_df[\"cohere_rerank_relevancy\"].mean()\n",
@@ -443,26 +468,26 @@
" \n",
" | \n",
" retrievers | \n",
- " hit_rate | \n",
" mrr | \n",
- " cohere_rerank_relevancy | \n",
+ " hit_rate | \n",
+ " ndcg | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" top-2 eval | \n",
- " 0.801724 | \n",
- " 0.685345 | \n",
- " 0.946009 | \n",
+ " 0.643443 | \n",
+ " 0.745902 | \n",
+ " 0.410976 | \n",
"
\n",
" \n",
"\n",
""
],
"text/plain": [
- " retrievers hit_rate mrr cohere_rerank_relevancy\n",
- "0 top-2 eval 0.801724 0.685345 0.946009"
+ " retrievers mrr hit_rate ndcg\n",
+ "0 top-2 eval 0.643443 0.745902 0.410976"
]
},
"execution_count": null,
diff --git a/llama-index-core/llama_index/core/evaluation/retrieval/metrics.py b/llama-index-core/llama_index/core/evaluation/retrieval/metrics.py
index c4f880ddf5327..a74049f513de2 100644
--- a/llama-index-core/llama_index/core/evaluation/retrieval/metrics.py
+++ b/llama-index-core/llama_index/core/evaluation/retrieval/metrics.py
@@ -1,3 +1,4 @@
+import math
import os
from typing import Any, Callable, ClassVar, Dict, List, Literal, Optional, Type
@@ -7,6 +8,7 @@
BaseRetrievalMetric,
RetrievalMetricResult,
)
+from typing_extensions import assert_never
_AGG_FUNC: Dict[str, Callable] = {"mean": np.mean, "median": np.median, "max": np.max}
@@ -18,8 +20,8 @@ class HitRate(BaseRetrievalMetric):
- The more granular method checks for all potential matches between retrieved docs and expected docs.
Attributes:
- use_granular_hit_rate (bool): Determines whether to use the granular method for calculation.
metric_name (str): The name of the metric.
+ use_granular_hit_rate (bool): Determines whether to use the granular method for calculation.
"""
metric_name: ClassVar[str] = "hit_rate"
@@ -77,11 +79,11 @@ class MRR(BaseRetrievalMetric):
- The more granular method sums the reciprocal ranks of all relevant retrieved documents and divides by the count of relevant documents.
Attributes:
- use_granular_mrr (bool): Determines whether to use the granular method for calculation.
metric_name (str): The name of the metric.
+ use_granular_mrr (bool): Determines whether to use the granular method for calculation.
"""
- metric_name: str = "mrr"
+ metric_name: ClassVar[str] = "mrr"
use_granular_mrr: bool = False
def compute(
@@ -140,6 +142,91 @@ def compute(
return RetrievalMetricResult(score=mrr_score)
+DiscountedGainMode = Literal["linear", "exponential"]
+
+
+def discounted_gain(*, rel: float, i: int, mode: DiscountedGainMode) -> float:
+ # Avoid unnecessary calculations. Note that `False == 0` and `True == 1`.
+ if rel == 0:
+ return 0
+ if rel == 1:
+ return 1 / math.log2(i + 1)
+
+ if mode == "linear":
+ return rel / math.log2(i + 1)
+ elif mode == "exponential":
+ return (2**rel - 1) / math.log2(i + 1)
+ else:
+ assert_never(mode)
+
+
+class NDCG(BaseRetrievalMetric):
+ """NDCG (Normalized Discounted Cumulative Gain) metric.
+
+ The position `p` is taken as the size of the query results (which is usually
+ `top_k` of the retriever).
+
+ Currently only supports binary relevance
+ (``rel=1`` if document is in ``expected_ids``, otherwise ``rel=0``)
+ since we assume that ``expected_ids`` is unordered.
+
+ Attributes:
+ metric_name (str): The name of the metric.
+ mode (DiscountedGainMode): Determines the formula for each item in the summation.
+ """
+
+ metric_name: ClassVar[str] = "ndcg"
+ mode: DiscountedGainMode = "linear"
+
+ def compute(
+ self,
+ query: Optional[str] = None,
+ expected_ids: Optional[List[str]] = None,
+ retrieved_ids: Optional[List[str]] = None,
+ expected_texts: Optional[List[str]] = None,
+ retrieved_texts: Optional[List[str]] = None,
+ ) -> RetrievalMetricResult:
+ """Compute NDCG based on the provided inputs and selected method.
+
+ Parameters:
+ query (Optional[str]): The query string (not used in the current implementation).
+ expected_ids (Optional[List[str]]): Expected document IDs, unordered by relevance.
+ retrieved_ids (Optional[List[str]]): Retrieved document IDs, ordered by relevance from highest to lowest.
+ expected_texts (Optional[List[str]]): Expected texts (not used in the current implementation).
+ retrieved_texts (Optional[List[str]]): Retrieved texts (not used in the current implementation).
+
+ Raises:
+ ValueError: If the necessary IDs are not provided.
+
+ Returns:
+ RetrievalMetricResult: The result with the computed MRR score.
+ """
+ # Checking for the required arguments
+ if (
+ retrieved_ids is None
+ or expected_ids is None
+ or not retrieved_ids
+ or not expected_ids
+ ):
+ raise ValueError("Retrieved ids and expected ids must be provided")
+
+ mode = self.mode
+ expected_set = set(expected_ids)
+
+ dcg = sum(
+ discounted_gain(rel=docid in expected_set, i=i, mode=mode)
+ for i, docid in enumerate(retrieved_ids, start=1)
+ )
+ idcg = sum(
+ discounted_gain(rel=True, i=i, mode=mode)
+ for i in range(1, len(retrieved_ids) + 1)
+ )
+
+ ndcg_score = dcg / idcg
+
+ return RetrievalMetricResult(score=ndcg_score)
+
+
class CohereRerankRelevancyMetric(BaseRetrievalMetric):
"""Cohere rerank relevancy metric."""
@@ -209,6 +296,7 @@ def compute(
METRIC_REGISTRY: Dict[str, Type[BaseRetrievalMetric]] = {
"hit_rate": HitRate,
"mrr": MRR,
+ "ndcg": NDCG,
"cohere_rerank_relevancy": CohereRerankRelevancyMetric,
}
diff --git a/llama-index-core/tests/evaluation/test_rr_mrr_hitrate.py b/llama-index-core/tests/evaluation/test_metrics.py
similarity index 56%
rename from llama-index-core/tests/evaluation/test_rr_mrr_hitrate.py
rename to llama-index-core/tests/evaluation/test_metrics.py
index 448245f29b060..ee309dbfb56e5 100644
--- a/llama-index-core/tests/evaluation/test_rr_mrr_hitrate.py
+++ b/llama-index-core/tests/evaluation/test_metrics.py
@@ -1,5 +1,7 @@
+from math import log2
+
import pytest
-from llama_index.core.evaluation.retrieval.metrics import HitRate, MRR
+from llama_index.core.evaluation.retrieval.metrics import HitRate, MRR, NDCG
# Test cases for the updated HitRate class using instance attribute
@@ -49,6 +51,66 @@ def test_mrr(expected_ids, retrieved_ids, use_granular, expected_result):
assert result.score == pytest.approx(expected_result)
+# Test cases for the updated NDCG class using instance attribute
+@pytest.mark.parametrize(
+ ("expected_ids", "retrieved_ids", "mode", "expected_result"),
+ [
+ (
+ ["id1", "id2", "id3"],
+ ["id3", "id1", "id2", "id4"],
+ "linear",
+ (1 / log2(1 + 1) + 1 / log2(2 + 1) + 1 / log2(3 + 1))
+ / (1 / log2(1 + 1) + 1 / log2(2 + 1) + 1 / log2(3 + 1) + 1 / log2(4 + 1)),
+ ),
+ (
+ ["id1", "id2", "id3", "id4"],
+ ["id5", "id1"],
+ "linear",
+ (1 / log2(2 + 1)) / (1 / log2(1 + 1) + 1 / log2(2 + 1)),
+ ),
+ (
+ ["id1", "id2"],
+ ["id3", "id4"],
+ "linear",
+ 0.0,
+ ),
+ (
+ ["id1", "id2"],
+ ["id2", "id1", "id7"],
+ "linear",
+ (1 / log2(1 + 1) + 1 / log2(2 + 1))
+ / (1 / log2(1 + 1) + 1 / log2(2 + 1) + 1 / log2(3 + 1)),
+ ),
+ (
+ ["id1", "id2", "id3"],
+ ["id3", "id1", "id2", "id4"],
+ "exponential",
+ (1 / log2(1 + 1) + 1 / log2(2 + 1) + 1 / log2(3 + 1))
+ / (1 / log2(1 + 1) + 1 / log2(2 + 1) + 1 / log2(3 + 1) + 1 / log2(4 + 1)),
+ ),
+ (
+ ["id1", "id2", "id3", "id4"],
+ ["id1", "id2", "id5"],
+ "exponential",
+ (1 / log2(1 + 1) + 1 / log2(2 + 1))
+ / (1 / log2(1 + 1) + 1 / log2(2 + 1) + 1 / log2(3 + 1)),
+ ),
+ (
+ ["id1", "id2"],
+ ["id1", "id7", "id15", "id2"],
+ "exponential",
+ (1 / log2(1 + 1) + 1 / log2(4 + 1))
+ / (1 / log2(1 + 1) + 1 / log2(2 + 1) + 1 / log2(3 + 1) + 1 / log2(4 + 1)),
+ ),
+ ],
+)
+def test_ndcg(expected_ids, retrieved_ids, mode, expected_result):
+ ndcg = NDCG()
+ ndcg.mode = mode
+ result = ndcg.compute(expected_ids=expected_ids, retrieved_ids=retrieved_ids)
+ assert result.score == pytest.approx(expected_result)
+
+
# Test cases for exceptions handling for both HitRate and MRR
@pytest.mark.parametrize(
("expected_ids", "retrieved_ids", "use_granular"),
@@ -72,6 +134,11 @@ def test_exceptions(expected_ids, retrieved_ids, use_granular):
hr.use_granular_hit_rate = use_granular
hr.compute(expected_ids=expected_ids, retrieved_ids=retrieved_ids)
+ with pytest.raises(ValueError):
mrr = MRR()
mrr.use_granular_mrr = use_granular
mrr.compute(expected_ids=expected_ids, retrieved_ids=retrieved_ids)
+
+ with pytest.raises(ValueError):
+ ndcg = NDCG()
+ ndcg.compute(expected_ids=expected_ids, retrieved_ids=retrieved_ids)