feat(weave): Project level leaderboards (#2634)

* stubbed * First cursor implementation * Cursor Checkpoint 2 * Cursor Checkpoint 3 * Cursor Checkpoint 4 * Cursor Checkpoint 5 * Cursor Checkpoint 6 * Cursor Checkpoint 7 * Cleanup * Human: Starting to extract real data from the system * Human: Starting to extract real data from the system * Human: Starting to extract real data from the system * Human: Ok, getting to a checkpoint * Human: Cleanup * Human: more work * Human: getting the config going * Human: finish the basic stubbing * Human: finish the basic stubbing * Cursor: Template Config * Human + AI: Config editor * Cursor: Template Config * Human + AI: Config Styling * Human + AI: Config Styling * Human: Leaderboard state * Humnan: Rename file * Lint * Human: dataset lookups * Human: Kind of a bad checkpoint, but need to pause * Human: Making incremental progress * Human: Finished basic config * Human: Getting things wired uo * Human: Ok, really forked hard * Human: Ok, really forked hard * addressed comments * addressed comments * lint city * REFACTOR START * Massive refactor step 1 * Massive refactor step 2 * Massive refactor step 3 * Massive refactor step 4 * Fixed clicking * Fixed clicking * Fixed clicking * Added eval sources * Added eval sources - lint * Added eval * support * Added eval * support - lint * Human: Make room for the AI * AI: Make room for the human * AI: Doing the hard work * AI: Doing the hard work * AI: Doing the hard work 3 * Human: Name refactorrs * Human: Name refactorrs - lint * Human: Better sidebar * Human: Nearly there - just need to do data fetch now * Human: A bunch of little style changes because i am crazy ocd * Human: Layer 1 of queries complete * Human: Little styling * Human: Subtle improvements * Human: Made subtle logic fixes to the query * Checkpoint * Yuk that was a lot * Checkpoint * blend it all together * pulled initial files * MAYBE CHECKPOINT * MAYBE CHECKPOINT 2 * MAYBE CHECKPOINT 2 * Cursor 1, human 0 * Cursor 1, human 1 + lint * basic listing layout * added button * a little bit * ok, getting things a bit closer * cleanup leaderbaord * a little cleanuo * ok, getting close * lint * Ok, big changes, but good changes * basically done - except for editting * Done checkpoint * buttons added * updated the client * initial config + lint * REMOVED OLD CONFIG EFITOR * TODO: Make the editor pull live data, Add Create new, code clean * Lint and a few changes * almost there * working * nearly complete * everything done except make now * Ok, all the building blocks are now there * Ok, all the building blocks are now there * Keep working * Keep working * Keep working * Keep working * Keep working * finished merge * code clean * more cleaning * more cleaning * more cleaning * more cleaning * more cleaning * Cleanup and Refactoring of Names * Small cleanup * Cleanups * Cleanups * Small cleanup * fixed * Part 1 of removing old styles * Part 1 of removing old styles * Part 1 of removing old styles * Almost done * Almost done * Fixed state change * REVIEW CANDIDATE * Moved Docs * Fixed tests * init * init * init * generation complete * beginning ts implementation * Initial TS implementation complete * Initial TS tests complete * Initial TS tests complete * Initial python tests complete * Typescript improvements * Python Tests complete * Attempted fix * Attempted fix * clean * clean * fixed bug * maybe fix * maybe fix * Added diagram * Removed first hack * Removed second hack * lint * Initial uptake of changes - tests passing * Paydantic fix + schema gen * typescript * style * convert to path instead of path parts * Uptake formal type * Uptake formal type * split out hooks * Uptake the first of many hooks * Fixed generation * Fixed generation 2 * More code removal * Fixed types * Refactor complete * Small ts error * Release Candidate * Type fixes * Lint * Moved to generated * Moved to gen * Addressed comments * Addressed comments * empty state * Lint * Lint
wandb · Nov 4, 2024 · fc26837 · fc26837
1 parent 9f3383e
commit fc26837
Show file tree

Hide file tree

Showing 22 changed files with 2,691 additions and 62 deletions.
diff --git a/docs/docs/reference/gen_notebooks/leaderboard_quickstart.md b/docs/docs/reference/gen_notebooks/leaderboard_quickstart.md
@@ -0,0 +1,343 @@
+---
+title: Leaderboard Quickstart
+---
+
+
+:::tip[This is a notebook]
+
+<a href="https://colab.research.google.com/github/wandb/weave/blob/master/docs/./notebooks/leaderboard_quickstart.ipynb" target="_blank" rel="noopener noreferrer" class="navbar__item navbar__link button button--secondary button--med margin-right--sm notebook-cta-button"><div><img src="https://upload.wikimedia.org/wikipedia/commons/archive/d/d0/20221103151430%21Google_Colaboratory_SVG_Logo.svg" alt="Open In Colab" height="20px" /><div>Open in Colab</div></div></a>
+
+<a href="https://github.com/wandb/weave/blob/master/docs/./notebooks/leaderboard_quickstart.ipynb" target="_blank" rel="noopener noreferrer" class="navbar__item navbar__link button button--secondary button--med margin-right--sm notebook-cta-button"><div><img src="https://upload.wikimedia.org/wikipedia/commons/9/91/Octicons-mark-github.svg" alt="View in Github" height="15px" /><div>View in Github</div></div></a>
+
+:::
+
+
+
+<!--- @wandbcode{leaderboard-demo} -->
+
+# Leaderboard Quickstart
+
+In this notebook we will learn to use Weave's Leaderboard to compare model performance across different datasets and scoring functions. Specifically, we will:
+
+1. Generate a dataset of fake zip code data
+2. Author some scoring functions and evaluate a baseline model.
+3. Use these techniques to evaluate a matrix of models vs evaluations.
+4. Review the leaderboard in the Weave UI.
+
+## Step 1: Generate a dataset of fake zip code data
+
+First we will create a function `generate_dataset_rows` that generates a list of fake zip code data.
+
+
+```python
+import json
+
+from openai import OpenAI
+from pydantic import BaseModel
+
+
+class Row(BaseModel):
+    zip_code: str
+    city: str
+    state: str
+    avg_temp_f: float
+    population: int
+    median_income: int
+    known_for: str
+
+
+class Rows(BaseModel):
+    rows: list[Row]
+
+
+def generate_dataset_rows(
+    location: str = "United States", count: int = 5, year: int = 2022
+):
+    client = OpenAI()
+
+    completion = client.chat.completions.create(
+        model="gpt-4o-mini",
+        messages=[
+            {"role": "system", "content": "You are a helpful assistant."},
+            {
+                "role": "user",
+                "content": f"Please generate {count} rows of data for random zip codes in {location} for the year {year}.",
+            },
+        ],
+        response_format={
+            "type": "json_schema",
+            "json_schema": {
+                "name": "response_format",
+                "schema": Rows.model_json_schema(),
+            },
+        },
+    )
+
+    return json.loads(completion.choices[0].message.content)["rows"]
+```
+
+
+```python
+import weave
+
+weave.init("leaderboard-demo")
+```
+
+## Step 2: Author scoring functions
+
+Next we will author 3 scoring functions:
+
+1. `check_concrete_fields`: Checks if the model output matches the expected city and state.
+2. `check_value_fields`: Checks if the model output is within 10% of the expected population and median income.
+3. `check_subjective_fields`: Uses a LLM to check if the model output matches the expected "known for" field.
+
+
+
+```python
+@weave.op
+def check_concrete_fields(city: str, state: str, output: dict):
+    return {
+        "city_match": city == output["city"],
+        "state_match": state == output["state"],
+    }
+
+
+@weave.op
+def check_value_fields(
+    avg_temp_f: float, population: int, median_income: int, output: dict
+):
+    return {
+        "avg_temp_f_err": abs(avg_temp_f - output["avg_temp_f"]) / avg_temp_f,
+        "population_err": abs(population - output["population"]) / population,
+        "median_income_err": abs(median_income - output["median_income"])
+        / median_income,
+    }
+
+
+@weave.op
+def check_subjective_fields(zip_code: str, known_for: str, output: dict):
+    client = OpenAI()
+
+    class Response(BaseModel):
+        correct_known_for: bool
+
+    completion = client.chat.completions.create(
+        model="gpt-4o-mini",
+        messages=[
+            {"role": "system", "content": "You are a helpful assistant."},
+            {
+                "role": "user",
+                "content": f"My student was asked what the zip code {zip_code} is best known best for. The right answer is '{known_for}', and they said '{output['known_for']}'. Is their answer correct?",
+            },
+        ],
+        response_format={
+            "type": "json_schema",
+            "json_schema": {
+                "name": "response_format",
+                "schema": Response.model_json_schema(),
+            },
+        },
+    )
+
+    return json.loads(completion.choices[0].message.content)
+```
+
+## Step 3: Create a simple Evaluation
+
+Next we define a simple evaliation using our fake data and scoring functions.
+
+
+
+```python
+rows = generate_dataset_rows()
+evaluation = weave.Evaluation(
+    name="United States - 2022",
+    dataset=rows,
+    scorers=[
+        check_concrete_fields,
+        check_value_fields,
+        check_subjective_fields,
+    ],
+)
+```
+
+## Step 4: Evaluate a baseline model
+
+Now we will evaluate a baseline model which returns a static response.
+
+
+
+```python
+@weave.op
+def baseline_model(zip_code: str):
+    return {
+        "city": "New York",
+        "state": "NY",
+        "avg_temp_f": 50.0,
+        "population": 1000000,
+        "median_income": 100000,
+        "known_for": "The Big Apple",
+    }
+
+
+await evaluation.evaluate(baseline_model)
+```
+
+## Step 5: Create more Models
+
+Now we will create 2 more models to compare against the baseline.
+
+
+```python
+@weave.op
+def gpt_4o_mini_no_context(zip_code: str):
+    client = OpenAI()
+
+    completion = client.chat.completions.create(
+        model="gpt-4o-mini",
+        messages=[{"role": "user", "content": f"""Zip code {zip_code}"""}],
+        response_format={
+            "type": "json_schema",
+            "json_schema": {
+                "name": "response_format",
+                "schema": Row.model_json_schema(),
+            },
+        },
+    )
+
+    return json.loads(completion.choices[0].message.content)
+
+
+await evaluation.evaluate(gpt_4o_mini_no_context)
+```
+
+
+```python
+@weave.op
+def gpt_4o_mini_with_context(zip_code: str):
+    client = OpenAI()
+
+    completion = client.chat.completions.create(
+        model="gpt-4o-mini",
+        messages=[
+            {
+                "role": "user",
+                "content": f"""Please answer the following questions about the zip code {zip_code}:
+                   1. What is the city?
+                   2. What is the state?
+                   3. What is the average temperature in Fahrenheit?
+                   4. What is the population?
+                   5. What is the median income?
+                   6. What is the most well known thing about this zip code?
+                   """,
+            }
+        ],
+        response_format={
+            "type": "json_schema",
+            "json_schema": {
+                "name": "response_format",
+                "schema": Row.model_json_schema(),
+            },
+        },
+    )
+
+    return json.loads(completion.choices[0].message.content)
+
+
+await evaluation.evaluate(gpt_4o_mini_with_context)
+```
+
+## Step 6: Create more Evaluations
+
+Now we will evaluate a matrix of models vs evaluations.
+
+
+
+```python
+scorers = [
+    check_concrete_fields,
+    check_value_fields,
+    check_subjective_fields,
+]
+evaluations = [
+    weave.Evaluation(
+        name="United States - 2022",
+        dataset=weave.Dataset(
+            name="United States - 2022",
+            rows=generate_dataset_rows("United States", 5, 2022),
+        ),
+        scorers=scorers,
+    ),
+    weave.Evaluation(
+        name="California - 2022",
+        dataset=weave.Dataset(
+            name="California - 2022", rows=generate_dataset_rows("California", 5, 2022)
+        ),
+        scorers=scorers,
+    ),
+    weave.Evaluation(
+        name="United States - 2000",
+        dataset=weave.Dataset(
+            name="United States - 2000",
+            rows=generate_dataset_rows("United States", 5, 2000),
+        ),
+        scorers=scorers,
+    ),
+]
+models = [
+    baseline_model,
+    gpt_4o_mini_no_context,
+    gpt_4o_mini_with_context,
+]
+
+for evaluation in evaluations:
+    for model in models:
+        await evaluation.evaluate(
+            model, __weave={"display_name": evaluation.name + ":" + model.__name__}
+        )
+```
+
+## Step 7: Review the Leaderboard
+
+You can create a new leaderboard by navigating to the leaderboard tab in the UI and clicking "Create Leaderboard".
+
+We can also generate a leaderboard directly from Python:
+
+
+```python
+from weave.flow import leaderboard
+from weave.trace.weave_client import get_ref
+
+spec = leaderboard.Leaderboard(
+    name="Zip Code World Knowledge",
+    description="""
+This leaderboard compares the performance of models in terms of world knowledge about zip codes.
+
+### Columns
+
+1. **State Match against `United States - 2022`**: The fraction of zip codes that the model correctly identified the state for.
+2. **Avg Temp F Error against `California - 2022`**: The mean absolute error of the model's average temperature prediction.
+3. **Correct Known For against `United States - 2000`**: The fraction of zip codes that the model correctly identified the most well known thing about the zip code.
+""",
+    columns=[
+        leaderboard.LeaderboardColumn(
+            evaluation_object_ref=get_ref(evaluations[0]).uri(),
+            scorer_name="check_concrete_fields",
+            summary_metric_path="state_match.true_fraction",
+        ),
+        leaderboard.LeaderboardColumn(
+            evaluation_object_ref=get_ref(evaluations[1]).uri(),
+            scorer_name="check_value_fields",
+            should_minimize=True,
+            summary_metric_path="avg_temp_f_err.mean",
+        ),
+        leaderboard.LeaderboardColumn(
+            evaluation_object_ref=get_ref(evaluations[2]).uri(),
+            scorer_name="check_subjective_fields",
+            summary_metric_path="correct_known_for.true_fraction",
+        ),
+    ],
+)
+
+ref = weave.publish(spec)
+```