Skip to content

Commit

Permalink
feat(weave): Project level leaderboards (#2634)
Browse files Browse the repository at this point in the history
* stubbed

* First cursor implementation

* Cursor Checkpoint 2

* Cursor Checkpoint 3

* Cursor Checkpoint 4

* Cursor Checkpoint 5

* Cursor Checkpoint 6

* Cursor Checkpoint 7

* Cleanup

* Human: Starting to extract real data from the system

* Human: Starting to extract real data from the system

* Human: Starting to extract real data from the system

* Human: Ok, getting to a checkpoint

* Human: Cleanup

* Human: more work

* Human: getting the config going

* Human: finish the basic stubbing

* Human: finish the basic stubbing

* Cursor: Template Config

* Human + AI: Config editor

* Cursor: Template Config

* Human + AI: Config Styling

* Human + AI: Config Styling

* Human: Leaderboard state

* Humnan: Rename file

* Lint

* Human: dataset lookups

* Human: Kind of a bad checkpoint, but need to pause

* Human: Making incremental progress

* Human: Finished basic config

* Human: Getting things wired uo

* Human: Ok, really forked hard

* Human: Ok, really forked hard

* addressed comments

* addressed comments

* lint city

* REFACTOR START

* Massive refactor step 1

* Massive refactor step 2

* Massive refactor step 3

* Massive refactor step 4

* Fixed clicking

* Fixed clicking

* Fixed clicking

* Added eval sources

* Added eval sources - lint

* Added eval * support

* Added eval * support - lint

* Human: Make room for the AI

* AI: Make room for the human

* AI: Doing the hard work

* AI: Doing the hard work

* AI: Doing the hard work 3

* Human: Name refactorrs

* Human: Name refactorrs - lint

* Human: Better sidebar

* Human: Nearly there - just need to do data fetch now

* Human: A bunch of little style changes because i am crazy ocd

* Human: Layer 1 of queries complete

* Human: Little styling

* Human: Subtle improvements

* Human: Made subtle logic fixes to the query

* Checkpoint

* Yuk that was a lot

* Checkpoint

* blend it all together

* pulled initial files

* MAYBE CHECKPOINT

* MAYBE CHECKPOINT 2

* MAYBE CHECKPOINT 2

* Cursor 1, human 0

* Cursor 1, human 1 + lint

* basic listing layout

* added button

* a little bit

* ok, getting things a bit closer

* cleanup leaderbaord

* a little cleanuo

* ok, getting close

* lint

* Ok, big changes, but good changes

* basically done - except for editting

* Done checkpoint

* buttons added

* updated the client

* initial config + lint

* REMOVED OLD CONFIG EFITOR

* TODO: Make the editor pull live data, Add Create new, code clean

* Lint and a few changes

* almost there

* working

* nearly complete

* everything done except make now

* Ok, all the building blocks are now there

* Ok, all the building blocks are now there

* Keep working

* Keep working

* Keep working

* Keep working

* Keep working

* finished merge

* code clean

* more cleaning

* more cleaning

* more cleaning

* more cleaning

* more cleaning

* Cleanup and Refactoring of Names

* Small cleanup

* Cleanups

* Cleanups

* Small cleanup

* fixed

* Part 1 of removing old styles

* Part 1 of removing old styles

* Part 1 of removing old styles

* Almost done

* Almost done

* Fixed state change

* REVIEW CANDIDATE

* Moved Docs

* Fixed tests

* init

* init

* init

* generation complete

* beginning ts implementation

* Initial TS implementation complete

* Initial TS tests complete

* Initial TS tests complete

* Initial python tests complete

* Typescript improvements

* Python Tests complete

* Attempted fix

* Attempted fix

* clean

* clean

* fixed bug

* maybe fix

* maybe fix

* Added diagram

* Removed first hack

* Removed second hack

* lint

* Initial uptake of changes - tests passing

* Paydantic fix + schema gen

* typescript

* style

* convert to path instead of path parts

* Uptake formal type

* Uptake formal type

* split out hooks

* Uptake the first of many hooks

* Fixed generation

* Fixed generation 2

* More code removal

* Fixed types

* Refactor complete

* Small ts error

* Release Candidate

* Type fixes

* Lint

* Moved to generated

* Moved to gen

* Addressed comments

* Addressed comments

* empty state

* Lint

* Lint
  • Loading branch information
tssweeney authored Nov 4, 2024
1 parent 9f3383e commit fc26837
Show file tree
Hide file tree
Showing 22 changed files with 2,691 additions and 62 deletions.
343 changes: 343 additions & 0 deletions docs/docs/reference/gen_notebooks/leaderboard_quickstart.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,343 @@
---
title: Leaderboard Quickstart
---


:::tip[This is a notebook]

<a href="https://colab.research.google.com/github/wandb/weave/blob/master/docs/./notebooks/leaderboard_quickstart.ipynb" target="_blank" rel="noopener noreferrer" class="navbar__item navbar__link button button--secondary button--med margin-right--sm notebook-cta-button"><div><img src="https://upload.wikimedia.org/wikipedia/commons/archive/d/d0/20221103151430%21Google_Colaboratory_SVG_Logo.svg" alt="Open In Colab" height="20px" /><div>Open in Colab</div></div></a>

<a href="https://github.com/wandb/weave/blob/master/docs/./notebooks/leaderboard_quickstart.ipynb" target="_blank" rel="noopener noreferrer" class="navbar__item navbar__link button button--secondary button--med margin-right--sm notebook-cta-button"><div><img src="https://upload.wikimedia.org/wikipedia/commons/9/91/Octicons-mark-github.svg" alt="View in Github" height="15px" /><div>View in Github</div></div></a>

:::



<!--- @wandbcode{leaderboard-demo} -->

# Leaderboard Quickstart

In this notebook we will learn to use Weave's Leaderboard to compare model performance across different datasets and scoring functions. Specifically, we will:

1. Generate a dataset of fake zip code data
2. Author some scoring functions and evaluate a baseline model.
3. Use these techniques to evaluate a matrix of models vs evaluations.
4. Review the leaderboard in the Weave UI.

## Step 1: Generate a dataset of fake zip code data

First we will create a function `generate_dataset_rows` that generates a list of fake zip code data.


```python
import json

from openai import OpenAI
from pydantic import BaseModel


class Row(BaseModel):
zip_code: str
city: str
state: str
avg_temp_f: float
population: int
median_income: int
known_for: str


class Rows(BaseModel):
rows: list[Row]


def generate_dataset_rows(
location: str = "United States", count: int = 5, year: int = 2022
):
client = OpenAI()

completion = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{
"role": "user",
"content": f"Please generate {count} rows of data for random zip codes in {location} for the year {year}.",
},
],
response_format={
"type": "json_schema",
"json_schema": {
"name": "response_format",
"schema": Rows.model_json_schema(),
},
},
)

return json.loads(completion.choices[0].message.content)["rows"]
```


```python
import weave

weave.init("leaderboard-demo")
```

## Step 2: Author scoring functions

Next we will author 3 scoring functions:

1. `check_concrete_fields`: Checks if the model output matches the expected city and state.
2. `check_value_fields`: Checks if the model output is within 10% of the expected population and median income.
3. `check_subjective_fields`: Uses a LLM to check if the model output matches the expected "known for" field.



```python
@weave.op
def check_concrete_fields(city: str, state: str, output: dict):
return {
"city_match": city == output["city"],
"state_match": state == output["state"],
}


@weave.op
def check_value_fields(
avg_temp_f: float, population: int, median_income: int, output: dict
):
return {
"avg_temp_f_err": abs(avg_temp_f - output["avg_temp_f"]) / avg_temp_f,
"population_err": abs(population - output["population"]) / population,
"median_income_err": abs(median_income - output["median_income"])
/ median_income,
}


@weave.op
def check_subjective_fields(zip_code: str, known_for: str, output: dict):
client = OpenAI()

class Response(BaseModel):
correct_known_for: bool

completion = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{
"role": "user",
"content": f"My student was asked what the zip code {zip_code} is best known best for. The right answer is '{known_for}', and they said '{output['known_for']}'. Is their answer correct?",
},
],
response_format={
"type": "json_schema",
"json_schema": {
"name": "response_format",
"schema": Response.model_json_schema(),
},
},
)

return json.loads(completion.choices[0].message.content)
```

## Step 3: Create a simple Evaluation

Next we define a simple evaliation using our fake data and scoring functions.



```python
rows = generate_dataset_rows()
evaluation = weave.Evaluation(
name="United States - 2022",
dataset=rows,
scorers=[
check_concrete_fields,
check_value_fields,
check_subjective_fields,
],
)
```

## Step 4: Evaluate a baseline model

Now we will evaluate a baseline model which returns a static response.



```python
@weave.op
def baseline_model(zip_code: str):
return {
"city": "New York",
"state": "NY",
"avg_temp_f": 50.0,
"population": 1000000,
"median_income": 100000,
"known_for": "The Big Apple",
}


await evaluation.evaluate(baseline_model)
```

## Step 5: Create more Models

Now we will create 2 more models to compare against the baseline.


```python
@weave.op
def gpt_4o_mini_no_context(zip_code: str):
client = OpenAI()

completion = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": f"""Zip code {zip_code}"""}],
response_format={
"type": "json_schema",
"json_schema": {
"name": "response_format",
"schema": Row.model_json_schema(),
},
},
)

return json.loads(completion.choices[0].message.content)


await evaluation.evaluate(gpt_4o_mini_no_context)
```


```python
@weave.op
def gpt_4o_mini_with_context(zip_code: str):
client = OpenAI()

completion = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{
"role": "user",
"content": f"""Please answer the following questions about the zip code {zip_code}:
1. What is the city?
2. What is the state?
3. What is the average temperature in Fahrenheit?
4. What is the population?
5. What is the median income?
6. What is the most well known thing about this zip code?
""",
}
],
response_format={
"type": "json_schema",
"json_schema": {
"name": "response_format",
"schema": Row.model_json_schema(),
},
},
)

return json.loads(completion.choices[0].message.content)


await evaluation.evaluate(gpt_4o_mini_with_context)
```

## Step 6: Create more Evaluations

Now we will evaluate a matrix of models vs evaluations.



```python
scorers = [
check_concrete_fields,
check_value_fields,
check_subjective_fields,
]
evaluations = [
weave.Evaluation(
name="United States - 2022",
dataset=weave.Dataset(
name="United States - 2022",
rows=generate_dataset_rows("United States", 5, 2022),
),
scorers=scorers,
),
weave.Evaluation(
name="California - 2022",
dataset=weave.Dataset(
name="California - 2022", rows=generate_dataset_rows("California", 5, 2022)
),
scorers=scorers,
),
weave.Evaluation(
name="United States - 2000",
dataset=weave.Dataset(
name="United States - 2000",
rows=generate_dataset_rows("United States", 5, 2000),
),
scorers=scorers,
),
]
models = [
baseline_model,
gpt_4o_mini_no_context,
gpt_4o_mini_with_context,
]

for evaluation in evaluations:
for model in models:
await evaluation.evaluate(
model, __weave={"display_name": evaluation.name + ":" + model.__name__}
)
```

## Step 7: Review the Leaderboard

You can create a new leaderboard by navigating to the leaderboard tab in the UI and clicking "Create Leaderboard".

We can also generate a leaderboard directly from Python:


```python
from weave.flow import leaderboard
from weave.trace.weave_client import get_ref

spec = leaderboard.Leaderboard(
name="Zip Code World Knowledge",
description="""
This leaderboard compares the performance of models in terms of world knowledge about zip codes.
### Columns
1. **State Match against `United States - 2022`**: The fraction of zip codes that the model correctly identified the state for.
2. **Avg Temp F Error against `California - 2022`**: The mean absolute error of the model's average temperature prediction.
3. **Correct Known For against `United States - 2000`**: The fraction of zip codes that the model correctly identified the most well known thing about the zip code.
""",
columns=[
leaderboard.LeaderboardColumn(
evaluation_object_ref=get_ref(evaluations[0]).uri(),
scorer_name="check_concrete_fields",
summary_metric_path="state_match.true_fraction",
),
leaderboard.LeaderboardColumn(
evaluation_object_ref=get_ref(evaluations[1]).uri(),
scorer_name="check_value_fields",
should_minimize=True,
summary_metric_path="avg_temp_f_err.mean",
),
leaderboard.LeaderboardColumn(
evaluation_object_ref=get_ref(evaluations[2]).uri(),
scorer_name="check_subjective_fields",
summary_metric_path="correct_known_for.true_fraction",
),
],
)

ref = weave.publish(spec)
```
Loading

0 comments on commit fc26837

Please sign in to comment.