Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(weave): Project level leaderboards #2634

Merged
merged 184 commits into from
Nov 4, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
184 commits
Select commit Hold shift + click to select a range
84e5fad
stubbed
tssweeney Oct 8, 2024
d563e0c
First cursor implementation
tssweeney Oct 8, 2024
dcb3dc6
Cursor Checkpoint 2
tssweeney Oct 8, 2024
470c657
Cursor Checkpoint 3
tssweeney Oct 8, 2024
422636b
Cursor Checkpoint 4
tssweeney Oct 8, 2024
f28ae1c
Cursor Checkpoint 5
tssweeney Oct 8, 2024
784cc9d
Cursor Checkpoint 6
tssweeney Oct 8, 2024
dd019e9
Cursor Checkpoint 7
tssweeney Oct 8, 2024
4beb237
Cleanup
tssweeney Oct 8, 2024
5237650
Human: Starting to extract real data from the system
tssweeney Oct 8, 2024
6f6f75d
Human: Starting to extract real data from the system
tssweeney Oct 8, 2024
c2714d5
Human: Starting to extract real data from the system
tssweeney Oct 8, 2024
91f56ea
Human: Ok, getting to a checkpoint
tssweeney Oct 8, 2024
a386945
Human: Cleanup
tssweeney Oct 8, 2024
35fed91
Human: more work
tssweeney Oct 9, 2024
2754a02
Human: getting the config going
tssweeney Oct 9, 2024
c36415b
Human: finish the basic stubbing
tssweeney Oct 9, 2024
44d168c
Human: finish the basic stubbing
tssweeney Oct 9, 2024
6e2e58e
Cursor: Template Config
tssweeney Oct 9, 2024
6651013
Human + AI: Config editor
tssweeney Oct 9, 2024
f6f9db4
Cursor: Template Config
tssweeney Oct 9, 2024
a2a945c
Human + AI: Config Styling
tssweeney Oct 9, 2024
f185297
Human + AI: Config Styling
tssweeney Oct 9, 2024
9d3a403
Human: Leaderboard state
tssweeney Oct 9, 2024
c464fb4
Humnan: Rename file
tssweeney Oct 9, 2024
3f5b119
Lint
tssweeney Oct 9, 2024
6e60298
Human: dataset lookups
tssweeney Oct 9, 2024
ce8cc44
Human: Kind of a bad checkpoint, but need to pause
tssweeney Oct 9, 2024
4c78162
Human: Making incremental progress
tssweeney Oct 9, 2024
938dd9a
Human: Finished basic config
tssweeney Oct 9, 2024
cb86ea6
Human: Getting things wired uo
tssweeney Oct 9, 2024
5b0d9d4
Human: Ok, really forked hard
tssweeney Oct 9, 2024
30483c6
Human: Ok, really forked hard
tssweeney Oct 9, 2024
cf78571
Merge branch 'master' into tim/leaderboard_mvp
tssweeney Oct 10, 2024
6c368d5
addressed comments
tssweeney Oct 10, 2024
17077d3
addressed comments
tssweeney Oct 10, 2024
13f9a76
lint city
tssweeney Oct 10, 2024
0537bd5
REFACTOR START
tssweeney Oct 10, 2024
c8f542e
Massive refactor step 1
tssweeney Oct 10, 2024
8869086
Massive refactor step 2
tssweeney Oct 10, 2024
805451e
Massive refactor step 3
tssweeney Oct 10, 2024
7029a99
Massive refactor step 4
tssweeney Oct 10, 2024
e1fc438
Fixed clicking
tssweeney Oct 10, 2024
1e26f1a
Fixed clicking
tssweeney Oct 10, 2024
4c07720
Fixed clicking
tssweeney Oct 10, 2024
14ed5b2
Added eval sources
tssweeney Oct 10, 2024
aad1232
Added eval sources - lint
tssweeney Oct 10, 2024
3e1b0a9
Added eval * support
tssweeney Oct 10, 2024
5f2bd3e
Added eval * support - lint
tssweeney Oct 10, 2024
bdded0f
Human: Make room for the AI
tssweeney Oct 10, 2024
ba9b468
AI: Make room for the human
tssweeney Oct 10, 2024
927253d
AI: Doing the hard work
tssweeney Oct 10, 2024
9530671
AI: Doing the hard work
tssweeney Oct 10, 2024
1d84870
AI: Doing the hard work 3
tssweeney Oct 10, 2024
3448763
Human: Name refactorrs
tssweeney Oct 10, 2024
de45ea2
Human: Name refactorrs - lint
tssweeney Oct 10, 2024
785c29e
Human: Better sidebar
tssweeney Oct 10, 2024
19f5107
Human: Nearly there - just need to do data fetch now
tssweeney Oct 10, 2024
4142f04
Human: A bunch of little style changes because i am crazy ocd
tssweeney Oct 10, 2024
30beb31
Human: Layer 1 of queries complete
tssweeney Oct 10, 2024
ae2f9c4
Human: Little styling
tssweeney Oct 10, 2024
3a02321
Human: Subtle improvements
tssweeney Oct 10, 2024
455101a
Human: Made subtle logic fixes to the query
tssweeney Oct 10, 2024
a185e3b
Checkpoint
tssweeney Oct 11, 2024
04523e4
Yuk that was a lot
tssweeney Oct 11, 2024
0411249
Checkpoint
tssweeney Oct 11, 2024
f606d67
Merge branch 'master' into tim/leaderboard_mvp
tssweeney Oct 18, 2024
c266f1e
blend it all together
tssweeney Oct 18, 2024
d1fcc9b
pulled initial files
tssweeney Oct 18, 2024
fe63f28
MAYBE CHECKPOINT
tssweeney Oct 18, 2024
b6fda47
MAYBE CHECKPOINT 2
tssweeney Oct 18, 2024
c75be77
MAYBE CHECKPOINT 2
tssweeney Oct 18, 2024
407ebdf
Cursor 1, human 0
tssweeney Oct 18, 2024
20947ed
Cursor 1, human 1 + lint
tssweeney Oct 18, 2024
ff8c958
basic listing layout
tssweeney Oct 18, 2024
43ec46c
added button
tssweeney Oct 18, 2024
ba039a1
a little bit
tssweeney Oct 18, 2024
19c9316
ok, getting things a bit closer
tssweeney Oct 18, 2024
f327b02
cleanup leaderbaord
tssweeney Oct 18, 2024
0a20392
a little cleanuo
tssweeney Oct 18, 2024
61e59c1
ok, getting close
tssweeney Oct 18, 2024
1d67f51
lint
tssweeney Oct 18, 2024
5de4b6d
Ok, big changes, but good changes
tssweeney Oct 18, 2024
a83c440
basically done - except for editting
tssweeney Oct 18, 2024
2294f66
Done checkpoint
tssweeney Oct 18, 2024
14fe665
buttons added
tssweeney Oct 21, 2024
d607390
updated the client
tssweeney Oct 21, 2024
3d1e9bd
initial config + lint
tssweeney Oct 21, 2024
5efa848
REMOVED OLD CONFIG EFITOR
tssweeney Oct 21, 2024
dc307b5
TODO: Make the editor pull live data, Add Create new, code clean
tssweeney Oct 21, 2024
d103448
Lint and a few changes
tssweeney Oct 21, 2024
973dead
almost there
tssweeney Oct 21, 2024
c68d3e5
working
tssweeney Oct 21, 2024
f1b6b72
nearly complete
tssweeney Oct 21, 2024
3f7669c
everything done except make now
tssweeney Oct 21, 2024
f210380
Ok, all the building blocks are now there
tssweeney Oct 21, 2024
71ae5a3
Ok, all the building blocks are now there
tssweeney Oct 21, 2024
97453fc
Merge branch 'master' into tim/leaderboard_mvp
tssweeney Oct 23, 2024
b738739
Keep working
tssweeney Oct 23, 2024
ed702cb
Keep working
tssweeney Oct 23, 2024
f44cc58
Keep working
tssweeney Oct 23, 2024
05dc0d3
Keep working
tssweeney Oct 23, 2024
92bedfd
Keep working
tssweeney Oct 23, 2024
ce94f54
merged in master
tssweeney Oct 28, 2024
baf6a03
finished merge
tssweeney Oct 28, 2024
5c47b8d
code clean
tssweeney Oct 28, 2024
b9544a2
more cleaning
tssweeney Oct 28, 2024
300be67
more cleaning
tssweeney Oct 28, 2024
2d9522f
more cleaning
tssweeney Oct 28, 2024
08e6687
more cleaning
tssweeney Oct 28, 2024
000f50a
more cleaning
tssweeney Oct 28, 2024
d9996d8
Cleanup and Refactoring of Names
tssweeney Oct 28, 2024
9ce1d6c
Small cleanup
tssweeney Oct 28, 2024
7ce789d
Cleanups
tssweeney Oct 29, 2024
1b6f4f4
Cleanups
tssweeney Oct 29, 2024
c0c281b
Small cleanup
tssweeney Oct 29, 2024
6ddb9d2
fixed
tssweeney Oct 29, 2024
16a4a79
Part 1 of removing old styles
tssweeney Oct 29, 2024
9492c0e
Part 1 of removing old styles
tssweeney Oct 29, 2024
b583d6c
Part 1 of removing old styles
tssweeney Oct 29, 2024
f15172e
Almost done
tssweeney Oct 29, 2024
975f52a
Almost done
tssweeney Oct 29, 2024
6a4292a
Fixed state change
tssweeney Oct 29, 2024
e87860f
REVIEW CANDIDATE
tssweeney Oct 29, 2024
2fc2f39
Moved Docs
tssweeney Oct 29, 2024
7951875
Fixed tests
tssweeney Oct 29, 2024
7e76f51
merged
tssweeney Oct 30, 2024
14fc3fe
init
tssweeney Oct 30, 2024
94cc668
Merge branch 'master' into tim/improved_object_schemas
tssweeney Oct 30, 2024
298ffc5
init
tssweeney Oct 30, 2024
f6be1ce
init
tssweeney Oct 30, 2024
29da8cb
generation complete
tssweeney Oct 30, 2024
2a6b2ac
beginning ts implementation
tssweeney Oct 31, 2024
92cf5cd
Initial TS implementation complete
tssweeney Oct 31, 2024
259e4c0
Initial TS tests complete
tssweeney Oct 31, 2024
ac471c4
Initial TS tests complete
tssweeney Oct 31, 2024
e326b17
Initial python tests complete
tssweeney Oct 31, 2024
154de47
Typescript improvements
tssweeney Oct 31, 2024
454058c
Python Tests complete
tssweeney Oct 31, 2024
ec3959f
Attempted fix
tssweeney Oct 31, 2024
00516d0
Attempted fix
tssweeney Oct 31, 2024
6bfe115
merged
tssweeney Oct 31, 2024
a81daed
clean
tssweeney Oct 31, 2024
45c1756
clean
tssweeney Oct 31, 2024
5e4457d
fixed bug
tssweeney Oct 31, 2024
10f7e1d
maybe fix
tssweeney Oct 31, 2024
166dbdb
maybe fix
tssweeney Oct 31, 2024
185f247
Added diagram
tssweeney Oct 31, 2024
e44d634
Removed first hack
tssweeney Oct 31, 2024
c723b32
Removed second hack
tssweeney Oct 31, 2024
ec7c0b7
lint
tssweeney Oct 31, 2024
1bd9308
merged in master
tssweeney Oct 31, 2024
6100a54
BREAKING CHANGES - MERGE IN SCHEMA VALIDATION
tssweeney Oct 31, 2024
26e6ecc
Initial uptake of changes - tests passing
tssweeney Oct 31, 2024
2e3cb7a
Paydantic fix + schema gen
tssweeney Oct 31, 2024
2b67837
typescript
tssweeney Oct 31, 2024
f8bf2a6
style
tssweeney Oct 31, 2024
7b9a49c
convert to path instead of path parts
tssweeney Oct 31, 2024
c217b60
Uptake formal type
tssweeney Oct 31, 2024
5a51ef1
Uptake formal type
tssweeney Oct 31, 2024
998f467
split out hooks
tssweeney Oct 31, 2024
d3e0e00
Uptake the first of many hooks
tssweeney Oct 31, 2024
c85b6f4
Fixed generation
tssweeney Oct 31, 2024
ea49fb1
Fixed generation 2
tssweeney Oct 31, 2024
e9cd4e9
Merged in with Schema Gen
tssweeney Oct 31, 2024
d6406eb
More code removal
tssweeney Oct 31, 2024
d84fb36
Fixed types
tssweeney Oct 31, 2024
e3cd06e
Merge branch 'tim/improved_object_schemas' into tim/leaderboard_mvp
tssweeney Oct 31, 2024
ec9f5f6
Refactor complete
tssweeney Oct 31, 2024
e0ab86c
Small ts error
tssweeney Oct 31, 2024
1532ffd
Release Candidate
tssweeney Oct 31, 2024
99f58d1
Type fixes
tssweeney Oct 31, 2024
4e49cd3
Merge branch 'tim/improved_object_schemas' into tim/leaderboard_mvp
tssweeney Oct 31, 2024
7b76d3e
Lint
tssweeney Oct 31, 2024
7f95417
Moved to generated
tssweeney Oct 31, 2024
9d35f13
Moved to gen
tssweeney Oct 31, 2024
5e74e5b
Addressed comments
tssweeney Oct 31, 2024
a155629
Addressed comments
tssweeney Oct 31, 2024
1a0044e
Merge branch 'tim/improved_object_schemas' into tim/leaderboard_mvp
tssweeney Oct 31, 2024
b280954
Merged in master
tssweeney Oct 31, 2024
5001612
Merge branch 'master' into tim/leaderboard_mvp
tssweeney Oct 31, 2024
9ccdcd0
empty state
tssweeney Oct 31, 2024
42fcea8
Lint
tssweeney Oct 31, 2024
0e65c6b
Lint
tssweeney Nov 1, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
343 changes: 343 additions & 0 deletions docs/docs/reference/gen_notebooks/leaderboard_quickstart.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,343 @@
---
title: Leaderboard Quickstart
---


:::tip[This is a notebook]

<a href="https://colab.research.google.com/github/wandb/weave/blob/master/docs/./notebooks/leaderboard_quickstart.ipynb" target="_blank" rel="noopener noreferrer" class="navbar__item navbar__link button button--secondary button--med margin-right--sm notebook-cta-button"><div><img src="https://upload.wikimedia.org/wikipedia/commons/archive/d/d0/20221103151430%21Google_Colaboratory_SVG_Logo.svg" alt="Open In Colab" height="20px" /><div>Open in Colab</div></div></a>

<a href="https://github.com/wandb/weave/blob/master/docs/./notebooks/leaderboard_quickstart.ipynb" target="_blank" rel="noopener noreferrer" class="navbar__item navbar__link button button--secondary button--med margin-right--sm notebook-cta-button"><div><img src="https://upload.wikimedia.org/wikipedia/commons/9/91/Octicons-mark-github.svg" alt="View in Github" height="15px" /><div>View in Github</div></div></a>

:::



<!--- @wandbcode{leaderboard-demo} -->

# Leaderboard Quickstart

In this notebook we will learn to use Weave's Leaderboard to compare model performance across different datasets and scoring functions. Specifically, we will:

1. Generate a dataset of fake zip code data
2. Author some scoring functions and evaluate a baseline model.
3. Use these techniques to evaluate a matrix of models vs evaluations.
4. Review the leaderboard in the Weave UI.

## Step 1: Generate a dataset of fake zip code data

First we will create a function `generate_dataset_rows` that generates a list of fake zip code data.


```python
import json

from openai import OpenAI
from pydantic import BaseModel


class Row(BaseModel):
zip_code: str
city: str
state: str
avg_temp_f: float
population: int
median_income: int
known_for: str


class Rows(BaseModel):
rows: list[Row]


def generate_dataset_rows(
location: str = "United States", count: int = 5, year: int = 2022
):
client = OpenAI()

completion = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{
"role": "user",
"content": f"Please generate {count} rows of data for random zip codes in {location} for the year {year}.",
},
],
response_format={
"type": "json_schema",
"json_schema": {
"name": "response_format",
"schema": Rows.model_json_schema(),
},
},
)

return json.loads(completion.choices[0].message.content)["rows"]
```


```python
import weave

weave.init("leaderboard-demo")
```

## Step 2: Author scoring functions

Next we will author 3 scoring functions:

1. `check_concrete_fields`: Checks if the model output matches the expected city and state.
2. `check_value_fields`: Checks if the model output is within 10% of the expected population and median income.
3. `check_subjective_fields`: Uses a LLM to check if the model output matches the expected "known for" field.



```python
@weave.op
def check_concrete_fields(city: str, state: str, output: dict):
return {
"city_match": city == output["city"],
"state_match": state == output["state"],
}


@weave.op
def check_value_fields(
avg_temp_f: float, population: int, median_income: int, output: dict
):
return {
"avg_temp_f_err": abs(avg_temp_f - output["avg_temp_f"]) / avg_temp_f,
"population_err": abs(population - output["population"]) / population,
"median_income_err": abs(median_income - output["median_income"])
/ median_income,
}


@weave.op
def check_subjective_fields(zip_code: str, known_for: str, output: dict):
client = OpenAI()

class Response(BaseModel):
correct_known_for: bool

completion = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{
"role": "user",
"content": f"My student was asked what the zip code {zip_code} is best known best for. The right answer is '{known_for}', and they said '{output['known_for']}'. Is their answer correct?",
},
],
response_format={
"type": "json_schema",
"json_schema": {
"name": "response_format",
"schema": Response.model_json_schema(),
},
},
)

return json.loads(completion.choices[0].message.content)
```

## Step 3: Create a simple Evaluation

Next we define a simple evaliation using our fake data and scoring functions.



```python
rows = generate_dataset_rows()
evaluation = weave.Evaluation(
name="United States - 2022",
dataset=rows,
scorers=[
check_concrete_fields,
check_value_fields,
check_subjective_fields,
],
)
```

## Step 4: Evaluate a baseline model

Now we will evaluate a baseline model which returns a static response.



```python
@weave.op
def baseline_model(zip_code: str):
return {
"city": "New York",
"state": "NY",
"avg_temp_f": 50.0,
"population": 1000000,
"median_income": 100000,
"known_for": "The Big Apple",
}


await evaluation.evaluate(baseline_model)
```

## Step 5: Create more Models

Now we will create 2 more models to compare against the baseline.


```python
@weave.op
def gpt_4o_mini_no_context(zip_code: str):
client = OpenAI()

completion = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": f"""Zip code {zip_code}"""}],
response_format={
"type": "json_schema",
"json_schema": {
"name": "response_format",
"schema": Row.model_json_schema(),
},
},
)

return json.loads(completion.choices[0].message.content)


await evaluation.evaluate(gpt_4o_mini_no_context)
```


```python
@weave.op
def gpt_4o_mini_with_context(zip_code: str):
client = OpenAI()

completion = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{
"role": "user",
"content": f"""Please answer the following questions about the zip code {zip_code}:
1. What is the city?
2. What is the state?
3. What is the average temperature in Fahrenheit?
4. What is the population?
5. What is the median income?
6. What is the most well known thing about this zip code?
""",
}
],
response_format={
"type": "json_schema",
"json_schema": {
"name": "response_format",
"schema": Row.model_json_schema(),
},
},
)

return json.loads(completion.choices[0].message.content)


await evaluation.evaluate(gpt_4o_mini_with_context)
```

## Step 6: Create more Evaluations

Now we will evaluate a matrix of models vs evaluations.



```python
scorers = [
check_concrete_fields,
check_value_fields,
check_subjective_fields,
]
evaluations = [
weave.Evaluation(
name="United States - 2022",
dataset=weave.Dataset(
name="United States - 2022",
rows=generate_dataset_rows("United States", 5, 2022),
),
scorers=scorers,
),
weave.Evaluation(
name="California - 2022",
dataset=weave.Dataset(
name="California - 2022", rows=generate_dataset_rows("California", 5, 2022)
),
scorers=scorers,
),
weave.Evaluation(
name="United States - 2000",
dataset=weave.Dataset(
name="United States - 2000",
rows=generate_dataset_rows("United States", 5, 2000),
),
scorers=scorers,
),
]
models = [
baseline_model,
gpt_4o_mini_no_context,
gpt_4o_mini_with_context,
]

for evaluation in evaluations:
for model in models:
await evaluation.evaluate(
model, __weave={"display_name": evaluation.name + ":" + model.__name__}
)
```

## Step 7: Review the Leaderboard

You can create a new leaderboard by navigating to the leaderboard tab in the UI and clicking "Create Leaderboard".

We can also generate a leaderboard directly from Python:


```python
from weave.flow import leaderboard
from weave.trace.weave_client import get_ref

spec = leaderboard.Leaderboard(
name="Zip Code World Knowledge",
description="""
This leaderboard compares the performance of models in terms of world knowledge about zip codes.
### Columns
1. **State Match against `United States - 2022`**: The fraction of zip codes that the model correctly identified the state for.
2. **Avg Temp F Error against `California - 2022`**: The mean absolute error of the model's average temperature prediction.
3. **Correct Known For against `United States - 2000`**: The fraction of zip codes that the model correctly identified the most well known thing about the zip code.
""",
columns=[
leaderboard.LeaderboardColumn(
evaluation_object_ref=get_ref(evaluations[0]).uri(),
scorer_name="check_concrete_fields",
summary_metric_path="state_match.true_fraction",
),
leaderboard.LeaderboardColumn(
evaluation_object_ref=get_ref(evaluations[1]).uri(),
scorer_name="check_value_fields",
should_minimize=True,
summary_metric_path="avg_temp_f_err.mean",
),
leaderboard.LeaderboardColumn(
evaluation_object_ref=get_ref(evaluations[2]).uri(),
scorer_name="check_subjective_fields",
summary_metric_path="correct_known_for.true_fraction",
),
],
)

ref = weave.publish(spec)
```
Loading
Loading