Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
IlyaGusev authored Sep 14, 2024
1 parent 1cbe3a5 commit 63c81cb
Showing 1 changed file with 5 additions and 5 deletions.
10 changes: 5 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,15 +6,15 @@ Website: [link](https://ilyagusev.github.io/ping_pong_bench/)

Paper: [link](https://arxiv.org/abs/2409.06820)

[LLM-as-a-Judge](https://arxiv.org/abs/2306.05685) is an evaluation method that relies on solid LLMs such as GPT-4 instead of humans. In this benchmark, we rely on LLMs not only to judge the answer but also to ask the questions.
LLM-as-a-Judge is an evaluation method that relies on solid LLMs such as GPT-4 instead of humans. In this benchmark, we rely on LLMs not only to judge the answer but also to ask the questions.

We believe the only way to evaluate a language model's conversational abilities is to talk with it. However, humans usually don't have enough time to talk with new models, and many popular benchmarks are single-turn. So, the main idea of this benchmark is to use LLMs **to emulate users** in role-playing conversations.
We believe talking with a language model's conversational abilities is the only way to evaluate it. However, humans usually don't have enough time to talk with new models, and many popular benchmarks are single-turn. So, the main idea of this benchmark is to use LLMs **to emulate users** in role-playing conversations.

For that, we have a set of characters and test situations. A strong enough model interacts with characters pretending to be users with different goals. After each interaction, the responder model answers are rated. See the example below.
For that, we have a set of characters and test situations. A strong enough model interacts with characters pretending to be users with different goals. After each interaction, the responder model answers are rated. Please take a look at the example below.

For now, we use three criteria for evaluation: whether the bot was in-character, entertaining, and fluent.
For now, we use three criteria for evaluation: whether the bot was in character, entertaining, and fluent.

To compose the final rating, we average numbers across criteria, characters, and situations.
We average numbers across criteria, characters, and situations to compose the final rating.

### Character
```
Expand Down

0 comments on commit 63c81cb

Please sign in to comment.