Skip to content

Commit

Permalink
Merge pull request #168 from AlibabaResearch/flowbench
Browse files Browse the repository at this point in the history
Flowbench
  • Loading branch information
tnlin authored Nov 19, 2024
2 parents 610a882 + 7660771 commit dac5bd0
Show file tree
Hide file tree
Showing 14 changed files with 1,765 additions and 0 deletions.
112 changes: 112 additions & 0 deletions FlowBench/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@


<div align="center">
<h1 align="center"> 🌊 FlowBench 🌊</h1>
<b>FlowBench: Revisiting and Benchmarking Workflow-Guided Planning for LLM-based Agents</b>

<p align="center"><font size=6>📃</font> <a target="_self" href="https://arxiv.org/abs/2406.14884"> <img style="height:14pt" src="https://img.shields.io/badge/-Paper-red?style=flat&logo=arxiv"></a> <font size=6>•</font> <font size=6>🔔</font> <a target="_self" href="https://github.com/Justherozen/FlowBench"> <img style="height:14pt" src="https://img.shields.io/badge/-Code-pink?style=flat&logo=github"></a></p>

</div>


## Overview

This repository contains the source data and code for our EMNLP 2024 paper [FlowBench: Revisiting and Benchmarking Workflow-Guided Planning for LLM-based Agents](https://arxiv.org/abs/2406.14884). We propose a comprehensive benchmark, FlowBench, for workflow-guided agent planning. We first revisit and formalize different workflow knowledge formats for agent planning. FlowBench covers an extensive taxonomy (6 domains, 22 roles, 51 scenarios) and different knowledge formats (text, code, flowchart) to synchronize with real-world applications comprehensively. The benchmark data is constructed through a three-phase pipeline of task collection, workflow organization, and session generation. FlowBench features numerous distinct characteristics, such as coverage, difficulty, expert-level annotation, and support for multi-round user-agent interaction. Through extensive experiments on FlowBench, we find that even the best-performing model, GPT4o, fails to deliver satisfying results on challenging FlowBench. We hope that our work can provide meaningful insights to future research in the field of workflow-guided agent planning. An overview of our proposed FlowBench can be seen as follows:

![overview of flowbench](./resources/flowbench.png)

> *Please find more details of this work in our paper.*






### Dataset Introduction

Download `turn_data.zip` and `session_data.zip` from [Google Drive](https://drive.google.com/drive/folders/1PFzA5e-fuKpVZvAHP-otBhWPdU60O3d4?usp=sharing). After extracting, you will get two folders: `turn_data` and `session_data`. Move these two folders into the `data` directory. These two folders contain the benchmark data on the session-level and turn-level. All workflow knowledge with different formats has been organized into the `knowledge.json`.





### Evaluating workflow-guided agent planning

##### Dependencies

To install requirements:

pip install requirements.txt

##### API preparation

Set up your OPENAI key in ./utils/keys.json

```
api_key: "Your OPENAI key"
```

After that, you can conduct the turn-level and session-level evaluations.

##### Turn-level evaluation

- To generate the single-turn predictions for different test samples, please run

```
python ./turn_level/turn_inference.py --input_path INPUT_FOLDER --output_path OUTPUT_FOLDER
```

* Then you can calculate and display the evaluation metrics with the following commands, where `OUTPUT_FOLDER` is the output path of the last generation step.

```
python ./turn_level/turn_metric_display.py --output_path OUTPUT_FOLDER
```



##### Session-level evaluation

- To simulate the predicted sessions, use the following commands with simulate mode, where `INPUT_PATH`, `OUTPUT_PATH`, and `EVAL_PATH` indicate the paths for test input, simulation generation, and simulation evaluation results, respectively.

```
python ./session_level/session_simulate.py --mode simulate --input_path INPUT_PATH --output_path OUTPUT_PATH --eval_path EVAL_PATH
```

* After session simulation, you can calculate and save the evaluation metrics using the eval mode as follows.

```
python ./session_level/session_simulate.py --mode eval --input_path INPUT_PATH --output_path OUTPUT_PATH --eval_path EVAL_PATH
```

* Finally, you can display the evaluation metrics for each scenario and optionally save them to the Excel file.
```
python ./session_level/session_metric_display.py --eval_path EVAL_PATH
```

You can specify the LLM used for generation, the LLM used as a judge, and the LLM used for environment simulation from the command line.




##### Future plans

Apart from the scenarios presented in the paper, we will incorporate more additional scenarios. We will also keep refining our benchmark quality and our evaluation framework as part of our future initiatives!



### Citation

If you use or extend our work, please cite the paper as follows:

```
@misc{xiao2024flowbenchrevisitingbenchmarkingworkflowguided,
title={FlowBench: Revisiting and Benchmarking Workflow-Guided Planning for LLM-based Agents},
author={Ruixuan Xiao and Wentao Ma and Ke Wang and Yuchuan Wu and Junbo Zhao and Haobo Wang and Fei Huang and Yongbin Li},
year={2024},
eprint={2406.14884},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2406.14884},
}
```
6 changes: 6 additions & 0 deletions FlowBench/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
regex
pandas
numpy
openai
jsonlines
xlsxwriter
Binary file added FlowBench/resources/flowbench.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
6 changes: 6 additions & 0 deletions FlowBench/script/session_level.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
#To simulate the predicted sessions, use the following commands
python ./session_level/session_simulate.py --mode simulate --input_path INPUT_PATH --output_path OUTPUT_PATH
#After session simulation, you can calculate and save the evaluation metrics as follows.
python ./session_level/session_simulate.py --mode eval --output_path OUTPUT_PATH --eval_path EVAL_PATH
#Finally, you can display the evaluation metrics for each scenario and optionally save them to excel file.
python ./session_level/session_metric_display.py --input_directory EVAL_PATH
4 changes: 4 additions & 0 deletions FlowBench/script/turn_level.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
#To generate the singele-turn predictions for different test samples, please run
python ./turn_level/turn_inference.py --input_path INPUT_FOLDER --output_path OUTPUT_FOLDER
#Then you can calculate and display the evaluation metrics with the following commands.
python ./turn_level/turn_metric_display.py --input_path OUTPUT_FOLDER
84 changes: 84 additions & 0 deletions FlowBench/session_level/session_metric_display.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
import os
import jsonlines
import pandas as pd
import argparse

def compute_session_metrics(input_directory, output_excel=''):
final_progress = []
final_all_session = 0
final_right_session = 0
final_right_api_num = 0
final_all_api_num_gt = 0
final_all_api_num_pre = 0
if output_excel:
excel_path = pd.ExcelWriter(output_excel, engine='xlsxwriter')
jsonl_files = [f for f in os.listdir(input_directory) if f.endswith('.jsonl')]

for file_name in jsonl_files:
file_path = os.path.join(input_directory, file_name)
gpt_success = []
gpt_progress = []
api_num_right = []
api_num_all_gt = []
api_num_all_pre = []

data = []
with jsonlines.open(file_path) as reader:
for obj in reader:
data.append(obj)
gpt_success.append(int(obj.get('success_gpt')))
gpt_progress.append(float(obj.get('progress_gpt')))
api_num_right.append(obj.get('right_api_num'))
api_num_all_gt.append(obj.get('all_api_num_gt'))
api_num_all_pre.append(obj.get('all_api_num_pre'))
tmp_output = {
"scenarios": file_name,
"success_rate": sum(gpt_success) / len(gpt_success) if gpt_success else 0,
"avg_progress": sum(gpt_progress) / len(gpt_progress) if gpt_progress else 0,
"tool_precision": sum(api_num_right) / sum(api_num_all_pre) if api_num_all_pre else 0,
"tool_recall": sum(api_num_right) / sum(api_num_all_gt) if api_num_all_gt else 0,
}

print(tmp_output)
final_progress.extend(gpt_progress)
final_all_session += len(gpt_success)
final_right_session += sum(gpt_success)

final_right_api_num += sum(api_num_right)
final_all_api_num_gt += sum(api_num_all_gt)
final_all_api_num_pre += sum(api_num_all_pre)
if output_excel:
df = pd.DataFrame(data)
df.to_excel(excel_path, sheet_name=file_name.split('.')[0][:], index=False)

final_gpt_success = final_right_session / final_all_session if final_all_session > 0 else 0
final_gpt_progress = sum(final_progress) / len(final_progress) if final_progress else 0
final_api_prec = final_right_api_num / final_all_api_num_pre if final_all_api_num_pre > 0 else 0
final_api_recall = final_right_api_num / final_all_api_num_gt if final_all_api_num_gt > 0 else 0

final_tmp_output = {
"scenarios": "All",
"success_rate": final_gpt_success,
"avg_progress": final_gpt_progress,
"tool_precision": final_api_prec,
"tool_recall": final_api_recall
}
print("--------------")
print(final_tmp_output)
print(final_all_session)
if output_excel:
df_final = pd.DataFrame([final_tmp_output])
df_final.to_excel(excel_path, sheet_name='Overall Metrics', index=False)
excel_path._save()


if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Process paths and modes.")

# Add arguments
parser.add_argument("--output_excel", help="Path to the output Excel file")
parser.add_argument("--eval_path", required=True, help="Path to the input directory for metric display")

# Parse arguments
args = parser.parse_args()
ret = compute_session_metrics(args.eval_path,args.output_excel)
Loading

0 comments on commit dac5bd0

Please sign in to comment.