Feature/add march models (#484)

* cleanup readme * update readme results to include march models * update readme results to include march models
emrgnt-cmplxty · Aug 11, 2023 · 609bb93 · 609bb93
1 parent ba74732
commit 609bb93
Show file tree

Hide file tree

Showing 32 changed files with 181 additions and 33 deletions.
diff --git a/research/study_human_eval/__init__.py → ...study_agency/study_human_eval/__init__.py b/research/study_human_eval/__init__.py → ...study_agency/study_human_eval/__init__.py
diff --git a/...ch/study_human_eval/basic_agency_study.py → ...cy/study_human_eval/basic_agency_study.py b/...ch/study_human_eval/basic_agency_study.py → ...cy/study_human_eval/basic_agency_study.py
diff --git a/...h/study_human_eval/completion_provider.py → ...y/study_human_eval/completion_provider.py b/...h/study_human_eval/completion_provider.py → ...y/study_human_eval/completion_provider.py
diff --git a/research/study_human_eval/constants.py → ...tudy_agency/study_human_eval/constants.py b/research/study_human_eval/constants.py → ...tudy_agency/study_human_eval/constants.py
diff --git a/...ch/study_human_eval/human_eval_results.md → ...cy/study_human_eval/human_eval_results.md b/...ch/study_human_eval/human_eval_results.md → ...cy/study_human_eval/human_eval_results.md
@@ -6,38 +6,41 @@ The following sections details the results, system prompts, instructions, and ag
 
 ## Results
 
-### GPT-3.5-03-01-turbo
+### GPT-3.5-0301-turbo
 
 | Category                         | HumanEval | HumanEval+ |
 |----------------------------------|-----------|------------|
-| Zero-Shot                        | 64.63     | 53.04      |
+| Zero-Shot                        | 64.63     | 53.08      |
 
-### GPT-3.5-0613-turbo
+### GPT-4-0413
 
 | Category                         | HumanEval | HumanEval+ |
 |----------------------------------|-----------|------------|
-| Zero-Shot                        | 62.20     | 54.88      |
-| Vanilla Agent, No Tools          | 64.02*    | 58.54*     |
-| Advanced Agent, No Tools         | 62.20*    | 56.71*     |
-| Advanced Agent, with Interpreter | 63.41*    | 55.49*     |
+| Zero-Shot                        | 85.98     | 75.00      |
+
+Note, the agent workflow cannot be ran on mach models due to dependence on the recently introduced function calling.
 
-### GPT-4-03-14-turbo
+### GPT-3.5-0613-turbo
 
 | Category                         | HumanEval | HumanEval+ |
 |----------------------------------|-----------|------------|
-| Zero-Shot                        | 85.98     | 75.00      |
+| Zero-Shot                        | 62.20     | 54.88      |
+| Vanilla Agent, No Tools          | 64.02*    | 58.54*     |
+| Advanced Agent, No Tools         | 62.20*    | 56.71*     |
+| Advanced Agent, With Interpreter | 63.41*    | 55.49*     |
 
-### GPT-4-0613-turbo
+### GPT-4-0613
 
 | Category                         | HumanEval | HumanEval+ |
 |----------------------------------|-----------|------------|
 | Zero-Shot                        | 81.71     | 76.22      |
 | Vanilla Agent, No Tools          | 82.93*    | 70.12*     |
-| Advanced Agent, No Tools         | 78.05*    | 73.78*     |
+| Advanced Agent, No Tools         | 79.26*    | 69.51*     |
+| Advanced Agent, With Interpreter | 79.87*    | 70.12*     |
 
 ---
 
-The tabulated results showcase the performance of different configurations of the Large Language Models (LLMs) on the HumanEval/HumanEval+ dataset. Both GPT-3.5 and GPT-4 versions are evaluated across various modes, including Zero-Shot, Vanilla Agent, and Advanced Agent, and with and without specific tools. The asterisk (*) denotes imputed values, where malformatted prompts are replaced with Zero-Shot solutions, leading to improved performance in some cases. These results provide insights into the capabilities of the models and their ability to handle complex algorithmic challenges.
+The tabulated results showcase the performance of different configurations of the Large Language Models (LLMs) on the HumanEval dataset. Both GPT-3.5 and GPT-4 versions are evaluated across various modes, including Zero-Shot, Vanilla Agent, and Advanced Agent, with and without specific tools. The asterisk (*) denotes imputed values, where malformatted prompts are replaced with Zero-Shot solutions, leading to improved performance in some cases. These results provide insights into the capabilities of the models and their ability to handle complex algorithmic challenges.
 
 ## Approach Overview
 
@@ -307,7 +310,7 @@ In addition to any specific instructions, the agents message buffer will be inje
 > ```python
 > {PROMPT}
 > ```
-
+</details>
 
 ## References
 
@@ -320,21 +323,3 @@ In addition to any specific instructions, the agents message buffer will be inje
 [4] [Human Eval](https://github.com/openai/human-eval)
 
 [5] [Chain-of-Thought Prompting Elicits Reasoning in Large Language Models](https://arxiv.org/abs/2201.11903)
-
-## Unimputed Results (*see above)
-
-### GPT-3.5-0613-turbo (unimputed)
-
-| Category                         | HumanEval | HumanEval+ |
-|----------------------------------|-----------|------------|
-| Zero-Shot                        | 62.20     | 54.88      |
-| Vanilla Agent, No Tools          | 59.15     | 50.61      |
-| Advanced Agent, No Tools         | 57.32     | 57.32      |
-| Advanced Agent with Interpreter| 59.76     | 52.44      |
-
-### GPT-4-0613-turbo (unimputed)
-
-| Category                         | HumanEval | HumanEval+ |
-|----------------------------------|-----------|------------|
-| Zero-Shot                        | 81.71     | 76.22      |
-| Vanilla Agent, No Tools          | 75.61     | 62.80      |
diff --git a/...q_0.7_run_mode_eq_vanilla_solutions.jsonl → ...q_0.7_run_mode_eq_vanilla_solutions.jsonl b/...q_0.7_run_mode_eq_vanilla_solutions.jsonl → ...q_0.7_run_mode_eq_vanilla_solutions.jsonl
diff --git a/...dvanced-agent-return-only_solutions.jsonl → ...dvanced-agent-return-only_solutions.jsonl b/...dvanced-agent-return-only_solutions.jsonl → ...dvanced-agent-return-only_solutions.jsonl
diff --git a/...agent-return-only_solutions_imputed.jsonl → ...agent-return-only_solutions_imputed.jsonl b/...agent-return-only_solutions_imputed.jsonl → ...agent-return-only_solutions_imputed.jsonl
diff --git a/...ed-agent-with-interpreter_solutions.jsonl → ...ed-agent-with-interpreter_solutions.jsonl b/...ed-agent-with-interpreter_solutions.jsonl → ...ed-agent-with-interpreter_solutions.jsonl
diff --git a/...-with-interpreter_solutions_imputed.jsonl → ...-with-interpreter_solutions_imputed.jsonl b/...-with-interpreter_solutions_imputed.jsonl → ...-with-interpreter_solutions_imputed.jsonl
diff --git a/..._eq_bad-agent-return-only_solutions.jsonl → ..._eq_bad-agent-return-only_solutions.jsonl b/..._eq_bad-agent-return-only_solutions.jsonl → ..._eq_bad-agent-return-only_solutions.jsonl
diff --git a/...agent-return-only_solutions_imputed.jsonl → ...agent-return-only_solutions_imputed.jsonl b/...agent-return-only_solutions_imputed.jsonl → ...agent-return-only_solutions_imputed.jsonl
diff --git a/...vanilla-agent-return-only_solutions.jsonl → ...vanilla-agent-return-only_solutions.jsonl b/...vanilla-agent-return-only_solutions.jsonl → ...vanilla-agent-return-only_solutions.jsonl
diff --git a/...agent-return-only_solutions_imputed.jsonl → ...agent-return-only_solutions_imputed.jsonl b/...agent-return-only_solutions_imputed.jsonl → ...agent-return-only_solutions_imputed.jsonl
diff --git a/...q_0.7_run_mode_eq_vanilla_solutions.jsonl → ...q_0.7_run_mode_eq_vanilla_solutions.jsonl b/...q_0.7_run_mode_eq_vanilla_solutions.jsonl → ...q_0.7_run_mode_eq_vanilla_solutions.jsonl
diff --git a/...q_0.7_run_mode_eq_vanilla_solutions.jsonl → ...q_0.7_run_mode_eq_vanilla_solutions.jsonl b/...q_0.7_run_mode_eq_vanilla_solutions.jsonl → ...q_0.7_run_mode_eq_vanilla_solutions.jsonl
diff --git a/...al_model_eq_gpt-4-0613_temp_eq_0.7_run_mode_eq_advanced-agent-return-only_solutions.jsonl b/...al_model_eq_gpt-4-0613_temp_eq_0.7_run_mode_eq_advanced-agent-return-only_solutions.jsonl
diff --git a/...ed-agent-with-interpreter_solutions.jsonl → ...ed-agent-with-interpreter_solutions.jsonl b/...ed-agent-with-interpreter_solutions.jsonl → ...ed-agent-with-interpreter_solutions.jsonl
@@ -161,4 +161,4 @@
 {"task_id": "HumanEval/96", "completion": "\ndef is_prime(num):\n    if num < 2:\n        return False\n    for i in range(2, int(num**0.5) + 1):\n        if num % i == 0:\n            return False\n    return True\n\ndef count_up_to(n):\n    primes = []\n    for i in range(n):\n        if is_prime(i):\n            primes.append(i)\n    return primes\n", "raw_completion": "Observation:\n```python\ndef is_prime(num):\n    if num < 2:\n        return False\n    for i in range(2, int(num**0.5) + 1):\n        if num % i == 0:\n            return False\n    return True\n\ndef count_up_to(n):\n    primes = []\n    for i in range(n):\n        if is_prime(i):\n            primes.append(i)\n    return primes\n```\n"}
 {"task_id": "HumanEval/97", "completion": "\ndef multiply(a, b):\n    return (a % 10) * (b % 10)", "raw_completion": "Observation:\n```python\ndef multiply(a, b):\n    return (a % 10) * (b % 10)```\n"}
 {"task_id": "HumanEval/98", "completion": "\ndef count_upper(s):\n    count = 0\n    vowels = ['A', 'E', 'I', 'O', 'U']\n    for i, c in enumerate(s):\n        if i % 2 == 0 and c in vowels:\n            count += 1\n    return count", "raw_completion": "Observation:\n```python\ndef count_upper(s):\n    count = 0\n    vowels = ['A', 'E', 'I', 'O', 'U']\n    for i, c in enumerate(s):\n        if i % 2 == 0 and c in vowels:\n            count += 1\n    return count```\n"}
-{"task_id": "HumanEval/99", "completion": "\ndef closest_integer(value):\n    n = float(value)\n    if n >= 0:\n        return int(n) if n - int(n) < 0.5 else int(n) + 1\n    else:\n        return int(n) if abs(n - int(n)) < 0.5 else int(n) - 1\n", "raw_completion": "```python\ndef closest_integer(value):\n    n = float(value)\n    if n >= 0:\n        return int(n) if n - int(n) < 0.5 else int(n) + 1\n    else:\n        return int(n) if abs(n - int(n)) < 0.5 else int(n) - 1\n```"}
+{"task_id": "HumanEval/99", "completion": "Exception You exceeded your current quota, please check your plan and billing details. occurred while running.", "raw_completion": "Exception You exceeded your current quota, please check your plan and billing details. occurred while running."}
diff --git a/...vanilla-agent-return-only_solutions.jsonl → ...vanilla-agent-return-only_solutions.jsonl b/...vanilla-agent-return-only_solutions.jsonl → ...vanilla-agent-return-only_solutions.jsonl
diff --git a/...agent-return-only_solutions_imputed.jsonl → ...agent-return-only_solutions_imputed.jsonl b/...agent-return-only_solutions_imputed.jsonl → ...agent-return-only_solutions_imputed.jsonl
diff --git a/...q_0.7_run_mode_eq_vanilla_solutions.jsonl → ...q_0.7_run_mode_eq_vanilla_solutions.jsonl b/...q_0.7_run_mode_eq_vanilla_solutions.jsonl → ...q_0.7_run_mode_eq_vanilla_solutions.jsonl
diff --git a/...udy_human_eval/study_leetcode/__init__.py → ...h/study_agency/study_leetcode/__init__.py b/...udy_human_eval/study_leetcode/__init__.py → ...h/study_agency/study_leetcode/__init__.py
diff --git a/...dy_leetcode/agentified_solution_oracle.py → ...dy_leetcode/agentified_solution_oracle.py b/...dy_leetcode/agentified_solution_oracle.py → ...dy_leetcode/agentified_solution_oracle.py
diff --git a/...eval/study_leetcode/leetcode_constants.py → ...ency/study_leetcode/leetcode_constants.py b/...eval/study_leetcode/leetcode_constants.py → ...ency/study_leetcode/leetcode_constants.py
diff --git a/...study_leetcode/leetcode_problem_solver.py → ...study_leetcode/leetcode_problem_solver.py b/...study_leetcode/leetcode_problem_solver.py → ...study_leetcode/leetcode_problem_solver.py
diff --git a/...tudy_leetcode/leetcode_problems_loader.py → ...tudy_leetcode/leetcode_problems_loader.py b/...tudy_leetcode/leetcode_problems_loader.py → ...tudy_leetcode/leetcode_problems_loader.py
diff --git a/...udy_leetcode/leetcode_solutions_finder.py → ...udy_leetcode/leetcode_solutions_finder.py b/...udy_leetcode/leetcode_solutions_finder.py → ...udy_leetcode/leetcode_solutions_finder.py
diff --git a/...val/study_leetcode/leetcode_test_stand.py → ...ncy/study_leetcode/leetcode_test_stand.py b/...val/study_leetcode/leetcode_test_stand.py → ...ncy/study_leetcode/leetcode_test_stand.py
diff --git a/...y_leetcode/run_automata_problem_solver.py → ...y_leetcode/run_automata_problem_solver.py b/...y_leetcode/run_automata_problem_solver.py → ...y_leetcode/run_automata_problem_solver.py
diff --git a/...y_leetcode/run_embed_leetcode_problems.py → ...y_leetcode/run_embed_leetcode_problems.py b/...y_leetcode/run_embed_leetcode_problems.py → ...y_leetcode/run_embed_leetcode_problems.py
diff --git a/...dy_leetcode/run_vanilla_problem_solver.py → ...dy_leetcode/run_vanilla_problem_solver.py b/...dy_leetcode/run_vanilla_problem_solver.py → ...dy_leetcode/run_vanilla_problem_solver.py
diff --git a/...al_model_eq_gpt-4-0613_temp_eq_0.7_run_mode_eq_advanced-agent-return-only_solutions.jsonl b/...al_model_eq_gpt-4-0613_temp_eq_0.7_run_mode_eq_advanced-agent-return-only_solutions.jsonl