Skip to content
This repository has been archived by the owner on Mar 16, 2024. It is now read-only.

Commit

Permalink
Feature/add march models (#484)
Browse files Browse the repository at this point in the history
* cleanup readme

* update readme results to include march models

* update readme results to include march models
  • Loading branch information
emrgnt-cmplxty authored Aug 11, 2023
1 parent ba74732 commit 609bb93
Show file tree
Hide file tree
Showing 32 changed files with 181 additions and 33 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -6,38 +6,41 @@ The following sections details the results, system prompts, instructions, and ag

## Results

### GPT-3.5-03-01-turbo
### GPT-3.5-0301-turbo

| Category | HumanEval | HumanEval+ |
|----------------------------------|-----------|------------|
| Zero-Shot | 64.63 | 53.04 |
| Zero-Shot | 64.63 | 53.08 |

### GPT-3.5-0613-turbo
### GPT-4-0413

| Category | HumanEval | HumanEval+ |
|----------------------------------|-----------|------------|
| Zero-Shot | 62.20 | 54.88 |
| Vanilla Agent, No Tools | 64.02* | 58.54* |
| Advanced Agent, No Tools | 62.20* | 56.71* |
| Advanced Agent, with Interpreter | 63.41* | 55.49* |
| Zero-Shot | 85.98 | 75.00 |

Note, the agent workflow cannot be ran on mach models due to dependence on the recently introduced function calling.

### GPT-4-03-14-turbo
### GPT-3.5-0613-turbo

| Category | HumanEval | HumanEval+ |
|----------------------------------|-----------|------------|
| Zero-Shot | 85.98 | 75.00 |
| Zero-Shot | 62.20 | 54.88 |
| Vanilla Agent, No Tools | 64.02* | 58.54* |
| Advanced Agent, No Tools | 62.20* | 56.71* |
| Advanced Agent, With Interpreter | 63.41* | 55.49* |

### GPT-4-0613-turbo
### GPT-4-0613

| Category | HumanEval | HumanEval+ |
|----------------------------------|-----------|------------|
| Zero-Shot | 81.71 | 76.22 |
| Vanilla Agent, No Tools | 82.93* | 70.12* |
| Advanced Agent, No Tools | 78.05* | 73.78* |
| Advanced Agent, No Tools | 79.26* | 69.51* |
| Advanced Agent, With Interpreter | 79.87* | 70.12* |

---

The tabulated results showcase the performance of different configurations of the Large Language Models (LLMs) on the HumanEval/HumanEval+ dataset. Both GPT-3.5 and GPT-4 versions are evaluated across various modes, including Zero-Shot, Vanilla Agent, and Advanced Agent, and with and without specific tools. The asterisk (*) denotes imputed values, where malformatted prompts are replaced with Zero-Shot solutions, leading to improved performance in some cases. These results provide insights into the capabilities of the models and their ability to handle complex algorithmic challenges.
The tabulated results showcase the performance of different configurations of the Large Language Models (LLMs) on the HumanEval dataset. Both GPT-3.5 and GPT-4 versions are evaluated across various modes, including Zero-Shot, Vanilla Agent, and Advanced Agent, with and without specific tools. The asterisk (*) denotes imputed values, where malformatted prompts are replaced with Zero-Shot solutions, leading to improved performance in some cases. These results provide insights into the capabilities of the models and their ability to handle complex algorithmic challenges.

## Approach Overview

Expand Down Expand Up @@ -307,7 +310,7 @@ In addition to any specific instructions, the agents message buffer will be inje
> ```python
> {PROMPT}
> ```
</details>
## References
Expand All @@ -320,21 +323,3 @@ In addition to any specific instructions, the agents message buffer will be inje
[4] [Human Eval](https://github.com/openai/human-eval)
[5] [Chain-of-Thought Prompting Elicits Reasoning in Large Language Models](https://arxiv.org/abs/2201.11903)
## Unimputed Results (*see above)
### GPT-3.5-0613-turbo (unimputed)
| Category | HumanEval | HumanEval+ |
|----------------------------------|-----------|------------|
| Zero-Shot | 62.20 | 54.88 |
| Vanilla Agent, No Tools | 59.15 | 50.61 |
| Advanced Agent, No Tools | 57.32 | 57.32 |
| Advanced Agent with Interpreter| 59.76 | 52.44 |
### GPT-4-0613-turbo (unimputed)
| Category | HumanEval | HumanEval+ |
|----------------------------------|-----------|------------|
| Zero-Shot | 81.71 | 76.22 |
| Vanilla Agent, No Tools | 75.61 | 62.80 |

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
Expand Up @@ -161,4 +161,4 @@
{"task_id": "HumanEval/96", "completion": "\ndef is_prime(num):\n if num < 2:\n return False\n for i in range(2, int(num**0.5) + 1):\n if num % i == 0:\n return False\n return True\n\ndef count_up_to(n):\n primes = []\n for i in range(n):\n if is_prime(i):\n primes.append(i)\n return primes\n", "raw_completion": "Observation:\n```python\ndef is_prime(num):\n if num < 2:\n return False\n for i in range(2, int(num**0.5) + 1):\n if num % i == 0:\n return False\n return True\n\ndef count_up_to(n):\n primes = []\n for i in range(n):\n if is_prime(i):\n primes.append(i)\n return primes\n```\n"}
{"task_id": "HumanEval/97", "completion": "\ndef multiply(a, b):\n return (a % 10) * (b % 10)", "raw_completion": "Observation:\n```python\ndef multiply(a, b):\n return (a % 10) * (b % 10)```\n"}
{"task_id": "HumanEval/98", "completion": "\ndef count_upper(s):\n count = 0\n vowels = ['A', 'E', 'I', 'O', 'U']\n for i, c in enumerate(s):\n if i % 2 == 0 and c in vowels:\n count += 1\n return count", "raw_completion": "Observation:\n```python\ndef count_upper(s):\n count = 0\n vowels = ['A', 'E', 'I', 'O', 'U']\n for i, c in enumerate(s):\n if i % 2 == 0 and c in vowels:\n count += 1\n return count```\n"}
{"task_id": "HumanEval/99", "completion": "\ndef closest_integer(value):\n n = float(value)\n if n >= 0:\n return int(n) if n - int(n) < 0.5 else int(n) + 1\n else:\n return int(n) if abs(n - int(n)) < 0.5 else int(n) - 1\n", "raw_completion": "```python\ndef closest_integer(value):\n n = float(value)\n if n >= 0:\n return int(n) if n - int(n) < 0.5 else int(n) + 1\n else:\n return int(n) if abs(n - int(n)) < 0.5 else int(n) - 1\n```"}
{"task_id": "HumanEval/99", "completion": "Exception You exceeded your current quota, please check your plan and billing details. occurred while running.", "raw_completion": "Exception You exceeded your current quota, please check your plan and billing details. occurred while running."}

This file was deleted.

0 comments on commit 609bb93

Please sign in to comment.