Adding Simplified Coding Tasks (#645)

* adding simple tasks * add simple human_eval * fix yaml * fix yaml * remove breakpoint * remove breakpoint * change bsz * merge main * add udpated readme * fix precommit * restor line * restor line * add link to codegeex * restor hf eval --------- Co-authored-by: Jeremy Dohmann <[email protected]> Co-authored-by: Jeremy D <[email protected]> Co-authored-by: Daniel King <[email protected]>
mosaicml · Oct 11, 2023 · cdb1c28 · cdb1c28
1 parent 0045ae6
commit cdb1c28
Show file tree

Hide file tree

Showing 10 changed files with 792 additions and 15 deletions.
diff --git a/scripts/eval/local_data/MODEL_GAUNTLET.md b/scripts/eval/local_data/MODEL_GAUNTLET.md
@@ -257,8 +257,43 @@ Language understanding tasks evaluate the model’s ability to understand the st
 ### Programming
 Programming tasks evaluate the model's ability to understand code, write functionally correct code given a specification, simulate code, and document code. Right now we just have HumanEval but later versions will include more.
 
-35. HumanEval code generation
-    - Description: HumanEval consists of 164 python programming challenges, in which the model is presented with the method signature and docstring comment for a python program and is expected to complete the program. We then test the resultant code’s functional correctness on a number of test input/output pairs.
+35. HumanEval Python code generation
+    - Description: HumanEval Python consists of 164 python programming challenges, in which the model is presented with the method signature and docstring comment for a python program and is expected to complete the program. We then test the resultant code’s functional correctness on a number of test input/output pairs.
     - Year released: 2022
     - Number of few shot examples: 0
     - Random baseline accuracy: 0%
+36. HumanEval C++ code generation
+    - Description: HumanEval C++ consists of 161 C++ programming challenges, in which the model is presented with the method signature and docstring comment for a C++ program and is expected to complete the program. We then test the resultant code’s functional correctness on a number of test input/output pairs. The C++ translation of HumanEval comes from the [CodeGeex](https://huggingface.co/datasets/THUDM/humaneval-x/viewer/cpp) project.
+    - Year released: 2022
+    - Number of few shot examples: 0
+    - Random baseline accuracy: 0%
+37. HumanEval JS code generation
+    - Description: HumanEval JS consists of 164 Javscript programming challenges, in which the model is presented with the method signature and docstring comment for a Javacript program and is expected to complete the program. We then test the resultant code’s functional correctness on a number of test input/output pairs. The JS translation of HumanEval comes from the [CodeGeex](https://huggingface.co/datasets/THUDM/humaneval-x/viewer/cpp) project.
+    - Year released: 2022
+    - Number of few shot examples: 0
+    - Random baseline accuracy: 0%
+38. HumanEval Python 25% code generation
+    - Description: HumanEval Python 25% is an easier variant of HumanEval Python in which in addition to the original method signature, the model is also provided 25% of the lines in the canonical solution and expected to complete the reaminder of the program. It consists of 164 samples.
+    - Year released: 2023
+    - Number of few shot examples: 0
+    - Random baseline accuracy: 0%
+39. HumanEval Python 50% code generation
+    - Description: HumanEval Python 50% is an easier variant of HumanEval Python in which in addition to the original method signature, the model is also provided 50% of the lines in the canonical solution and expected to complete the reaminder of the program. It consists of 164 samples.
+    - Year released: 2023
+    - Number of few shot examples: 0
+    - Random baseline accuracy: 0%
+40. HumanEval Python 75% code generation
+    - Description: HumanEval Python 75% is an easier variant of HumanEval Python in which in addition to the original method signature, the model is also provided 75% of the lines in the canonical solution and expected to complete the reaminder of the program. It consists of 164 samples.
+    - Year released: 2023
+    - Number of few shot examples: 0
+    - Random baseline accuracy: 0%
+41. HumanEval Python simple return statement code generation
+    - Description: HumanEval Python simple return statament is an easier variant of HumanEval Python in which the model is provided all of the canonical solution with the exception of the return statement and is expected to complete the return statement. Additionally, this set contains only the problems for which the canonical solution has a "simple" return statement consisting only of a line of the form `return VARIABLE\_NAME`. There are 37 samples.
+    - Year released: 2023
+    - Number of few shot examples: 0
+    - Random baseline accuracy: 0%
+42. HumanEval Python complex return statement code generation
+    - Description: HumanEval Pythom complex return statament is an easier variant of HumanEval Python in which the model is provided all of the canonical solution with the exception of the return statement and is expected to complete the return statement. Additionally, this set contains only the problems for which the canonical solution does not have a "simple" return statement as defined above. There are 127 samples.
+    - Year released: 2023
+    - Number of few shot examples: 0
+    - Random baseline accuracy: 0%
diff --git a/scripts/eval/local_data/programming/human_eval-0.25.jsonl b/scripts/eval/local_data/programming/human_eval-0.25.jsonl
diff --git a/scripts/eval/local_data/programming/human_eval-0.5.jsonl b/scripts/eval/local_data/programming/human_eval-0.5.jsonl
diff --git a/scripts/eval/local_data/programming/human_eval-0.75.jsonl b/scripts/eval/local_data/programming/human_eval-0.75.jsonl
diff --git a/scripts/eval/local_data/programming/human_eval_return_complex.jsonl b/scripts/eval/local_data/programming/human_eval_return_complex.jsonl
diff --git a/scripts/eval/local_data/programming/human_eval_return_simple.jsonl b/scripts/eval/local_data/programming/human_eval_return_simple.jsonl
diff --git a/scripts/eval/local_data/programming/processed_human_eval_c.jsonl b/scripts/eval/local_data/programming/processed_human_eval_c.jsonl
diff --git a/scripts/eval/yamls/coding_tasks.yaml b/scripts/eval/yamls/coding_tasks.yaml
@@ -5,22 +5,62 @@ icl_tasks:
   num_fewshot: [0]
   pass_at_k: 1
   num_beams: 20
-  icl_task_type: code_evaluation
   batch_size: 1
+  icl_task_type: code_evaluation
 
 -
   label: human_eval_cpp
   dataset_uri: eval/local_data/programming/processed_human_eval_cpp.jsonl # ADD YOUR OWN DATASET URI
   num_fewshot: [0]
   pass_at_k: 1
   num_beams: 20
-  icl_task_type: code_evaluation
   batch_size: 1
+  icl_task_type: code_evaluation
 -
   label: human_eval_js
   dataset_uri: eval/local_data/programming/processed_human_eval_js.jsonl # ADD YOUR OWN DATASET URI
   num_fewshot: [0]
   pass_at_k: 1
   num_beams: 20
+  batch_size: 1
+  icl_task_type: code_evaluation
+-
+  label: human_eval_return_simple
+  dataset_uri: eval/local_data/programming/human_eval_return_simple.jsonl # ADD YOUR OWN DATASET URI
+  num_fewshot: [0]
+  pass_at_k: 1
+  num_beams: 20
+  batch_size: 1
+  icl_task_type: code_evaluation
+-
+  label: human_eval_return_complex
+  dataset_uri: eval/local_data/programming/human_eval_return_complex.jsonl # ADD YOUR OWN DATASET URI
+  num_fewshot: [0]
+  pass_at_k: 1
+  num_beams: 20
+  batch_size: 1
+  icl_task_type: code_evaluation
+-
+  label: human_eval_25
+  dataset_uri: eval/local_data/programming/human_eval-0.25.jsonl # ADD YOUR OWN DATASET URI
+  num_fewshot: [0]
+  pass_at_k: 1
+  num_beams: 20
+  batch_size: 1
   icl_task_type: code_evaluation
+-
+  label: human_eval_50
+  dataset_uri: eval/local_data/programming/human_eval-0.5.jsonl # ADD YOUR OWN DATASET URI
+  num_fewshot: [0]
+  pass_at_k: 1
+  num_beams: 20
+  batch_size: 1
+  icl_task_type: code_evaluation
+-
+  label: human_eval_75
+  dataset_uri: eval/local_data/programming/human_eval-0.75.jsonl # ADD YOUR OWN DATASET URI
+  num_fewshot: [0]
+  pass_at_k: 1
+  num_beams: 20
   batch_size: 1
+  icl_task_type: code_evaluation
diff --git a/scripts/eval/yamls/eval_gauntlet.yaml b/scripts/eval/yamls/eval_gauntlet.yaml
@@ -123,6 +123,21 @@ eval_gauntlet:
     - name: human_eval_js
       num_fewshot: 0
       random_baseline: 0.0
+    - name: human_eval_return_simple
+      num_fewshot: 0
+      random_baseline: 0.0
+    - name: human_eval_return_complex
+      num_fewshot: 0
+      random_baseline: 0.0
+    - name: human_eval_25
+      num_fewshot: 0
+      random_baseline: 0.0
+    - name: human_eval_50
+      num_fewshot: 0
+      random_baseline: 0.0
+    - name: human_eval_75
+      num_fewshot: 0
+      random_baseline: 0.0
   - name: world_knowledge_lm_task_subscore
     benchmarks:
     - name: jeopardy

diff --git a/scripts/eval/yamls/tasks.yaml b/scripts/eval/yamls/tasks.yaml
@@ -179,21 +179,61 @@ icl_tasks:
   num_fewshot: [0]
   pass_at_k: 1
   num_beams: 20
-  icl_task_type: code_evaluation
   batch_size: 1
+  icl_task_type: code_evaluation
 -
   label: human_eval_cpp
   dataset_uri: eval/local_data/programming/processed_human_eval_cpp.jsonl # ADD YOUR OWN DATASET URI
   num_fewshot: [0]
   pass_at_k: 1
   num_beams: 20
-  icl_task_type: code_evaluation
   batch_size: 1
+  icl_task_type: code_evaluation
 -
   label: human_eval_js
   dataset_uri: eval/local_data/programming/processed_human_eval_js.jsonl # ADD YOUR OWN DATASET URI
   num_fewshot: [0]
   pass_at_k: 1
   num_beams: 20
+  batch_size: 1
+  icl_task_type: code_evaluation
+-
+  label: human_eval_return_simple
+  dataset_uri: eval/local_data/programming/human_eval_return_simple.jsonl # ADD YOUR OWN DATASET URI
+  num_fewshot: [0]
+  pass_at_k: 1
+  num_beams: 20
+  batch_size: 1
+  icl_task_type: code_evaluation
+-
+  label: human_eval_return_complex
+  dataset_uri: eval/local_data/programming/human_eval_return_complex.jsonl # ADD YOUR OWN DATASET URI
+  num_fewshot: [0]
+  pass_at_k: 1
+  num_beams: 20
+  batch_size: 1
+  icl_task_type: code_evaluation
+-
+  label: human_eval_25
+  dataset_uri: eval/local_data/programming/human_eval-0.25.jsonl # ADD YOUR OWN DATASET URI
+  num_fewshot: [0]
+  pass_at_k: 1
+  num_beams: 20
+  batch_size: 1
   icl_task_type: code_evaluation
+-
+  label: human_eval_50
+  dataset_uri: eval/local_data/programming/human_eval-0.5.jsonl # ADD YOUR OWN DATASET URI
+  num_fewshot: [0]
+  pass_at_k: 1
+  num_beams: 20
+  batch_size: 1
+  icl_task_type: code_evaluation
+-
+  label: human_eval_75
+  dataset_uri: eval/local_data/programming/human_eval-0.75.jsonl # ADD YOUR OWN DATASET URI
+  num_fewshot: [0]
+  pass_at_k: 1
+  num_beams: 20
   batch_size: 1
+  icl_task_type: code_evaluation