Experiments

To reproduce our experiments, we publicly release our experimental code and data here accorrding to correponding ability.

We also call for support of computing power for conducting more comprehensive experiments.

Code and data

Language Generation experiments.
Knowledge Utilization experiments.
Knowledge Reasoning experiments.
Symbolic Reasoning experiments.
Mathematical Reasoning experiments.
Human Alignment experiments.
Tool Manipulation experiments.
Knowledge Reasoning experimets.

Results

Instruction Tuning Experiments Results

Here is the results of instruction-tuning experiments (all in a single-turn conversation) based on the LLaMA (7B) model under the chat and QA setting. We employ four instruction improvement strategies on the Self-Instruct-52K dataset, i.e., enhancing the complexity (w/ complexity), increasing the diversity (w/ diversity), balancing the difficulty (w/ difficulty), and scaling the instruction number (w/ scaling). ∗Since we select the LLaMA-7B model fine-tuned on Self-Instruct-52K as the baseline, we omit the win rate of the fine-tuned model with Self-Instruct-52K against itself.

Ability Evaluaition Experiments Results

Here is the results of evaluation on the eight abilities of LLMs with specially selected tasks. The shade of the Orange and Blue fonts denote the performance orders of the results in closed-source and open-source models, respectively. This table will be continuously updated by incorporating the results of more models.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Experiments

Code and data

Results

Instruction Tuning Experiments Results

Ability Evaluaition Experiments Results

Files

README.md

Latest commit

History

README.md

File metadata and controls

Experiments

Code and data

Results

Instruction Tuning Experiments Results

Ability Evaluaition Experiments Results