This folder contains some configuration files to allow users to easily and quickly refine Alpaca-CoT.
The raw data files can be downloaded from Alpaca-CoT on HuggingFace.
Use raw_alpaca_cot_merge_add_meta.py to select instruction
, input
and output
columns and merge them to text
field with a space, and add extra META info to dataset:
python tools/preprocess/raw_alpaca_cot_merge_add_meta.py \
--src_dir <Alpaca-CoT_src_dir> \
--target_dir <target_dir> \
--num_proc <num_proc>
Use dataset_split_by_language.py to split the dataset to EN and ZH sub-datasets:
python tools/preprocess/dataset_split_by_language.py \
--src_dir <src_dir> \
--target_dir <target_dir> \
--suffixes jsonl \
--num_proc <num_proc>
After preprocess, modify the dataset path in alpaca-cot-en-refine.yaml and alpaca-cot-zh-refine.yaml, and then execute the following command to reproduce the processing flow of refined Alpaca-CoT.
# refine English dataset
python tools/process_data.py --config configs/data_juicer_recipes/alpaca_cot/alpaca-cot-en-refine.yaml
# refine Chinese dataset
python tools/process_data.py --config configs/data_juicer_recipes/alpaca_cot/alpaca-cot-zh-refine.yaml
Each sample in refined data of Alpaca-CoT contains meta info listed as below:
- Language Tags:
- EN: Instruction datasets in English
- CN: Instruction datasets in Chinese
- ML: [Multi-lingual] Instruction datasets in multiple languages
- Task Tags
- MT: [Multi-task] Datasets containing multiple tasks
- TS: [Task-specific] Datasets tailored for specific tasks
- Generation-method:
- HG: [Human Generated Dataset] Datasets created by humans
- SI: [Self-Instruct] Datasets generated using self-instruct methods
- MIX: [Mixed Dataset] Dataset contains both human and machine generated data
- COL: [Collection of Dataset] Dataset made from a collection of other datasets
-
Dataset
: dataset name in Alpaca-CoT -
origin_path
: original file path in Alpaca-CoT -
IFT
: tagged as Instruct Fine-Tuning datasets -
CFT
: tagged as Chat Fine-Tuning datasets-
CFT-SR
: tagged as Single-round Dialog datasets -
CFT-MR
: tagged as Multi-round Dialog datasets -
CFT-P
: tagged as Preference datasets
-
Task | Gen | Lang | Dataset | IFT | CFT-SR | CFT-MR | CFT-P | |
---|---|---|---|---|---|---|---|---|
Chain-of-Thought | MT | HG | EN/CN | Chain-of-Thought | โ | |||
GPT4all | MT | COL | EN | GPT4all | โ | โ | ||
GPTeacher | MT | SI | EN | GPTeacher | โ | |||
Guanaco | MT | SI | ML | Guanaco | โ | |||
HC3 | TS | MIX | EN/CN | HC3 | โ | โ | ||
alpaca | MT | SI | EN | alpaca | โ | |||
Natural-Instructions | MT | COL | ML | Natural-Instructions | โ | |||
belle_cn | TS/MT | SI | CN | belle_cn | โ | |||
instinwild | MT | SI | EN/CN | instinwild | โ | |||
prosocial-dialog | TS | MIX | EN | prosocial-dialog | โ | |||
finance | TS | COL | EN | finance | โ | |||
xP3 | MT | COL | ML | xP3 | โ | |||
firefly | MT | COL | CN | firefly | โ | |||
instruct | MT | COL | EN | instruct | โ | |||
CodeAlpaca | TS | SI | EN | CodeAlpaca | โ | |||
alpacaGPT4 | MT | SI | EN/CN | alpacaGPT4 | โ | โ | ||
webGPT | TS | MIX | EN | webGPT | โ | โ | ||
dolly | TS | HG | EN | dolly | โ | |||
baize | MT | COL | EN | baize | โ | |||
hh-rlhf | TS | MIX | EN | hh-rlhf | โ | โ | โ | |
OIG | MT | COL | EN | OIG | โ | |||
GAOKAO | MT | COL | CN | GAOKAO | โ | |||
camel | MT | SI | EN | camel | โ | |||
FLAN-Muffin | MT | COL | EN | FLAN-Muffin | โ | |||
COIG | MT | COL | CN | COIG | โ | |||
gpt4tools | MT | SI | EN | gpt4tools | โ | |||
ShareGPT | MT | MIX | EN | ShareGPT | โ | โ | ||
Auto-CoT | MT | COL | EN | Auto-CoT | โ | |||
MOSS | TS | SI | EN/CN | MOSS | โ | |||
ultrachat | TS | SI | EN | ultrachat | โ | |||
Chinese-medical | TS | COL | CN | Chinese-medical | โ | |||
CSL | MT | COL | CN | CSL | โ | |||
pCLUE | MT | COL | CN | pCLUE | โ | |||
news_commentary | TS | COL | CN | news_commentary | โ | |||
StackExchange | MT | COL | EN | StackExchange | โ | โ | ||
ConvAI2 | TS | HG | EN | ConvAI2 | โ | |||
FastChat | MT | SI | EN | FastChat | โ | |||
Tabular-LLM-Data | MT | COL | EN/CN | Tabular-LLM-Data | โ |