Skip to content

Latest commit

ย 

History

History
114 lines (93 loc) ยท 7.83 KB

File metadata and controls

114 lines (93 loc) ยท 7.83 KB

Refine Alpaca-CoT Config Files

This folder contains some configuration files to allow users to easily and quickly refine Alpaca-CoT.

Preprocess

The raw data files can be downloaded from Alpaca-CoT on HuggingFace.

Convert raw Alpaca-CoT data to jsonl

Use raw_alpaca_cot_merge_add_meta.py to select instruction, input and output columns and merge them to text field with a space, and add extra META info to dataset:

python tools/preprocess/raw_alpaca_cot_merge_add_meta.py    \
    --src_dir             <Alpaca-CoT_src_dir>              \
    --target_dir          <target_dir>                      \
    --num_proc            <num_proc>

Split datasets to sub-datasets by language

Use dataset_split_by_language.py to split the dataset to EN and ZH sub-datasets:

python tools/preprocess/dataset_split_by_language.py    \
    --src_dir             <src_dir>                     \
    --target_dir          <target_dir>                  \
    --suffixes            jsonl                         \
    --num_proc            <num_proc>

Process

After preprocess, modify the dataset path in alpaca-cot-en-refine.yaml and alpaca-cot-zh-refine.yaml, and then execute the following command to reproduce the processing flow of refined Alpaca-CoT.

# refine English dataset
python tools/process_data.py --config configs/data_juicer_recipes/alpaca_cot/alpaca-cot-en-refine.yaml

# refine Chinese dataset
python tools/process_data.py --config configs/data_juicer_recipes/alpaca_cot/alpaca-cot-zh-refine.yaml

Meta Info

Each sample in refined data of Alpaca-CoT contains meta info listed as below:

Alpaca-CoT original meta info

  • Language Tags:
    • EN: Instruction datasets in English
    • CN: Instruction datasets in Chinese
    • ML: [Multi-lingual] Instruction datasets in multiple languages
  • Task Tags
    • MT: [Multi-task] Datasets containing multiple tasks
    • TS: [Task-specific] Datasets tailored for specific tasks
  • Generation-method:
    • HG: [Human Generated Dataset] Datasets created by humans
    • SI: [Self-Instruct] Datasets generated using self-instruct methods
    • MIX: [Mixed Dataset] Dataset contains both human and machine generated data
    • COL: [Collection of Dataset] Dataset made from a collection of other datasets

Data-Juicer Meta info

  • Dataset: dataset name in Alpaca-CoT

  • origin_path: original file path in Alpaca-CoT

  • IFT: tagged as Instruct Fine-Tuning datasets

  • CFT: tagged as Chat Fine-Tuning datasets

    • CFT-SR: tagged as Single-round Dialog datasets

    • CFT-MR: tagged as Multi-round Dialog datasets

    • CFT-P: tagged as Preference datasets

Refined Alpaca-CoT dataset Meta info

Task Gen Lang Dataset IFT CFT-SR CFT-MR CFT-P
Chain-of-Thought MT HG EN/CN Chain-of-Thought โœ…
GPT4all MT COL EN GPT4all โœ… โœ…
GPTeacher MT SI EN GPTeacher โœ…
Guanaco MT SI ML Guanaco โœ…
HC3 TS MIX EN/CN HC3 โœ… โœ…
alpaca MT SI EN alpaca โœ…
Natural-Instructions MT COL ML Natural-Instructions โœ…
belle_cn TS/MT SI CN belle_cn โœ…
instinwild MT SI EN/CN instinwild โœ…
prosocial-dialog TS MIX EN prosocial-dialog โœ…
finance TS COL EN finance โœ…
xP3 MT COL ML xP3 โœ…
firefly MT COL CN firefly โœ…
instruct MT COL EN instruct โœ…
CodeAlpaca TS SI EN CodeAlpaca โœ…
alpacaGPT4 MT SI EN/CN alpacaGPT4 โœ… โœ…
webGPT TS MIX EN webGPT โœ… โœ…
dolly TS HG EN dolly โœ…
baize MT COL EN baize โœ…
hh-rlhf TS MIX EN hh-rlhf โœ… โœ… โœ…
OIG MT COL EN OIG โœ…
GAOKAO MT COL CN GAOKAO โœ…
camel MT SI EN camel โœ…
FLAN-Muffin MT COL EN FLAN-Muffin โœ…
COIG MT COL CN COIG โœ…
gpt4tools MT SI EN gpt4tools โœ…
ShareGPT MT MIX EN ShareGPT โœ… โœ…
Auto-CoT MT COL EN Auto-CoT โœ…
MOSS TS SI EN/CN MOSS โœ…
ultrachat TS SI EN ultrachat โœ…
Chinese-medical TS COL CN Chinese-medical โœ…
CSL MT COL CN CSL โœ…
pCLUE MT COL CN pCLUE โœ…
news_commentary TS COL CN news_commentary โœ…
StackExchange MT COL EN StackExchange โœ… โœ…
ConvAI2 TS HG EN ConvAI2 โœ…
FastChat MT SI EN FastChat โœ…
Tabular-LLM-Data MT COL EN/CN Tabular-LLM-Data โœ