Skip to content

Commit

Permalink
Merge branch 'main' into release/2.0
Browse files Browse the repository at this point in the history
  • Loading branch information
Jintao-Huang committed Apr 23, 2024
2 parents 480b4eb + 6d40dd1 commit 7519e10
Show file tree
Hide file tree
Showing 13 changed files with 70 additions and 110 deletions.
5 changes: 4 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -292,6 +292,7 @@ swift sft \
```

#### Deepspeed Training
Deepspeed supports training of quantized GPTQ and AWQ models.

ZeRO2:
```shell
Expand Down Expand Up @@ -432,6 +433,7 @@ CUDA_VISIBLE_DEVICES=0 swift deploy \
```

### Supported Models
The complete list of supported models and datasets can be found at [Supported Models and Datasets List](https://idealab.alibaba-inc.com/docs/source/LLM/Supported-Models-and-Datasets.md).

#### LLMs

Expand Down Expand Up @@ -470,7 +472,8 @@ CUDA_VISIBLE_DEVICES=0 swift deploy \
| c4ai-command-r | [c4ai](https://cohere.com/command) | Multilingual | 35B-104B | chat model |
| WizardLM2 | [WizardLM2 series models](https://github.com/nlpxucan/WizardLM) | English | 7B-8x22B<br>including quantized versions | chat model<br>MoE model |
| Atom | [Atom](https://github.com/LlamaFamily/Llama-Chinese) | Chinese | 7B| base model<br>chat model|
| Chinese-LLaMA-Alpaca-2 | [Chinese-LLaMA-Alpaca-2](https://github.com/ymcui/Chinese-LLaMA-Alpaca-2) | Chinese | 1.3B-13B| base model<br>chat model<br>long text model|
| Chinese-LLaMA-Alpaca-2 | [Chinese-LLaMA-Alpaca-2](https://github.com/ymcui/Chinese-LLaMA-Alpaca-2) | Chinese | 1.3B-13B| base model<br>chat model<br>long text model |
| ModelScope-Agent | [ModelScope Agent series models](https://github.com/modelscope/modelscope-agent) | Chinese | 7B-14B| agent model |

#### MLLMs

Expand Down
5 changes: 4 additions & 1 deletion README_CN.md
Original file line number Diff line number Diff line change
Expand Up @@ -290,6 +290,7 @@ swift sft \
```

#### Deepspeed训练
Deepspeed支持对GPTQ和AWQ量化模型进行训练.

ZeRO2:
```shell
Expand Down Expand Up @@ -429,6 +430,7 @@ CUDA_VISIBLE_DEVICES=0 swift deploy \
```

### 支持的模型
完整的支持模型和数据集可以查看[支持的模型和数据集列表](docs/source/LLM/支持的模型和数据集.md).

#### 大语言模型

Expand Down Expand Up @@ -467,7 +469,8 @@ CUDA_VISIBLE_DEVICES=0 swift deploy \
| c4ai-command-r | [c4ai](https://cohere.com/command) | 多语种 | 35B-104B | chat模型 |
| WizardLM2 | [WizardLM2系列模型](https://github.com/nlpxucan/WizardLM) | 多语种 | 7B-8x22B<br>包含量化版本 | chat模型<br>MoE模型 |
| Atom | [Atom](https://github.com/LlamaFamily/Llama-Chinese) | 中文 | 7B| base模型<br>chat模型|
| Chinese-LLaMA-Alpaca-2 | [Chinese-LLaMA-Alpaca-2](https://github.com/ymcui/Chinese-LLaMA-Alpaca-2) | 中文 | 1.3B-13B| base模型<br>chat模型<br>长文本模型|
| Chinese-LLaMA-Alpaca-2 | [Chinese-LLaMA-Alpaca-2](https://github.com/ymcui/Chinese-LLaMA-Alpaca-2) | 中文 | 1.3B-13B| base模型<br>chat模型<br>长文本模型 |
| ModelScope-Agent | [ModelScope Agent系列](https://github.com/modelscope/modelscope-agent) | 中文 | 7B-14B| agent模型 |


#### 多模态大模型
Expand Down
56 changes: 0 additions & 56 deletions ROADMAP.md

This file was deleted.

8 changes: 4 additions & 4 deletions docs/source/LLM/自定义与拓展.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,11 +26,11 @@
2. `--custom_val_dataset_path`: 默认值为`[]`, 表示不使用自定义验证数据集. 如果你指定了`custom_train_dataset_path`, 则自定义数据集的验证集将按照命令行参数`dataset_test_ratio`进行切割.

脚本支持的文件格式包含`csv`, `json`, `jsonl`格式. 你需要将传入的文件符合以下数据集格式. csv格式的文件只支持指令微调, 即没有history的情况. json, jsonl格式的文件支持system, history.
脚本支持的文件格式包含`csv`, `json`, `jsonl`格式. 你需要将传入的文件符合以下数据集格式. 以下格式都支持system. `json`, `jsonl`格式的文件支持多轮对话 (`csv`不支持).

**格式1:**

Pre-Training
预训练:

```csv
response
Expand All @@ -45,7 +45,7 @@ AAAAA
{"response": "AAAAA"}
```

Single-Round Dialogue
单轮对话:

```csv
query,response
Expand All @@ -60,7 +60,7 @@ AAAAA,BBBBB
{"query": "AAAAA", "response": "BBBBB"}
```

Multi-Round Dialogue
多轮对话:

```jsonl
{"query": "55555", "response": "66666"}
Expand Down
2 changes: 1 addition & 1 deletion docs/source_en/LLM/Customization.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ The corresponding example sh script can be found [here](https://github.com/model
2. `--custom_val_dataset_path`: The default value is `[]`, indicating not to use a custom validation dataset. If you specify `custom_train_dataset_path`, then the validation set of the custom dataset will be split according to the command line argument `dataset_test_ratio`.

The script supports file formats including `csv`, `json`, and `jsonl`. You need to ensure the passed in files conform to the following dataset formats. csv files only support instruction tuning, i.e. the case without history. json and jsonl files support system and history.
The supported file formats for the script include `csv`, `json`, and `jsonl`. You need to ensure that the incoming files conform to the following dataset formats. Both `json` and `jsonl` formats support multi-turn dialogues (`csv` does not support this).

**Format 1:**

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ torchrun \
--nproc_per_node=$nproc_per_node \
--master_port 29500 \
llm_sft.py \
--model_id_or_path OpenBuddy/openbuddy-mistral-7b-v13.1 \
--model_id_or_path OpenBuddy/openbuddy-mistral-7b-v17.1-32k \
--model_revision master \
--sft_type lora \
--tuner_backend peft \
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ torchrun \
--nproc_per_node=$nproc_per_node \
--master_port 29500 \
llm_sft.py \
--model_id_or_path OpenBuddy/openbuddy-mistral-7b-v13.1 \
--model_id_or_path OpenBuddy/openbuddy-mistral-7b-v17.1-32k \
--model_revision master \
--sft_type lora \
--tuner_backend peft \
Expand Down
4 changes: 3 additions & 1 deletion swift/llm/utils/argument.py
Original file line number Diff line number Diff line change
Expand Up @@ -309,7 +309,7 @@ class SftArguments(ArgumentsBase):
})
output_dir: str = 'output'
add_output_dir_suffix: Optional[bool] = None
ddp_backend: Literal['nccl', 'gloo', 'mpi', 'ccl'] = None
ddp_backend: Optional[Literal['nccl', 'gloo', 'mpi', 'ccl']] = None
ddp_find_unused_parameters: Optional[bool] = None
ddp_broadcast_buffers: Optional[bool] = None

Expand Down Expand Up @@ -658,6 +658,8 @@ def __post_init__(self) -> None:
else:
torch.cuda.set_device(local_rank)
self.seed += rank # Avoid the same dropout
if self.ddp_backend is None:
self.ddp_backend = 'nccl'
if self.ddp_backend == 'gloo' and self.quantization_bit != 0:
raise ValueError('not supported, please use `nccl`')

Expand Down
34 changes: 28 additions & 6 deletions swift/llm/utils/dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,8 @@ class DatasetName:
open_orca_gpt4 = 'open-orca-gpt4'
sharegpt_gpt4 = 'sharegpt-gpt4'
sharegpt_gpt4_mini = 'sharegpt-gpt4-mini'
deepctrl_sft_zh = 'deepctrl-sft-zh'
deepctrl_sft_en = 'deepctrl-sft-en'
# agent
ms_agent = 'ms-agent'
ms_agent_for_agentfabric_default = 'ms-agent-for-agentfabric-default'
Expand Down Expand Up @@ -549,7 +551,7 @@ def _preprocess_aishell1_dataset(dataset: HfDataset) -> HfDataset:


def _repair_agent_conversations(conversations: str,
use_mini: bool) -> Dict[str, str]:
use_mini: bool) -> List[Dict[str, str]]:
if use_mini:
pattern = r'\d\. {"plugin_name": "(.+?)"'
else:
Expand All @@ -562,13 +564,14 @@ def _repair_agent_conversations(conversations: str,
find_list = re.findall(pattern, conversations[:idx])
if len(set(find_list)) <= 1:
return
conversations = ast.literal_eval(conversations)
if isinstance(conversations, str):
conversations = ast.literal_eval(conversations)
if len(conversations) == 1:
return
return conversations


def _repair_ms_bench(conversations: str) -> Dict[str, str]:
def _repair_ms_bench(conversations: str) -> List[Dict[str, str]]:
if isinstance(conversations, str):
conversations = ast.literal_eval(conversations)
default_system = 'You are a helpful assistant.'
Expand Down Expand Up @@ -684,6 +687,22 @@ def map_row(row):
get_dataset_from_repo,
tags=['chat', 'agent', 'multi-round'])

register_dataset(
DatasetName.deepctrl_sft_zh,
'AI-ModelScope/deepctrl-sft-data', [['default', 'train']],
None,
SmartPreprocessor(),
get_dataset_from_repo,
tags=['chat', 'general', 'sft', 'multi-round'])

register_dataset(
DatasetName.deepctrl_sft_en,
'AI-ModelScope/deepctrl-sft-data', [['en', 'train']],
None,
SmartPreprocessor(),
get_dataset_from_repo,
tags=['chat', 'general', 'sft', 'multi-round'])

advertise_gen_prompt = """Task: Generating advertisements based on keywords.
Keywords: {query}
Advertisements:"""
Expand Down Expand Up @@ -1066,7 +1085,8 @@ def _preprocess_sharegpt(dataset: HfDataset) -> HfDataset:
response = []
history: List[History] = []
for d in tqdm(dataset):
conversation = ast.literal_eval(d['conversation'])
if isinstance(d['conversation'], str):
conversation = ast.literal_eval(d['conversation'])
query.append(conversation[-1]['human'])
response.append(conversation[-1]['assistant'])
h = []
Expand Down Expand Up @@ -1316,9 +1336,11 @@ def _preprocess_leetcode_python(dataset: HfDataset) -> HfDataset:
]


def _repair_conversations_agent_instruct(s: str) -> str:
def _repair_conversations_agent_instruct(s: str) -> List[Dict[str, Any]]:
s = s.replace('}\n {', '},\n {')
return ast.literal_eval(s)
if isinstance(s, str):
s = ast.literal_eval(s)
return s


register_dataset(
Expand Down
1 change: 1 addition & 0 deletions swift/llm/utils/model.py
Original file line number Diff line number Diff line change
Expand Up @@ -2377,6 +2377,7 @@ def _git_clone_github(github_url: str,
command = f'git -C {git_cache_dir} clone {github_url} {local_repo_name}'
logger.info(f'Run the command: `{command}`')
os.system(command)
logger.info(f'local_repo_path: {local_repo_path}')
return local_repo_path


Expand Down
24 changes: 20 additions & 4 deletions swift/llm/utils/preprocess.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,9 +39,16 @@ def __init__(self,
def __call__(self, dataset: HfDataset) -> HfDataset:
query: List[str] = []
response = []
for d in tqdm(dataset):
inst, inp, output = d['instruction'], d.get('input',
None), d['output']
system = None
history = None
for i, d in tqdm(enumerate(dataset)):
inst, inp = d['instruction'], d.get('input', None)
h, output = d.pop('history', None), d['output']
sys = d.pop('system', None)
if history is None and h is not None:
history = [None for _ in range(i - 1)]
if system is None and sys is not None:
system = [None for _ in range(i - 1)]
if output is None:
continue
if inp is None or len(inp) == 0:
Expand All @@ -52,7 +59,16 @@ def __call__(self, dataset: HfDataset) -> HfDataset:
q = f'{inst}\n{inp}'
query.append(q)
response.append(output)
dataset = HfDataset.from_dict({'query': query, 'response': response})
if history is not None:
history.append(h)
if system is not None:
system.append(sys)
d_dict = {'query': query, 'response': response}
if history is not None:
d_dict['history'] = history
if system is not None:
d_dict['system'] = system
dataset = HfDataset.from_dict(d_dict)
return dataset


Expand Down
35 changes: 2 additions & 33 deletions swift/trainers/trainers.py
Original file line number Diff line number Diff line change
Expand Up @@ -268,40 +268,9 @@ def compute_loss(self, model, inputs, return_outputs=None):

def get_train_dataloader(self):

def __iter__(self):
self._num_yielded = 0
if self._iterator is None:
self._iterator = self.__original_iter__()
return self

def __next__(self):
if self._num_yielded >= len(self):
raise StopIteration
self._num_yielded += 1
try:
return next(self._iterator)
except StopIteration:
self._iterator = self.__original_iter__()
return next(self._iterator)

if not use_torchacc():
origin_loader = super().get_train_dataloader()
grad_acc_steps = self.args.gradient_accumulation_steps
if grad_acc_steps is None or grad_acc_steps <= 1:
return origin_loader

length = len(origin_loader) // grad_acc_steps * grad_acc_steps
origin_loader_type = type(origin_loader)
loader = type(
origin_loader_type.__name__, (origin_loader_type, ), {
'__len__': lambda _: length,
'__iter__': __iter__,
'__next__': __next__
})(
origin_loader.dataset)
loader.__dict__.update(origin_loader.__dict__)
loader.__original_iter__ = origin_loader.__iter__
return loader
return super().get_train_dataloader()

else:
if trainer.is_datasets_available():
import datasets
Expand Down
2 changes: 1 addition & 1 deletion tests/llm/data/alpaca.jsonl
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
{"instruction": "11111", "input": "22222", "output": "33333"}
{"instruction": "11111", "input": "22222", "output": "33333", "history": [["aaaaa", "bbbbb"]], "system": "system123"}
{"instruction": "aaaaa", "output": "ccccc"}
{"instruction": "AAAAA", "input": "BBBBB", "output": "CCCCC"}

0 comments on commit 7519e10

Please sign in to comment.