Skip to content

Commit

Permalink
Merge branch 'main' into release/3.0
Browse files Browse the repository at this point in the history
  • Loading branch information
Jintao-Huang committed Dec 26, 2024
2 parents dac37de + 37cb3c6 commit 80ac762
Show file tree
Hide file tree
Showing 3 changed files with 17 additions and 3 deletions.
7 changes: 7 additions & 0 deletions docs/source/Customization/自定义数据集.md
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,13 @@ query-response格式:
{"messages": [{"role": "system", "content": "你是个有用无害的数学计算器"}, {"role": "user", "content": "1+1等于几"}, {"role": "assistant", "content": "等于2"}, {"role": "user", "content": "再加1呢"}, {"role": "assistant", "content": "等于3"}], "label": true}
```

### 序列分类
```jsonl
{"messages": [{"role": "user", "content": "今天天气真好呀"}], "label": 1}
{"messages": [{"role": "user", "content": "今天真倒霉"}], "label": 0}
{"messages": [{"role": "user", "content": "好开心"}], "label": 1}
```

### 多模态

对于多模态数据集,和上述任务的格式相同。区别在于增加了`images`, `videos`, `audios`几个key,分别代表多模态资源,`<image>` `<video>` `<audio>`标签代表了插入图片/视频/音频的位置。下面给出的四条示例分别展示了纯文本,以及包含图像、视频和音频数据的数据格式。
Expand Down
7 changes: 7 additions & 0 deletions docs/source_en/Customization/Custom-dataset.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,13 @@ The following provides the recommended dataset format for ms-swift, where the sy
{"messages": [{"role": "system", "content": "You are a useful and harmless math calculator"}, {"role": "user", "content": "What is 1 + 1?"}, {"role": "assistant", "content": "It equals 2"}, {"role": "user", "content": "What about adding 1?"}, {"role": "assistant", "content": "It equals 3"}], "label": true}
```

### Sequence Classification
```jsonl
{"messages": [{"role": "user", "content": "The weather is really nice today"}], "label": 1}
{"messages": [{"role": "user", "content": "Today is really unlucky"}], "label": 0}
{"messages": [{"role": "user", "content": "So happy"}], "label": 1}
```

### Multimodal

For multimodal datasets, the format is the same as the tasks mentioned above. The difference is the addition of several keys: `images`, `videos`, and `audios`, which represent multimodal resources. The tags `<image>`, `<video>`, and `<audio>` indicate the positions where images, videos, and audio are inserted, respectively. The four examples provided below demonstrate the data format for pure text, as well as formats that include image, video, and audio data.
Expand Down
6 changes: 3 additions & 3 deletions swift/llm/dataset/loader.py
Original file line number Diff line number Diff line change
Expand Up @@ -170,11 +170,11 @@ def _load_dataset_path(dataset_meta: DatasetMeta,
dataset_path = dataset_meta.dataset_path

ext = os.path.splitext(dataset_path)[1].lstrip('.')
ext = ext if ext != 'jsonl' else 'json'
file_type = {'jsonl': 'json', 'txt': 'text'}.get(ext) or ext
kwargs = {'split': 'train', 'streaming': streaming, 'num_proc': num_proc}
if ext == 'csv':
if file_type == 'csv':
kwargs['na_filter'] = False
dataset = hf_load_dataset(ext, data_files=dataset_path, **kwargs)
dataset = hf_load_dataset(file_type, data_files=dataset_path, **kwargs)

dataset = dataset_meta.preprocess_func(
dataset, num_proc=num_proc, strict=strict, load_from_cache_file=load_from_cache_file)
Expand Down

0 comments on commit 80ac762

Please sign in to comment.