We found that there are still some "bad" samples in existing processed datasets (e.g. RedPajama, The Pile.). So we use our Data-Juicer to refine them and try to feed them to LLMs for better performance.
We use simple 3-σ rule to set the hyperparameters for ops in each recipe.
subset | #samples before | #samples after | keep ratio | config link | data link | source |
---|---|---|---|---|---|---|
arXiv | 1,724,497 | 1,655,259 | 95.99% | redpajama-arxiv-refine.yaml | Aliyun ModelScope HuggingFace |
Redpajama |
Books | 205,182 | 195,983 | 95.51% | redpajama-book-refine.yaml | Aliyun ModelScope HuggingFace |
Redpajama |
Wikipedia | 29,834,171 | 26,990,659 | 90.47% | redpajama-wiki-refine.yaml | Aliyun ModelScope HuggingFace |
Redpajama |
C4 | 364,868,892 | 344,491,171 | 94.42% | redpajama-c4-refine.yaml | Aliyun ModelScope HuggingFace |
Redpajama |
Common Crawl 2019-30 | 81,085,420 | 36,557,283 | 45.08% | redpajama-cc-2019-30-refine.yaml | Aliyun ModelScope HuggingFace |
Redpajama |
Common Crawl 2020-05 | 90,850,492 | 42,612,596 | 46.90% | redpajama-cc-2020-05-refine.yaml | Aliyun ModelScope HuggingFace |
Redpajama |
Common Crawl 2021-04 | 98,878,523 | 44,724,752 | 45.23% | redpajama-cc-2021-04-refine.yaml | Aliyun ModelScope HuggingFace |
Redpajama |
Common Crawl 2022-05 | 94,058,868 | 42,648,496 | 45.34% | redpajama-cc-2022-05-refine.yaml | Aliyun ModelScope HuggingFace |
Redpajama |
Common Crawl 2023-06 | 111,402,716 | 50,643,699 | 45.46% | redpajama-cc-2023-06-refine.yaml | Aliyun ModelScope HuggingFace |
Redpajama |
Github Code | 73,208,524 + 21,387,703 |
49,279,344 | 52.09% | redpajama-code-refine.yaml stack-code-refine.yaml redpajama-stack-code-deduplicate.yaml |
Aliyun ModelScope HuggingFace |
Redpajama The Stack |
StackExchange | 45,447,328 | 26,309,203 | 57.89% | redpajama-pile-stackexchange-refine.yaml | Aliyun ModelScope HuggingFace |
Redpajama The Pile |
EuroParl | 69,814 | 61,601 | 88.23% | pile-europarl-refine.yaml | Aliyun ModelScope HuggingFace |
The Pile |
FreeLaw | 3,562,015 | 2,942,612 | 82.61% | pile-freelaw-refine.yaml | Aliyun ModelScope HuggingFace |
The Pile |
HackerNews | 373,027 | 371,331 | 99.55% | pile-hackernews-refine.yaml | Aliyun ModelScope HuggingFace |
The Pile |
NIH ExPorter | 939,661 | 858,492 | 91.36% | pile-nih-refine.yaml | Aliyun ModelScope HuggingFace |
The Pile |
PhilPapers | 32,782 | 29,117 | 88.82% | pile-philpaper-refine.yaml | Aliyun ModelScope HuggingFace |
The Pile |
PubMed Abstracts | 15,518,009 | 15,009,325 | 96.72% | pile-pubmed-abstract-refine.yaml | Aliyun ModelScope HuggingFace |
The Pile |
PubMed Central | 3,098,930 | 2,694,860 | 86.96% | pile-pubmed-central-refine.yaml | Aliyun ModelScope HuggingFace |
The Pile |
USPTO | 5,883,024 | 4,516,283 | 76.77% | pile-uspto-refine.yaml | Aliyun ModelScope HuggingFace |
The Pile |
subset | #samples before | #samples after | keep ratio | config link | data link | source |
---|---|---|---|---|---|---|
Alpaca-Cot EN | 136,219,879 | 72,855,345 | 54.48% | alpaca-cot-en-refine.yaml | Aliyun ModelScope HuggingFace |
39 Subsets of Alpaca-CoT |
Alpaca-Cot ZH | 21,197,246 | 9,873,214 | 46.58% | alpaca-cot-zh-refine.yaml | Aliyun ModelScope HuggingFace |
28 Subsets of Alpaca-CoT |
subset | #samples before | #samples after | keep ratio | config link | data link | source |
---|---|---|---|---|---|---|
LLaVA pretrain (LCS-558k) | 558,128 | 500,380 | 89.65% | llava-pretrain-refine.yaml | Aliyun ModelScope HuggingFace |
LLaVA-1.5 |
Data-Juicer-T2V | 1,217,346 | 147,176 | 12.09% | 2_multi_op_pipline.yaml | Aliyun ModelScope HuggingFace |
InternVid (606k) Panda-70M (605k) MSR-VTT (6k) |
- LLaVA pretrain (LCS-558k): models pretrained with refined dataset and fine-tuned with the original instruct dataset outperforms the baseline (LLaVA-1.5-13B) on 10 out of 12 benchmarks.
model | VQAv2 | GQA | VizWiz | SQA | TextVQA | POPE | MME | MM-Bench | MM-Bench-CN | SEED | LLaVA-Bench-Wild | MM-Vet |
---|---|---|---|---|---|---|---|---|---|---|---|---|
LLaVA-1.5-13B (baseline) |
80.0 | 63.3 | 53.6 | 71.6 | 61.3 | 85.9 | 1531.3 | 67.7 | 63.6 | 61.6 | 72.5 | 36.1 |
LLaVA-1.5-13B (refined pretrain dataset) |
79.94 | 63.5 | 54.09 | 74.20 | 60.82 | 86.67 | 1565.53 | 68.2 | 63.9 | 61.8 | 75.9 | 37.4 |
We provide a video dataset processing recipe example for users to better utilize video-related OPs in general-video-refine-example.yaml. Here we apply three types of OPs:
- Text-Only: to improve the dataset quality according to the video captions.
- Video-Only: to improve the dataset quality according to the video features.
- Text-Video: to improve the dataset quality according to the alignment between text and videos. Users can start to process their video datasets based on this recipe.