Refined open source dataset by Data-Juicer

We found that there are still some "bad" samples in existing processed datasets (e.g. RedPajama, The Pile.). So we use our Data-Juicer to refine them and try to feed them to LLMs for better performance.

We use simple 3-σ rule to set the hyperparameters for ops in each recipe.

Before and after refining for Pretraining Text Dataset

subset	#samples before	#samples after	keep ratio	config link	data link	source
arXiv	1,724,497	1,655,259	95.99%	redpajama-arxiv-refine.yaml	Aliyun ModelScope HuggingFace	Redpajama
Books	205,182	195,983	95.51%	redpajama-book-refine.yaml	Aliyun ModelScope HuggingFace	Redpajama
Wikipedia	29,834,171	26,990,659	90.47%	redpajama-wiki-refine.yaml	Aliyun ModelScope HuggingFace	Redpajama
C4	364,868,892	344,491,171	94.42%	redpajama-c4-refine.yaml	Aliyun ModelScope HuggingFace	Redpajama
Common Crawl 2019-30	81,085,420	36,557,283	45.08%	redpajama-cc-2019-30-refine.yaml	Aliyun ModelScope HuggingFace	Redpajama
Common Crawl 2020-05	90,850,492	42,612,596	46.90%	redpajama-cc-2020-05-refine.yaml	Aliyun ModelScope HuggingFace	Redpajama
Common Crawl 2021-04	98,878,523	44,724,752	45.23%	redpajama-cc-2021-04-refine.yaml	Aliyun ModelScope HuggingFace	Redpajama
Common Crawl 2022-05	94,058,868	42,648,496	45.34%	redpajama-cc-2022-05-refine.yaml	Aliyun ModelScope HuggingFace	Redpajama
Common Crawl 2023-06	111,402,716	50,643,699	45.46%	redpajama-cc-2023-06-refine.yaml	Aliyun ModelScope HuggingFace	Redpajama
Github Code	73,208,524 + 21,387,703	49,279,344	52.09%	redpajama-code-refine.yaml stack-code-refine.yaml redpajama-stack-code-deduplicate.yaml	Aliyun ModelScope HuggingFace	Redpajama The Stack
StackExchange	45,447,328	26,309,203	57.89%	redpajama-pile-stackexchange-refine.yaml	Aliyun ModelScope HuggingFace	Redpajama The Pile
EuroParl	69,814	61,601	88.23%	pile-europarl-refine.yaml	Aliyun ModelScope HuggingFace	The Pile
FreeLaw	3,562,015	2,942,612	82.61%	pile-freelaw-refine.yaml	Aliyun ModelScope HuggingFace	The Pile
HackerNews	373,027	371,331	99.55%	pile-hackernews-refine.yaml	Aliyun ModelScope HuggingFace	The Pile
NIH ExPorter	939,661	858,492	91.36%	pile-nih-refine.yaml	Aliyun ModelScope HuggingFace	The Pile
PhilPapers	32,782	29,117	88.82%	pile-philpaper-refine.yaml	Aliyun ModelScope HuggingFace	The Pile
PubMed Abstracts	15,518,009	15,009,325	96.72%	pile-pubmed-abstract-refine.yaml	Aliyun ModelScope HuggingFace	The Pile
PubMed Central	3,098,930	2,694,860	86.96%	pile-pubmed-central-refine.yaml	Aliyun ModelScope HuggingFace	The Pile
USPTO	5,883,024	4,516,283	76.77%	pile-uspto-refine.yaml	Aliyun ModelScope HuggingFace	The Pile

Before and after refining for Alpaca-CoT Dataset

subset	#samples before	#samples after	keep ratio	config link	data link	source
Alpaca-Cot EN	136,219,879	72,855,345	54.48%	alpaca-cot-en-refine.yaml	Aliyun ModelScope HuggingFace	39 Subsets of Alpaca-CoT
Alpaca-Cot ZH	21,197,246	9,873,214	46.58%	alpaca-cot-zh-refine.yaml	Aliyun ModelScope HuggingFace	28 Subsets of Alpaca-CoT

Before and after refining for Multimodal Dataset

subset	#samples before	#samples after	keep ratio	config link	data link	source
LLaVA pretrain (LCS-558k)	558,128	500,380	89.65%	llava-pretrain-refine.yaml	Aliyun ModelScope HuggingFace	LLaVA-1.5
Data-Juicer-T2V	1,217,346	147,176	12.09%	2_multi_op_pipline.yaml	Aliyun ModelScope HuggingFace	InternVid (606k) Panda-70M (605k) MSR-VTT (6k)

Evaluation Results

LLaVA pretrain (LCS-558k): models pretrained with refined dataset and fine-tuned with the original instruct dataset outperforms the baseline (LLaVA-1.5-13B) on 10 out of 12 benchmarks.

model	VQAv2	GQA	VizWiz	SQA	TextVQA	POPE	MME	MM-Bench	MM-Bench-CN	SEED	LLaVA-Bench-Wild	MM-Vet
LLaVA-1.5-13B (baseline)	80.0	63.3	53.6	71.6	61.3	85.9	1531.3	67.7	63.6	61.6	72.5	36.1
LLaVA-1.5-13B (refined pretrain dataset)	79.94	63.5	54.09	74.20	60.82	86.67	1565.53	68.2	63.9	61.8	75.9	37.4

For Video Dataset

We provide a video dataset processing recipe example for users to better utilize video-related OPs in general-video-refine-example.yaml. Here we apply three types of OPs:

Text-Only: to improve the dataset quality according to the video captions.
Video-Only: to improve the dataset quality according to the video features.
Text-Video: to improve the dataset quality according to the alignment between text and videos. Users can start to process their video datasets based on this recipe.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Refined open source dataset by Data-Juicer

Before and after refining for Pretraining Text Dataset

Before and after refining for Alpaca-CoT Dataset

Before and after refining for Multimodal Dataset

Evaluation Results

For Video Dataset

Files

README.md

Latest commit

History

README.md

File metadata and controls

Refined open source dataset by Data-Juicer

Before and after refining for Pretraining Text Dataset

Before and after refining for Alpaca-CoT Dataset

Before and after refining for Multimodal Dataset

Evaluation Results

For Video Dataset