The following introduction comes from the abstract of Compression Represents Intelligence Linearly:
There is a belief that learning to compress well will lead to intelligence. Recently, language modeling has been shown to be equivalent to compression, which offers a compelling rationale for the success of large language models (LLMs): the development of more advanced language models is essentially enhancing compression which facilitates intelligence. ...our findings suggest that compression efficiency, as an unsupervised metric derived from raw text corpora, serves as a reliable evaluation measure that is linearly associated with the model capabilities. We open-source our compression datasets as well as our data collection pipelines to facilitate future researchers to assess compression properly.
- Paper: Compression Represents Intelligence Linearly
- GitHub Repository: llm-compression-intelligence
The dataset, which consists of three external corpora, can be downloaded using the following python script:
from os import os.path as osp
from datasets import load_dataset
data_path = "data/llm-compression"
subset_mapping = {
'arxiv_math': ['arxiv_math'],
'commoncraw': ['cc'],
'python': ['python'],
}
for key, value in subset_mapping.items():
llmc_dataset = load_dataset(r"hkust-nlp/llm-compression", name=value)
llmc_dataset["test"].to_json(osp.join(data_path, f"{key}.jsonl"))
Note: Refer to the original repository for more details on data collection and design.
The inference stage (SWCELossInferencer
) consists of the following key steps:
- For each candidate model, we obtain the encodings of each sample of the dataset using its tokenizer.
- Concatenate the encodings of all samples into a single array and construct a PyTorch Dataset. Each item of
__getitem__
is a chunk of the array based on a sliding window. To reproduce results from the original paper, setblock_size=1900
andstride=512
. - For each batch, calculate the cross entropy loss based on model logits and targets. The losses within each batch is reduced to a single loss by summation.
- Output the losses and
total_chr_num
toBPCEvaluator
for evaluation.
BPCEvaluator
: Using the total loss for each batch and the total number of characters in the original dataset from the inference stage, calculate the Bits per Character (BPC) metric for each model:
- Dataset config:
configs/datasets/llm-compression.py
- Evaluation config:
configs/eval_llm_compression.py
metric version model commoncraw python arxiv_math average
0 bpc af04af qwen1.5-32b-hf 0.5910 0.2584 0.4080 0.4191
1 bpc af04af qwen1.5-14b-hf 0.6459 0.2766 0.4310 0.4512
2 bpc af04af qwen-14b-hf 0.6197 0.2849 0.4498 0.4515
3 bpc af04af llama-30b-hf 0.5773 0.3212 0.4562 0.4516
4 bpc af04af llama-2-13b-hf 0.5807 0.3336 0.4752 0.4632
5 bpc af04af qwen1.5-7b-hf 0.6658 0.2935 0.4500 0.4698
6 bpc af04af qwen-7b-hf 0.6453 0.3088 0.4830 0.4790
7 bpc af04af llama-13b-hf 0.6083 0.3555 0.4865 0.4834
8 bpc af04af llama-2-7b-hf 0.6117 0.3536 0.4995 0.4883
9 bpc af04af llama-7b-hf 0.6285 0.3794 0.5096 0.5058
10 bpc af04af qwen1.5-1.8b-hf 0.7448 0.4029 0.5625 0.5701
11 bpc af04af qwen-1.8b-hf 0.7542 0.4175 0.5842 0.5853
12 bpc af04af qwen1.5-0.5b-hf 0.8102 0.4520 0.6181 0.6268
I am getting this warning during inference, should I truncate long samples to max_seq_len
to avoid further errors?
Token indices sequence length is longer than the specified maximum sequence length for this model. Running this sequence through the model will result in indexing errors
A: This warning comes from the tokenizer indicating that the input sequence length exceeds the model's input length, but it does not affect the operation of the tokenizer. For loss calculation, as long as we set a
block_size
of the sliding window less thanmax_seq_len
, we can safely ignore this warning.
@misc{huang2024compression,
title={Compression Represents Intelligence Linearly},
author={Yuzhen Huang and Jinghan Zhang and Zifei Shan and Junxian He},
year={2024},
eprint={2404.09937},
archivePrefix={arXiv},
primaryClass={cs.CL}
}