+ + + + + + + +This is a post for the paper + + Information-Theoretic Probing with Minimum Description Length. + + + +
+
+ + + + Probing classifiers often fail to adequately reflect differences in representations + and can show different results depending on hyperparameters. +
+ As an alternative to the standard probes, + +
    +
  • we propose information-theoretic probing which measures + minimum description length (MDL) of labels given representations;
  • +
  • we show that MDL characterizes both probe quality and + the amount of effort needed to achieve it;
  • +
  • we explain how to easily measure MDL on top of standard probe-training pipelines;
  • +
  • we show that results of MDL probes are more informative and stable than those of standard probes.
  • +
+
+ + + + + + + +March 2020 + +

+
+ +

How to understand if a model captures a linguistic property?

+ +How would you understand whether a model (e.g., ELMO, BERT) learned to encode some linguistic property? +Usually this question is narrowed down to the following. + +

+We have:
data +, where + are representations from a model and + are labels for some linguistic task +(). + +
+We want:
to measure to what extent +representations capture labels +. +

+ + + +

Standard Probing: train a classifier, use its accuracy

+ +The most popular approach is to use probing classifiers (aka probes, probing tasks, diagnostic classifiers). +These classifiers are trained to predict a linguistic property from frozen representations, +and accuracy of the classifier is used to measure how well these representations encode the property. +
+Looks reasonable and simple, right? Yes, but... +

+ +

Wait, how about some sanity checks?

+ + + + +While such probes are extremely popular, several 'sanity checks' showed that +differences in accuracies fail to reflect differences in representations. +

+These sanity checks are different kinds of random baselines. For example, +Zhang & Bowman (2018) compared +probe scores for trained models and randomly initialized ones. They were able to see reasonable differences in the scores +only when reducing the amount of a classifier training data.

+ + +Later Hewitt & Liang (2019) proposed to +put probe accuracy in context with its ability to remember from word types. They constructed so-called 'control tasks', +which are manually constructed for each linguistic task. A control task defines random output for a word type, +regardless of context. For example, for POS tags, each word is assigned +a random label based on the empirical distribution of tags.
+ +Control tasks were used to perform an exhaustive search over hyperparameters for training probes and +to find a setting with the largest +difference in the scores. This was achieved by reducing size of a probing model (e.g., fewer neurons). + + +

+ +

Houston, we have a problem.

+ +We see that accuracy of a probe does not always reflect what we want it to reflect, at least not without explicit +search for a setting where it does. But then another problem arises: if we tune a probe to behave in a certain way for one task +(e.g., we tune its setting to have low accuracy for a certain control task), how do we know if it's still reasonable +for other tasks? Well, we do not (if you do, please tell!). + + +

+ +All in all, we want a probing method which does not require a human to tell it what to say. + +

+ + +

Information-Theoretic Viewpoint

+ + +Let us come back to the initial task: we have data +with representations and labels + and want to measure to what extent + encode +. +Let us look at this task from a new perspective, different from the current most popular approach. +
+Our idea can be summarized in two sentences. + +

+ + + + + Regularity in representations with respect to labels can be exploited to compress the data.
+ Better compression ↔ stronger regularity ↔ representations better encode labels.
+
+ +

+ +Basically, this is it. All is left is to make it clear, theoretically justified and illustrated experimentally. +For this, I will need some help from Alice and Bob. +

+ +

Adventures of Alice, Bob and the Data

+ + + +
+Let us imagine that Alice has all pairs +from , Bob has just +, and that Alice wants to communicate + to Bob. Transmitting data is +a lot of work, and surely Alice has much better things to do: let's help Alice to compress the data! +

+ +Formally, the task is to encode labels +knowing representations in an optimal way, +i.e., with minimal codelength (in bits) needed to transmit +. +The resulting minimal number of bits is minimum description length (MDL) +of labels given representations, and we will refer to probes measuring MDL as +description length probes or MDL probes. + +

+ +

Changing a probe's goal converts it into an MDL probe

+ +Turns out that to evaluate MDL we don't need to change much in standard probe-training pipelines. +This is done via a simple trick: we know that data can be compressed using some probabilistic model +, and we just set this model + to be a probing classifier. This is how +we turn a standard probe into an MDL probe by changing a goal from predicting labels to transmitting data. + +

+ + + + + +Note that since Bob does not know the precise model Alice is using, we will also have to +transmit the model, either explicitly or implicitly. Thus, the overall codelength is a combination +of quality-of-fit of the model (compressed data length), together with the cost of transmitting the model itself. +

+ +Intuitively, codelength characterizes not only final quality of a probe +(data codelength), but also +the amount of effort needed to achieve this quality. + +

+ + +

The Data Part

+

TL;DR: cross-entropy is the data codelength

+ +Let's assume for now Alice and Bob have agreed in advance on a model + and both know the inputs +; +we need just to transmit data using this known model. +Then there exists a code to transmit the labels + +losslessly with codelength +
+ +
+This is the Shannon-Huffman code, which gives the optimal bound on the codelength +if the data are independent and come from a conditional probability distribution +. + +

+ +

Learning is Compression

+Note that (1) is exactly the categorical cross-entropy +loss evaluated on the model p. This shows that the task of compressing labels + +is equivalent to learning a model +of data:
+quality of a learned model + is the codelength needed to transmit the data. +

+ + +

Compression is compared against Uniform Code

+ +Compression is usually compared against uniform encoding +which does not require any learning from data. For K classes, +it assumes , and yields +codelength bits. + + + +

+ +

The Amount of Effort Part: Strength of the Regularity in the Data

+ + +
+ + Based on real events! These are t-sne projections of representations of different occurrences of + the token is from the last layer of MT encoder (strong) and the last LM layer (weak). + The colors show CCG tags (only 4 most frequent tags).
+ Curious why MT and LM behave this way? Look at our + Evolution of Representations post! +
+

+Now we come to the other component of the total codelength, which +characterizes how hard it is to achieve final quality of a trained probe. Intuitively, +if representations have some clear structure with respect to labels, this structure can be +understood with +less effort; for example, +
    +
  • the "rule" explaining this structure (i.e., a probing model) can be + simple, and/or
  • +
  • the amount of data needed to reveal this structure can be small.
  • +
+This is exactly how our vague (so far) notion of the "amount of effort" is translated into codelength. +We explain this more formally when describing the two methods for evaluating MDL we use: +variational coding and online coding; +they differ in a way they incorporate +model cost: directly or indirectly. +

+ + +

Variational Code: Explicit Transmission of the Model

+ + + + +Variational code is an instance of two-part codes, +where a model (probe weights) is transmitted explicitly and then used to encode the data. +It has been known for quite a while that +this joint cost (model and data codes) is exactly the loss function of a variational learning methods. +

+ +

The Idea: Informal version

+What is happening in this part: +
    +
  • model weights are transmitted explicitly;
  • +
  • each weight is treated as a random variable;
  • +
  • weights with higher variance can be transmitted with lower precision (i.e., fewer bits);
  • +
  • by choosing sparsity-inducing priors we will obtain induced (sparse) probe architectures.
  • +
+
+ +

Codelength: Formal Derivations

+ +
+ I hereby confirm that I'm not afraid of scary formulas + + +
+ +

May the force be with you!

+Here +we assume that Alice and Bob have agreed on a model class +, +and for any model , +Alice first transmits its parameters +and then encodes the data while relying on the model. +The description length decomposes accordingly: +
+ +
+To compute the description length of the parameters, +we can further assume that Alice and Bob have agreed on a prior distribution over the +parameters . +Now, we can rewrite the total description length as +
+ +
+where parameters +is the number of parameters and +is a prearranged precision for each parameter. +With deep learning models (in our case, probing classifiers), +such straightforward codes for parameters +are highly inefficient. +Instead, in the variational approach, +weights are treated as random variables, +and the description length is given by the expectation +
+ +
+where +is a distribution encoding uncertainty about the parameter values. +The distribution +is chosen by minimizing the codelength given in Eq. (3). +In you are interested in the formal justification for the description length, +it relies on the bits-back argument and can be found, for example, +here. +However, the underlying intuition is straightforward: +parameters we are uncertain about can be transmitted + at a lower cost as +the uncertainty + can be used to determine the required +precision. The entropy term in equation (3), +, +quantifies this discount. +

+The negated codelength +is known as the evidence-lower-bound (ELBO) and used as the objective in variational inference. +The distribution +approximates the intractable posterior distribution +. +Consequently, any variational method can in principle be used to estimate the codelength. +
+ +

+ + +

We Use: Bayesian Compression

+ + + +You can choose priors for weights to get something interesting. For example, if you choose +sparsity-inducing priors on the parameters, you can get +variational dropout. +What is more interesting, you can do this in such way that you prune the whole neurons (not just individual weights): this is the + + Bayesian network compression method we use. + This allows us to assess the probe complexity +both using its description length + +and by inspecting the discovered architecture. + + +

+ +

Intuition

+ + +If regularity in the data is strong, +it can be explained with a "simple rule" (i.e., a small probing model). +A small probing model is easy to communicate, i.e., only a few parameters require high precision. +As we will see in the experiments, +similar probe accuracies often come at a very different model cost, i.e. different total codelength. + +

+Note that the variational code gives us probe architecture (and model size) as a +byproduct of training, and does not require + manual search over probe hyperparameters. + +

+
+ + + +

Online Code: Implicit Transmission of the Model

+ +Online code provides a way to transmit data without directly transmitting the model. Intuitively, it +measures the ability to learn from different amounts of data.

+ + + In this setting, the data is transmitted in +a sequence of portions; at each step, the data transmitted so far is used to understand the regularity +in this data and compress the following portion. +

+ +

Codelength: Formal Derivations

+Formally, Alice and Bob agree on the form of the model + +with learnable parameters , +its initial random seeds, and its learning algorithm. They also choose timesteps + +and encode data by blocks. Alice starts by communicating + +with a uniform code, then both Alice and Bob learn a model + +that predicts +from using data +, +and Alice uses that model to communicate the next data block +. +Then both Alice and Bob learn a model + +from a larger block +and use it to encode . +This process continues until the entire dataset is transmitted. The resulting online codelength is +
+ +
+ +

Data and Model components of Online Code

+While the online code does not incorporate model cost explicitly, we can still evaluate model cost by interpreting +the difference between the cross-entropy of the model trained on all data, +, +and online codelength as the cost of the model: + +
+ +
+ +Indeed, +is codelength of the data if one knows model parameters, + - if one does not know them. + +

+ + +

Intuition

+ + +Online code +measures the ability to learn from different amounts of data, and +it is related to the area under the learning curve, which plots quality +as a function of the number of training examples. +

+Intuitively, if the regularity in the data is strong, it can be +revealed + using a small subset of the data, i.e., early in the transmission process, and can be exploited to efficiently +transmit the rest of the dataset. + +

+ + + + + +

Description Length and Control Tasks

+ + + +

+ +

Compression methods agree in results:

+
    +
  • for the linguistic task, the best layer is the first;
  • +
  • for the control task, codes become + larger as we move up from the embedding layer
    + (this is expected since the control task measures the ability to remember word types); +
  • +
  • codelengths for the control task are substantially larger than for + the linguistic task. +
  • +
+ + +

Layer 0: MDL is correct, accuracy is not.

+ +What is even more surprising, codelength identifies the control task even when accuracy indicates the opposite: +for layer 0, accuracy for the control task is higher, but +the code is twice longer than for the linguistic task. +This is because codelength characterizes how hard it is to achieve this accuracy: +for the control task, accuracy is higher, but the cost of achieving this score is very big. +We will illustrate this a bit later when looking at the model component of the code.

+ +

Embedding vs contextual: drastic difference.

+ +For the linguistic task, +note that codelength for the embedding layer is approximately twice larger than that for the first layer. +Later, when looking at the random model baseline, we will see the same trends for several other tasks +and will show that even contextualized representations obtained with a randomly initialized model are a lot +better than the embedding layer alone. + +

+ +

Model: small for linguistic, large for control.

+ +Let us now look at the data and model components of code separately +(as shown in formulas (3) and (5) for variational and online codes, respectively). +

+ + + +The trends for the two compression methods are similar: +for the control +task, model size is several times larger than for the linguistic task. This is something that +probe accuracy alone is not able to reflect: +while representations have structure with respect to the linguistic task labels, +the same representations do not have structure with + respect to random labels.

+For variational code, it means that the linguistic task labels can be "explained" +with a small model, but random labels can be learned only using a larger model. +
+For online code, it shows that the linguistic task labels can be learned +from a small amount of data, but random labels can be learned (memorized) well only using a large dataset. + + +
+ + + +
+For online code, we can also recall that it is related to the area under the learning curve, which plots quality as a function of the number of training examples. +This figure shows +learning curves corresponding to online code. We can clearly see the difference between behavior of the linguistic and control tasks. + + +


+ +

Architecture: sparse for linguistic, dense for control.

+ + + + +Let us now recall that Bayesian compression +gives us not only variational codelength, but also the induced probe architecture. +

+Probes learned for linguistic tasks have small architectures with only 33-75 neurons at the second +and third layers of the probe. In contrast, models learned for control tasks are quite dense. + +

+This relates to the control tasks +paper which proposed to use random label baselines. +The authors considered several predefined +probe architectures and picked one of them based on a manually defined criterion. In contrast, variational +code gives probe architecture as a byproduct of training and +does not require human guidance. + +

+ + +

MDL results are stable, accuracy is not

+ +

Hyperparameters: change results for accuracy but not for MDL.

+ + + + While here we discussed in detail results for the default settings, we also tried +10 different settings and saw that accuracy varies greatly with the settings. +For example, for layer 0 there are settings with contradictory +results: accuracy can be higher either for the linguistic or for the control task depending on the settings +(look at the figure). In striking contrast to accuracy, +MDL results are stable across settings, thus +MDL does not require search for probe settings. +

+ +

Random Seed: affects accuracy but not MDL.

+ + +Figure shows results for 5 random seeds (linguistic task). +We see that using accuracy can lead to different +rankings of layers depending on a random seed, making it hard to draw conclusions about their relative qualities. +For example, accuracy for layer 1 and layer 2 are 97.48 and 97.31 for seed 1, but 97.38 and 97.48 +for seed 0. +
+On the contrary, the MDL results are stable and the scores given to different layers are well separated. + +

+ + +

Description Length and Random Models

+ + + + +

+ + + +In the paper, we also consider another type of random baselines: randomly initialized models. + Following previous work, we also experiment with ELMo and compare it with a version of the ELMo model in which all weights above the lexical +layer (layer 0) are replaced with random orthonormal matrices (but the embedding layer, +layer 0, is retained from trained ELMo). +

+ +We experiment with 7 edge probing tasks (5 of them are listed above). +Here we will only briefly mention our findings: if you are interested in more details, look at the paper. + +

+ +

We find, that:

+
    +
  • layer 0 vs contextual: even random contextual representations are better
    + As we already saw in the previous part, codelength shows a drastic difference between the + embedding layer (layer 0) and contextualized representations: codelengths differ about twice for most tasks. + Both compression methods show that even for the randomly initialized model, + contextualized representations are better than lexical representations. +

  • +
  • codelengths for the randomly initialized model are larger than for the trained one
    +This is more prominent when not just looking at the bare scores, but comparing compression against + context-agnostic representations. For all tasks, compression bounds for the randomly + initialized model are closer to those of context-agnostic Layer 0 than to representations + from the trained model. + This shows that gain from using context for the randomly initialized model is at least twice smaller than + for the trained model. +

  • +
  • randomly initialized layers do not evolve
    + For all tasks, MDL (and data/model components of code separately) for layers of the randomly initialized model is identical. +For the trained model, this is not the case: layer 2 is worse than layer 1 for all tasks. + This is one more illustration of the general process explained + in our Evolution of Representations post: + the way representations evolve between layers is + defined by the training objective. For the randomly initialized model, + since no training objective has been optimized, no evolution happens. +
  • +
+ + +

Conclusions

+ +We propose information-theoretic probing which measures minimum description length (MDL) +of labels given representations. We show that MDL naturally characterizes not only probe quality, +but also the amount of effort needed to achieve it (or, intuitively, +strength of the regularity in representations with respect to the labels); +this is done in a theoretically justified way without manual search for settings. +We explain how to easily measure MDL on top of standard probe-training pipelines. We show that results of +MDL probing are more informative and stable compared to the standard probes. +

+ +Want to know more? + + + + + + +

+Share: + +