A Tutorial On Multi-View Information Bottleneck

In this tutorial we are going to learn about information bottleneck (IB) and the setting in which it is defined. then I will introduce to you a new setting “Multi-View Unsupervised Learning” and we are going to learn how IB can be extended to this setting. this tutorial is based on Learning Robust Representations via Multi-View Information Bottleneck.

Information Bottleneck

Consider the standard supervised learning setting; we have an input RV and a target RV . we are given a dataset of samples from their joint distribution and are tasked to find a model for predicting from . What properties of and affect the performance of our model? Answer: is a critical value in this setting. the higher it is the more we are expected to perform and in the limit of there exists a function from to that can predict with no error. but should be the only thing to measure the difficulty of task with? Well No. Consider two supervised tasks with identical ; in practice we are likely to perform better on the task where the relation between and is simpler. Why? because we can look at the problem as finding a point (true conditional distribution ) in a space of probabilities (probability simplex) and simpler relations require smaller search space and thus we need less data to find these points and we are more confident in our finding (bias variance trade off). Okay having established that, How can we measure the simplicity of the relation between and ? Answer: One proxy for this is . because (assuming is binary) the effective total number of possible relations between and is . Now out of this insight comes the following: If we can turn the problem of predicting from into the problem of predicting from such that while still we have turned the original problem into a simpler one and thus we are more likely to achieve a better performance. This is “Information Bottleneck” in a nutshell. Now we reformulate this:

Information Bottleneck Setting

given two RVs and , find a representation for (called ) such that is minimally sufficient for .

In our formulation we paraphrased the two conditions we enumerated into “minimally sufficient”. we now elaborate on this:

Sufficiency

consider the following Markov chain ; is sufficient for if .

by applying Data Processing Inequality (DPI) on both Markov chains we get:

Therefore sufficiency leads to and thus sufficiency is what we meant by saying a good representation should keep .

More on Sufficiency

Actually is equivalent to sufficiency; to see this note that means and since is a representation of we assume is independent of any other random variable conditioned on observing , then we have:

Thus .

Minimal Sufficiency

A sufficient statistic () is a minimal sufﬁcient statistic, if it is a function of every other sufﬁcient statistic . or equivalently forms a Markov chain.

Again by applying DPI we get:

so minimal sufficient statistic is the one having the lowest among them which is upper bounded by . and so this is what we actually meant by trying to get such that .

A Note on Minimality

in our discussion so far we have explained the pursuit of minimality based on model complexity and data efficiency. we now give another reasoning based on out of distribution generalization and robustness followed by an example:

the more a representation is independent of (lower ) the more it is robust to changes in and the more it is likely to generalize beyond train distribution. we provide an intuition for this with an example:

Cats and Dogs 😺🐶

consider a dog vs cat classification task where we are given an input image and tasked to predict the label which corresponds to whether the input is an image of a dog or a cat. additionally assume that the training data has the following property:

cats are more likely to be photographed in indoor environments while dogs are more likely to be photographed in outdoor environments.

but this property is reversed in test distribution. Now consider two models:

uses both information of background and object shape to determine the label.
only relies in object shape. both these models are going to perform well in training data but it is obvious that is going to perform better in testing data. what separates from ? dependence on is less than .

Optimization

Knowing more about minimal sufficient statistics we can write the IB objective as:

assuming a parametric family of conditional distributions the Lagrangian of the above is:

where controls the trade off between sufficiency and minimality. There are other ways which we can reformulate the above objective and depending on the scenario, thinking about IB as them can be more useful:

Let’s write as following:

Thus and putting this in IB Lagrangian objective we get:
Assume and are both MSS. Now based on the definition of MSS must be a function of and vise versa. Hence: fixing , minimizing is equivalent to maximizing . Additionally following proposition 2.1 of Achille17 can be written as where . so for to be a function of means that cannot depend on and so . Now we can write another objective for optimizing a parametric representation based on IB:

and it’s Lagrangian is:

we can view is the “informative” part of and as “superﬂuous” part and thus IB tries to increase dependence of on and decrease it’s dependence on .

Multi View Setting

In IB a principal assumption is that we have a dataset of labeled inputs; however labelling large amounts of data is costly and time-consuming. So in practice we cannot have large labeled datasets. Instead what we have are large unlabeled datasets. There are two scenarios that are of interest to us in such settings:

multi-view: our observations are compromised of multiple views. for example we might be interested in classifying a 3D object based on a 2D image of it and we are provided with multiple images of each object taken from different angles. or we want to tag a social media post and we are provided with it’s image and author’s short description. The underlying assumption in this scenario is that down-stream tasks we are interested in are invariant to the views. to make this more formal we define to be the original object of interest, to be a view of and a property of interest. then:
self-supervised: we do not have access to multiple views of but we are aware of some invariances of label w.r.t input in the form of a set of transformations . meaning that we can create our views by applying transformations present in to our observations.

Multi-View IB

Now a natural question is how one can extend the IB framework into Multi-view setting?

Method

Building Intuition

IB tries to find a representation that while is faithful (sufficient) tries to maximally discard information present in not helpful for predicting (superﬂuous). And so it needs to know to answer “is this information predictive of ?“. But in unsupervised settings we do not have . instead what we have are some invariance assumptions about ; in essence these invariances indicate that some parts of can be discarded for prediction of . what parts? the parts that are variant. In what follows we build upon this intuition.

We define redundancy to formalize the assumption that all views have the same predictive information regarding :

Redundancy

We say is redundant with respect to for if . Also we say and are mutually redundant if .

The following Corollary formalizes our intuition that variant parts of views can be discarded for prediction of :

Corollary

Let and be two mutually redundant views for a target and let be a representation of . If is sufficient for then is as predictive for as the joint observation of the two views.

based on this Corollary we can now extend IB to the multi-view setting. i.e. a good representation in this setting is the representation found by IB in the supervised task of predicting from .

How should views be?

The more two views are different the more they are informative of what information can be discarded and hence the more we lower . at the extreme the only shared information between views would be the label information and we show that our method reduces to traditional IB.

Now we are in a place to define the optimization objective of Multi-View IB; based on the above ideas we need to focus on and that are MSS for predicting and respectively. for we have:

where . Symmetrically we have:

averaging the two losses:

this can be upper bounded by the following loss:

where is the symmetric KL divergence defined as the average of and .

Intuitively this loss encourages and to have high mutual information and thus retain the redundancy present in and while having identical conditional distributions meaning that predicting which view has generated a given can be done no better than chance and thus and have discarded any view specific information.

The symmetric KL can be computed directly when s have known density and can be maximized using any sample based differentiable MI lower bound like Jenson-Shannon lower bound or InfoNCE.

Below we have presented the optimization algorithm corresponding to this objective accompanied by an schematic describing the algorithm visually:

A Note on Single View Setting:

what can we do when we do not have access to multiple views? a simple solution is to come up with a set of transformations on data that we believe down-stream tasks would be invariant to. (e.g. rotation of input image in dog/cat classification example). then we create the required multi-view dataset of our setup as follows:

sample transformations and randomly and independently from set of transformations .

set and . if our assumption about these transformations is true:

and thus these two views are mutually redundant with respect to .

we employ the information plane to create a holistic view of this method and its relation to other methods of representation learning in multi-view setting. information plane is a 2D figure where axis represent and . based on IB a good representation should be on the top left corned of the plane. we can view the proposed method and previous methods as points and curved in this plane:

Supervised IB: given label optimizing IB objective leads us to the top left corner of the plane. this representation has best predictive performance and is most generalizable.
Infomax: based on this method good representations should have maximal mutual information with input (right most corner). trivially satisfies this objective; to prevent this, different methods employ different architectural constraints on .
MV-Infomax: these methods are extensions of the Infomax Idea and they try to maximize which can be proved to be a lower bound on . by maximizing they maintain the sufficiency property but don’t put any restrictions on which leads to representations having different values of ranging from MIB to Infomax.
-VAE: this method finds a latent representation that balances compression and informativeness by a hyperparameter . the main disadvantage of this method is that informativeness is measured by and thus when we increase , becomes more compressed while trying to maintain but there is no explicit preference to which information about should be discarded to accomplish the compression of and this preference is implicit in the choice of architecture. this means architectural choices can be made such that compression leads to maintaining sufficiency for (higher curves in the figure ) or on the other hand opposite of this; such that compression leads to discarding predominantly information pertinent to prediction of (lower curves in the figure).
MIB: our work has the advantage of discarding information like -VAE, but explicitly forces to maintain sufficiency like Infomax objectives. this leads to discarding of only irrelevant information for predicting .

Experiments (Multi View)

Sketch-Based Image Retrieval

Dataset. The Sketchy dataset consists of 12,500 images and 75,471 hand-drawn sketches of objects from 125 classes. we also include another 60,502 images from the ImageNet from the same classes, which results in total 73,002 natural object images.

Setup. The sketch-based image retrieval task is a ranking of natural images according to the query sketch. Retrieval is done for our model by generating representations for the query sketch as well as all natural images, and ranking the image by the Euclidean distance of their representation from the sketch representation. The baselines use various domain specific ranking methodologies. Model performance is computed based on the class of the ranked pictures corresponding to the query sketch. The training set consists of pairs of image and sketch randomly selected from the same class, to ensure that both views contain the equivalent label information (mutual redundancy).

we use features extracted from images and sketches by a VGG architecture trained for classification on the TU-Berlin dataset. The resulting flattened 4096-dimensional feature vectors are fed to our image and sketch encoders to produce a 64-dimensional representation.

MIR-Flickr

Dataset. The MIR-Flickr dataset consists of 1M images annotated with 800K distinct user tags. Each image is represented by a vector of 3,857 hand-crafted image features () while the 2,000 most frequent tags are used to produce a 2000-dimensional multi-hot encoding () for each picture.

Setup. we train our model on the unlabeled pairs of images and tags. Then a multi-label logistic classifier is trained from the representation of 10K labeled train images to the corresponding macro-categories. The quality of the representation is assessed based on the performance of the trained logistic classifier on the labeled test set.

Results. Our MIB model is compared with other popular multi-view learning models in the above Figure. Although the tuned MIB performs similarly to Multi-View InfoMax with a large number of labels, it outperforms it when fewer labels are available. Furthermore, by choosing a larger the accuracy of our model drastically increases in scarce label regimes, while slightly reducing the accuracy when all the labels are observed (see right side of Figure). This effect is likely due to a violation of the mutual redundancy constraint.

A possible reason for the effectiveness of MIB against some of the other baselines may be its ability to use mutual information estimators that do not require reconstruction. Both Multi-View VAE (MVAE) and Deep Variational CCA (VCCA) rely on a reconstruction term to capture cross-modal information, which can introduce bias that decreases performance.

Experiments (Single View)

In this section, we compare the performance of different unsupervised learning models by measuring their data efficiency and empirically estimating the coordinates of their representation on the Information Plane.

Dataset. The dataset is generated from MNIST by creating the two views, and , via the application of data augmentation consisting of small affine transformations and independent pixel corruption to each image. These are kept small enough to ensure that label information is not effected. Each pair of views is generated from the same underlying image, so no label information is used in this process.

Setup. To evaluate, we train the encoders using the unlabeled multi-view dataset just described, and then fix the representation model. A logistic regression model is trained using the resulting representations along with a subset of labels for the training set, and we report the accuracy of this model on a disjoint test set as is standard for the unsupervised representation learning literature. We estimate and using mutual information estimation networks trained from scratch on the final representations using batches of joint samples .

Results. The empirical measurements of mutual information reported on the Information Plane are consistent with the theoretical analysis reported in Related word section; models that retain less information about the data while maintaining the maximal amount of predictive information, result in better classification performance at low-label regimes, confirming the hypothesis that discarding irrelevant information yields robustness and more data-efficient representations. Notably, the MIB model with retains almost exclusively label information, hardly decreasing the classification performance when only one label is used for each data point.

Future Work

MIB is only applicable to two-view settings and so a prominent future work direction is to extend this idea to multi-view settings with no limit on number of views.
an important drawback of Information Bottleneck is that it’s invariant to one-to-one transformations and thus does not put any constraints on the distribution of learned representation or whether is should be disentangled or not. while other forms of representation learning like VAEs, Normalizing Flows, etc… explicitly regularize representations to admit to a simple prior distribution. I think a promising area of research is combining IB ideas with VAEs in Multi-view setting.

🌌Riding the Milky Way

Explorer

A Tutorial On Multi-View Information Bottleneck