Skip to content

Latest commit

 

History

History
409 lines (269 loc) · 74 KB

README.md

File metadata and controls

409 lines (269 loc) · 74 KB

Descriptor Vector Exchange

This repo provides code for learning dense landmarks without supervision. Our approach is described in the ICCV 2019 paper "Unsupervised learning of landmarks by exchanging descriptor vectors".

DVE diagram

High level Overview: The goal of this work is to learn a dense embedding Φu(x) ∈ RC of image pixels without annotation. Our starting point was the Dense Equivariant Labelling approach of [3] (references follow at the end of the README), which similarly tackles the same problem, but is restricted to learning low-dimensional embeddings to achieve the key objective of generalisation across different identities. The key focus of Descriptor Vector Exchange (DVE) is to address this dimensionality issue to enable the learning of more powerful, higher dimensional embeddings while still preserving their generalisation ability. To do so, we take inspiration from methods which enforce transitive/cyclic consistency constraints [4, 5, 6].

The embedding is learned from pairs of images (x,x′) related by a known warp v = g(u). In the image above, on the left we show the approach used by [3], which directly matches embedding Φu(x) from the left image to embeddings Φv(x′) in the right image to generate a loss. On the right, DVE replaces Φu(x) with its reconstruction Φˆu(x|xα) obtained from the embeddings in a third auxiliary image xα (the correspondence with xα does not need to be known). This mechanism encourages the embeddings to act consistently across different instances, even when the dimensionality is increased (see the paper for more details).

Requirements: The code assumes PyTorch 1.1 and Python 3.6/3.7 (other versions may work, but have not been tested). See the section on dependencies towards the end of this file for specific package requirements.

Learned Embeddings

We provide pretrained models for each dataset to reproduce the results reported in the paper [1]. The training is performed with CelebA, a dataset of over 200k faces of celebrities that was originally described in this paper. We use this dataset to train our embedding function without annotations.

Each model is accompanied by training and evaluation logs and its mean pixel error performance on the task of matching annotated landmarks across the MAFL test set (described in more detail below). We use two architectures: the smallnet model of [3] and the more powerful hourglass model, inspired by its effectiveness in [7].

The goal of these initial experiments is to demonstrate that DVE allows models to generalise across identities even when using higher dimensional embeddings (e.g. 64d rather than 3d). By contrast, this does not occur when DVE is removed (see the ablation section below).

Embed. Dim Model Same Identity Different Identity Params Links
3 smallnet 1.36 3.03 334.9k config, model, log
16 smallnet 1.28 2.79 338.2k config, model, log
32 smallnet 1.29 2.79 342.3k config, model, log
64 smallnet 1.28 2.77 350.6k config, model, log
64 hourglass 0.93 2.37 12.6M config, model, log

Notes: The error metrics for the hourglass model, which are included for completeness, are approximately (but are not exactly) comparable to the metrics for the smallnet due to very slight differences in the cropping ratios used by the two architectures (0.3 for smallnet, 0.294 for Hourglass).

Landmark Regression

Protocol Description: To transform the learned dense embeddings into landmark predictions, we use the same approach as [3]. For each target dataset, we freeze the dense embeddings and learn to peg onto them a collection of 50 "virtual" keypoints via a spatial softmax. These virtual keypoints are then used to regress the target keypoints of the dataset. We report the error as a percentage of inter-ocular distance (a metric defined by the landmarks of each dataset).

MAFL landmark regression

MAFL is a dataset of 20k faces which includes landmark annotations. The dataset is partitioned into 19k training images and 1k testing images.

Embed. Dim Model Error (%IOD) Links
3 smallnet 4.17 config, model, log
16 smallnet 3.97 config, model, log
32 smallnet 3.82 config, model, log
64 smallnet 3.42 config, model, log
64 hourglass 2.86 config, model, log

300-W landmark regression

The 300-W This dataset contains 3,148 training images and 689 testing images with 68 facial landmark annotations for each face (with the split introduced this this CVPR 2014 paper). The dataset is described in this 2013 ICCV workshop paper.

Embed. Dim Model Error (%IOD) Links
3 smallnet 7.66 config, model, log
16 smallnet 6.29 config, model, log
32 smallnet 6.13 config, model, log
64 smallnet 5.75 config, model, log
64 hourglass 4.65 config, model, log

AFLW landmark regression

The original AFLW contains around 25k images with up to 21 landmarks. For the purposes of evaluating five-landmark detectors, the authors of TCDCN introduced a test subset of almost 3K faces (for convenience, we include a mirror version of these images, but you can obtain the originals here)

There are two slightly different partitions of AFLW that have been used in prior work (we report numbers on both to allow for comparison). One is a set of recropped faces released by [7] (2991 test faces with 132 duplicates, 10122 train faces) (here we call this AFLWR). The second is the train/test partition of AFLW used in the works of [2,3] which used the existing crops from MTFL (2995 faces) for testing and 10122 AFLW faces for training (we call this dataset split AFLWM).

Additionally, in the tables immediately below, each embedding is further fine-tuned on the AFLWR/AFLWM training sets (without annotations), as was done in [2], [3], [7], [8]. The rationale for this is that (i) it does not require any additional superviserion; (ii) it allows the model to adjust for the differences in the face crops provided by the detector. To give an idea of how sensitive the method is to this step, we also report performance without finetuning in the ablation studies below.

AFLWR landmark regression

Embed. Dim Model Error (%IOD) Links
3 smallnet 10.13 config, model, log
16 smallnet 8.40 config, model, log
32 smallnet 8.18 config, model, log
64 smallnet 7.79 config, model, log
64 hourglass 6.54 config, model, log

AFLWM landmark regression

AFLWMis a dataset of faces which also includes landmark annotations. We use the P = 5 landmark test split (10,122 training images and 2,991 test images). The dataset can be obtained here and is described in this 2011 ICCV workshop paper.

Embed. Dim Model Error (%IOD) Links
3 smallnet 11.12 config, model, log
16 smallnet 9.15 config, model, log
32 smallnet 9.17 config, model, log
64 smallnet 8.60 config, model, log
64 hourglass 7.53 config, model, log

Ablation Studies

We can study the effect of the DVE method by removing it during training and assessing the resulting embeddings for landmark regression. The ablations are performed on the lighter SmallNet model.

Embed. Dim Model DVE Same Identity Different Identity Links
3 smallnet ✖️ / ✔️ 1.33 / 1.36 2.89 / 3.03 (config, model, log) / (config, model, log)
16 smallnet ✖️ / ✔️ 1.25 / 1.28 5.65 / 2.79 (config, model, log) / (config, model, log)
32 smallnet ✖️ / ✔️ 1.26 / 1.29 5.81 / 2.79 (config, model, log) / (config, model, log)
64 smallnet ✖️ / ✔️ 1.25 / 1.28 5.68 / 2.77 (config, model, log) / (config, model, log)

We see that without DVE, the learned embedding performs reasonably when the dimensionality is restricted to 3d. However, when we seek to learn higher dimensionality embeddings without DVE, they lose their ability to match across different identities. This inability to generalise at higher dimensions is similarly reflected when the embeddings are used to regress landmarks:

DVE Ablation: MAFL landmark regression

Embed. Dim Model DVE Error (%IOD) Links
3 smallnet ✖️ / ✔️ 4.02/4.17 (config, model, log) / (config, model, log)
16 smallnet ✖️ / ✔️ 5.31/3.97 (config, model, log) / (config, model, log)
32 smallnet ✖️ / ✔️ 5.36/3.82 (config, model, log) / (config, model, log)
64 smallnet ✖️ / ✔️ 4.99/3.42 (config, model, log) / (config, model, log)

DVE Ablation: 300w landmark regression

Embed. Dim Model DVE Error (%IOD) Links
3 smallnet ✖️ / ✔️ 8.23/7.66 (config, model, log) / (config, model, log)
16 smallnet ✖️ / ✔️ 10.66/6.29 (config, model, log) / (config, model, log)
32 smallnet ✖️ / ✔️ 10.33/6.13 (config, model, log) / (config, model, log)
64 smallnet ✖️ / ✔️ 9.33/5.75 (config, model, log) / (config, model, log)

DVE Ablation: AFLWM landmark regression

Embed. Dim Model DVE Error (%IOD) Links
3 smallnet ✖️ / ✔️ 10.99/11.12 (config, model, log) / (config, model, log)
16 smallnet ✖️ / ✔️ 12.22/9.15 (config, model, log) / (config, model, log)
32 smallnet ✖️ / ✔️ 12.60/9.17 (config, model, log) / (config, model, log)
64 smallnet ✖️ / ✔️ 12.92/8.60 (config, model, log) / (config, model, log)

DVE Ablation: AFLWR landmark regression

Embed. Dim Model DVE Error (%IOD) Links
3 smallnet ✖️ / ✔️ 10.14/10.13 (config, model, log) / (config, model, log)
16 smallnet ✖️ / ✔️ 10.73/8.40 (config, model, log) / (config, model, log)
32 smallnet ✖️ / ✔️ 11.05/8.18 (config, model, log) / (config, model, log)
64 smallnet ✖️ / ✔️ 11.43/7.79 (config, model, log) / (config, model, log)

Next we investigate how sensitive our approach is to finetuning on the target dataset (this is done for the AFLWR and AFLWM landmark regressions). We do two sets of experiments. First we, remove the finetuning for both the AFLW dataset variants and re-evaluate on the landmark regression tasks. Second, we add in a finetuning step for a different dataset, 300w, to see how the method is affected on a different benchmark. Note that all models for these experiments use DVE, and the finetuning consists of training the embeddings for an additional 50 epochs without annotations. We see that for the AFLW datasets, it makes a reasonable difference to performance. However, for 300w, particularly for stronger models, it adds little benefit (for this reason we do not use finetuning on 300w for the results reported in the paper).

Finetuning Ablation: AFLWM landmark regression

Embed. Dim Model Finetune Error (%IOD) Links
3 smallnet ✖️ / ✔️ 11.82/11.12 (config, model, log) / (config, model, log)
16 smallnet ✖️ / ✔️ 10.22/9.15 (config, model, log) / (config, model, log)
32 smallnet ✖️ / ✔️ 9.80/9.17 (config, model, log) / (config, model, log)
64 smallnet ✖️ / ✔️ 9.28/8.60 (config, model, log) / (config, model, log)
64 hourglass ✖️ / ✔️ 8.15/7.53 (config, model, log) / (config, model, log)

Finetuning Ablation: AFLWR landmark regression

Embed. Dim Model Finetune Error (%IOD) Links
3 smallnet ✖️ / ✔️ 9.65/10.13 (config, model, log) / (config, model, log)
16 smallnet ✖️ / ✔️ 8.91/8.40 (config, model, log) / (config, model, log)
32 smallnet ✖️ / ✔️ 8.73/8.18 (config, model, log) / (config, model, log)
64 smallnet ✖️ / ✔️ 8.14/7.79 (config, model, log) / (config, model, log)
64 hourglass ✖️ / ✔️ 6.88/6.54 (config, model, log) / (config, model, log)

Finetuning Ablation: 300w landmark regression

Embed. Dim Model Finetune Error (%IOD) Links
3 smallnet ✖️ / ✔️ 7.66/7.20 (config, model, log) / (config, model, log)
16 smallnet ✖️ / ✔️ 6.29/5.90 (config, model, log) / (config, model, log)
32 smallnet ✖️ / ✔️ 6.13/5.75 (config, model, log) / (config, model, log)
64 smallnet ✖️ / ✔️ 5.75/5.58 (config, model, log) / (config, model, log)
64 hourglass ✖️ / ✔️ 4.65/4.65 (config, model, log) / (config, model, log)

To enable the finetuning experiments to be reproduced, the training logs for each of the three datasets are provided below, together with their performance on the matching task.

Finetuning on AFLWM

Embed. Dim Model Same Identity Different Identity Links
3 smallnet 5.99 7.16 config, model, log
16 smallnet 4.72 7.11 config, model, log
32 smallnet 6.42 8.71 config, model, log
64 smallnet 8.07 10.09 config, model, log
64 hourglass 1.53 3.65 config, model, log

Finetuning on AFLWR

Embed. Dim Model Same Identity Different Identity Links
3 smallnet 6.36 7.69 config, model, log
16 smallnet 6.34 8.62 config, model, log
32 smallnet 8.10 10.11 config, model, log
64 smallnet 4.08 5.21 config, model, log
64 hourglass 1.17 4.04 config, model, log

Finetuning on 300w

Embed. Dim Model Same Identity Different Identity Links
3 smallnet 5.21 6.51 config, model, log
16 smallnet 5.55 7.30 config, model, log
32 smallnet 5.85 7.47 config, model, log
64 smallnet 6.58 8.19 config, model, log
64 hourglass 1.63 3.82 config, model, log

Annotation Ablation: AFLWM landmark regression with limited labels

We perform a final ablation to investigate how well the regressors are able to perform when their access to annotation is further reduced, and they are simply provided with a few images. The results, shown below, are reported as mean/std over three runs (because when there is only a single annotation, the performance is quite sensitive to which particular annotation is selected). Particularly for the stronger models, reasonable performance can be obtained with a small number of annotated images.

Embed. Dim Model DVE Num annos. Error (%IOD) Links
3 smallnet 1 ✖️ 19.87 (+/- 3.10) config, model, log
3 smallnet 5 ✖️ 16.90 (+/- 1.04) config, model, log
3 smallnet 10 ✖️ 16.12 (+/- 1.07) config, model, log
3 smallnet 20 ✖️ 15.30 (+/- 0.59) config, model, log
64 smallnet 1 ✔️ 17.13 (+/- 1.78) config, model, log
64 smallnet 5 ✔️ 13.57 (+/- 2.08) config, model, log
64 smallnet 10 ✔️ 12.97 (+/- 2.36) config, model, log
64 smallnet 20 ✔️ 11.26 (+/- 0.93) config, model, log
64 hourglass 1 ✔️ 14.23 (+/- 1.54) config, model, log
64 hourglass 5 ✔️ 12.04 (+/- 2.03) config, model, log
64 hourglass 10 ✔️ 12.25 (+/- 2.42) config, model, log
64 hourglass 20 ✔️ 11.46 (+/- 0.83) config, model, log

Dataset mirrors

For each dataset used in the paper, we provide a preprocessed copy to allow the results described above to be reproduced directly. These can be downloaded and unpacked with a utility script, which will store them in the locations expected by the training code. Each dataset has a brief README, which also provides the citations for use with each dataset, together with a link from which it can be downloaded directly.

Dataset Details and links Archive size sha1sum
CelebA (+ MAFL) README 9.0 GiB f6872ab0f2df8e5843abe99dc6d6100dd4fea29f
300w README 3.0 GiB 885b09159c61fa29998437747d589c65cfc4ccd3
AFLWM README 252 MiB 1ff31c07cef4f2777b416d896a65f6c17d8ae2ee
AFLWR README 1.1 GiB 939fdce0e6262a14159832c71d4f84a9d516de5e

Additional Notes

In the codebase AFLW<sub>R</sub> is simply referred to as AFLW, while AFLW<sub>M</sub> is referred to as AFLW-MTFL. For 300w, we compute the inter-ocular distance according to the definition given by the dataset organizers here. Some of the logs are generated from existing logfiles that were created with a slightly older version of the codebase (these differences only affect the log format, rather than the training code itself - the log generator can be found here.)

Evaluating a pretrained embedding

Evaluting a pretrained model for a given dataset requires:

  1. The target dataset, which should be located in <root>/data/<dataset-name> (this will be done automatically by the data fetching script, or can be done manually).
  2. A config.json file.
  3. A checkpoint.pth file.

Evaluation is then performed with the following command:

python3 test_matching.py --config <path-to-config.json> --resume <path-to-trained_model.pth> --device <gpu-id>

where <gpu-id> is the index of the GPU to evaluate on. This option can be ommitted to run the evaluation on the CPU.

For example, to reproduce the smallnet-32d-dve results described above, run the following sequence of commands:

# fetch the mafl dataset (contained with celeba) 
python misc/sync_datasets.py --dataset celeba

# find the name of a pretrained model using the links in the tables above 
export MODEL=data/models/celeba-smallnet-32d-dve/2019-08-02_06-19-59/checkpoint-epoch100.pth

# create a local directory and download the model into it 
mkdir -p $(dirname "${MODEL}")
wget --output-document="${MODEL}" "http://www.robots.ox.ac.uk/~vgg/research/DVE/${MODEL}"

# Evaluate the model
python3 test_matching.py --config configs/celeba/smallnet-32d-dve.json --resume ${MODEL} --device 0

Regressing landmarks

Learning a landmark regressor for a given pretrained embedding requires:

  1. The target dataset, which should be located in <root>/data/<dataset-name> (this will be done automatically by the data fetching script, or can be done manually).
  2. A config.json file.
  3. A checkpoint.pth file.

See the regressor code for details of how the regressor is implemented (it consists of a conv, then a spatial softmax, then a group conv).

Landmark learning is then performed with the following command:

python3 train.py --config <path-to-config.json> --resume <path-to-trained_model.pth> --device <gpu-id>

where <gpu-id> is the index of the GPU to evaluate on. This option can be ommitted to run the evaluation on the CPU.

For example, to reproduce the smallnet-32d-dve landmark regression results described above, run the following sequence of commands:

# fetch the mafl dataset (contained with celeba) 
python misc/sync_datasets.py --dataset celeba

# find the name of a pretrained model using the links in the tables above 
export MODEL=data/models/celeba-smallnet-32d-dve/2019-08-08_17-56-24/checkpoint-epoch100.pth

# create a local directory and download the model into it 
mkdir -p $(dirname "${MODEL}")
wget --output-document="${MODEL}" "http://www.robots.ox.ac.uk/~vgg/research/DVE/${MODEL}"

# Evaluate the features by training a keypoint regressor 
python3 train.py --config configs/aflw-keypoints/celeba-smallnet-32d-dve.json --device 0

Learning new embeddings

Learning a new embedding requires:

  1. The dataset used for training, which should be located in <root>/data/<dataset-name> (this will be done automatically by the data fetching script, or can be done manually).
  2. A config.json file. You can define your own, or use one of the provided configs in the configs directory.

Training is then performed with the following command:

python3 train.py --config <path-to-config.json> --device <gpu-id>

where <gpu-id> is the index of the GPU to train on. This option can be ommitted to run the training on the CPU.

For example, to train a 16d-dve embedding on celeba, run the following sequence of commands:

# fetch the celeba dataset 
python misc/sync_datasets.py --dataset celeba

# Train the model
python3 train.py --config configs/celeba/smallnet-16d-dve.json --device 0

Dependencies

If you have enough disk space, the recommended approach to installing the dependencies for this project is to create a conda enviroment via the requirements/conda-freeze.txt:

conda env create -f requirements/conda-freeze.yml

Otherwise, if you'd prefer to take a leaner approach, you can either:

  1. pip/conda install each missing package each time you hit an ImportError
  2. manually inspect the slightly more readable requirements/pip-requirements.txt

Citation

If you find this code useful, please consider citing:

@inproceedings{Thewlis2019a,
  author    = {Thewlis, J. and Albanie, S. and Bilen, H. and Vedaldi, A.},
  booktitle = {International Conference on Computer Vision},
  title     = {Unsupervised learning of landmarks by exchanging descriptor vectors},
  date      = {2019},
}

Related useful codebases

Some other codebases you might like to check out if you are interested in self-supervised learning of object structure.

Acknowledgements

We would like to thank Almut Sophia Koepke for helpful discussions. The project structure uses the pytorch-template by @victoresque.

References

[1] James Thewlis, Samuel Albanie, Hakan Bilen, and Andrea Vedaldi. "Unsupervised learning of landmarks by exchanging descriptor vectors" ICCV 2019.

[2] James Thewlis, Hakan Bilen and Andrea Vedaldi, "Unsupervised learning of object landmarks by factorized spatial embeddings." ICCV 2017.

[3] James Thewlis, Hakan Bilen and Andrea Vedaldi, "Unsupervised learning of object frames by dense equivariant image labelling." NeurIPS 2017

[4] Sundaram, N., Brox, T., & Keutzer, K. "Dense point trajectories by GPU-accelerated large displacement optical flow", ECCV 2010

[5] C. Zach, M. Klopschitz, and M. Pollefeys. "Disambiguating visual relations using loop constraints", CVPR, 2010

[6] Zhou, T., Jae Lee, Y., Yu, S. X., & Efros, A. A. "Flowweb: Joint image set alignment by weaving consistent, pixel-wise correspondences". CVPR 2015.

[7] Zhang, Yuting, Yijie Guo, Yixin Jin, Yijun Luo, Zhiyuan He, and Honglak Lee. "Unsupervised discovery of object landmarks as structural representations.", CVPR 2018

[8] Jakab, T., Gupta, A., Bilen, H., & Vedaldi, A. Unsupervised learning of object landmarks through conditional image generation, NeurIPS 2018

[9] Olivia Wiles, A. Sophia Koepke and Andrew Zisserman. "Self-supervised learning of a facial attribute embedding from video" , BMVC 2018