A repository for render-and-compare machine learning pose estimation using a known CAD model, without the use of depth measurements. A neural network compares a real image and an object rendered under an initial pose estimate. The neural network iteravely predicts pose updates to the model until the rendering of the object matches the real image.
Based on the work from DeepIM and CosyPose:
DeepIM: https://arxiv.org/abs/1804.00175
CosyPose:
https://arxiv.org/abs/2008.08465
https://github.com/ylabbe/cosypose
Snippets of code are copied from the CosyPose github. Copied functions contains an explicit comment about the source.
- Install torch and cuda from https://pytorch.org/get-started/locally/
pip install -r requirements.txt
- Create an image dataset from https://github.com/olaals/datasets-rgb-pose-estimation
- Create a symbolic link or copy the dataset to
- Create a config file in configs
- An example config file is given in example_config.py
- To train a model, run the following command
python train_model.py configs/example_config.py
- The training and validation loss may be tracked with tensorboard with
tensorboard --logdir tensorboard
Additional visualizations are stored in logdir
python test_model.py configs/example_config.py
The results are stored in logdir
The exact training process depends on the configuration that is set in the config files in the configs directory, but the overall pipeline is shown below
The general pipeline includes
- A renderer that produces two images of the same object, where the initial guess of the object is slightly off.
- These images are concatenated and used as the input to a convolutional neural network.
- The CNN tries to estimate either a 6D or 9D representation of rotation, and pixel translation in x and y direction, as well as a depth parameter vz.
- The output of the CNN is passed onto a rotation representation function, which calculates a valid rotation matrix
- The pixel translation output of the CNN is converted to translation in Euclidean space.
- Together, the rotation matrix and translation form a transformation matrix delta_T, which updates the current estimate of T_CO with T_CO_new = delta_T*T_CO
- A loss function loss(T_CO_new, T_CO_gt) determines a number which represents the deviation between T_CO_new and T_CO_gt
The code in the repository uses shorthand notation for the transformation matrix describing the rotation and translation between frames. The image below shows the shorthand notations used, where T_CO is of particular importance.