Rendezvous: Attention Mechanisms for the Recognition of Surgical Action Triplets in Endoscopic Videos
C.I. Nwoye, T. Yu, C. Gonzalez, B. Seeliger, P. Mascagni, D. Mutter, J. Marescaux, and N. Padoy
This repository contains the implementation code, inference demo, and evaluation scripts.
Out of all existing frameworks for surgical workflow analysis in endoscopic videos, action triplet recognition stands out as the only one aiming to provide truly fine-grained and comprehensive information on surgical activities. This information, presented as <instrument, verb, target> combinations, is highly challenging to be accurately identified. Triplet components can be difficult to recognize individually; in this task, it requires not only performing recognition simultaneously for all three triplet components, but also correctly establishing the data association between them.
To achieve this task, we introduce our new model, the Rendezvous (RDV), which recognizes triplets directly from surgical videos by leveraging attention at two different levels. We first introduce a new form of spatial attention to capture individual action triplet components in a scene; called Class Activation Guided Attention Mechanism (CAGAM). This technique focuses on the recognition of verbs and targets using activations resulting from instruments. To solve the association problem, our RDV model adds a new form of semantic attention inspired by Transformer networks; Multi-Head of Mixed Attention (MHMA). This technique uses several cross and self attentions to effectively capture relationships between instruments, verbs, and targets.
We also introduce CholecT50 - a dataset of 50 endoscopic videos in which every frame has been annotated with labels from 100 triplet classes. Our proposed RDV model significantly improves the triplet prediction mAP by over 9% compared to the state-of-the-art methods on this dataset.
- [2022.04.01]: Demo code and pre-trained model released!
- [2022.04.12]: 45 videos subset of CholecT50 released! download access.
- [2022.05.03]: PyTorch implementation code released!
The RDV model is composed of:
- Feature Extraction layer: extract high and low level features from input image from a video
- Encoder: for triplet components encoding
- Weakly-Supervised Localization (WSL) Layer: for localizing the instruments
- Class Activation Guided Attention Mechanism (CAGAM): for detecting the verbs and targets leveraging attention resulting from instrument activations. (channel anad position spatial attentions are used here)
- Bottleneck layer: for collecting unfiltered features for initial scene understanding
- Decoder: for triplet assocaition decoding over L successive layers
- Multi-Head of Mixed Attention (MHMA): for learning to associate instrument-verb-target using successive self- and cross-attention mechanism
- Feed-forward layer: for triplet feature refinement
- Classifier: for final triplet classification
We hope this repo will help researches/engineers in the development of surgical action recognition systems. For algorithm development, we provide training data, baseline models and evaluation methods to make a level playground. For application usage, we also provide a small video demo that takes raw videos as input without any bells and whistles.
Components AP | Association AP | |||||||
---|---|---|---|---|---|---|---|---|
API | APV | APT | APIV | APIT | APIVT | |||
92.0 | 60.7 | 38.3 | 39.4 | 36.9 | 29.9 |
Available on Youtube.
The model depends on the following libraries:
- sklearn
- PIL
- Python >= 3.5
- ivtmetrics
- Developer's framework:
- For Tensorflow version 1:
- TF >= 1.10
- For Tensorflow version 2:
- TF >= 2.1
- For PyTorch version:
- Pyorch >= 1.10.1
- TorchVision >= 0.11
- For Tensorflow version 1:
The code has been test on Linux operating system. It runs on both CPU and GPU. Equivalence of basic OS commands such as unzip, cd, wget, etc. will be needed to run in Windows or Mac OS.
- clone the git repository:
git clone https://github.com/CAMMA-public/rendezvous.git
- install all the required libraries according to chosen your framework.
- download the dataset
- download model's weights
- train
- evaluate
coming soon . . .
- CholecT45
- CholecT50
- Dataset splits
- All frames are resized to 256 x 448 during training and evaluation.
- Image data are mean normalized.
- The dataset variants are tagged in this code as follows:
- cholect50 = CholecT50 with split used in the original paper.
- cholect50-challenge = CholecT50 with split used in the CholecTriplet challenge.
- cholect45-crossval = CholecT45 with official cross-val split (currently public released).
- cholect50-crossval = CholecT50 with official cross-val split.
The ivtmetrics computes AP for triplet recognition. It also support the evaluation of the recognition of the triplet components.
pip install ivtmetrics
or
conda install -c nwoye ivtmetrics
Usage guide is found on pypi.org.
The code can be run in a trianing mode (-t
) or testing mode (-e
) or both (-t -e
) if you want to evaluate at the end of training :
Simple training on CholecT50 dataset:
python run.py -t --data_dir="/path/to/dataset" --dataset_variant=cholect50 --version=1
You can include more details such as epoch, batch size, cross-validation and evaluation fold, weight initialization, learning rates for all subtasks, etc.:
python3 run.py -t -e --data_dir="/path/to/dataset" --dataset_variant=cholect45-crossval --kfold=1 --epochs=180 --batch=64 --version=2 -l 1e-2 1e-3 1e-4 --pretrain_dir='path/to/imagenet/weights'
All the flags can been seen in the run.py
file.
The experimental setup of the published model is contained in the paper.
python3 run.py -e --dataset_variant=cholect45-crossval --kfold 3 --batch 32 --version=1 --test_ckpt="/path/to/model-k3/weights" --data_dir="/path/to/dataset"
Adding custom datasets is quite simple, what you need to do are:
- organize your annotation files in the same format as in CholecT45 dataset.
- final model layers can be modified to suit your task by changing the class-size (num_tool_classes, num_verb_classes, num_target_classes, num_triplet_classes) in the argparse.
- N.B. Download links to models' weights will not be provided until after the CholecTriplet2022 challenge.
Network | Base | Resolution | Dataset | Data split | Link |
---|---|---|---|---|---|
Rendezvous | ResNet-18 | Low | CholecT50 | RDV | [Google] [Baidu] |
Rendezvous | ResNet-18 | High | CholecT50 | RDV | [Google] [Baidu] |
Rendezvous | ResNet-18 | Low | CholecT50 | Challenge | [Google] [Baidu] |
Rendezvous | ResNet-18 | Low | CholecT50 | crossval k1 | [Google] [Baidu] |
Rendezvous | ResNet-18 | Low | CholecT50 | crossval k2 | [Google] [Baidu] |
Rendezvous | ResNet-18 | Low | CholecT50 | crossval k3 | [Google] [Baidu] |
Rendezvous | ResNet-18 | Low | CholecT50 | crossval k4 | [Google] [Baidu] |
Rendezvous | ResNet-18 | Low | CholecT50 | crossval k5 | [Google] [Baidu] |
Rendezvous | ResNet-18 | Low | CholecT45 | crossval k1 | [Google] [Baidu] |
Rendezvous | ResNet-18 | Low | CholecT45 | crossval k2 | [Google] [Baidu] |
Rendezvous | ResNet-18 | Low | CholecT45 | crossval k3 | [Google] [Baidu] |
Rendezvous | ResNet-18 | Low | CholecT45 | crossval k4 | [Google] [Baidu] |
Rendezvous | ResNet-18 | Low | CholecT45 | crossval k5 | [Google] [Baidu] |
Network | Base | Resolution | Dataset | Data split | Link |
---|---|---|---|---|---|
Rendezvous | ResNet-18 | High | CholecT50 | RDV | [Google] [Baidu] |
Rendezvous | ResNet-18 | High | CholecT50 | Challenge | [Google] [Baidu] |
Network | Base | Resolution | Dataset | Data split | Link |
---|---|---|---|---|---|
Rendezvous | ResNet-18 | High | CholecT50 | RDV | [Google] [Baidu] |
Rendezvous | ResNet-18 | Low | CholecT50 | RDV | [Google] [Baidu] |
Rendezvous | ResNet-18 | High | CholecT50 | Challenge | [Google] [Baidu] |
TensorFlow v1
Model | Layer Size | Ablation Component | APIVT | Link |
---|---|---|---|---|
Rendezvous | 1 | Proposed | 24.6 | [Google] [Baidu] |
Rendezvous | 2 | Proposed | 27.0 | [Google] [Baidu] |
Rendezvous | 4 | Proposed | 27.3 | [Google] [Baidu] |
Rendezvous | 8 | Proposed | 29.9 | [Google] [Baidu] |
Rendezvous | 8 | Patch sequence | 24.1 | [Google] [Baidu] |
Rendezvous | 8 | Temporal sequence | --.-- | [Google] [Baidu] |
Rendezvous | 8 | Single Self Attention Head | 18.8 | [Google] [Baidu] |
Rendezvous | 8 | Multiple Self Attention Head | 26.1 | [Google] [Baidu] |
Rendezvous | 8 | CholecTriplet2021 Challenge Model | 32.7 | [Google] [Baidu] |
Model weights are released periodically because some training are in progress.
This code, models, and datasets are available for non-commercial scientific research purposes provided by CC BY-NC-SA 4.0 LICENSE attached as LICENSE file. By downloading and using this code you agree to the terms in the LICENSE. Third-party codes are subject to their respective licenses.
This work was supported by French state funds managed within the Investissements d'Avenir program by BPI France in the scope of ANR project CONDOR, ANR Labex CAMI, ANR DeepSurg, ANR IHU Strasbourg and ANR National AI Chair AI4ORSafety. We thank the research teams of IHU and IRCAD for their help in the initial annotation of the dataset during the CONDOR project.
- CholecT45 / CholecT50 Datasets
- Offical Dataset Splits
- Tripnet
- Attention Tripnet
- CholecTriplet2021 Challenge
- CholecTriplet2022 Challenge
If you find this repo useful in your project or research, please consider citing the relevant publications:
- For the CholecT45/CholecT50 Dataset:
@article{nwoye2021rendezvous,
title={Rendezvous: Attention Mechanisms for the Recognition of Surgical Action Triplets in Endoscopic Videos},
author={Nwoye, Chinedu Innocent and Yu, Tong and Gonzalez, Cristians and Seeliger, Barbara and Mascagni, Pietro and Mutter, Didier and Marescaux, Jacques and Padoy, Nicolas},
journal={Medical Image Analysis},
volume={78},
pages={102433},
year={2022}
}
- For the CholecT45/CholecT50 Official Dataset Splits:
@article{nwoye2022data,
title={Data Splits and Metrics for Benchmarking Methods on Surgical Action Triplet Datasets},
author={Nwoye, Chinedu Innocent and Padoy, Nicolas},
journal={arXiv preprint arXiv:2204.05235},
year={2022}
}
- For the Rendezvous or Attention Tripnet Baseline Models or any snippet of code from this repo:
@article{nwoye2021rendezvous,
title={Rendezvous: Attention Mechanisms for the Recognition of Surgical Action Triplets in Endoscopic Videos},
author={Nwoye, Chinedu Innocent and Yu, Tong and Gonzalez, Cristians and Seeliger, Barbara and Mascagni, Pietro and Mutter, Didier and Marescaux, Jacques and Padoy, Nicolas},
journal={Medical Image Analysis},
volume={78},
pages={102433},
year={2022}
}
- For the Tripnet Baseline Model:
@inproceedings{nwoye2020recognition,
title={Recognition of instrument-tissue interactions in endoscopic videos via action triplets},
author={Nwoye, Chinedu Innocent and Gonzalez, Cristians and Yu, Tong and Mascagni, Pietro and Mutter, Didier and Marescaux, Jacques and Padoy, Nicolas},
booktitle={International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI)},
pages={364--374},
year={2020},
organization={Springer}
}
- For the models presented @ CholecTriplet2021 Challenge:
@article{nwoye2022cholectriplet2021,
title={CholecTriplet2021: a benchmark challenge for surgical action triplet recognition},
author={Nwoye, Chinedu Innocent and Alapatt, Deepak and Vardazaryan, Armine ... Gonzalez, Cristians and Padoy, Nicolas},
journal={arXiv preprint arXiv:2204.04746},
year={2022}
}
This repo is maintained by CAMMA. Comments and suggestions on models are welcomed. Check this page for updates.