We present Multi-Scale Transformers for Surgical Phase Recognition (MuST), a two-stage Transformer-based architecture designed to enhance the modeling of short-, mid-, and long-term information within surgical phases. Our method employs a frame encoder that leverages multi-scale surgical context across different temporal dimensions. The frame encoder considers diverse time spans around a specific frame of interest, which we call a keyframe. The keyframe serves as the specific frame that we encode. We construct temporal windows around this keyframe to provide the necessary temporal context for accurate phase classification. Our encoder generates rich embeddings that capture short- and mid-term dependencies. To further enhance long-term understanding, we employ a Temporal Consistency Module that establishes relationships among frame embeddings within an extensive temporal window, ensuring coherent phase recognition within an extensive temporal window.
-
Confernece paper in Medical Image Computing and Computer Assisted Intervention – MICCAI 2024. Proceedings available at Springer
-
Preprint available at Arxiv
-
Winning solution of the 2024 PhaKIR Challenge
-
You can also visit our Project Page
Please follow these steps to run MuST:
$ conda create --name must python=3.8 -y
$ conda activate must
$ conda install pytorch==1.9.0 torchvision==0.10.0 cudatoolkit=11.1 -c pytorch -c nvidia
$ conda install av -c conda-forge
$ pip install -U iopath
$ pip install -U opencv-python
$ pip install -U pycocotools
$ pip install 'git+https://github.com/facebookresearch/fvcore'
$ pip install 'git+https://github.com/facebookresearch/fairscale'
$ python -m pip install 'git+https://github.com/facebookresearch/detectron2.git'
$ git clone https://github.com/BCV-Uniandes/MuST
$ cd MuST
$ pip install -r requirements.txt
The DATA_PREPARATION.md file contains detailed instructions for preparing the datasets used to validate our method, downloading pre-trained model weights, and guidelines for setting up your own custom dataset.
Dataset | Test Metric (metric) | Config | Run File | Frames Features | Model |
---|---|---|---|---|---|
GraSP | 79.14 (mAP) | GrasP TCM Config | Run GraSP TCM | ./data/GraSP/frames_features | GrasP Weights |
MISAW | 98.08 (mAP) | MISAW TCM Config | Run MISAW TCM | ./data/misaw/frames_features | MISAW Weights |
HeiChole | 77.25 (F1-score) | HeiChole TCM Config | Run HeiChole TCM | ./data/heichole/frames_features | Heichole Weights |
Cholec80 | 85.57 (F1-score) | Cholec80 TCM Config | Run Cholec80 TCM | ./data/cholec80/frames_features | Cholec80 Weights |
We provide bash scripts with the default parameters to evaluate each dataset. Please first download our preprocessed data files and pretrained models as instructed earlier and run the following commands to run evaluation on each task:
# Calculate features running the script corresponding to the desired dataset
$ sh run_files/extract_features/{dataset}_phases
# Run the script corresponding to the desired dataset to evaluate
$ sh run_files/tcm/{dataset}_phases
You can easily modify the bash scripts to train our models. Just set TRAIN.ENABLE True
on the desired script to enable training, and set TEST.ENABLE False
to avoid testing before training. You might also want to modify TRAIN.CHECKPOINT_FILE_PATH
to the model weights you want to use as initialization. You can modify the config files or the bash scripts to modify the architecture design, training schedule, video input design, etc. We provide documentation for each hyperparameter in the defaults script. For the Temporal Consistency Module (TCM), ensure the temporal chunks are being used by setting TEMPORAL_MODULE.CHUNKS True
. For more details to train MuST, refer to TRAINING.md
If you find this repository helpful, please consider citing:
@inproceedings{perez2024must,
title={MuST: Multi-scale Transformers for Surgical Phase Recognition},
author={P{\'e}rez, Alejandra and Rodr{\'\i}guez, Santiago and Ayobi, Nicol{\'a}s and Aparicio, Nicol{\'a}s and Dessevres, Eug{\'e}nie and Arbel{\'a}ez, Pablo},
booktitle={International Conference on Medical Image Computing and Computer-Assisted Intervention},
pages={422--432},
year={2024},
organization={Springer}
}