Mayan Leavitt • Edan Kinderman
- Summary
- The SGRAF Model
- Data
- Proposed Improvements
- Comparison
- Files and Usage
- How to Run the Code
- References and credits
Our project aimed to improve the matching of two X-ray scans with their fitting radiology report, using the SGRAF image-text matching model as a baseline. To achieve this, we tested various loss functions, architectures, and training methods.
Through our experimentation, we successfully incorporated the second X-ray scan into our models and achieved significantly better results. Our research provides insights into enhancing the accuracy of image-text matching, which can have important implications for medical diagnosis and treatment.
The model extract features from the given image and text, and learn vector-based similarity representations between different areas in them. Then a SAF (Similarity Attention Filtartaion) module processes the vectors alignments using attention mechanisms to identify significant alignments and reduce less meaningful ones. The module outputs a matching score indicating the compatibility between the image and text. For more details see the original article [1].
We used the MIMIC-CXR dataset, which contains studies with a frontal image, a lateral image and a radiology report. In the existing image-text matching models, the lateral image is often not used, even though it contains critical information.
- Check different loss functions: Bi-directional ranking loss, NT-Xnet [2] and their weighted sum.
- Train two regular SGRAF models simultaneously – one for each viewpoint, and use learned weights to average the similarity scores.
- Concatenate the two image types features to obtain one input.
- Use positional encoding to differentiate between the two viewpoints.
Here is an evaluation of the model's ability to match the image with the correct text. A higher R@K value indicates improved retrieval performance, indicating a better alignment between the image and the corresponding text.
Here is a comparison of the basic models, which trained only on one type of image (frontal or lateral).
Image type | Loss | R@1 | R@5 | R@10 |
---|---|---|---|---|
Frontal | BRL | 0.5 | 4.2 | 8.5 |
Lateral | BRL | 0.5 | 1.5 | 3.1 |
Frontal | NT-Xent | 6.6 | 18.6 | 27.2 |
Lateral | NT-Xent | 5.0 | 13.9 | 21.1 |
Frontal | Sum | 3.3 | 10.4 | 15.4 |
Lateral | Sum | 0.3 | 2 | 3.4 |
Here is a comparison of the "double" models family, which has two encoders for encoding each image type (frontal and lateral). Those models are trained on both image types.
Model type | Learned weights | Shared text encoder | R@1 | R@5 | R@10 |
---|---|---|---|---|---|
Uniform Average | X | X | 8.1 | 21.3 | 29.3 |
Weighted Average | X | X | 8.2 | 21.2 | 29.5 |
Double Model | V | X | 6.7 | 21.1 | 30.4 |
Light Double Model | V | V | 8.5 | 22.5 | 31.5 |
Pretrained Model | V | X | 8.1 | 21 | 29.6 |
Here is a comparison of the "concatenation" models family, which gets as input a text and a concatenation of the frontal and lateral images. Some of those models trained with positional encoding [4] added to the images.
Model type | Positional encoding | R@1 | R@5 | R@10 |
---|---|---|---|---|
Basic Concatenation | X | 7.4 | 20.2 | 29.9 |
Tagged Features | X | 6.6 | 20.1 | 27 |
Constant Positional Encoding | V | 7.4 | 18.8 | 27.2 |
Full Positional Encoding | V | 7.5 | 20.6 | 28 |
We can see that using the lateral images improves results as opposed to using frontal data alone. In addition, training two models at once achieves the best performance, but concatenating image features is a cheaper way to combine viewpoints.
File Name | Description |
---|---|
average_eval.py | Evaluate 2 trained models |
data_xray.py | Dealing with the data loading and batching |
evaluation_xray.py | Evaluate a trained model |
model_xray.py | The models implementation |
opts_xray.py | Running experiments using scripts |
train_xray.py | For training a model |
You can train a regular SGRAF model on the MIMIC-CXR dataset, using only frontal images, with this script:
opts_xray.py --model_name '../checkpoint/<model_name>' --view 'frontal' --model_num <number> --model_type 'regular_model' --batch_size 64 --num_epochs 40
- Project supervisor: Gefen Dawidowicz. Some of the algorithms were implemented based on her code.
- [1] Z. M. L. Diao, "Similarity Reasoning and Filtration for Image-Text Matching", AAAI conference on artificial intelligence, 2021.
- [2] K. N. H. Chen, “A Simple Framework for Contrastive Learning of Visual Representations” PMLR , pp. 1597-1607, 2020.
- [3] S. M. S. P. G. Ji, “Improving Joint Learning of Chest X-Ray and Radiology Report by Word Region Alignment” MLMI , pp. 110-119, 2021.
- [4] S. P. U. J. N. G. K. P. Vaswani, “Attention is all you need” Advances in neural information processing systems, 2017.