Satellite Image Analysis using Vision Transformers: Achieving SOTA in Vision Classification
Manav Garg
Satellite imagery is used for different purposes including environmental monitoring, urban planning, and disaster response, among others. However, analyzing these images accurately and efficiently remains a challenge due to their large-scale and complex nature. In this study, we address this challenge by employing Vision Transformers (ViTs) to achieve state-of-the-art (SOTA) performance in satellite image analysis. With this in mind, we will explore the possibility of applying ViTs to a direct analysis of satellite images, recording the global relation dependencies and making land cover and land use classifications more accurate. The thought process here further leverages on recent development in computer vision particularly the success of ViTs in image recognition applications. Our task is to fine-tune pre-trained ViT models on the EuroSAT dataset, which is satellite images annotated with land cover classes. EuroSAT has 10 classes of land covers. By optimizing ViTs’ performance on EuroSAT, we aim to surpass traditional methods in satellite image classification, offering more accurate insights into Earth’s surface dynamics. This research contributes to advancing satellite image analysis and enhances tools for environmental monitoring, urban planning, and land management. We use evaluation technique based on metrics like classification accuracy that proves the efficacy of the suggested approach with reference to existent methods. This research also compares two Vision Transformer models, the base ViT and the CCT (Compact Convolutional Transformer) model.
Index Terms—Vision Transformers, PyTorch, Satellite Image Analysis, EuroSAT