Solution 1

To write a encoder that adapts to the SMP decoder, we need a encoder that output the same format as the decoder input. The original MiT encoder output is an array of 4D tensors, and we need to adheare to that format.

The encoder output dimension shape is defined as follows:

Index	Shape	Explain	Description
0	16x3x224x224	Batch_Size * 3_Color_Channels * height * width	Original Image Tensor
1	16x0x112x112	Batch_Size * 0_Channels * height/2 * width/2	Dummy Tensor
2	16x64x56x56	Batch_Size * 64_Channels * height/4 * width/4	1st ViT Block Output
3	16x128x28x28	Batch_Size * 128_Channels * height/8 * width/8	2nd ViT Block Output
4	16x320x14x14	Batch_Size * 320_Channels * height/16 * width/16	3rd ViT Block Output
5	16x512x7x7	Batch_Size * 512_Channels * height/32 * width/32	4th ViT Block Output

Compare to traditional ViT encoder, each block of MiT does not accepting a same tensor shape, but each block accepting them at different shapes. That means, the output of each block was not being Upsampled or Downsampled from a fixed size tensor like 16x128x14x14 to a different output shape. Instead, the transformation of the tensor is done by a patch embedding layer that comes after each ViT block, and each block takes the output of the patch embedding (which shape is already being transformed) layer as input.

Hence that their MiT encoder model structure looks like this in each layer:

Layer	Input Shape	Output Shape	Description
0	16x3x224x224	16x3x224x224	Original Image Input
1	16x3x224x224	16x64x56x56	Patch Embedding 0
2	16x64x56x56	16x64x56x56	1st ViT Block Output
3	16x64x56x56	16x128x28x28	Patch Embedding 1
4	16x128x28x28	16x128x28x28	2nd ViT Block Output
5	16x128x28x28	16x320x14x14	Patch Embedding 2
6	16x320x14x14	16x320x14x14	3rd ViT Block Output
7	16x320x14x14	16x512x7x7	Patch Embedding 3
8	16x512x7x7	16x512x7x7	4th ViT Block Output

But since we are using traditional ViT encoder, we need to adapt the output to the MiT decoder input. We can adapt the output by Upsampling or Downsampling the tensor to the desired dimension. For our models, we can have such a model structure for ViT encoder:

Layer	Input Shape	Output Shape	Description
0	16x3x224x224	16x3x224x224	Original Image Input
1	16x3x224x224	16x320x14x14	Patch Embedding
2	16x320x14x14	16x320x14x14	1st ViT Block
2-1	16x320x14x14	16x64x56x56	Upsampling 4x Output
3	16x320x14x14	16x320x14x14	2nd ViT Block
3-1	16x320x14x14	16x128x28x28	Upsampling 2x Output
4	16x320x14x14	16x320x14x14	3rd ViT Block
4-1	16x320x14x14	16x320x14x14	Identity Output
5	16x320x14x14	16x320x14x14	4th ViT Block
5-1	16x320x14x14	16x512x7x7	Downsampling 2x Output

However, we have not verified the performance of such a model structure. This might not be as effective as the original MiT encoder. But I believe the performance will still be acceptable as a segmentation model.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

smp_encoder_doc.md

smp_encoder_doc.md

Solution 1

Files

smp_encoder_doc.md

Latest commit

History

smp_encoder_doc.md

File metadata and controls

Solution 1