Skip to content

Latest commit

 

History

History
47 lines (38 loc) · 3.19 KB

smp_encoder_doc.md

File metadata and controls

47 lines (38 loc) · 3.19 KB

Solution 1

To write a encoder that adapts to the SMP decoder, we need a encoder that output the same format as the decoder input. The original MiT encoder output is an array of 4D tensors, and we need to adheare to that format.

The encoder output dimension shape is defined as follows:

Index Shape Explain Description
0 16x3x224x224 Batch_Size * 3_Color_Channels * height * width Original Image Tensor
1 16x0x112x112 Batch_Size * 0_Channels * height/2 * width/2 Dummy Tensor
2 16x64x56x56 Batch_Size * 64_Channels * height/4 * width/4 1st ViT Block Output
3 16x128x28x28 Batch_Size * 128_Channels * height/8 * width/8 2nd ViT Block Output
4 16x320x14x14 Batch_Size * 320_Channels * height/16 * width/16 3rd ViT Block Output
5 16x512x7x7 Batch_Size * 512_Channels * height/32 * width/32 4th ViT Block Output

Compare to traditional ViT encoder, each block of MiT does not accepting a same tensor shape, but each block accepting them at different shapes. That means, the output of each block was not being Upsampled or Downsampled from a fixed size tensor like 16x128x14x14 to a different output shape. Instead, the transformation of the tensor is done by a patch embedding layer that comes after each ViT block, and each block takes the output of the patch embedding (which shape is already being transformed) layer as input.

Hence that their MiT encoder model structure looks like this in each layer:

Layer Input Shape Output Shape Description
0 16x3x224x224 16x3x224x224 Original Image Input
1 16x3x224x224 16x64x56x56 Patch Embedding 0
2 16x64x56x56 16x64x56x56 1st ViT Block Output
3 16x64x56x56 16x128x28x28 Patch Embedding 1
4 16x128x28x28 16x128x28x28 2nd ViT Block Output
5 16x128x28x28 16x320x14x14 Patch Embedding 2
6 16x320x14x14 16x320x14x14 3rd ViT Block Output
7 16x320x14x14 16x512x7x7 Patch Embedding 3
8 16x512x7x7 16x512x7x7 4th ViT Block Output

But since we are using traditional ViT encoder, we need to adapt the output to the MiT decoder input. We can adapt the output by Upsampling or Downsampling the tensor to the desired dimension. For our models, we can have such a model structure for ViT encoder:

Layer Input Shape Output Shape Description
0 16x3x224x224 16x3x224x224 Original Image Input
1 16x3x224x224 16x320x14x14 Patch Embedding
2 16x320x14x14 16x320x14x14 1st ViT Block
2-1 16x320x14x14 16x64x56x56 Upsampling 4x Output
3 16x320x14x14 16x320x14x14 2nd ViT Block
3-1 16x320x14x14 16x128x28x28 Upsampling 2x Output
4 16x320x14x14 16x320x14x14 3rd ViT Block
4-1 16x320x14x14 16x320x14x14 Identity Output
5 16x320x14x14 16x320x14x14 4th ViT Block
5-1 16x320x14x14 16x512x7x7 Downsampling 2x Output

However, we have not verified the performance of such a model structure. This might not be as effective as the original MiT encoder. But I believe the performance will still be acceptable as a segmentation model.