Enhancing underwater images using a truncated U-Net architecture for real-time color correction applications
"The shore is an ancient world… Yet it is a world that keeps alive the sense of continuing creation and of the relentless drive of life.” from The Edge of the Sea by Rachel Carson
Truncated U-Net architecture inspired by UWGAN: Underwater GAN for Real-world Underwater Color Restoration and Dehazing.
Usage of U-Net for image-to-image-translation inspired Image-to-Image Translation with Conditional Adversarial Networks.
I love diving and filming but my underwater photos come out with distorted colors(greenish, blueish, distorted due to refraction etc). Thanks to Deep Sea-NN I don't have to learn photoshop to edit my favorite underwater pics. YaAY!
-
Focused on two factors:
- Training / Inference Speed ...since i don't have good gpus :'(
- Image Quality(especially crispness)
-
Deep Sea-NN does NOT require the following:
- $$ GPUs to train & run
- Depth map for training images
- a lot of images to train(apparently)
"Underwater environment...distortion is extremely non-linear in nature, and is affected by a large number of factors, such as the amount of light present(overcast versus sunny, operational depth), amount of particles in the water, time of day, and the camera being used." from Enhancing Underwater Imagery using Generative Adversarial Networks.
-
Underwater image correction has been a task worked on for years especially with the coming of AUVs. Underwater images representing correct color schemes are necessary for effective object identification/segmentation tasks. Traditional methods relied on physics or stats(tldr: used an energy minimization formulation based on Markov Random Fields to learn color correction patterns; pretty cool but looses a lot of detail since they divide single images into patches during training).
-
Even rather recent models relied partly on physics to deal with attenuation, scattering, and color correction(tldr: complex and hard to generalize to different types of waters due to different lighting conditions for each ocean + requires depth map of training images).
-
GAN based models are shown to be effective too, but didn't fit my goal of wanting to iterate through multiple tweaks of the same model doing ablation experiments. + I don't have effective GPUs to run through GAN training at such a fast pace like CNNs would.
-
I needed a model which had the potential of allowing real-time inference of underwater footages so it could be used within an underwater AUV system while performing at a good level for color correction.
Dataset from U of Minnesota's EUVP(Enhancing Underwater Visual Perception) dataset.
- 3700 Paired Underwater ImageNet + 1270 for validation
- 2185 Paired Underwater Scenes + 130 for validation
- Total: 5885 Paired Underwater Image Sets for training + 1400 for validation
Comparison with U of Minnesota's UGAN.
- UGAN outputs tend to lose image detail upon correction and include checkerboard artifacts in its background. This could be a problem for tasks such as color correction for marine biological research where exact details of an organism(e.g. size of spots on the porcupine puffer_row1) must be preserved.
- This is not the case with Deep Sea-NN where original content detail is preserved while correcting distorted colors.
shark above from youtube
- Media above is 20FPS, and the upper limit for real-time enhancement for Deep Sea-NNs are media of ~20FPS.
- For a stabler application FPS would have to decreased to ~15FPS or lower.
-
Inspired by Shallow UWnet's unbelievably simple(yet really good results considering its simplicity) architecture of just 3 densely connected convolutional blocks with skip connections, I decided to stick with CNNs with skip layers. It's just that...
- Shallow UWnet created blurry outputs and lost some details when testing. They did use the VGG perceptual loss + MSE loss, which has been said to create blotchy results by the NVIDIA/MIT MEDIA LAB paper.
- Did not work well with resized images of larger sizes than 256x256(size it was trained upon), therefore making it unappealing for people like me who want personal semi-quality diving photos(well inference image size should at least be over 256x256..it's 2022...). Maybe this was due to a rather shallow model architecture not being able to capture many color features patterns to generalize to larger input image sizes.
- This led me to look for a CNN architecture which has skip connections(for content preservation) and a deep architecture(for generalization), while capturing features well...which naturally led to an U-Net. Its flexibiility in working with various input image sizes during inference was a big plus.
-
Realized I didn't need the traditionally deep U-Net since even the Shallow-UWnet performs so well, so I took out one max-pool-depth layer and worked with a shallower U-Net to accelerate training time.
- MS-SSIM+L1(right) captures fine-grain details better than the VGG+MSE(left).
- MS-SSIM show distinct anemone tentacles(yellow box) while vgg+mse show a blob
- Effectiveness of MS-SSIM in darker regions(green box) is less apparent, but still looks a bit crispier(would have to compare SSIM, PSNR, or UIQM metrics(for images with no ground truth=>i.e. images that I took on my own) for exact quantitative results..which i am too lazy to do for this project...
- There's two choices of loss functions when it comes to the SSIM family: the classic SSIM or the beefed-up MS-SSIM.
- SSIM loss functions are sensitive to the specific gaussian sigma value it uses. A smaller gaussian sigma value works well with correcting edges in an image while leaving splotchy artifacts in flat areas while a higher gaussian sigma value does the exact opposite(see results below from this paper).
- Enter the beefed up MS-SSIM. Instead of going through the work of fine-tuning your gaussian sigma values, MS-SSIM calculates SSIM scores on difference scales on the image and works with multiple gaussian sigma values. The authors of the original paper set gaussian sigmas to be [0.5, 1, 2, 4, 8].
MS-SSIM will outperform SSIM in most common scenarios, but the paper also mentions...
"Thanks to its multi-scale nature, MS-SSIM solves the issue of noise around edges, but does not solve the problem of the change colors in flat areas, in particular when at least one channel is strong, as is the case with the sky." - from Loss Functions for Image Restoration with Neural Networks
MS-SSIM is, after all, a better generalization of SSIM where we can avoid the hassle of finding the optimal gaussian sigma value, but it is only a convenient generalization.
- BUT what if we know the exact distribution of the images our model will see when it is deployed? (in other words, what if we know whether if our model will see images consisted mostly of edges instead of flat areas or vice versa?)
I believe we can put an answer to this question when working with underwater environments.
Unlike the surface where a picture of a street will include both edges(pedestrians, signs, etc) and flat areas(open skies), underwater images will most likely consist only of either edges or flat areas.
-
If, for example, a model is aimed to take a lot of pictures of marine organisms then it will process images that are bound to include a lot of edges. Those edges may be the fins of a particular fish, tentacles of an anemone, geometric patterns for camouflage, or such. With such a distribution, we could directly train the model with a SSIM loss using a small gaussian sigma.
-
On the contrary, the opposite example would be a model exposed to images which are mostly filled with flat areas. In the underwater environment, this would be open water images(like the one below) where images are ~80% filled with different shades of a single color. For this case, it would be appropriate to train the model with a SSIM loss using a high gaussian sigma. This would reduce the creation of splotchy artifacts in the processed output images.
- A potential application for this kind of model could be an AUV following a human diver and, thus the need to consistently process images like the one below to track the diver's current location.
This method would be a good fine-tuning practice to specifically tailor your model to the data it will mostly see upon deployment.
- Utilize Average-Pooling instead of Max-Pooling between U-Net Layers
"Each transparent particle...are multiple, small-sized, independent, and occupying 20 to 35 pixels, ordinary deep U-net model leads to certain small-sized particles being lost. As the particles are all homogeneous beads...average pooling can properly remain the particle shape." - from Short U-net model with average pooling based on in-line digital holography for simultaneous restoration of multiple particles
-
TLDR for the paper above: Classic U-Nets with max pooling tend to omit detections of small sized particles. They replaced max pool with avg pool to prevent the dropping of these small particles and achieved successful results.
-
Applying this study, we can build U-Net models with avg pooling if the model must capture minute details. Examples of such details in the context of underwater vision systems may be having to caputure individual coral polyps or tentacles of an anemone.
-
Another reason for trying out the switch is...
"For image synthesis we found that replacing the max-pooling operation by average pooling improves the gradient flow and one obtains slightly more appealing reults." - from A Neural Algorithm of Artistic Style
- Using Attention_Res-UNet or Res-UNet for model architecture
- Although skip connections in the classic U-Net architecture transfer content very well, these skip connections from shallower levels of the net also transfer data with low feature information. It is during the deeper parts of the net where better features are learned.
- With the help of attention gates the model could filter skip connections containing less-learned features.
- BUT training would take ~2 or ~3 times longer than the classic UNet due to increased parameters and steps in training. Would be an interesting ablation study to do to see how well the attention gates help the model generalize to different distortions.
- Generalize Dataset beyond EUVP
- Each ocean offers a different type of distortion due to their differences in density, salinity, turbidity, etc.
- It is crucial to find specific data from the waters you will be testing on to optiimize your outputs.
Would like to train GAN models with unpaired datasets. CNN results are great, but there is the limitation of having to acquire paired training data :(
With such a model I could possibly generalize better to the waters of Korea where paired datasets are less existent while its conditions can be very different from many of the EUVP dataset that I used to train the Deep Sea-NN.