# RGBD image generation

As we work in a 3D scene, we would like to create 3D images, such as RGBD images, where D is depth.
The idea is that text-to-3D models are not great, while text-to-image and image-to-RGBD models work well.

## Direct RGBD image creation

As we will work in a 3D scene, a direct RGBD image generation would be great.
Yet I know no model that acheives it greatly, so I created a created a small pipeline of image generation and depth from image generation to circumvent it.
In the futur, using a real text-to-RGBD model would be more efficient.

## Depth from an image

I have tried several models, with a final choice of Marigold.
The advantage of this model is that the results are very stable: the depth map does not change a lot when regenerating it, and the "model behavior" is the same from one image to another.
The most interesting characteristic of this model is that it always represents the sky with an increasing depth from top-to-bottom, which simplifies the segmentation task greatly.

## Segmentation

For the segmentation task, we want to classify voxels between three categories: skybox, terrain and 3D objects (if any).
The use of an RGBD image gives great information for the segmentation, especially when compared to direct semantic segmentation on a RGB AI-generated image.

### Skybox

To segment the skybox, we use the following definition: the skybox represents the background.
So, we only need to segment the distant voxels, for instance the voxels with D > 80% of maximum depth.
Below is is illustration of this selection.

![Binary mask applied to isolate the skybox](<media/segmentation/skybox mask.png>)

### Terrain

This segment is more complex, as some objects have the same depth as the terrain.
![Terrain depth map](media/segmentation/depth_map.png)

The main idea here is that the terrain has a continuous gradient,
while objects are mostly "flat".
The problem with using a direct vertical gradient is that our data are very noisy,
preventing any segmentation.
In order to reduce the noise, we first apply a "brushing" of the terrain's depth:
from bottom to top, the depth can only increase.
It generates a "monotonous" depth map:
![Monotonous depth map](media/segmentation/monotonous_depth.png)

Then, we compute the vertical gradient:
![Gradient depth map](media/segmentation/gradient.png)

Here, we can start to see the shape of the terrain,
and wide areas where the vertical gradient is positive or null.

For each horizontal line, we replace the areas of gradient >= 0 by the mean of the other
pixels with striclty negative gradient on this horizontal line.

It gives use a new "completed" depth map, that we call "corrupted" depth.
![Corrupted depth map](media/segmentation/corrupted_depth.png)

The top white areas is a zone where the gradient is null everywhere,
which is normal for the sky.
We don't have to fill this area with depth, so it's not an issue.

Now, we have an expected depth reconstruction of the terrain,
but not a segmentation of the terrain itself.
It is no longer hard, in fact,
we just need to substract the "monotonous" depth map (that contain objects depth)
by the "corrupted" depth map (without objects).
In order to get an absolute difference (not relative to the distance),
we also multiply each horizontal line by it's average depth.
It gives us an "adherence" depth map where the terrain is clearly isolated.

![Terrain adherence map](media/segmentation/adherence_map.png)

We can only use a thresold on the map to create a binary mask.
![Terrain binary mask](<media/segmentation/terrain mask.png>)

Terrain segmentation is achieved!

### 3D elements

When the skybox and the terrain are segmented, the 3D elements are what remain.