Add support for Apple's Depth-Pro #34583

geetu040 · 2024-11-03T06:20:30Z

What does this PR do?

This PR adds Apple's Depth Pro model to Hugging Face Transformers. Depth Pro is a foundation model for zero-shot metric monocular depth estimation. It leverages a multi-scale vision transformer optimized for dense predictions. It downsamples an image at several scales. At each scale, it is split into patches, which are processed by a ViT-based (Dinov2) patch encoder, with weights shared across scales. Patches are merged into feature maps, upsampled, and fused via a DPT decoder.

Relevant Links

Research Paper: Depth Pro: Sharp Monocular Metric Depth in Less Than a Second
Authors: Aleksei Bochkovskii, Amaël Delaunoy, and others
Implementation: apple/ml-depth-pro
Models Weights: apple/DepthPro

Before submitting

Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@amyeroberts, @qubvel

geetu040 · 2024-11-03T06:59:29Z

I have implemented the foundational components of the model and manually loaded the weights to ensure that the architecture aligns with the original design and produces consistent output.

Below is a concise overview of the class hierarchy. I would greatly appreciate your feedback or any suggestions for improvements:

DepthProForDepthEstimation
├── depth_pro: DepthProModel
│   ├── encoder: DepthProEncoder
│   │   ├── patch_encoder: DepthProViT
│   │   │   ├── embeddings: DepthProViTEmbeddings
│   │   │   └── encoder: DepthProViTEncoder
│   │   ├── image_encoder: DepthProViT
│   │   │   ├── embeddings: DepthProViTEmbeddings
│   │   │   └── encoder: DepthProViTEncoder
│   ├── decoder: DepthProDecoder
│   └── fov_model: DepthProFOVModel
│       ├── encoder: DepthProViT
│       │   ├── embeddings: DepthProViTEmbeddings
│       │   └── encoder: DepthProViTEncoder
└── head: DepthProDepthEstimationHead

I have a couple of questions:

The encoder: DepthProEncoder outputs features processed at various scales, including hidden states from the intermediate layers of ViTEncoder. Currently, I use BaseModelOutput, returning all features in the last_hidden_state argument. Should I create a dedicated ModelOutput class for DepthProEncoder? If so, it should reside in the same file as the DepthPro classes since it is specific to them.
For handling the FOV (Field of View) output, would it be appropriate to create a class named DepthEstimatorOutputWithFOV in transformers.modeling_outputs, or should it also remain within the DepthPro context?

Rocketknight1 · 2024-11-04T13:40:59Z

cc @pcuenca as well!

qubvel · 2024-11-05T10:30:42Z

Hi @geetu040! Thanks for working on this model!

Regarding model outputs they should be written if you want to add a new argument or write better docs. In case of intermediate outputs you can store them in BaseModelOutput.hidden_states, for example mllama set default output_hidden_states=True and then select required hidden states from vision transformer.

geetu040 · 2024-11-10T06:27:52Z

@qubvel @pcuenca Thanks, I have updated the code for hidden_states.

I still need an opinion with fov (field of view)
DepthPro returns the predicted_depth as well as the fov which is a scaler value.

The existing DepthEstimatorOutput class in transformers/src/transformers/modeling_outputs.py looks like this:

class DepthEstimatorOutput(ModelOutput):
    loss: Optional[torch.FloatTensor] = None
    predicted_depth: torch.FloatTensor = None
    hidden_states: Optional[Tuple[torch.FloatTensor, ...]] = None
    attentions: Optional[Tuple[torch.FloatTensor, ...]] = None

Q1: Do I create a new class DepthEstimatorOutputWithFOV or update the existing class?
Q2: User should be given the option to turn the FOV on or off because calculating FOV requires extra processing. In this case should this parameter be a part of model initialization DepthProForDepthEstimation(config, return_fov=True) or should it be kept inside config.

qubvel · 2024-11-11T14:01:13Z

Thanks @geetu040

Q1:

class DepthProDepthEstimatorOutput(DepthEstimatorOutput):
    fov: Optional[torch.FloatTensor] = None

This output can be returned in both cases: fov=None and not None.

Q2:

Yeah, this can be a parameter of the config, but also should be an argument in forward method to override the config parameter (similar to output_hidden_states)

Please, let me know if you have more questions!

geetu040 · 2024-11-12T04:52:18Z

Yeah, this can be a parameter of the config, but also should be an argument in forward method to override the config parameter (similar to output_hidden_states)

This needs to be done during __init__, because it requires fov_model (another vision transformer) to be initialized.

qubvel · 2024-11-15T23:21:09Z

OK, got it! Then it should be done with config! And anyone can just load a model as following:

model = DepthProForDepthEstimation(checkpoint, fov_model=True)
# or
model = DepthProForDepthEstimation(checkpoint, fov_model=False)

With such initialization fov_model param will be overridden in config

geetu040 · 2024-11-18T05:35:55Z

currently an image is down-scaled to medium resolution (high / 2) and low resolution (high / 4)
then patches are created from high, medium and low and concatenated.

I was wondering can we also give this option to the users to decide which scales to use, for example, a user tells in config to use these custom scales image_scales=[0.6, 0.4, 0.3]

now an image will downscale to these 3 scales
then patches are created from high and scaled images and concatenated.

@qubvel I have looked into the code how this can be implemented, it is do-able and I can easily make this option available and I would prefer that, but I have to ask you as well, do you think this option should be given to the users?

qubvel · 2024-11-18T10:29:08Z

Hi @geetu040, we try to avoid overcomplicated code with lots of parameters, the general rule is to get rid of different code paths / unused params that are not different across pretrained checkpoints. For this particular case, feel free to add it, but only in case it will not introduce extra complexity to the modeling code.

geetu040 · 2024-11-25T07:36:09Z

Hi @qubvel I have a question about the image processor.

the source code from apple/depth-pro preprocesses the image in this sequence normalize -> resize, however in conventional image processor for vit and dpt, the sequence is resize -> normalize

this causes the two outputs to be slightly different from each other.

do you suggest I stay with the convention and ignore the minor difference in output or I make the implementation exactly like the source code, I am not very sure how to do this because the original resize function gives an error if it is simply moved above normalization code and if I use torch.nn.funtional.interpolate that is also not very optimal, it requires data conversions.

Here are the outputs

Different in Outputs

there is a slight difference, this happens because of how the image is pre-processed before being given to the model

Source code results

ic| depth: tensor([[0.9604, 0.9329, 0.8837,  ..., 3.0123, 2.9720, 2.9517],
                   [0.9210, 0.8995, 0.8605,  ..., 3.0148, 3.0120, 3.0106],
                   [0.8811, 0.8655, 0.8366,  ..., 3.0245, 3.0473, 3.0592],
                   ...,
                   [1.2283, 1.2263, 1.2225,  ..., 1.2698, 1.2818, 1.2881],
                   [1.2228, 1.2241, 1.2266,  ..., 1.2679, 1.2806, 1.2872],
                   [1.2167, 1.2223, 1.2333,  ..., 1.2655, 1.2757, 1.2810]])
ic| depth.shape: torch.Size([2268, 3024])
ic| focallength_px: tensor(3362.0200)

HF code results

ic| predicted_depth: [tensor([[0.9727, 0.9443, 0.8937,  ..., 3.0023, 2.9608, 2.9399],
                             [0.9320, 0.9097, 0.8693,  ..., 3.0045, 3.0006, 2.9987],
                             [0.8899, 0.8737, 0.8439,  ..., 3.0129, 3.0352, 3.0469],
                             ...,
                             [1.2393, 1.2373, 1.2334,  ..., 1.2805, 1.2934, 1.3001],
                             [1.2344, 1.2356, 1.2379,  ..., 1.2802, 1.2935, 1.3004],
                             [1.2286, 1.2341, 1.2447,  ..., 1.2788, 1.2892, 1.2947]])]
ic| fov: [tensor(3383.9839)]

Difference in Output Image

visually no difference in the 2 images

Input Image

Source code results

HF code results

geetu040 · 2024-11-25T07:54:39Z

Also how does the weight conversion work?

I have created the script for weight conversion, but when and who uploads that on huggingface? because I would need these converted weights for examples in docstring.

qubvel · 2024-12-05T10:42:48Z

Hey @geetu040, I believe it's better to override this method in model-specific tests

geetu040 · 2024-12-06T10:57:51Z

@qubvel some of these tests are failing, I think they are not related to this PR, can you please confirm?

ci/circleci: tests_generate

FAILED tests/models/fuyu/test_modeling_fuyu.py::FuyuModelTest::test_prompt_lookup_decoding_matches_greedy_search - RuntimeError: The size of tensor a (2) must match the size of tensor b (0) at non-singleton dimension 1

ci/circleci: tests_non_model

FAILED tests/utils/test_modeling_utils.py::ModelUtilsTest::test_from_pretrained_low_cpu_mem_usage_equal - AssertionError: 423.62109375 != 425.75 within 2 delta (2.12890625 difference) : using `low_cpu_mem_usage` should incur the same memory usage in both cases.

qubvel · 2024-12-06T12:18:24Z

@geetu040, yes it seems they are unrelated to this PR

geetu040 · 2024-12-06T12:22:53Z

@qubvel Thanks for all the help!

This PR is complete and ready for review.

Failing tests are unrelated to this PR.

geetu040 · 2024-12-12T16:22:03Z

some minor fixes I realised

image_encoder and fov_encoder uses image scaled to patch_size instead of scaled_images_ratios[0] as suggested in the paper
DepthProFeatureFusionStage now returns fused_hidden_states for each hidden_state, just like DPT, as someone can make use of the intermediate fused_hidden_state
fixed the output shape in example for DepthProModel

Should be ready for review now, failing test seems unrelated to this PR

docs/source/en/model_doc/depth_pro.md

src/transformers/models/depth_pro/__init__.py

src/transformers/models/depth_pro/configuration_depth_pro.py

qubvel · 2024-12-20T16:21:45Z

src/transformers/models/depth_pro/configuration_depth_pro.py

+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+        layer_norm_eps (`float`, *optional*, defaults to 1e-06):
+            The epsilon used by the layer normalization layers.
+        image_size (`int`, *optional*, defaults to 1536):


Does the model work with square images only? otherwise lets do it a List of two integers

Although the pre-trained model works with square image of 1536x1536, yet the model architecture works with image of any size or aspect ratio and doesnot use the image_size from config. I'll simply remove this from config.
However the image_processor requires size, I'll set that to 1536x1536 in the huggingface repo.

qubvel

Hi @geetu040! Great work, the code looks very good. Thank you for following standards and leaving comments explaining the code 🙌

I did an initial review, but I got lost somewhere in the middle of the main forward pass 😄. I'll do a more thorough review next time. The main comments right now are:

Check if it is possible to avoid patch/batch conversion in the encoder layer and perform it outside to preserve existing ViT modules.
Try to avoid double looping for merge patches, perhaps we can pad/unpad the whole tensor and then split it?

Thanks for your work and sorry for a delay on review 🤗

src/transformers/models/depth_pro/convert_depth_pro_weights_to_hf.py

src/transformers/models/depth_pro/image_processing_depth_pro.py

qubvel · 2024-12-20T17:12:20Z

src/transformers/models/depth_pro/modeling_depth_pro.py

+            output_attentions=output_attentions,
+            # required for intermediate features
+            output_hidden_states=self.n_intermediate_hooks or output_hidden_states,
+            return_dict=True,


We should have both options working, I suppose this comes from torch.jit.trace / script

qubvel · 2024-12-20T17:13:32Z

src/transformers/models/depth_pro/modeling_depth_pro.py

+        last_hidden_state = patch_encodings.last_hidden_state
+        last_hidden_state = batch_to_patch(last_hidden_state)
+        scaled_images_last_hidden_state = torch.split_with_sizes(last_hidden_state, scaled_images_num_patches[::-1])
+        scaled_images_last_hidden_state = scaled_images_last_hidden_state[::-1]
+        # -1 as patch encoder expects high res patches first


Lets have more comments regarding shapes transformations here

qubvel · 2024-12-20T17:16:40Z

src/transformers/models/depth_pro/modeling_depth_pro.py

+    """
+    merge_out_size = (box_size - 2) * (out_size - 2 * padding) + (2) * (out_size - padding)
+    padding = (merge_out_size - box_size * out_size) / (6 - 2 * box_size)
+    """


Suggested change

"""

merge_out_size = (box_size - 2) * (out_size - 2 * padding) + (2) * (out_size - padding)

padding = (merge_out_size - box_size * out_size) / (6 - 2 * box_size)

"""

# merge_out_size = (box_size - 2) * (out_size - 2 * padding) + (2) * (out_size - padding)

# padding = (merge_out_size - box_size * out_size) / (6 - 2 * box_size)

src/transformers/models/depth_pro/modeling_depth_pro.py

- fix download spell - add push_to_hub option - fix Optional type hinting - apply single loop for DepthProImageProcessor.preprocess

- capitalize start of docstring - use ignore copy - fix examples - move docstring templates and custom output classes to top - remove "-> None" typehinting from __init__ - type hinting for forward passes - fix docstrings for custom output classes

implement config and model building blocks

2986dc2

qubvel added New model Vision labels Nov 5, 2024

refactor model architechture

1728a2f

update model outputs

11ce50c

geetu040 added 10 commits November 16, 2024 10:23

update init param to include use_fov_model

27e9593

update param name in config

e74a7f5

fix hidden_states and attentions outputs for fov

8c2460b

sort config

55f6ed3

complete minor todos

b25dffb

update patching

c225deb

update config for encoder

176932d

fix config

dcec522

use correct defaults in config

0384d2f

update merge for compatibility with different image size

85e4f86

geetu040 added 4 commits November 21, 2024 11:04

restructure encoder for custom configuration

00e4aa3

make fov model compatible with custom config

6be242c

replace word "decoder" with "fusion"

0189108

weight conversion script

7614e1a

geetu040 added 2 commits December 5, 2024 14:55

fix ruff formatting

2c1cc10

Merge branch 'main' into depth-pro

4d94396

geetu040 added 4 commits December 6, 2024 10:42

add tests for fov

871b80d

use interpolation in postprocess

0ff0655

run and fix slow tests locally

befa6cd

Merge branch 'main' into depth-pro

db16fe6

geetu040 marked this pull request as ready for review December 6, 2024 12:21

geetu040 requested a review from qubvel December 6, 2024 12:23

geetu040 added 5 commits December 12, 2024 19:53

use scaled_images_features for image and fov encoder

99ac5e8

return fused_hidden_states in fusion stage

ebb62dd

fix example

46c88e8

fix ruff

2431358

Merge branch 'main' into depth-pro

fd38841

qubvel reviewed Dec 20, 2024

View reviewed changes

docs/source/en/model_doc/depth_pro.md Outdated Show resolved Hide resolved

qubvel reviewed Dec 20, 2024

View reviewed changes

src/transformers/models/depth_pro/__init__.py Outdated Show resolved Hide resolved

qubvel reviewed Dec 20, 2024

View reviewed changes

src/transformers/models/depth_pro/configuration_depth_pro.py Outdated Show resolved Hide resolved

qubvel reviewed Dec 20, 2024

View reviewed changes

geetu040 added 7 commits December 21, 2024 10:23

fix copyright license for all files

d9d3a49

add __all__ for each file

8f4c61f

minor fixes

8960535

- fix download spell - add push_to_hub option - fix Optional type hinting - apply single loop for DepthProImageProcessor.preprocess

return list in post_process_depth_estimation

1ac1b84

minor fixes

27bff69

- capitalize start of docstring - use ignore copy - fix examples - move docstring templates and custom output classes to top - remove "-> None" typehinting from __init__ - type hinting for forward passes - fix docstrings for custom output classes

fix "ruff check"

a69b5af

update upsample and projection

365a71d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for Apple's Depth-Pro #34583

Add support for Apple's Depth-Pro #34583

geetu040 commented Nov 3, 2024

geetu040 commented Nov 3, 2024

Rocketknight1 commented Nov 4, 2024

qubvel commented Nov 5, 2024

geetu040 commented Nov 10, 2024

qubvel commented Nov 11, 2024

geetu040 commented Nov 12, 2024

qubvel commented Nov 15, 2024

geetu040 commented Nov 18, 2024

qubvel commented Nov 18, 2024

geetu040 commented Nov 25, 2024 •

edited

Loading

geetu040 commented Nov 25, 2024

qubvel commented Dec 5, 2024

geetu040 commented Dec 6, 2024

qubvel commented Dec 6, 2024 •

edited

Loading

geetu040 commented Dec 6, 2024

geetu040 commented Dec 12, 2024

qubvel Dec 20, 2024

geetu040 Dec 21, 2024

qubvel left a comment

qubvel Dec 20, 2024

qubvel Dec 20, 2024

qubvel Dec 20, 2024

Add support for Apple's Depth-Pro #34583

Are you sure you want to change the base?

Add support for Apple's Depth-Pro #34583

Conversation

geetu040 commented Nov 3, 2024

What does this PR do?

Before submitting

Who can review?

geetu040 commented Nov 3, 2024

Rocketknight1 commented Nov 4, 2024

qubvel commented Nov 5, 2024

geetu040 commented Nov 10, 2024

qubvel commented Nov 11, 2024

geetu040 commented Nov 12, 2024

qubvel commented Nov 15, 2024

geetu040 commented Nov 18, 2024

qubvel commented Nov 18, 2024

geetu040 commented Nov 25, 2024 • edited Loading

geetu040 commented Nov 25, 2024

qubvel commented Dec 5, 2024

geetu040 commented Dec 6, 2024

qubvel commented Dec 6, 2024 • edited Loading

geetu040 commented Dec 6, 2024

geetu040 commented Dec 12, 2024

qubvel Dec 20, 2024

Choose a reason for hiding this comment

geetu040 Dec 21, 2024

Choose a reason for hiding this comment

qubvel left a comment

Choose a reason for hiding this comment

qubvel Dec 20, 2024

Choose a reason for hiding this comment

qubvel Dec 20, 2024

Choose a reason for hiding this comment

qubvel Dec 20, 2024

Choose a reason for hiding this comment

geetu040 commented Nov 25, 2024 •

edited

Loading

qubvel commented Dec 6, 2024 •

edited

Loading