Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Apple's Depth-Pro #34583

Open
wants to merge 73 commits into
base: main
Choose a base branch
from

Conversation

geetu040
Copy link

@geetu040 geetu040 commented Nov 3, 2024

What does this PR do?

Fixes #34020

This PR adds Apple's Depth Pro model to Hugging Face Transformers. Depth Pro is a foundation model for zero-shot metric monocular depth estimation. It leverages a multi-scale vision transformer optimized for dense predictions. It downsamples an image at several scales. At each scale, it is split into patches, which are processed by a ViT-based (Dinov2) patch encoder, with weights shared across scales. Patches are merged into feature maps, upsampled, and fused via a DPT decoder.

Relevant Links

Before submitting

Who can review?

@amyeroberts, @qubvel

@geetu040
Copy link
Author

geetu040 commented Nov 3, 2024

I have implemented the foundational components of the model and manually loaded the weights to ensure that the architecture aligns with the original design and produces consistent output.

Below is a concise overview of the class hierarchy. I would greatly appreciate your feedback or any suggestions for improvements:

DepthProForDepthEstimation
├── depth_pro: DepthProModel
│   ├── encoder: DepthProEncoder
│   │   ├── patch_encoder: DepthProViT
│   │   │   ├── embeddings: DepthProViTEmbeddings
│   │   │   └── encoder: DepthProViTEncoder
│   │   ├── image_encoder: DepthProViT
│   │   │   ├── embeddings: DepthProViTEmbeddings
│   │   │   └── encoder: DepthProViTEncoder
│   ├── decoder: DepthProDecoder
│   └── fov_model: DepthProFOVModel
│       ├── encoder: DepthProViT
│       │   ├── embeddings: DepthProViTEmbeddings
│       │   └── encoder: DepthProViTEncoder
└── head: DepthProDepthEstimationHead

I have a couple of questions:

  1. The encoder: DepthProEncoder outputs features processed at various scales, including hidden states from the intermediate layers of ViTEncoder. Currently, I use BaseModelOutput, returning all features in the last_hidden_state argument. Should I create a dedicated ModelOutput class for DepthProEncoder? If so, it should reside in the same file as the DepthPro classes since it is specific to them.

  2. For handling the FOV (Field of View) output, would it be appropriate to create a class named DepthEstimatorOutputWithFOV in transformers.modeling_outputs, or should it also remain within the DepthPro context?

@Rocketknight1
Copy link
Member

cc @pcuenca as well!

@qubvel
Copy link
Member

qubvel commented Nov 5, 2024

Hi @geetu040! Thanks for working on this model!

Regarding model outputs they should be written if you want to add a new argument or write better docs. In case of intermediate outputs you can store them in BaseModelOutput.hidden_states, for example mllama set default output_hidden_states=True and then select required hidden states from vision transformer.

@geetu040
Copy link
Author

@qubvel @pcuenca Thanks, I have updated the code for hidden_states.

I still need an opinion with fov (field of view)
DepthPro returns the predicted_depth as well as the fov which is a scaler value.

The existing DepthEstimatorOutput class in transformers/src/transformers/modeling_outputs.py looks like this:

class DepthEstimatorOutput(ModelOutput):
    loss: Optional[torch.FloatTensor] = None
    predicted_depth: torch.FloatTensor = None
    hidden_states: Optional[Tuple[torch.FloatTensor, ...]] = None
    attentions: Optional[Tuple[torch.FloatTensor, ...]] = None

Q1: Do I create a new class DepthEstimatorOutputWithFOV or update the existing class?
Q2: User should be given the option to turn the FOV on or off because calculating FOV requires extra processing. In this case should this parameter be a part of model initialization DepthProForDepthEstimation(config, return_fov=True) or should it be kept inside config.

@qubvel
Copy link
Member

qubvel commented Nov 11, 2024

Thanks @geetu040

Q1:

class DepthProDepthEstimatorOutput(DepthEstimatorOutput):
    fov: Optional[torch.FloatTensor] = None

This output can be returned in both cases: fov=None and not None.

Q2:

Yeah, this can be a parameter of the config, but also should be an argument in forward method to override the config parameter (similar to output_hidden_states)

Please, let me know if you have more questions!

@geetu040
Copy link
Author

Yeah, this can be a parameter of the config, but also should be an argument in forward method to override the config parameter (similar to output_hidden_states)

This needs to be done during __init__, because it requires fov_model (another vision transformer) to be initialized.

@qubvel
Copy link
Member

qubvel commented Nov 15, 2024

OK, got it! Then it should be done with config! And anyone can just load a model as following:

model = DepthProForDepthEstimation(checkpoint, fov_model=True)
# or
model = DepthProForDepthEstimation(checkpoint, fov_model=False)

With such initialization fov_model param will be overridden in config

@geetu040
Copy link
Author

  • currently an image is down-scaled to medium resolution (high / 2) and low resolution (high / 4)
  • then patches are created from high, medium and low and concatenated.

I was wondering can we also give this option to the users to decide which scales to use, for example, a user tells in config to use these custom scales image_scales=[0.6, 0.4, 0.3]

  • now an image will downscale to these 3 scales
  • then patches are created from high and scaled images and concatenated.

@qubvel I have looked into the code how this can be implemented, it is do-able and I can easily make this option available and I would prefer that, but I have to ask you as well, do you think this option should be given to the users?

@qubvel
Copy link
Member

qubvel commented Nov 18, 2024

Hi @geetu040, we try to avoid overcomplicated code with lots of parameters, the general rule is to get rid of different code paths / unused params that are not different across pretrained checkpoints. For this particular case, feel free to add it, but only in case it will not introduce extra complexity to the modeling code.

@geetu040
Copy link
Author

geetu040 commented Nov 25, 2024

Hi @qubvel I have a question about the image processor.

the source code from apple/depth-pro preprocesses the image in this sequence normalize -> resize, however in conventional image processor for vit and dpt, the sequence is resize -> normalize

this causes the two outputs to be slightly different from each other.

do you suggest I stay with the convention and ignore the minor difference in output or I make the implementation exactly like the source code, I am not very sure how to do this because the original resize function gives an error if it is simply moved above normalization code and if I use torch.nn.funtional.interpolate that is also not very optimal, it requires data conversions.

Here are the outputs

Different in Outputs

there is a slight difference, this happens because of how the image is pre-processed before being given to the model

Source code results

ic| depth: tensor([[0.9604, 0.9329, 0.8837,  ..., 3.0123, 2.9720, 2.9517],
                   [0.9210, 0.8995, 0.8605,  ..., 3.0148, 3.0120, 3.0106],
                   [0.8811, 0.8655, 0.8366,  ..., 3.0245, 3.0473, 3.0592],
                   ...,
                   [1.2283, 1.2263, 1.2225,  ..., 1.2698, 1.2818, 1.2881],
                   [1.2228, 1.2241, 1.2266,  ..., 1.2679, 1.2806, 1.2872],
                   [1.2167, 1.2223, 1.2333,  ..., 1.2655, 1.2757, 1.2810]])
ic| depth.shape: torch.Size([2268, 3024])
ic| focallength_px: tensor(3362.0200)

HF code results

ic| predicted_depth: [tensor([[0.9727, 0.9443, 0.8937,  ..., 3.0023, 2.9608, 2.9399],
                             [0.9320, 0.9097, 0.8693,  ..., 3.0045, 3.0006, 2.9987],
                             [0.8899, 0.8737, 0.8439,  ..., 3.0129, 3.0352, 3.0469],
                             ...,
                             [1.2393, 1.2373, 1.2334,  ..., 1.2805, 1.2934, 1.3001],
                             [1.2344, 1.2356, 1.2379,  ..., 1.2802, 1.2935, 1.3004],
                             [1.2286, 1.2341, 1.2447,  ..., 1.2788, 1.2892, 1.2947]])]
ic| fov: [tensor(3383.9839)]

Difference in Output Image

visually no difference in the 2 images

Input Image
example

Source code results
Figure_1

HF code results
Figure_2

@geetu040
Copy link
Author

Also how does the weight conversion work?

I have created the script for weight conversion, but when and who uploads that on huggingface? because I would need these converted weights for examples in docstring.

@qubvel
Copy link
Member

qubvel commented Dec 5, 2024

Hey @geetu040, I believe it's better to override this method in model-specific tests

@geetu040
Copy link
Author

geetu040 commented Dec 6, 2024

@qubvel some of these tests are failing, I think they are not related to this PR, can you please confirm?

ci/circleci: tests_generate

FAILED tests/models/fuyu/test_modeling_fuyu.py::FuyuModelTest::test_prompt_lookup_decoding_matches_greedy_search - RuntimeError: The size of tensor a (2) must match the size of tensor b (0) at non-singleton dimension 1

ci/circleci: tests_non_model

FAILED tests/utils/test_modeling_utils.py::ModelUtilsTest::test_from_pretrained_low_cpu_mem_usage_equal - AssertionError: 423.62109375 != 425.75 within 2 delta (2.12890625 difference) : using `low_cpu_mem_usage` should incur the same memory usage in both cases.

@qubvel
Copy link
Member

qubvel commented Dec 6, 2024

@geetu040, yes it seems they are unrelated to this PR

@geetu040 geetu040 marked this pull request as ready for review December 6, 2024 12:21
@geetu040
Copy link
Author

geetu040 commented Dec 6, 2024

@qubvel Thanks for all the help!

This PR is complete and ready for review.

Failing tests are unrelated to this PR.

@geetu040 geetu040 requested a review from qubvel December 6, 2024 12:23
@geetu040
Copy link
Author

some minor fixes I realised

  • image_encoder and fov_encoder uses image scaled to patch_size instead of scaled_images_ratios[0] as suggested in the paper
  • DepthProFeatureFusionStage now returns fused_hidden_states for each hidden_state, just like DPT, as someone can make use of the intermediate fused_hidden_state
  • fixed the output shape in example for DepthProModel

Should be ready for review now, failing test seems unrelated to this PR

The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
layer_norm_eps (`float`, *optional*, defaults to 1e-06):
The epsilon used by the layer normalization layers.
image_size (`int`, *optional*, defaults to 1536):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the model work with square images only? otherwise lets do it a List of two integers

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although the pre-trained model works with square image of 1536x1536, yet the model architecture works with image of any size or aspect ratio and doesnot use the image_size from config. I'll simply remove this from config.
However the image_processor requires size, I'll set that to 1536x1536 in the huggingface repo.

Copy link
Member

@qubvel qubvel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @geetu040! Great work, the code looks very good. Thank you for following standards and leaving comments explaining the code 🙌

I did an initial review, but I got lost somewhere in the middle of the main forward pass 😄. I'll do a more thorough review next time. The main comments right now are:

  • Check if it is possible to avoid patch/batch conversion in the encoder layer and perform it outside to preserve existing ViT modules.
  • Try to avoid double looping for merge patches, perhaps we can pad/unpad the whole tensor and then split it?

Thanks for your work and sorry for a delay on review 🤗

output_attentions=output_attentions,
# required for intermediate features
output_hidden_states=self.n_intermediate_hooks or output_hidden_states,
return_dict=True,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should have both options working, I suppose this comes from torch.jit.trace / script

Comment on lines +966 to +970
last_hidden_state = patch_encodings.last_hidden_state
last_hidden_state = batch_to_patch(last_hidden_state)
scaled_images_last_hidden_state = torch.split_with_sizes(last_hidden_state, scaled_images_num_patches[::-1])
scaled_images_last_hidden_state = scaled_images_last_hidden_state[::-1]
# -1 as patch encoder expects high res patches first
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets have more comments regarding shapes transformations here

Comment on lines +735 to +738
"""
merge_out_size = (box_size - 2) * (out_size - 2 * padding) + (2) * (out_size - padding)
padding = (merge_out_size - box_size * out_size) / (6 - 2 * box_size)
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""
merge_out_size = (box_size - 2) * (out_size - 2 * padding) + (2) * (out_size - padding)
padding = (merge_out_size - box_size * out_size) / (6 - 2 * box_size)
"""
# merge_out_size = (box_size - 2) * (out_size - 2 * padding) + (2) * (out_size - padding)
# padding = (merge_out_size - box_size * out_size) / (6 - 2 * box_size)

src/transformers/models/depth_pro/modeling_depth_pro.py Outdated Show resolved Hide resolved
src/transformers/models/depth_pro/modeling_depth_pro.py Outdated Show resolved Hide resolved
- fix download spell
- add push_to_hub option
- fix Optional type hinting
- apply single loop for DepthProImageProcessor.preprocess
- capitalize start of docstring
- use ignore copy
- fix examples
- move docstring templates and custom output classes to top
- remove "-> None" typehinting from __init__
- type hinting for forward passes
- fix docstrings for custom output classes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for Apple's Depth-Pro
3 participants