-
Notifications
You must be signed in to change notification settings - Fork 27.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for Apple's Depth-Pro #34583
base: main
Are you sure you want to change the base?
Conversation
I have implemented the foundational components of the model and manually loaded the weights to ensure that the architecture aligns with the original design and produces consistent output. Below is a concise overview of the class hierarchy. I would greatly appreciate your feedback or any suggestions for improvements:
I have a couple of questions:
|
cc @pcuenca as well! |
Hi @geetu040! Thanks for working on this model! Regarding model outputs they should be written if you want to add a new argument or write better docs. In case of intermediate outputs you can store them in |
@qubvel @pcuenca Thanks, I have updated the code for hidden_states. I still need an opinion with The existing class DepthEstimatorOutput(ModelOutput):
loss: Optional[torch.FloatTensor] = None
predicted_depth: torch.FloatTensor = None
hidden_states: Optional[Tuple[torch.FloatTensor, ...]] = None
attentions: Optional[Tuple[torch.FloatTensor, ...]] = None Q1: Do I create a new class |
Thanks @geetu040 Q1: class DepthProDepthEstimatorOutput(DepthEstimatorOutput):
fov: Optional[torch.FloatTensor] = None This output can be returned in both cases: Q2: Yeah, this can be a parameter of the config, but also should be an argument in Please, let me know if you have more questions! |
This needs to be done during |
OK, got it! Then it should be done with config! And anyone can just load a model as following: model = DepthProForDepthEstimation(checkpoint, fov_model=True)
# or
model = DepthProForDepthEstimation(checkpoint, fov_model=False) With such initialization |
I was wondering can we also give this option to the users to decide which scales to use, for example, a user tells in config to use these custom scales
@qubvel I have looked into the code how this can be implemented, it is do-able and I can easily make this option available and I would prefer that, but I have to ask you as well, do you think this option should be given to the users? |
Hi @geetu040, we try to avoid overcomplicated code with lots of parameters, the general rule is to get rid of different code paths / unused params that are not different across pretrained checkpoints. For this particular case, feel free to add it, but only in case it will not introduce extra complexity to the modeling code. |
Hi @qubvel I have a question about the image processor. the source code from this causes the two outputs to be slightly different from each other. do you suggest I stay with the convention and ignore the minor difference in output or I make the implementation exactly like the source code, I am not very sure how to do this because the original Here are the outputs Different in Outputs there is a slight difference, this happens because of how the image is pre-processed before being given to the model Source code results
HF code results
Difference in Output Image visually no difference in the 2 images |
Also how does the weight conversion work? I have created the script for weight conversion, but when and who uploads that on huggingface? because I would need these converted weights for examples in docstring. |
Hey @geetu040, I believe it's better to override this method in model-specific tests |
@qubvel some of these tests are failing, I think they are not related to this PR, can you please confirm?
|
@geetu040, yes it seems they are unrelated to this PR |
@qubvel Thanks for all the help! This PR is complete and ready for review. Failing tests are unrelated to this PR. |
some minor fixes I realised
Should be ready for review now, failing test seems unrelated to this PR |
The standard deviation of the truncated_normal_initializer for initializing all weight matrices. | ||
layer_norm_eps (`float`, *optional*, defaults to 1e-06): | ||
The epsilon used by the layer normalization layers. | ||
image_size (`int`, *optional*, defaults to 1536): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the model work with square images only? otherwise lets do it a List of two integers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although the pre-trained model works with square image of 1536x1536
, yet the model architecture works with image of any size or aspect ratio and doesnot use the image_size
from config
. I'll simply remove this from config.
However the image_processor
requires size
, I'll set that to 1536x1536
in the huggingface repo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @geetu040! Great work, the code looks very good. Thank you for following standards and leaving comments explaining the code 🙌
I did an initial review, but I got lost somewhere in the middle of the main forward pass 😄. I'll do a more thorough review next time. The main comments right now are:
- Check if it is possible to avoid patch/batch conversion in the encoder layer and perform it outside to preserve existing ViT modules.
- Try to avoid double looping for merge patches, perhaps we can pad/unpad the whole tensor and then split it?
Thanks for your work and sorry for a delay on review 🤗
src/transformers/models/depth_pro/convert_depth_pro_weights_to_hf.py
Outdated
Show resolved
Hide resolved
src/transformers/models/depth_pro/image_processing_depth_pro.py
Outdated
Show resolved
Hide resolved
output_attentions=output_attentions, | ||
# required for intermediate features | ||
output_hidden_states=self.n_intermediate_hooks or output_hidden_states, | ||
return_dict=True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should have both options working, I suppose this comes from torch.jit.trace / script
last_hidden_state = patch_encodings.last_hidden_state | ||
last_hidden_state = batch_to_patch(last_hidden_state) | ||
scaled_images_last_hidden_state = torch.split_with_sizes(last_hidden_state, scaled_images_num_patches[::-1]) | ||
scaled_images_last_hidden_state = scaled_images_last_hidden_state[::-1] | ||
# -1 as patch encoder expects high res patches first |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets have more comments regarding shapes transformations here
""" | ||
merge_out_size = (box_size - 2) * (out_size - 2 * padding) + (2) * (out_size - padding) | ||
padding = (merge_out_size - box_size * out_size) / (6 - 2 * box_size) | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
""" | |
merge_out_size = (box_size - 2) * (out_size - 2 * padding) + (2) * (out_size - padding) | |
padding = (merge_out_size - box_size * out_size) / (6 - 2 * box_size) | |
""" | |
# merge_out_size = (box_size - 2) * (out_size - 2 * padding) + (2) * (out_size - padding) | |
# padding = (merge_out_size - box_size * out_size) / (6 - 2 * box_size) |
- fix download spell - add push_to_hub option - fix Optional type hinting - apply single loop for DepthProImageProcessor.preprocess
- capitalize start of docstring - use ignore copy - fix examples - move docstring templates and custom output classes to top - remove "-> None" typehinting from __init__ - type hinting for forward passes - fix docstrings for custom output classes
What does this PR do?
Fixes #34020
This PR adds Apple's Depth Pro model to Hugging Face Transformers. Depth Pro is a foundation model for zero-shot metric monocular depth estimation. It leverages a multi-scale vision transformer optimized for dense predictions. It downsamples an image at several scales. At each scale, it is split into patches, which are processed by a ViT-based (Dinov2) patch encoder, with weights shared across scales. Patches are merged into feature maps, upsampled, and fused via a DPT decoder.
Relevant Links
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
@amyeroberts, @qubvel