Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding grounding dino #26087

Merged
merged 274 commits into from
Apr 11, 2024
Merged

Conversation

EduardoPach
Copy link
Contributor

@EduardoPach EduardoPach commented Sep 11, 2023

What does this PR do?

This PR adds Grounding DINO

Fixes #25423

To-Do's:

  • Port vision backbone
  • Port Text Backbone
  • Port Encoder
  • Port Decoder
  • Port tokenizer
  • Port Image processing
  • Validate results
  • Check documentation

@EduardoPach EduardoPach changed the title Adding grounding dino [WIP] Adding grounding dino Sep 11, 2023
@EduardoPach EduardoPach mentioned this pull request Sep 11, 2023
4 tasks
@amyeroberts
Copy link
Collaborator

@EduardoPach Thanks for opening this model PR! From next week, I'll be away for a few weeks. If you need a review in that time please ping @rafaelpadilla.

@EduardoPach
Copy link
Contributor Author

@rafaelpadilla hey, so I've finished implementing the model and validated with the original implementation. Still have to clean up some things and make sure the documentation is correct.

My main question is about pushing the model to the hub, because the authors uploaded already the checkpoints (two checkpoints in the same repo) they made available to the model, but it's under an user instead of their org (IDEA-Research), what is usually done in this case?

@rafaelpadilla
Copy link
Contributor

Hi @EduardoPach ,

Are you referring to groundingdino_swinb_cogcoor.pth and groundingdino_swint_ogc.pth, placed here, right?

In this case, let's consult @ArthurZucker and @younesbelkada.

@EduardoPach
Copy link
Contributor Author

Hi @EduardoPach ,

Are you referring to groundingdino_swinb_cogcoor.pth and groundingdino_swint_ogc.pth, placed here, right?

In this case, let's consult @ArthurZucker and @younesbelkada.

Precisely, I'm asking this because I had the impression that model repos contain only one checkpoint and also the IDEA-Research group has other models that we've could add to the transformers library later on so it might be helpful if there was an account for this org.

@rafaelpadilla
Copy link
Contributor

Hi @EduardoPach ,
Are you referring to groundingdino_swinb_cogcoor.pth and groundingdino_swint_ogc.pth, placed here, right?
In this case, let's consult @ArthurZucker and @younesbelkada.

Precisely, I'm asking this because I had the impression that model repos contain only one checkpoint and also the IDEA-Research group has other models that we've could add to the transformers library later on so it might be helpful if there was an account for this org.

For now, you can upload weights to the hub under your own personal profile and use them until this PR is ready to merge.
Afterwards, we'll move the weights under the organization on the hub, and update all the paths to point to those.

@younesbelkada
Copy link
Contributor

Hi @EduardoPach
I second what @rafaelpadilla said, for the groundingdino_swinb_cogcoor.pth and groundingdino_swint_ogc.pth you can create two different repositories under your personal name space with a suffix that is distinguishable (e.g. yournamespace/groundingdino-swinb-cogcoor and yournamespace/groundingdino-swint-ogc, and make sure the files has been renamed to pytorch_model.bin

@EduardoPach
Copy link
Contributor Author

@rafaelpadilla Hey! Could you help me out with these questions?

  1. The ImageProcessor from the original implementation is exactly the same as we have in DeformableDetr. Should I copy the ImageProcessor and just remove the segmentation-related things? (Since GroundingDINO is used only for Object Detection)
  2. Their tokenizer is the same as Bert with a few extra steps after tokenizing so I copied and added this step, but I'm unsure how to push the pre-trained tokenizer to the hub
  3. My implementation of GroundingDINOConfig has an attribute called text_backbone_config which is a GroundingDINOTextPrenetConfig which is just a copy of Bert config. However, after pushing the model to the hub when I try to instantiate the model with .from_pretrained I get an error saying:
ValueError: Parameter config in `GroundingDINOTextPrenet(config)` should be an instance of class `PretrainedConfig`. To create a model from a pretrained model use `model = GroundingDINOTextPrenet.from_pretrained(PRETRAINED_MODEL_NAME)`

and when I do AutoConfig.from_pretrained("EduardoPacheco/grounding-dino-base").text_backbone_config I get {'model_type': 'grounding-dino-text-prenet'} is there anything different that I need to do to have a config as an attribute? I've tried to look at CLIP's configuration to get some idea of how to do it, but I'm unsure why I am not getting the full GroundingDINOTextPrenetConfig after pushing the model to the hub

@rafaelpadilla
Copy link
Contributor

Hi @EduardoPach ,

If your ImageProcessor is an exact copy from another model you must include the #Copied from. If somehow your ImageProcessor uses parts of other code, it would be good to have a #Modified from comment.

If I understood correctly, you have already generated the tokens using the newly extra steps, right? For pushing your tokens to the hub you could could use the hub api. See an example here

I'm not sure if the problem is regarding AutoConfig, as it could not load correctly your GroundingDINOConfig. Have you tried loading it directly with GroundingDINOTextPrenet.from_pretrained("EduardoPacheco/grounding-dino-base")?

@EduardoPach
Copy link
Contributor Author

EduardoPach commented Oct 6, 2023

@rafaelpadilla the ImageProcessor is precisely the same, but the DeformableDetr one works for object detection and segmentation. Right now I've copied the processor and just removed the segmentation stuff, is that okay?

Also, about the config, sorry I had forgotten to push the modifications I've done to the configuration_grounding_dino.py file

EDIT

I figured out what the issue was haha it was somewhat dumb. Either way, I wasn't aware that when we push the config to the hub the config class is then converted to a config.json, and any Nested configuration is also modified to a dictionary so I only had to change my GroundingDINOConfig implementation a bit when creating the attribute text_backbone_config

@NielsRogge
Copy link
Contributor

Hi @EduardoPach do you need any help in finishing this PR? Really great to see you're leveraging Copied from for the text encoder and all parts taken from Deformable DETR. Also, if the image processor is exactly the same as Deformable DETR, then we typically don't add a new image processor to the library, but rather just add a line in image_processing_auto, which will allow people to do:

from transformers import AutoImageProcessor

image_processor = AutoImageProcessor.from_pretrained("sensetime/grounding-dino-base")

this will then automatically create a DeformableDetrImageProcessor.

README.md Outdated Show resolved Hide resolved
@EduardoPach
Copy link
Contributor Author

EduardoPach commented Oct 13, 2023

Hi @EduardoPach do you need any help in finishing this PR? Really great to see you're leveraging Copied from for the text encoder and all parts taken from Deformable DETR. Also, if the image processor is exactly the same as Deformable DETR, then we typically don't add a new image processor to the library, but rather just add a line in image_processing_auto, which will allow people to do:

from transformers import AutoImageProcessor

image_processor = AutoImageProcessor.from_pretrained("sensetime/grounding-dino-base")

this will then automatically create a DeformableDetrImageProcessor.

Writing here just for record

As we discussed through Discord I'll make that and will do the same for the Tokenizer part and will just create a GroundingDINOProcessor.

Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the work on this - looks great!

Only thing is removing the Bert implementation and using AutoModel instead and some nits. Otherwise we're good to merge 🤗

docs/source/en/model_doc/grounding-dino.md Outdated Show resolved Hide resolved
docs/source/en/model_doc/grounding-dino.md Outdated Show resolved Hide resolved
@jiangtann
Copy link

@EduardoPach https://github.com/EduardoPach/transformers/blob/6f13fbb5f46a8c949a02c5c087de104fdf254f67/src/transformers/models/grounding_dino/modeling_grounding_dino.py#L1436 Can you modify the GroundingDinoMultiheadAttention to support text cross-attention mask?

Any particular reason for this? In the original implementation, they didn't use the cross-attention mask

If text cross-attention mask is not used, we can only train and inference with batch_size == 1, which is inefficient in utilizing GPU memory.

In a batch, multiple texts will be padding to a fixed length, so we need text cross-attention mask.

@EduardoPach
Copy link
Contributor Author

@EduardoPach https://github.com/EduardoPach/transformers/blob/6f13fbb5f46a8c949a02c5c087de104fdf254f67/src/transformers/models/grounding_dino/modeling_grounding_dino.py#L1436 Can you modify the GroundingDinoMultiheadAttention to support text cross-attention mask?

Any particular reason for this? In the original implementation, they didn't use the cross-attention mask

If text cross-attention mask is not used, we can only train and inference with batch_size == 1, which is inefficient in utilizing GPU memory.

In a batch, multiple texts will be padding to a fixed length, so we need text cross-attention mask.

Yeah, I was under the assumption that one would use always the same labels when doing inference, but thinking more about it I can see some cases where that wouldn't be the case and training would be one as well.

I do think though, that we could fix this in a different PR as this PR is quite old and having a working version of the model in the main repo would be beneficial IMO and then I can work on adding the cross-attention masks as well 🤗.

Are you using the implementation for a project?

@jiangtann
Copy link

@EduardoPach https://github.com/EduardoPach/transformers/blob/6f13fbb5f46a8c949a02c5c087de104fdf254f67/src/transformers/models/grounding_dino/modeling_grounding_dino.py#L1436 Can you modify the GroundingDinoMultiheadAttention to support text cross-attention mask?

Any particular reason for this? In the original implementation, they didn't use the cross-attention mask

If text cross-attention mask is not used, we can only train and inference with batch_size == 1, which is inefficient in utilizing GPU memory.
In a batch, multiple texts will be padding to a fixed length, so we need text cross-attention mask.

Yeah, I was under the assumption that one would use always the same labels when doing inference, but thinking more about it I can see some cases where that wouldn't be the case and training would be one as well.

I do think though, that we could fix this in a different PR as this PR is quite old and having a working version of the model in the main repo would be beneficial IMO and then I can work on adding the cross-attention masks as well 🤗.

Are you using the implementation for a project?

Yes, I'm currently working on the development of extra-visual-module-based MLLM (e.g. LISA, GLaMM, etc.). And I'm using your code with nn.MultiheadAttention instead of GroundingDinoMultiheadAttention for training and inference.

Due to consistent Transformers code style, your code is more suitable to use with LLM (e.g. LlamaForCausalLM). Thanks for your work!

@EduardoPach EduardoPach requested a review from amyeroberts April 9, 2024 12:54
@EduardoPach
Copy link
Contributor Author

Adding here a screenshot of running the tests RUN_SLOW=1 pytest tests/models/grounding_dino/ -vv where we can see that the tests are passing and specific GPU test test_inference_object_detection_head_equivalence_cpu_gpu is green

c..c @amyeroberts
image

@rb-synth
Copy link
Contributor

rb-synth commented Apr 9, 2024

I just tried out the model readme, and think it might be slightly outdated. Here are some changes I had to make:

  1. the post-processor needs to be put on the device: inputs = {k: v.to(device) for k, v in inputs.items()}.
  2. inputs is a dictionary, so needs to be inputs["input_ids"] not inputs.input_ids
  3. bbox_threshold needs to be box_threshold.

Otherwise it seems to work well for me, thanks for the hard work!

@EduardoPach
Copy link
Contributor Author

I just tried out the model readme, and think it might be slightly outdated. Here are some changes I had to make:

  1. the post-processor needs to be put on the device: inputs = {k: v.to(device) for k, v in inputs.items()}.
  2. inputs is a dictionary, so needs to be inputs["input_ids"] not inputs.input_ids
  3. bbox_threshold needs to be box_threshold.

Otherwise it seems to work well for me, thanks for the hard work!

Hey, thanks for the heads up. The output of GroundingDinoProcessor is of type BatchEncoding so in your first point, you can simply do inputs = inputs.to(device) (added that to the model readme). For your second point, if you didn't modify inputs to a dict it should be a BatchEncoding so no problem there as well. For your third point, I fix that 😄

@rb-synth
Copy link
Contributor

Good points, thanks! Last point, have you checked with a list of text prompts? The type hinting implies this should be possible (text: List[TextInput]), but I haven't succeeded. I tried both with and without padding=True:

import requests

import torch
from PIL import Image
from transformers import AutoProcessor, AutoModelForZeroShotObjectDetection

model_id = "IDEA-Research/grounding-dino-tiny"
device = torch.device("cuda")
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForZeroShotObjectDetection.from_pretrained(model_id).to(device)

image_url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(image_url, stream=True).raw)
# Check for cats and remote controls
text = ["cat", "remote control"]

inputs = processor(images=image, text=text, return_tensors="pt", padding=True).to(device)
with torch.no_grad():
    outputs = model(**inputs)

results = processor.post_process_grounded_object_detection(
    outputs,
    inputs.input_ids,
    box_threshold=0.4,
    text_threshold=0.3,
    target_sizes=[image.size[::-1]]
)

If padding is not given, error is:

Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length. Perhaps your features (`input_ids` in this case) have excessive nesting (inputs type `list` where type `int` is expected).

If padding=True is passed, the error is:

The expanded size of the tensor (8) must match the existing size (4) at non-singleton dimension 2.  Target sizes: [4, 17821, 8].  Tensor sizes: [8, 1, 4]

@rb-synth
Copy link
Contributor

I see that in your example, you separate with full stops, so a list of text could be converted with ". ".join(texts). I tried this on an image with some items of clothing, so the prompt was 'shirt. dress. blouse. jacket. jumper. sweater. undershirt. t-shirt. tie'. I would expect each detected object to be one of the full-stop enclosed phrases, but the results were pretty mangled:

jumper
shirt -
shirt blouse undershirt t - shirt
shirt blouse jacket sweater
blouse undershirt t - shirt tie
shirt blouse sweater -
blouse sweater undershirt t - shirt
blouse jacket jumper sweater t - shirt
shirt blouse jacket sweater undershirt t - shirt
undershirt
##shirt
t - shirt
t - shirt

Is this to be expected?

@EduardoPach
Copy link
Contributor Author

I see that in your example, you separate with full stops, so a list of text could be converted with ". ".join(texts). I tried this on an image with some items of clothing, so the prompt was 'shirt. dress. blouse. jacket. jumper. sweater. undershirt. t-shirt. tie'. I would expect each detected object to be one of the full-stop enclosed phrases, but the results were pretty mangled:

jumper
shirt -
shirt blouse undershirt t - shirt
shirt blouse jacket sweater
blouse undershirt t - shirt tie
shirt blouse sweater -
blouse sweater undershirt t - shirt
blouse jacket jumper sweater t - shirt
shirt blouse jacket sweater undershirt t - shirt
undershirt
##shirt
t - shirt
t - shirt

Is this to be expected?

It should also end with . so it would be something like ". ".join(texts) + ".". Fixed that on the example in the model readme as well!

And about your outputs, they are indeed a bit weird haha. It's unexpected to have outputs like shirt blouse jacket sweater if in your text input you separated the classes with .. Can you share your example?

@EduardoPach EduardoPach requested a review from amyeroberts April 10, 2024 17:18
Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huge piece of work - thanks for adding this model!

@EduardoPach
Copy link
Contributor Author

Huge piece of work - thanks for adding this model!

Thank you for reviewing and your patience as well 😅. I'll open an issue to add the cross-attetion as @jiangtann mentioned as soon as it gets merged

@amyeroberts amyeroberts merged commit b752ad3 into huggingface:main Apr 11, 2024
22 checks passed
ArthurZucker pushed a commit that referenced this pull request Apr 22, 2024
* Fixed typo when converting weigths to GroundingDINO vision backbone

* Final modifications on modeling

* Removed unnecessary class

* Fixed convert structure

* Added image processing

* make fixup partially completed

* Now text_backbone_config has its own class

* Modified convert script

* Removed unnecessary config attribute

* Added new function to generate sub sentence mask

* Renamed parameters with gamma in the name as it's currently not allowed

* Removed tokenization and image_processing scripts since we'll map from existing models

* Fixed some issues with configuration

* Just some modifications on conversion script

* Other modifications

* Copied deformable detr

* First commit

* Added bert to model

* Bert validated

* Created Text and Fusion layers for Encoder

* Adapted Encoder layer

* Fixed typos

* Adjusted Encoder

* Converted encoder to hf

* Modified Decoder Layer

* Modified main decoder class

* Removed copy comments

* Fixed forward from GroundingDINOModel and GroundingDINODecoder

* Added all necessary layers, configurations and forward logic up to GroundingDINOModel

* Added all layers to convertion

* Fixed outputs for GroundingDINOModel and GroundingDINOForObjectDetection

* Fixed mask input to encoders and fixed nn.MultiheadAttention batch first and attn output

* Fixed forward from GroundingDINOTextEnhancerLayer

* Fixed output bug with GroundingDINODeformableLayer

* Fixed bugs that prevent GroundingDINOForObjectDetection to run forward method

* Fixed attentions to be passed correctly

* Passing temperature arg when creating Sine position embedding

* Removed copy comments

* Added temperature argument for position embedding

* Fixed typo when converting weigths to GroundingDINO vision backbone

* Final modifications on modeling

* Removed unnecessary class

* Fixed convert structure

* Added image processing

* make fixup partially completed

* Now text_backbone_config has its own class

* Modified convert script

* Removed unnecessary config attribute

* Added new function to generate sub sentence mask

* Renamed parameters with gamma in the name as it's currently not allowed

* Removed tokenization and image_processing scripts since we'll map from existing models

* Fixed some issues with configuration

* Just some modifications on conversion script

* Other modifications

* Fix style

* Improve fixup

* Improve conversion script

* Improve conversion script

* Add GroundingDINOProcessor

* More improvements

* Return token type ids

* something

* Fix more tests

* More improvements

* More cleanup

* More improvements

* Fixed tests, improved modeling and config

* More improvements and fixing tests

* Improved tests and modeling

* Improved tests and added image processor

* Improved tests inference

* More improvements

* More test improvements

* Fixed last test

* Improved docstrings and comments

* Fix style

* Update src/transformers/models/grounding_dino/modeling_grounding_dino.py

Co-authored-by: Rafael Padilla <[email protected]>

* Update src/transformers/models/grounding_dino/modeling_grounding_dino.py

Co-authored-by: Rafael Padilla <[email protected]>

* Update src/transformers/models/grounding_dino/modeling_grounding_dino.py

Co-authored-by: Rafael Padilla <[email protected]>

* Update src/transformers/models/grounding_dino/modeling_grounding_dino.py

Co-authored-by: Rafael Padilla <[email protected]>

* Update src/transformers/models/grounding_dino/modeling_grounding_dino.py

Co-authored-by: Rafael Padilla <[email protected]>

* Better naming

* Better naming

* Added Copied statement

* Added Copied statement

* Moved param init from GroundingDINOBiMultiHeadAttention

* Better naming

* Fixing clamp style

* Better naming

* Update src/transformers/models/grounding_dino/modeling_grounding_dino.py

Co-authored-by: NielsRogge <[email protected]>

* Update src/transformers/models/grounding_dino/modeling_grounding_dino.py

Co-authored-by: NielsRogge <[email protected]>

* Update src/transformers/models/grounding_dino/configuration_grounding_dino.py

Co-authored-by: Rafael Padilla <[email protected]>

* Update src/transformers/models/grounding_dino/convert_grounding_dino_to_hf.py

Co-authored-by: Rafael Padilla <[email protected]>

* Update src/transformers/models/grounding_dino/modeling_grounding_dino.py

Co-authored-by: Rafael Padilla <[email protected]>

* Improving conversion script

* Improved config

* Improved naming

* Improved naming again

* Improved grouding-dino.md

* Moved grounding dino to multimodal

* Update src/transformers/models/grounding_dino/convert_grounding_dino_to_hf.py

Co-authored-by: Rafael Padilla <[email protected]>

* Fixed docstrings and style

* Fix docstrings

* Remove timm attributes

* Reorder imports

* More improvements

* Add Grounding DINO to pipeline

* Remove model from check_repo

* Added grounded post_process to GroundingDINOProcessor

* Fixed style

* Fixed GroundingDINOTextPrenetConfig docstrings

* Aligned inputs.keys() when both image and text are passed with model_input_names

* Added tests for GroundingDINOImageProcessor and GroundingDINOProcessor

* Testing post_process_grounded_object_detection from GroundingDINOProcessor at test_inference_object_detection_head

* Fixed order

* Marked test with require_torch

* Temporarily changed repo_id

* More improvements

* Fix style

* Final improvements

* Improve annotators

* Fix style

* Add is_torch_available

* Remove type hints

* vocab_tokens as one liner

* Removed print statements

* Renamed GroundingDINOTextPrenetConfig to GroundingDINOTextConfig

* remove unnecessary comments

* Removed unnecessary tests on conversion script

* Renamed GroundingDINO to camel case GroundingDino

* Fixed GroundingDinoProcessor docstrings

* loading MSDA kernels in the modeling file

* Fix copies

* Replace nn.multiheadattention

* Replace nn.multiheadattention

* Fixed inputs for GroundingDinoMultiheadAttention & order of modules

* Fixed processing to avoid messing with inputs

* Added more tips for GroundingDino

* Make style

* Chaning name to align with SAM

* Replace final nn.multiheadattention

* Fix model tests

* Update year, remove GenerationTesterMixin

* Address comments

* Address more comments

* Rename TextPrenet to TextModel

* Rename hidden_states

* Address more comments

* Address more comments

* Address comment

* Address more comments

* Address merge

* Address comment

* Address comment

* Address comment

* Make style

* Added layer norm eps to layer norms

* Address more comments

* More fixes

* Fixed equivalence

* Make fixup

* Remove print statements

* Address comments

* Address comments

* Address comments

* Address comments

* Address comments

* Address comments

* Add comment

* Address comment

* Remove overwriting of test

* Fix bbox_embed

* Improve decoder_bbox_embed_share

* Simplify outputs

* Updated post_process_grounded_object_detection

* Renamed sources to feature_maps

* Improved tests for Grounding Dino ImageProcessor and Processor

* Fixed test requirements and imports

* Fixed image_processing

* Fixed processor tests

* Fixed imports for image processing tests

* Fix copies

* Updated modeling

* Fix style

* Moved functions to correct position

* Fixed copy issues

* Update src/transformers/models/deformable_detr/modeling_deformable_detr.py

Co-authored-by: Sangbum Daniel Choi <[email protected]>

* Update src/transformers/models/grounding_dino/modeling_grounding_dino.py

Co-authored-by: Sangbum Daniel Choi <[email protected]>

* Update src/transformers/models/grounding_dino/modeling_grounding_dino.py

Co-authored-by: Sangbum Daniel Choi <[email protected]>

* Keeping consistency custom cuda kernels for MSDA

* Make GroundingDinoProcessor logic clearer

* Updated Grounding DINO checkpoints

* Changed tests to correct structure

* Updated gpu-cpu equivalence test

* fix copies

* Update src/transformers/models/grounding_dino/processing_grounding_dino.py

Co-authored-by: amyeroberts <[email protected]>

* Update src/transformers/models/grounding_dino/processing_grounding_dino.py

Co-authored-by: amyeroberts <[email protected]>

* Update src/transformers/models/grounding_dino/modeling_grounding_dino.py

Co-authored-by: amyeroberts <[email protected]>

* Update src/transformers/models/grounding_dino/configuration_grounding_dino.py

Co-authored-by: amyeroberts <[email protected]>

* Fixed erros and style

* Fix copies

* Removed inheritance from PreTrainedModel from GroundingDinoTextModel

* Fixed GroundingDinoTextModel

* Fixed type of default backbone config

* Fixed missing methods for GroundingDinoTextModel and Added timm support for GroundingDinoConvEncoder

* Addressed comments

* Addressed batched image processing tests

* Addressed zero shot test comment

* Addressed tip comment

* Removed GroundingDinoTextModel from check_repo

* Removed inplace masking

* Addressed comments

* Addressed comments

* Addressed comments

* Fix copies

* Fixing timm test

* Fixed batching equivalence test

* Update docs/source/en/model_doc/grounding-dino.md

Co-authored-by: Tianqi Xu <[email protected]>

* Update docs/source/en/model_doc/grounding-dino.md

Co-authored-by: Tianqi Xu <[email protected]>

* Update docs/source/en/model_doc/grounding-dino.md

Co-authored-by: Tianqi Xu <[email protected]>

* Addressed more comments

* Added a new comment

* Reduced image size

* Addressed more comments

* Nits

* Nits

* Changed the way text_config is initialized

* Update src/transformers/models/grounding_dino/processing_grounding_dino.py

Co-authored-by: amyeroberts <[email protected]>

---------

Co-authored-by: Niels <[email protected]>
Co-authored-by: Rafael Padilla <[email protected]>
Co-authored-by: NielsRogge <[email protected]>
Co-authored-by: Eduardo Pacheco <[email protected]>
Co-authored-by: Sangbum Daniel Choi <[email protected]>
Co-authored-by: amyeroberts <[email protected]>
Co-authored-by: Tianqi Xu <[email protected]>
itazap pushed a commit that referenced this pull request May 14, 2024
* Fixed typo when converting weigths to GroundingDINO vision backbone

* Final modifications on modeling

* Removed unnecessary class

* Fixed convert structure

* Added image processing

* make fixup partially completed

* Now text_backbone_config has its own class

* Modified convert script

* Removed unnecessary config attribute

* Added new function to generate sub sentence mask

* Renamed parameters with gamma in the name as it's currently not allowed

* Removed tokenization and image_processing scripts since we'll map from existing models

* Fixed some issues with configuration

* Just some modifications on conversion script

* Other modifications

* Copied deformable detr

* First commit

* Added bert to model

* Bert validated

* Created Text and Fusion layers for Encoder

* Adapted Encoder layer

* Fixed typos

* Adjusted Encoder

* Converted encoder to hf

* Modified Decoder Layer

* Modified main decoder class

* Removed copy comments

* Fixed forward from GroundingDINOModel and GroundingDINODecoder

* Added all necessary layers, configurations and forward logic up to GroundingDINOModel

* Added all layers to convertion

* Fixed outputs for GroundingDINOModel and GroundingDINOForObjectDetection

* Fixed mask input to encoders and fixed nn.MultiheadAttention batch first and attn output

* Fixed forward from GroundingDINOTextEnhancerLayer

* Fixed output bug with GroundingDINODeformableLayer

* Fixed bugs that prevent GroundingDINOForObjectDetection to run forward method

* Fixed attentions to be passed correctly

* Passing temperature arg when creating Sine position embedding

* Removed copy comments

* Added temperature argument for position embedding

* Fixed typo when converting weigths to GroundingDINO vision backbone

* Final modifications on modeling

* Removed unnecessary class

* Fixed convert structure

* Added image processing

* make fixup partially completed

* Now text_backbone_config has its own class

* Modified convert script

* Removed unnecessary config attribute

* Added new function to generate sub sentence mask

* Renamed parameters with gamma in the name as it's currently not allowed

* Removed tokenization and image_processing scripts since we'll map from existing models

* Fixed some issues with configuration

* Just some modifications on conversion script

* Other modifications

* Fix style

* Improve fixup

* Improve conversion script

* Improve conversion script

* Add GroundingDINOProcessor

* More improvements

* Return token type ids

* something

* Fix more tests

* More improvements

* More cleanup

* More improvements

* Fixed tests, improved modeling and config

* More improvements and fixing tests

* Improved tests and modeling

* Improved tests and added image processor

* Improved tests inference

* More improvements

* More test improvements

* Fixed last test

* Improved docstrings and comments

* Fix style

* Update src/transformers/models/grounding_dino/modeling_grounding_dino.py

Co-authored-by: Rafael Padilla <[email protected]>

* Update src/transformers/models/grounding_dino/modeling_grounding_dino.py

Co-authored-by: Rafael Padilla <[email protected]>

* Update src/transformers/models/grounding_dino/modeling_grounding_dino.py

Co-authored-by: Rafael Padilla <[email protected]>

* Update src/transformers/models/grounding_dino/modeling_grounding_dino.py

Co-authored-by: Rafael Padilla <[email protected]>

* Update src/transformers/models/grounding_dino/modeling_grounding_dino.py

Co-authored-by: Rafael Padilla <[email protected]>

* Better naming

* Better naming

* Added Copied statement

* Added Copied statement

* Moved param init from GroundingDINOBiMultiHeadAttention

* Better naming

* Fixing clamp style

* Better naming

* Update src/transformers/models/grounding_dino/modeling_grounding_dino.py

Co-authored-by: NielsRogge <[email protected]>

* Update src/transformers/models/grounding_dino/modeling_grounding_dino.py

Co-authored-by: NielsRogge <[email protected]>

* Update src/transformers/models/grounding_dino/configuration_grounding_dino.py

Co-authored-by: Rafael Padilla <[email protected]>

* Update src/transformers/models/grounding_dino/convert_grounding_dino_to_hf.py

Co-authored-by: Rafael Padilla <[email protected]>

* Update src/transformers/models/grounding_dino/modeling_grounding_dino.py

Co-authored-by: Rafael Padilla <[email protected]>

* Improving conversion script

* Improved config

* Improved naming

* Improved naming again

* Improved grouding-dino.md

* Moved grounding dino to multimodal

* Update src/transformers/models/grounding_dino/convert_grounding_dino_to_hf.py

Co-authored-by: Rafael Padilla <[email protected]>

* Fixed docstrings and style

* Fix docstrings

* Remove timm attributes

* Reorder imports

* More improvements

* Add Grounding DINO to pipeline

* Remove model from check_repo

* Added grounded post_process to GroundingDINOProcessor

* Fixed style

* Fixed GroundingDINOTextPrenetConfig docstrings

* Aligned inputs.keys() when both image and text are passed with model_input_names

* Added tests for GroundingDINOImageProcessor and GroundingDINOProcessor

* Testing post_process_grounded_object_detection from GroundingDINOProcessor at test_inference_object_detection_head

* Fixed order

* Marked test with require_torch

* Temporarily changed repo_id

* More improvements

* Fix style

* Final improvements

* Improve annotators

* Fix style

* Add is_torch_available

* Remove type hints

* vocab_tokens as one liner

* Removed print statements

* Renamed GroundingDINOTextPrenetConfig to GroundingDINOTextConfig

* remove unnecessary comments

* Removed unnecessary tests on conversion script

* Renamed GroundingDINO to camel case GroundingDino

* Fixed GroundingDinoProcessor docstrings

* loading MSDA kernels in the modeling file

* Fix copies

* Replace nn.multiheadattention

* Replace nn.multiheadattention

* Fixed inputs for GroundingDinoMultiheadAttention & order of modules

* Fixed processing to avoid messing with inputs

* Added more tips for GroundingDino

* Make style

* Chaning name to align with SAM

* Replace final nn.multiheadattention

* Fix model tests

* Update year, remove GenerationTesterMixin

* Address comments

* Address more comments

* Rename TextPrenet to TextModel

* Rename hidden_states

* Address more comments

* Address more comments

* Address comment

* Address more comments

* Address merge

* Address comment

* Address comment

* Address comment

* Make style

* Added layer norm eps to layer norms

* Address more comments

* More fixes

* Fixed equivalence

* Make fixup

* Remove print statements

* Address comments

* Address comments

* Address comments

* Address comments

* Address comments

* Address comments

* Add comment

* Address comment

* Remove overwriting of test

* Fix bbox_embed

* Improve decoder_bbox_embed_share

* Simplify outputs

* Updated post_process_grounded_object_detection

* Renamed sources to feature_maps

* Improved tests for Grounding Dino ImageProcessor and Processor

* Fixed test requirements and imports

* Fixed image_processing

* Fixed processor tests

* Fixed imports for image processing tests

* Fix copies

* Updated modeling

* Fix style

* Moved functions to correct position

* Fixed copy issues

* Update src/transformers/models/deformable_detr/modeling_deformable_detr.py

Co-authored-by: Sangbum Daniel Choi <[email protected]>

* Update src/transformers/models/grounding_dino/modeling_grounding_dino.py

Co-authored-by: Sangbum Daniel Choi <[email protected]>

* Update src/transformers/models/grounding_dino/modeling_grounding_dino.py

Co-authored-by: Sangbum Daniel Choi <[email protected]>

* Keeping consistency custom cuda kernels for MSDA

* Make GroundingDinoProcessor logic clearer

* Updated Grounding DINO checkpoints

* Changed tests to correct structure

* Updated gpu-cpu equivalence test

* fix copies

* Update src/transformers/models/grounding_dino/processing_grounding_dino.py

Co-authored-by: amyeroberts <[email protected]>

* Update src/transformers/models/grounding_dino/processing_grounding_dino.py

Co-authored-by: amyeroberts <[email protected]>

* Update src/transformers/models/grounding_dino/modeling_grounding_dino.py

Co-authored-by: amyeroberts <[email protected]>

* Update src/transformers/models/grounding_dino/configuration_grounding_dino.py

Co-authored-by: amyeroberts <[email protected]>

* Fixed erros and style

* Fix copies

* Removed inheritance from PreTrainedModel from GroundingDinoTextModel

* Fixed GroundingDinoTextModel

* Fixed type of default backbone config

* Fixed missing methods for GroundingDinoTextModel and Added timm support for GroundingDinoConvEncoder

* Addressed comments

* Addressed batched image processing tests

* Addressed zero shot test comment

* Addressed tip comment

* Removed GroundingDinoTextModel from check_repo

* Removed inplace masking

* Addressed comments

* Addressed comments

* Addressed comments

* Fix copies

* Fixing timm test

* Fixed batching equivalence test

* Update docs/source/en/model_doc/grounding-dino.md

Co-authored-by: Tianqi Xu <[email protected]>

* Update docs/source/en/model_doc/grounding-dino.md

Co-authored-by: Tianqi Xu <[email protected]>

* Update docs/source/en/model_doc/grounding-dino.md

Co-authored-by: Tianqi Xu <[email protected]>

* Addressed more comments

* Added a new comment

* Reduced image size

* Addressed more comments

* Nits

* Nits

* Changed the way text_config is initialized

* Update src/transformers/models/grounding_dino/processing_grounding_dino.py

Co-authored-by: amyeroberts <[email protected]>

---------

Co-authored-by: Niels <[email protected]>
Co-authored-by: Rafael Padilla <[email protected]>
Co-authored-by: NielsRogge <[email protected]>
Co-authored-by: Eduardo Pacheco <[email protected]>
Co-authored-by: Sangbum Daniel Choi <[email protected]>
Co-authored-by: amyeroberts <[email protected]>
Co-authored-by: Tianqi Xu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add Grounding DINO