-
Notifications
You must be signed in to change notification settings - Fork 27.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding grounding dino #26087
Adding grounding dino #26087
Conversation
@EduardoPach Thanks for opening this model PR! From next week, I'll be away for a few weeks. If you need a review in that time please ping @rafaelpadilla. |
@rafaelpadilla hey, so I've finished implementing the model and validated with the original implementation. Still have to clean up some things and make sure the documentation is correct. My main question is about pushing the model to the hub, because the authors uploaded already the checkpoints (two checkpoints in the same repo) they made available to the model, but it's under an user instead of their org (IDEA-Research), what is usually done in this case? |
Hi @EduardoPach , Are you referring to In this case, let's consult @ArthurZucker and @younesbelkada. |
Precisely, I'm asking this because I had the impression that model repos contain only one checkpoint and also the IDEA-Research group has other models that we've could add to the transformers library later on so it might be helpful if there was an account for this org. |
For now, you can upload weights to the hub under your own personal profile and use them until this PR is ready to merge. |
Hi @EduardoPach |
@rafaelpadilla Hey! Could you help me out with these questions?
and when I do |
Hi @EduardoPach , If your If I understood correctly, you have already generated the tokens using the newly extra steps, right? For pushing your tokens to the hub you could could use the hub api. See an example here I'm not sure if the problem is regarding |
@rafaelpadilla the Also, about the config, sorry I had forgotten to push the modifications I've done to the EDIT I figured out what the issue was haha it was somewhat dumb. Either way, I wasn't aware that when we push the config to the hub the config class is then converted to a config.json, and any Nested configuration is also modified to a dictionary so I only had to change my |
Hi @EduardoPach do you need any help in finishing this PR? Really great to see you're leveraging
this will then automatically create a |
Writing here just for record As we discussed through Discord I'll make that and will do the same for the |
…m existing models
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all the work on this - looks great!
Only thing is removing the Bert implementation and using AutoModel instead and some nits. Otherwise we're good to merge 🤗
src/transformers/models/grounding_dino/modeling_grounding_dino.py
Outdated
Show resolved
Hide resolved
If text cross-attention mask is not used, we can only train and inference with batch_size == 1, which is inefficient in utilizing GPU memory. In a batch, multiple texts will be padding to a fixed length, so we need text cross-attention mask. |
Yeah, I was under the assumption that one would use always the same labels when doing inference, but thinking more about it I can see some cases where that wouldn't be the case and training would be one as well. I do think though, that we could fix this in a different PR as this PR is quite old and having a working version of the model in the main repo would be beneficial IMO and then I can work on adding the cross-attention masks as well 🤗. Are you using the implementation for a project? |
Yes, I'm currently working on the development of extra-visual-module-based MLLM (e.g. LISA, GLaMM, etc.). And I'm using your code with nn.MultiheadAttention instead of GroundingDinoMultiheadAttention for training and inference. Due to consistent Transformers code style, your code is more suitable to use with LLM (e.g. LlamaForCausalLM). Thanks for your work! |
Adding here a screenshot of running the tests c..c @amyeroberts |
I just tried out the model readme, and think it might be slightly outdated. Here are some changes I had to make:
Otherwise it seems to work well for me, thanks for the hard work! |
Hey, thanks for the heads up. The output of |
Good points, thanks! Last point, have you checked with a list of text prompts? The type hinting implies this should be possible ( import requests
import torch
from PIL import Image
from transformers import AutoProcessor, AutoModelForZeroShotObjectDetection
model_id = "IDEA-Research/grounding-dino-tiny"
device = torch.device("cuda")
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForZeroShotObjectDetection.from_pretrained(model_id).to(device)
image_url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(image_url, stream=True).raw)
# Check for cats and remote controls
text = ["cat", "remote control"]
inputs = processor(images=image, text=text, return_tensors="pt", padding=True).to(device)
with torch.no_grad():
outputs = model(**inputs)
results = processor.post_process_grounded_object_detection(
outputs,
inputs.input_ids,
box_threshold=0.4,
text_threshold=0.3,
target_sizes=[image.size[::-1]]
) If
If
|
I see that in your example, you separate with full stops, so a list of text could be converted with
Is this to be expected? |
It should also end with And about your outputs, they are indeed a bit weird haha. It's unexpected to have outputs like |
src/transformers/models/grounding_dino/configuration_grounding_dino.py
Outdated
Show resolved
Hide resolved
src/transformers/models/grounding_dino/processing_grounding_dino.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huge piece of work - thanks for adding this model!
…no.py Co-authored-by: amyeroberts <[email protected]>
Thank you for reviewing and your patience as well 😅. I'll open an issue to add the |
* Fixed typo when converting weigths to GroundingDINO vision backbone * Final modifications on modeling * Removed unnecessary class * Fixed convert structure * Added image processing * make fixup partially completed * Now text_backbone_config has its own class * Modified convert script * Removed unnecessary config attribute * Added new function to generate sub sentence mask * Renamed parameters with gamma in the name as it's currently not allowed * Removed tokenization and image_processing scripts since we'll map from existing models * Fixed some issues with configuration * Just some modifications on conversion script * Other modifications * Copied deformable detr * First commit * Added bert to model * Bert validated * Created Text and Fusion layers for Encoder * Adapted Encoder layer * Fixed typos * Adjusted Encoder * Converted encoder to hf * Modified Decoder Layer * Modified main decoder class * Removed copy comments * Fixed forward from GroundingDINOModel and GroundingDINODecoder * Added all necessary layers, configurations and forward logic up to GroundingDINOModel * Added all layers to convertion * Fixed outputs for GroundingDINOModel and GroundingDINOForObjectDetection * Fixed mask input to encoders and fixed nn.MultiheadAttention batch first and attn output * Fixed forward from GroundingDINOTextEnhancerLayer * Fixed output bug with GroundingDINODeformableLayer * Fixed bugs that prevent GroundingDINOForObjectDetection to run forward method * Fixed attentions to be passed correctly * Passing temperature arg when creating Sine position embedding * Removed copy comments * Added temperature argument for position embedding * Fixed typo when converting weigths to GroundingDINO vision backbone * Final modifications on modeling * Removed unnecessary class * Fixed convert structure * Added image processing * make fixup partially completed * Now text_backbone_config has its own class * Modified convert script * Removed unnecessary config attribute * Added new function to generate sub sentence mask * Renamed parameters with gamma in the name as it's currently not allowed * Removed tokenization and image_processing scripts since we'll map from existing models * Fixed some issues with configuration * Just some modifications on conversion script * Other modifications * Fix style * Improve fixup * Improve conversion script * Improve conversion script * Add GroundingDINOProcessor * More improvements * Return token type ids * something * Fix more tests * More improvements * More cleanup * More improvements * Fixed tests, improved modeling and config * More improvements and fixing tests * Improved tests and modeling * Improved tests and added image processor * Improved tests inference * More improvements * More test improvements * Fixed last test * Improved docstrings and comments * Fix style * Update src/transformers/models/grounding_dino/modeling_grounding_dino.py Co-authored-by: Rafael Padilla <[email protected]> * Update src/transformers/models/grounding_dino/modeling_grounding_dino.py Co-authored-by: Rafael Padilla <[email protected]> * Update src/transformers/models/grounding_dino/modeling_grounding_dino.py Co-authored-by: Rafael Padilla <[email protected]> * Update src/transformers/models/grounding_dino/modeling_grounding_dino.py Co-authored-by: Rafael Padilla <[email protected]> * Update src/transformers/models/grounding_dino/modeling_grounding_dino.py Co-authored-by: Rafael Padilla <[email protected]> * Better naming * Better naming * Added Copied statement * Added Copied statement * Moved param init from GroundingDINOBiMultiHeadAttention * Better naming * Fixing clamp style * Better naming * Update src/transformers/models/grounding_dino/modeling_grounding_dino.py Co-authored-by: NielsRogge <[email protected]> * Update src/transformers/models/grounding_dino/modeling_grounding_dino.py Co-authored-by: NielsRogge <[email protected]> * Update src/transformers/models/grounding_dino/configuration_grounding_dino.py Co-authored-by: Rafael Padilla <[email protected]> * Update src/transformers/models/grounding_dino/convert_grounding_dino_to_hf.py Co-authored-by: Rafael Padilla <[email protected]> * Update src/transformers/models/grounding_dino/modeling_grounding_dino.py Co-authored-by: Rafael Padilla <[email protected]> * Improving conversion script * Improved config * Improved naming * Improved naming again * Improved grouding-dino.md * Moved grounding dino to multimodal * Update src/transformers/models/grounding_dino/convert_grounding_dino_to_hf.py Co-authored-by: Rafael Padilla <[email protected]> * Fixed docstrings and style * Fix docstrings * Remove timm attributes * Reorder imports * More improvements * Add Grounding DINO to pipeline * Remove model from check_repo * Added grounded post_process to GroundingDINOProcessor * Fixed style * Fixed GroundingDINOTextPrenetConfig docstrings * Aligned inputs.keys() when both image and text are passed with model_input_names * Added tests for GroundingDINOImageProcessor and GroundingDINOProcessor * Testing post_process_grounded_object_detection from GroundingDINOProcessor at test_inference_object_detection_head * Fixed order * Marked test with require_torch * Temporarily changed repo_id * More improvements * Fix style * Final improvements * Improve annotators * Fix style * Add is_torch_available * Remove type hints * vocab_tokens as one liner * Removed print statements * Renamed GroundingDINOTextPrenetConfig to GroundingDINOTextConfig * remove unnecessary comments * Removed unnecessary tests on conversion script * Renamed GroundingDINO to camel case GroundingDino * Fixed GroundingDinoProcessor docstrings * loading MSDA kernels in the modeling file * Fix copies * Replace nn.multiheadattention * Replace nn.multiheadattention * Fixed inputs for GroundingDinoMultiheadAttention & order of modules * Fixed processing to avoid messing with inputs * Added more tips for GroundingDino * Make style * Chaning name to align with SAM * Replace final nn.multiheadattention * Fix model tests * Update year, remove GenerationTesterMixin * Address comments * Address more comments * Rename TextPrenet to TextModel * Rename hidden_states * Address more comments * Address more comments * Address comment * Address more comments * Address merge * Address comment * Address comment * Address comment * Make style * Added layer norm eps to layer norms * Address more comments * More fixes * Fixed equivalence * Make fixup * Remove print statements * Address comments * Address comments * Address comments * Address comments * Address comments * Address comments * Add comment * Address comment * Remove overwriting of test * Fix bbox_embed * Improve decoder_bbox_embed_share * Simplify outputs * Updated post_process_grounded_object_detection * Renamed sources to feature_maps * Improved tests for Grounding Dino ImageProcessor and Processor * Fixed test requirements and imports * Fixed image_processing * Fixed processor tests * Fixed imports for image processing tests * Fix copies * Updated modeling * Fix style * Moved functions to correct position * Fixed copy issues * Update src/transformers/models/deformable_detr/modeling_deformable_detr.py Co-authored-by: Sangbum Daniel Choi <[email protected]> * Update src/transformers/models/grounding_dino/modeling_grounding_dino.py Co-authored-by: Sangbum Daniel Choi <[email protected]> * Update src/transformers/models/grounding_dino/modeling_grounding_dino.py Co-authored-by: Sangbum Daniel Choi <[email protected]> * Keeping consistency custom cuda kernels for MSDA * Make GroundingDinoProcessor logic clearer * Updated Grounding DINO checkpoints * Changed tests to correct structure * Updated gpu-cpu equivalence test * fix copies * Update src/transformers/models/grounding_dino/processing_grounding_dino.py Co-authored-by: amyeroberts <[email protected]> * Update src/transformers/models/grounding_dino/processing_grounding_dino.py Co-authored-by: amyeroberts <[email protected]> * Update src/transformers/models/grounding_dino/modeling_grounding_dino.py Co-authored-by: amyeroberts <[email protected]> * Update src/transformers/models/grounding_dino/configuration_grounding_dino.py Co-authored-by: amyeroberts <[email protected]> * Fixed erros and style * Fix copies * Removed inheritance from PreTrainedModel from GroundingDinoTextModel * Fixed GroundingDinoTextModel * Fixed type of default backbone config * Fixed missing methods for GroundingDinoTextModel and Added timm support for GroundingDinoConvEncoder * Addressed comments * Addressed batched image processing tests * Addressed zero shot test comment * Addressed tip comment * Removed GroundingDinoTextModel from check_repo * Removed inplace masking * Addressed comments * Addressed comments * Addressed comments * Fix copies * Fixing timm test * Fixed batching equivalence test * Update docs/source/en/model_doc/grounding-dino.md Co-authored-by: Tianqi Xu <[email protected]> * Update docs/source/en/model_doc/grounding-dino.md Co-authored-by: Tianqi Xu <[email protected]> * Update docs/source/en/model_doc/grounding-dino.md Co-authored-by: Tianqi Xu <[email protected]> * Addressed more comments * Added a new comment * Reduced image size * Addressed more comments * Nits * Nits * Changed the way text_config is initialized * Update src/transformers/models/grounding_dino/processing_grounding_dino.py Co-authored-by: amyeroberts <[email protected]> --------- Co-authored-by: Niels <[email protected]> Co-authored-by: Rafael Padilla <[email protected]> Co-authored-by: NielsRogge <[email protected]> Co-authored-by: Eduardo Pacheco <[email protected]> Co-authored-by: Sangbum Daniel Choi <[email protected]> Co-authored-by: amyeroberts <[email protected]> Co-authored-by: Tianqi Xu <[email protected]>
* Fixed typo when converting weigths to GroundingDINO vision backbone * Final modifications on modeling * Removed unnecessary class * Fixed convert structure * Added image processing * make fixup partially completed * Now text_backbone_config has its own class * Modified convert script * Removed unnecessary config attribute * Added new function to generate sub sentence mask * Renamed parameters with gamma in the name as it's currently not allowed * Removed tokenization and image_processing scripts since we'll map from existing models * Fixed some issues with configuration * Just some modifications on conversion script * Other modifications * Copied deformable detr * First commit * Added bert to model * Bert validated * Created Text and Fusion layers for Encoder * Adapted Encoder layer * Fixed typos * Adjusted Encoder * Converted encoder to hf * Modified Decoder Layer * Modified main decoder class * Removed copy comments * Fixed forward from GroundingDINOModel and GroundingDINODecoder * Added all necessary layers, configurations and forward logic up to GroundingDINOModel * Added all layers to convertion * Fixed outputs for GroundingDINOModel and GroundingDINOForObjectDetection * Fixed mask input to encoders and fixed nn.MultiheadAttention batch first and attn output * Fixed forward from GroundingDINOTextEnhancerLayer * Fixed output bug with GroundingDINODeformableLayer * Fixed bugs that prevent GroundingDINOForObjectDetection to run forward method * Fixed attentions to be passed correctly * Passing temperature arg when creating Sine position embedding * Removed copy comments * Added temperature argument for position embedding * Fixed typo when converting weigths to GroundingDINO vision backbone * Final modifications on modeling * Removed unnecessary class * Fixed convert structure * Added image processing * make fixup partially completed * Now text_backbone_config has its own class * Modified convert script * Removed unnecessary config attribute * Added new function to generate sub sentence mask * Renamed parameters with gamma in the name as it's currently not allowed * Removed tokenization and image_processing scripts since we'll map from existing models * Fixed some issues with configuration * Just some modifications on conversion script * Other modifications * Fix style * Improve fixup * Improve conversion script * Improve conversion script * Add GroundingDINOProcessor * More improvements * Return token type ids * something * Fix more tests * More improvements * More cleanup * More improvements * Fixed tests, improved modeling and config * More improvements and fixing tests * Improved tests and modeling * Improved tests and added image processor * Improved tests inference * More improvements * More test improvements * Fixed last test * Improved docstrings and comments * Fix style * Update src/transformers/models/grounding_dino/modeling_grounding_dino.py Co-authored-by: Rafael Padilla <[email protected]> * Update src/transformers/models/grounding_dino/modeling_grounding_dino.py Co-authored-by: Rafael Padilla <[email protected]> * Update src/transformers/models/grounding_dino/modeling_grounding_dino.py Co-authored-by: Rafael Padilla <[email protected]> * Update src/transformers/models/grounding_dino/modeling_grounding_dino.py Co-authored-by: Rafael Padilla <[email protected]> * Update src/transformers/models/grounding_dino/modeling_grounding_dino.py Co-authored-by: Rafael Padilla <[email protected]> * Better naming * Better naming * Added Copied statement * Added Copied statement * Moved param init from GroundingDINOBiMultiHeadAttention * Better naming * Fixing clamp style * Better naming * Update src/transformers/models/grounding_dino/modeling_grounding_dino.py Co-authored-by: NielsRogge <[email protected]> * Update src/transformers/models/grounding_dino/modeling_grounding_dino.py Co-authored-by: NielsRogge <[email protected]> * Update src/transformers/models/grounding_dino/configuration_grounding_dino.py Co-authored-by: Rafael Padilla <[email protected]> * Update src/transformers/models/grounding_dino/convert_grounding_dino_to_hf.py Co-authored-by: Rafael Padilla <[email protected]> * Update src/transformers/models/grounding_dino/modeling_grounding_dino.py Co-authored-by: Rafael Padilla <[email protected]> * Improving conversion script * Improved config * Improved naming * Improved naming again * Improved grouding-dino.md * Moved grounding dino to multimodal * Update src/transformers/models/grounding_dino/convert_grounding_dino_to_hf.py Co-authored-by: Rafael Padilla <[email protected]> * Fixed docstrings and style * Fix docstrings * Remove timm attributes * Reorder imports * More improvements * Add Grounding DINO to pipeline * Remove model from check_repo * Added grounded post_process to GroundingDINOProcessor * Fixed style * Fixed GroundingDINOTextPrenetConfig docstrings * Aligned inputs.keys() when both image and text are passed with model_input_names * Added tests for GroundingDINOImageProcessor and GroundingDINOProcessor * Testing post_process_grounded_object_detection from GroundingDINOProcessor at test_inference_object_detection_head * Fixed order * Marked test with require_torch * Temporarily changed repo_id * More improvements * Fix style * Final improvements * Improve annotators * Fix style * Add is_torch_available * Remove type hints * vocab_tokens as one liner * Removed print statements * Renamed GroundingDINOTextPrenetConfig to GroundingDINOTextConfig * remove unnecessary comments * Removed unnecessary tests on conversion script * Renamed GroundingDINO to camel case GroundingDino * Fixed GroundingDinoProcessor docstrings * loading MSDA kernels in the modeling file * Fix copies * Replace nn.multiheadattention * Replace nn.multiheadattention * Fixed inputs for GroundingDinoMultiheadAttention & order of modules * Fixed processing to avoid messing with inputs * Added more tips for GroundingDino * Make style * Chaning name to align with SAM * Replace final nn.multiheadattention * Fix model tests * Update year, remove GenerationTesterMixin * Address comments * Address more comments * Rename TextPrenet to TextModel * Rename hidden_states * Address more comments * Address more comments * Address comment * Address more comments * Address merge * Address comment * Address comment * Address comment * Make style * Added layer norm eps to layer norms * Address more comments * More fixes * Fixed equivalence * Make fixup * Remove print statements * Address comments * Address comments * Address comments * Address comments * Address comments * Address comments * Add comment * Address comment * Remove overwriting of test * Fix bbox_embed * Improve decoder_bbox_embed_share * Simplify outputs * Updated post_process_grounded_object_detection * Renamed sources to feature_maps * Improved tests for Grounding Dino ImageProcessor and Processor * Fixed test requirements and imports * Fixed image_processing * Fixed processor tests * Fixed imports for image processing tests * Fix copies * Updated modeling * Fix style * Moved functions to correct position * Fixed copy issues * Update src/transformers/models/deformable_detr/modeling_deformable_detr.py Co-authored-by: Sangbum Daniel Choi <[email protected]> * Update src/transformers/models/grounding_dino/modeling_grounding_dino.py Co-authored-by: Sangbum Daniel Choi <[email protected]> * Update src/transformers/models/grounding_dino/modeling_grounding_dino.py Co-authored-by: Sangbum Daniel Choi <[email protected]> * Keeping consistency custom cuda kernels for MSDA * Make GroundingDinoProcessor logic clearer * Updated Grounding DINO checkpoints * Changed tests to correct structure * Updated gpu-cpu equivalence test * fix copies * Update src/transformers/models/grounding_dino/processing_grounding_dino.py Co-authored-by: amyeroberts <[email protected]> * Update src/transformers/models/grounding_dino/processing_grounding_dino.py Co-authored-by: amyeroberts <[email protected]> * Update src/transformers/models/grounding_dino/modeling_grounding_dino.py Co-authored-by: amyeroberts <[email protected]> * Update src/transformers/models/grounding_dino/configuration_grounding_dino.py Co-authored-by: amyeroberts <[email protected]> * Fixed erros and style * Fix copies * Removed inheritance from PreTrainedModel from GroundingDinoTextModel * Fixed GroundingDinoTextModel * Fixed type of default backbone config * Fixed missing methods for GroundingDinoTextModel and Added timm support for GroundingDinoConvEncoder * Addressed comments * Addressed batched image processing tests * Addressed zero shot test comment * Addressed tip comment * Removed GroundingDinoTextModel from check_repo * Removed inplace masking * Addressed comments * Addressed comments * Addressed comments * Fix copies * Fixing timm test * Fixed batching equivalence test * Update docs/source/en/model_doc/grounding-dino.md Co-authored-by: Tianqi Xu <[email protected]> * Update docs/source/en/model_doc/grounding-dino.md Co-authored-by: Tianqi Xu <[email protected]> * Update docs/source/en/model_doc/grounding-dino.md Co-authored-by: Tianqi Xu <[email protected]> * Addressed more comments * Added a new comment * Reduced image size * Addressed more comments * Nits * Nits * Changed the way text_config is initialized * Update src/transformers/models/grounding_dino/processing_grounding_dino.py Co-authored-by: amyeroberts <[email protected]> --------- Co-authored-by: Niels <[email protected]> Co-authored-by: Rafael Padilla <[email protected]> Co-authored-by: NielsRogge <[email protected]> Co-authored-by: Eduardo Pacheco <[email protected]> Co-authored-by: Sangbum Daniel Choi <[email protected]> Co-authored-by: amyeroberts <[email protected]> Co-authored-by: Tianqi Xu <[email protected]>
What does this PR do?
This PR adds Grounding DINO
Fixes #25423
To-Do's: