Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create dataset loader for TotalDefMeme #355

Closed
SamuelCahyawijaya opened this issue Jan 22, 2024 · 7 comments · Fixed by #602
Closed

Create dataset loader for TotalDefMeme #355

SamuelCahyawijaya opened this issue Jan 22, 2024 · 7 comments · Fixed by #602
Assignees
Labels
bonus +1 help wanted Extra attention is needed pr-ready A PR that closes this issue is Ready to be reviewed

Comments

@SamuelCahyawijaya
Copy link
Collaborator

Dataloader name: total_defense_meme/total_defense_meme.py
DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?total_defense_meme

Dataset total_defense_meme
Description This is a large-scale multimodal and multi-attribute dataset containing memes about Singapore's Total Defence policy from different social media platforms. The type (Singaporean or generic), pillars (military, civil, economic, social, psychological, digital, others), topics and stances (against, neutral, supportive) of each meme are manually identified by annotators.
Subsets -
Languages eng
Tasks Topic Classification, Stance Detection, Optical Character Recognition
License Unknown (unknown)
Homepage Image: https://drive.google.com/file/d/1oJIh4QQS3Idff2g6bZORstS5uBROjUUz/view, Annotations: https://gitlab.com/bottle_shop/meme/TotalDefMemes/-/tree/main
HF URL -
Paper URL https://arxiv.org/pdf/2305.17911.pdf
@SamuelCahyawijaya SamuelCahyawijaya converted this from a draft issue Jan 22, 2024
@sabilmakbar sabilmakbar added the help wanted Extra attention is needed label Jan 30, 2024
@TysonYu
Copy link
Collaborator

TysonYu commented Feb 2, 2024

#self-assign

Copy link

Hi @, may I know if you are still working on this issue? Please let @holylovenia @SamuelCahyawijaya @sabilmakbar know if you need any help.

@holylovenia holylovenia added bonus +1 top-priority Needs to get done ASAP for the experiments labels Mar 12, 2024
@akhdanfadh
Copy link
Collaborator

akhdanfadh commented Mar 30, 2024

My question is in the end.

Based on the paper, they obtained a dataset of 7200 images. Then, they filtered it into 5301 memes with 2893 SG-related and 2408 non-SG. Among those SG-related memes, they annotated 2513 images. Quoting them,

Pillars, Topics & Stances Annotation ..... The annotators will first assign the memes’ defence pillars: military, civil, economic, social, psychological, digital, or others. ..... Next, they annotate the relevant topic tags associated with the meme (i.e., nouns, pronouns, and phrases) in a free-text format. ..... Lastly, the annotators annotate the meme’s stances towards the assigned pillars: support, against, or neutral.

So, for example, an image will have annotations like this:

"Pillar_Stances": [
    {
        "img_4120.jpg": [
            [
                "Economic Defence",
                [
                    "Neutral",
                    "Neutral"
                ]
            ],
            [
                "Psychological Defence",
                [
                    "Against",
                    "Against"
                ]
            ]
        ]
    },
    ...
]
"Tags": [
    {
        "img_4120.jpg": [
            "Government",
            "HDB",
            "Gone",
            "Lease End",
            "Sad",
            "Disappear",
            "99-Years",
            "e-scooter law"
        ]
    },
    ...
]

Furthermore, they also provide the text in almost all images (7012 to be exact) with some OCR algorithm as follows:

"Text": [
    {
        "img_4120.jpg": "When a HDB flat finishes it's 99-year lease: This is so sad can we hit a pedestrian with escooters?"
    },
    {
        "img_1712.jpg": "News; The mystery Chinese Virus can only spread through human interaction Engineering Students:"
    },
    ...
]

I think this one will use general image_text seacrowd schema. My question is should I just implement the text OCR field and ignore the pillars? If so, then I can pass tags for metadata/context in the schema. If not, I am not sure how to proceed with the stance labeling.

@akhdanfadh
Copy link
Collaborator

#self-assign

@akhdanfadh akhdanfadh added the question Further information is requested label Mar 30, 2024
@holylovenia
Copy link
Contributor

holylovenia commented Apr 1, 2024

I think this one will use general image_text seacrowd schema. My question is should I just implement the text OCR field and ignore the pillars? If so, then I can pass tags for metadata/context in the schema. If not, I am not sure how to proceed with the stance labeling.

I agree with you, @akhdanfadh.

  • The OCR subsets will use the [image_text](https://github.com/SEACrowd/seacrowd-datahub/blob/7bdfb4b461d6449b8200950938b13ef7614bc4f6/seacrowd/utils/schemas/image_text.py) schema. The pillars and tags info can simply be stored inside meta.
  • The topic classification subsets and the vision-language stance labeling subsets could use a new schema for image classification. We don't have one right now. Could you please make a separate PR for this new schema? According to our running name convention, the schema probably should be named image (though it sounds kind of weird...)

I hope this clears up things. What do you think?

@akhdanfadh
Copy link
Collaborator

The OCR subsets will use the image_text schema. The pillars and tags info can simply be stored inside meta.

The metadata on the schema is organized like this. The tags can be passed into context but I'm not sure about the pillars. Is it okay if I add another key in the metadata schema, for example, by using this code: feature['metadata']['stances'] = ...? See an implementation here for reference.

"metadata": {
    "context": datasets.Value("string"),
    "labels": datasets.Sequence(datasets.ClassLabel(names=label_names)),
}

The topic classification subsets and the vision-language stance labeling subsets could use a new schema for image classification. We don't have one right now. Could you please make a separate PR for this new schema? According to our running name convention, the schema probably should be named image (though it sounds kind of weird...)

Hmm, I don't mind actually. Though, implementing image classification schema would mean this SEACrowd project is not entirely NLP-oriented anymore haha 😄. I'll create a PR by this weekend.

@holylovenia
Copy link
Contributor

holylovenia commented Apr 1, 2024

Hmm, I don't mind actually. Though, implementing image classification schema would mean this SEACrowd project is not entirely NLP-oriented anymore haha 😄. I'll create a PR by this weekend.

Yes definitely. 👍 We're also consolidating every VL and speech and other datasets we can get our hands on.

Thanks @akhdanfadh!! Just let me know if you need anything.

@akhdanfadh akhdanfadh removed the question Further information is requested label Apr 1, 2024
@akhdanfadh akhdanfadh added the pr-ready A PR that closes this issue is Ready to be reviewed label Apr 2, 2024
@holylovenia holylovenia removed the top-priority Needs to get done ASAP for the experiments label Apr 11, 2024
holylovenia pushed a commit that referenced this issue May 1, 2024
* add image classification schema

* add dataloader

* change source feature, modify comment
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bonus +1 help wanted Extra attention is needed pr-ready A PR that closes this issue is Ready to be reviewed
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

5 participants