Davide Locatelli · Greta Damo · Debora Nozza
This repository contains data and code used in the paper A Crosslingual Analysis of Homotransphobia on Twitter.
In accordance with Twitter's policy, we have provided the tweet IDs for analysis. There are seven files, each containing tweet IDs for tweets in one of the seven languages: English, Italian, German, French, Spanish, Portuguese, and Norwegian.
The code consists of three files:
data.py
- to process the datatopics.py
- to run the contextualized topic modeling analysissentiment.py
- to run the sentiment analysis
To reproduce our study:
- Retrieve the tweets. To do this, you will need Twitter API keys. Once you have those, you can use the twarc library as follows:
twarc hydrate data/LANG.txt > LANG.jsonl
- Preprocess the data:
python data.py -l LANG
- Run topic modeling analysis:
python topics.py -l LANG
- Run sentiment analysis:
python sentiment.py -l LANG
Where LANG is an ISO 639-1 language code. For example, for Norwegian it's NO
.
The following pre-trained models are used for the analysis:
- CTM: distiluse-base-multilingual-cased-v1, distiluse-base-multilingual-cased-v2
- Sentiment analysis: twitter-xlm-roberta-base-sentiment
The results of the analysis will be stored in the results
folder. There will be three files per language:
LANG_topics.txt
- contains the results of the topic modeling analysis with the top words for 5, 10, 15, 20 topicsLANG_topics.csv
- contains the results of the topic modeling analysis with each tweet assigned to a topicLANG_sentiment.csv
- contains the results of the sentiment analysis with each tweet assigned to a sentiment class
If you use the data or code please cite the following paper:
@inproceedings{locatelli-etal-2023-cross,
title = "A Cross-Lingual Study of Homotransphobia on {T}witter",
author = "Locatelli, Davide and
Damo, Greta and
Nozza, Debora",
booktitle = "Proceedings of the First Workshop on Cross-Cultural Considerations in NLP (C3NLP)",
month = may,
year = "2023",
address = "Dubrovnik, Croatia",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.c3nlp-1.3",
pages = "16--24",
abstract = "We present a cross-lingual study of homotransphobia on Twitter, examining the prevalence and forms of homotransphobic content in tweets related to LGBT issues in seven languages. Our findings reveal that homotransphobia is a global problem that takes on distinct cultural expressions, influenced by factors such as misinformation, cultural prejudices, and religious beliefs. To aid the detection of hate speech, we also devise a taxonomy that classifies public discourse around LGBT issues. By contributing to the growing body of research on online hate speech, our study provides valuable insights for creating effective strategies to combat homotransphobia on social media.",