This repository contains the data and scripts necessary to reproduce the experiments in
Cristina España-Bonet and Alberto Barrón-Cedeño. 2022. The (Undesired) Attenuation of Human Biases by Multilinguality. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 2056–2077, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
(bibtex at the bottom)
Word embedding association tests (WEAT) are made of lists of items and attributes that are related to the concepts used in implicit association tests (IAT) in social psychology. In this work we focus on (universal) non-social tests: IAT1 (flowers and insects vs pleasant and unpleasant attributes) and IAT2 (musical instruments and weapons vs pleasant and unpleasant attributes). IAT results show human positive biases towards flowers and musical instruments.
CA-WEAT is the cultural aware version of the English WEAT lists, where lists are generated from scratch for every new language by native speakers therefore preserving cultural differences among languages.
The folder data
in this repo contains the tsv files with the cultural aware lists for WEAT1 and WEAT2 in 26 languages:
ar
,
bg
,
bn
,
ca
,
de
,
el
,
en
,
es
,
fa
,
fr
,
hr
,
id
,
it
,
ko
,
lb
,
mr
,
nl
,
no
,
pl
,
pt
,
ro
,
ru
,
tr
,
uk
,
vi
,
zh
. Different variants are included in the dataset (e.g., Spanish from Mexico, Bolivia, Spain...). The geographical distribution is represented in the following map:
If you see your home in grey, yellow, orange... we would highly appreciate your contribution for the ca-weat.v2 dataset. If your country is colored in purple, don't be shy, we appreciate the data anyway :-). You'll find the form to submit new CA-WEAT lists and the instructions in several languages: Catalan, English, French, German, Italian and Spanish. Whatever form you chose, you need to add the words in your mother tongue. If you are here you know about NLP... please, don't use named entities or words that are ambiguous, especially if they coincide with a stop word!
The calculation of the statistic and size effect has been adapted from Lauscher and Glavas (SEM* 2019).
Script runCaweat.sh
can be used to specify the languages to consider (LANG column in the CA-WEAT file) and the embedding model. Feel free to change the number of permutations to calculate p-values or the number of bootstraps for confidence intervals. Use flag --lower
if the embedding model has the vocabulary lowercased.
The results for the 16 embedding models and the 91 lists reported in the paper are collected in plots/collectedData.csv
. The script plots/plotCollectedData.py
can be used to generate the plots and tables in a straightforward manner.
Please, use the following bibtex entry when citing this research work
@inproceedings{espana-bonet-barron-cedeno-2022-undesired,
title = "The (Undesired) Attenuation of Human Biases by Multilinguality",
author = "Espa{\~n}a-Bonet, Cristina and Barr{\'o}n-Cede{\~n}o, Alberto",
booktitle = "Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing",
month = dec,
year = "2022",
address = "Abu Dhabi, United Arab Emirates",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.emnlp-main.133",
pages = "2056--2077"
}