Year | Reference | Task | Dataset: Has indiv.1 | Attributes | # annots/ instance |
# rows. | Score | Metric |
---|---|---|---|---|---|---|---|---|
2022 | ArMIS - The Arabic Misogyny and Sexism Corpus with Annotator Subjective Disagreements (Dina Almanea and Massimo Poesio) | Hate speech identification | ArMIS | π¨βπ©βπ§βπ¦ | 3 | 964 | 0.525 | Fleiss' Kappa |
2021 | Whose Opinions Matter? Perspective-aware Models to Identify Opinions of Hate Speech Victims in Abusive Language Detection (Sohail Akhtar, Valerio Basile, Viviana Patti) | Hate speech identification | HS-Brexit | π¨βπ©βπ§βπ¦ | 6 | 1120 | 0.35 | Fleiss' Kappa2 |
2021 | ConvAbuse: Data, Analysis, and Benchmarks for Nuanced Abuse Detection in Conversational AI (Amanda Cercas Curry, Gavin Abercrombie, Verena Rieser) | Hate speech identification | ConvAbuse | π¨βπ©βπ§βπ¦ | 3-8 | 4185 | 0.69 | Alpha |
2021 | Agreeing to Disagree: Annotating Offensive Language Datasets with Annotatorsβ Disagreement (Elisa Leonardelli, Stefano Menini, Alessio Palmero Aprosio, Marco Guerini, Sara Tonelli) | Hate speech identification | MD-Agreement | π¨βπ©βπ§βπ¦ | 5 | 10K | 71.172 | Percent agreement4 |
2021 | Designing Toxic Content Classification for a Diversity of Perspectives (Deepak Kumar, Patrick Gage Kelley, Sunny Consolvo, Joshua Mason, Elie Bursztein, Zakir Durumeric, Kurt Thomas, Michael Bailey) (jury learning) | Hate speech identification |
Dataset: |
π¨βπ©βπ§βπ¦ | 5 | 107,620 | 65.2-90% | Percent agreement2 |
2021 | Did they answer? Subjective acts and intents in conversational discourse (Elisa Ferracane, Greg Durrett, Junyi Jessy Li, Katrin Erk) | Sentiment Analysis+Intent Classification | Dataset | π¨βπ©βπ§βπ¦ | 3-7 | 1K | Overall (0.494), conversation act (0.652), intent (0.376) | Alpha |
2020 | On Faithfulness and Factuality in Abstractive Summarization (Joshua Maynez, Shashi Narayan, Bernd Bohnet, Ryan McDonald) | Hallucination classification | Dataset | π¨βπ©βπ§βπ¦ | 3 | 0.61-0.80 | Fleiss' Kappa2 | |
2020 | On Faithfulness and Factuality in Abstractive Summarization (Joshua Maynez, Shashi Narayan, Bernd Bohnet, Ryan McDonald) | Factuality classification | Dataset | π¨βπ©βπ§βπ¦ | 3 | 0.81-1.00 | Fleiss' Kappa2 | |
2019 | Understanding Discourse on Work and Job-Related Well-Being in Public Social Media (Liu, Tong and Homan, Christopher and Ovesdotter Alm, Cecilia and Lytle, Megan and Marie White, Ann and Kautz, Henry) | Dataset | π¨βπ©βπ§βπ¦ | |||||
2019 | Learning to Predict Population-Level Label Distributions (Tong Liu, Akash Venkatachalam, Pratik Sanjay Bongale, Christopher M. Homan) | Dataset | π¨βπ©βπ§βπ¦ | |||||
2018 | Introducing the gab hate corpus: Defining and applying hate-based rhetoric to social media posts at scale. (Brendan Kennedy, Mohammad Atari, Aida M. Davani, Leigh Yeh, Ali Omrani, Yehsong Kim, Kris Coombs, et al.) | Hate speech identification | Dataset | π¨βπ©βπ§βπ¦ | ||||
2018 | Don't Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization (Shashi Narayan, Shay B. Cohen, Mirella Lapata) | Dataset | π¨βπ©βπ§βπ¦ | |||||
2018 | Quantifying Qualitative Data for Understanding Controversial Issues () | |||||||
2018 | Addressing Age-Related Bias in Sentiment Analysis (Mark Diaz, Isaac Johnson, Amanda Lazar, Anne Marie Piper, and Darren Gergle) | Sentiment Analysis | Dataset | |||||
2018 | A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference (Williams, Adina and Nangia, Nikita and Bowman, Samuel) | Multi-genre NLI | Dataset | π¨βπ©βπ§βπ¦ | ||||
2015 | A large annotated corpus for learning natural language inference (Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning) | NLI | Dataset | π¨βπ©βπ§βπ¦ | ||||
2014 | Lexical Acquisition for Opinion Inference: A Sense-Level Lexicon of Benefactive and Malefactive Events (Yoonjung Choi, Lingjia Deng, and Janyce Wiebe) | WSD for sentiment analysis | Dataset | 0.84 | Percent agreement | |||
" | " | " | " | " | " | " | 0.75 | Kappa |
- The order of datasets within each year is random
1:
If there is no π¨βπ©βπ§βπ¦ emoji, this dataset contains aggregate level annotations. If there is, it contains annotator-level data as well.
2:
π = contains annotator instructions
π¨βπ©βπ§βπ¦ = has annotator-level data
π» = data is crowdsourced as well (not just annotations). For example, having MTurkers write pairs of sentences as opposed to scraping the web
3: Refer to the paper for more details. The interrater agreement is reported per subset and/or using several metrics.
4: Calculated using information given in paper/from dataset.
Papers talking about calculation methods (todo: organize this table)
- J. Richard Landis and Gary G. Koch. 1977. The meaοΏ½surement of observer agreement for categorical data. Biometrics, 33(1):159β174.
- https://aclanthology.org/Q14-1025/#:~:text=The%20annotation%20model%20provides%20far,cost%20of%20the%20conventional%20approach.
- https://pubmed.ncbi.nlm.nih.gov/18482474/
Not label annotations but contains input from individual annotators (todo: organize into table)