Name		Name	Last commit message	Last commit date
parent directory ..
datasets		datasets
deepmoji		deepmoji
emoji		emoji
pos		pos
tokens		tokens
README.md		README.md
contractions.txt		contractions.txt
stopwords.txt		stopwords.txt
stopwords_loose.txt		stopwords_loose.txt
subjectivity_lexicon.tff		subjectivity_lexicon.tff
word_list.txt		word_list.txt
word_list_freq.txt		word_list_freq.txt

README.md

Resources

The resources provided here are only partial. The main reason why these samples have been attached is just to get you started quickly but, for a thorough analysis, the whole /res directory should be replaced with the full version available for download (or simply for visualization) at this link.

Below are some descriptions of each group of resources that I used in my analysis and comparisons.

Datasets

Each dataset directory has a minimum of 8 files (all having the same names) comprising of the tokens, part of speech tags and labels for both the train and test sets as well as the original data files.

ghosh: this is the project's main dataset, collected by Aniruddha Ghosh and Tony Veale also available on their repository; this directory contains additional files for quick experiments (train_sample and test_sample) with their afferent tokens, pos tags and labels.
sarcasmdetection: collected by Mathieu Cliche and named in this rather vague way because this dataset was used to build thesarcasmdetector website, accessible here; other papers refer to it under the same name.
riloff: collected by Ellen Riloff, available for download on her publications' page; this is also the project's oldest and smallest dataset so the tweets are a bit outdated and a lot of them have been removed since 2013 when the tweet IDs were initially collected.
hercig: collected by Tomáš Hercig (previously named Ptáček) and Ivan Habernal, obtained from their resources page under their consent. This is also the dataset on which the current state-of-the-art has been achieved.
demo: just a subset of Ghosh's dataset that I used for the demo.

Corpus	Train Set		Test Set
Corpus	Sarcastic	Non-sarcastic	Sarcastic	Non-sarcastic
Ghosh	24,453	26,736	1,419	2,323
Riloff	215	1,153	93	495
SarcasmDetector	10,000	10,000	2,000	2,000
Ptacek	9,200	5,140	2,300	1,285

Note: For some of the above datasets, only the tweet IDs have been made publicly available and not the actual text bodies. Therefore, a considerable amount of them has been deleted since the dataset creation, leaving us with subsets of the originals.

Important: these datasets are uploaded for convenience purposes only. I do not claim any rights on them so you should use them at your own responsibility. Make sure that you respect their licence and cite the original papers and the authors who so kindly made them available to the research community.

DeepMoji

This directory is solely based on MIT's deepmoji project. I adapted the code on their repo to collect some dataframe .csv files for Ghosh's train and test sets. Each row of the dataframe contains the following information:

the text of the actual tweet
overall confidence in the prediction made (a number between 0 and 1)
indices for the top 5 predicted deepmojis (according to the emoji/wanted_emojis.txt)
a confidence number for each of the 5 predicted deepmoji (how suitable is a predicted deepmoji for the current tweet)

Emoji

emoji_frequencies.txt contains the most popular emojis, sorted by their occurence number
emoji_list.txt contains all the emojis and their description
emoji_negative_samples.txt and emoji_positive_samples.txt contain the samples used to train the emoji2vec model
emoji_sentiment_raw.txt and emoji_sentiment_dictionary.txt contain the underlying sentiments' probabilities associated with each emoji (positive, negative and neutral calculated based on context occurrences)
wanted_emojis.txt contains the 64 deepmojis mapped to their index as proposed by Bjarke Felbo et al. in their paper (original depiction here)

GloVe

Contains the global vector representations for words gathered from the original GloVe page.

Alternatively, one can download them directly:

! wget -q http://nlp.stanford.edu/data/glove.6B.zip
! unzip -q -o glove.6B.zip

Tokens and POS

Two directories obtained by collecting separately the tokens and the part of speech tags of the train/test data (for Ghosh's dataset which is also the default dataset) resulted after applying the CMU Tweet POS Tagger. Several experiments have been conducted on the data to draw some useful conclusions on the particularities of Twitter sarcasm:

original_train.txt - Ghosh's original dataset, no pre-processing applied
clean_original_train.txt - on the original_train.txt perform:
- split around special characters
- all #sarca* are removed
- URLs are removed
- all user mentions are replaced with @user
- hashtags are split and a # sign is appended to every word in the split

Note that any #sarcasm tags obtained after the hashtag splitting process should not be removed.

filtered_clean_original_train.txt - on the clean_original_train.txt perform:
- lower-case and lemmatize every word
- stopwords are removed
grammatical_train.txt - on the clean_original_train.txt perform:
- the # sign before every hashtag is removed (not the hashtag itself)
- case is preserved, but words are lemmatized
- all hyphens at the beginning of a word are removed
- all contracted forms are expanded
- all repeating characters are removed and checked against a dictionary
- emojis are left as they are
- emoticons are translated to emojis

finest_grammatical.txt - on the clean_original_train.txt perform:
- the sign # before each hashtag is removed (not the hashtag itself)
- everything is lower-cased, words are lemmatized
- all hyphens at the beginning of a word are removed
- all contracted forms are expanded
- all repeating characters are removed and checked against a dictionary
- emojis are translated to their descriptions
- emoticons are translated to their descriptions
- slang is corrected, abbreviations are expanded
strict_train.txt - the original dataset is cleared completely of hashtags, emojis, URLs and user mentions.

Note that some of the lines might be empty after this clearing process.

Topic Data

Contains the corpora and dictionaries used to train multiple LDA models in the topic analysis phase (i.e they have various numbers of topics/passes and different degrees of restrictiveness on the words allowed for topic training).

Vocabulary

Contains multiple vocabularies built on Ghosh's training data with various degrees of filtering to accomplish different tasks.

Other useful files

the MPQA subjectivity lexicon used to extract sentiment features
word_list.txt and word_list_freq.txt are kind of dictionaries used to better split hashtags
stopwords.txt and stopwords_loose.txt are two lists of commonly occurring words that are used to achieve different levels of filtering over the corpus

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

res

res

README.md

Resources

Datasets

DeepMoji

Emoji

GloVe

Tokens and POS

Topic Data

Vocabulary

Other useful files

Files

res

Directory actions

More options

Directory actions

More options

Latest commit

History

res

Folders and files

parent directory

README.md

Resources

Datasets

DeepMoji

Emoji

GloVe

Tokens and POS

Topic Data

Vocabulary

Other useful files