Spooky Author Identification (GloVe + LSTM)
Suppose that we are given a specific text and we only know that the author of the text is one among Edgar Allan Poe
In this work, we have a large dataset of texts labeled with the true author, who is one among
We use this problem to illustrate the use of two relevant techniques: GloVe model for word vectorizations and long short-term memory (LSTM) neural network for model building. The steps in this notebook towards the mentioned objective are as follows:
-
We define the
multiclass_log_loss
function, which takes in a matrix of binarized true target classesy_true_binarized
, a matrix of predicted class probabilitiesy_pred_probabilities
and a clipping parameterepsilon
, and produce the multiclass version of the log loss metric betweeny_true_binarized
andy_pred_probabilities
. To utilize this function as loss in model compilation, we use TensorFlow and Keras backend functions to write it, instead of the standard NumPy functions. -
We split the data in
$80:20$ ratio (the training set consisting of$80%$ data, and the validation set consisting of the rest). We stratify the split using the labels, so that the proportion of each label remains roughly the same in the training set and the validation set. -
We encode the labels
$\text{EAP}$ ,$\text{HPL}$ and$\text{MWS}$ using a dictionary and map them to integer values$0$ ,$1$ and$2$ , respectively; and convert the integer label vectors to binary class matrices, each row of which represents a one-hot vector, corresponding to an integer component of the label vector. -
We fit Keras tokenizer on the combined list of texts from the training set and the validation set. The obtained words are then indexed by employing the word_index method; convert the texts to sequences of integers using the texts_to_sequences method; and use the pad_sequences function of Keras to pad the sequences to a maximum length to be equal to the smallest integer greater than
$m + 2s$ , where$m$ and$s$ respectively denote the mean and standard deviation of the text lengths from the combined set of texts from the training set and the validation set. We construct a matrix of vector representations of the words found in the training set and the validation set by mapping the words to a$100$ -dimensional vector space through GloVe embedding. -
We build a sequential model consisting of an embedding layer with weights provided by the matrix of word vectors, constructed previously; a SpatialDropout1D layer; an LSTM layer with number of units same as the length of the GloVe vectors; two dense hidden layers with ReLU activation function, each followed by a dropout layer; and an output layer of three neurons, corresponding to the three probabilities for the three authors, with softmax activation function. The model is compiled with the manually defined
multiclass_log_loss
function as loss and the Adam optimizer with an initial learning rate of$0.001$ , which is then regulated by a manually defined schedule functionscheduler_modified_exponential
through learning rate scheduler callback to update the learning rate for the optimizer at each epoch. -
We fit the model on the padded sequences generated from the training texts and the binary class matrix generated from the training labels for a set number of epochs. The training loss and the validation loss is monitored at each epoch and we stop the training procedure once the validation loss stops improving via an early stopping callback. We produce a plot depicting how the training loss and the validation loss evolved over epochs, giving an overall picture of the model building procedure.
-
We employ the trained model to predict the probabilities of the texts, in both the training set and the validation set, being written by the three authors and obtain a training log loss of
$0.391$ and validation log loss of$0.581$ . The predicted probabilities are then converted to labels by picking the mode and we get a training accuracy of$0.846$ and validation accuracy of$0.764$ . Finally, a complete picture of the performance of the trained model on the validation set, in the context of the task of classifying the texts as written by one of the three authors, is provided through a confusion matrix.
- Dataset provided in the Kaggle competition Spooky Author Identification
- GloVe: Global Vectors for Word Representation dataset by Rachael Tatman
- How to Choose a Learning Rate Scheduler for Neural Networks by Yi Li
- Practical Recommendations for Gradient-Based Training of Deep Architectures by Yoshua Bengio
- GloVe: Global Vectors for Word Representation, by Jeffrey Pennington, Richard Socher and Christopher D. Manning
- Glove Research Paper Clearly Explained, by Meesala Lokesh
- Long Short-Term Memory, by Sepp Hochreiter and Jurgen Schmidhuber
- The Unreasonable Effectiveness of Recurrent Neural Networks, by Andrej Karpathy
- Understanding LSTM Networks, by Christopher Olah