MILNET implementation in TensorFlow Keras API. This is a machine learning model for sentiment classification.
The model is MILNET described in Stefanos Angelidis and Mirella Lapata, Multiple Instance Learning Networks for Fine-Grained Sentiment Analysis. The model features the following points.
- Multi instance learning on sentence level or elementary discourse unit (EDU) level
- Segment features extracted by convolutional layers with different kernel sizes or with sentence encoder
- Attention weights generated by GRU layer
- Segment level classification and merged by attention weights
The processing pipeline is as follows.
The model uses Amazon product data as the training data. The downloaded JSON file is reformated so that they look like the following. The dataset should be balanced during the reformat step.
###5.0
Its a splitter that allows me to not be tortured by my children fighting over a toy. Anything that makes my life easier is a win-win! Its cheap and I will never travel without it!
###1.0
This product didn't even work. I plugged it into my Mac and plugged the other end into my tv. Nothing.
The RST parser will generate EDU_BREAK
annotations in the raw data file, thus give indication for segmenting the raw data file into EDUs. The EDU annotated raw data file looks like this.
###5.0
Its a splitter EDU_BREAK that allows me to not be tortured by my children fighting over a toy .
Anything EDU_BREAK that makes my life easier EDU_BREAK is a win-win !
Its cheap and I will never travel without it !
###1.0
This product did n't even work .
I plugged it into my Mac EDU_BREAK and plugged the other end into my tv .
Nothing
The preprocessing extracts the word list from the raw data file, lemmatize the words, then assign each word with an index (start from 1, 0 stands for padding). By running preprocessing.ipynb
, three files will be generated: a hdf5
file containing the review features and labels, a pkl
file containing the dictionary for mapping word to indices, and a npy
file containing the embedding vectors of the words. The hdf5
file is the data source for the model, and the npy
file is used as the embedding layer's untrainable weights of the model.
The hdf5
file has the following structure. In document
and label
directories there are batches of samples, archived as numpy arraies. The numpy array shape for each sample's feature is max_seg
by max_word
, 0 is used as paddings.
+-- electronics.hdf5
| +-- document
| | +-- 1
| | +-- 2
| | +-- 3
| | +-- ...
| +-- label
| | +-- 1
| | +-- 2
| | +-- 3
| | +-- ...
The milnet.ipynb
uses convolutional layers as encoding for the documents, while milnet_xling.ipynb
uses XLING, a sentence level encoder, as the encoding for documents. Data in hdf5
file is read by data_generator
, where the data will be shuffled, truncated (since the max_seg
and max_word
defined in preprocessing step is much larger than the proper value) and reformated into a suitable batch size. __balance_data
is also called in this method, where reviews of star 2 and star 4 were removed, since we find this will improve the performance of classification.
The method performance_judge
will help to evaluate the result. According to our test run, the result strongly depends on the dataset: the test accuracy is 66.6% for the electronics dataset, while it's 76.4% for the food dataset. Besides, we found using XLING embedding, segment the input into EDU level, and use __balance_data
will help to improve the result. The result on food dataset with sentence level segmentation, word2vec embedding is as follows.
########## Training Error ##########
Accuracy: 0.82
_____ Class 0 _____
Precision 0.811
Recall 0.742
F1 Score 0.774
_____ Class 1 _____
Precision 0.821
Recall 0.777
F1 Score 0.798
_____ Class 2 _____
Precision 0.827
Recall 0.93
F1 Score 0.875
############ Test Error ############
Accuracy: 0.77
_____ Class 0 _____
Precision 0.75
Recall 0.678
F1 Score 0.711
_____ Class 1 _____
Precision 0.776
Recall 0.73
F1 Score 0.752
_____ Class 2 _____
Precision 0.777
Recall 0.886
F1 Score 0.827
The project report will be updated soon.
This is the course project of Master Praktikum - Machine Learning and Natural Language Processing for Opinion Mining (IN2106, IN4249), Technical University of Munich. I'd like to thanks my teammates: Tim Pfeifle and Hendrick Pauthner, our tutor: Hagerer Gerhard, and all member of other teams.