MILNET in TensorFlow Keras

MILNET implementation in TensorFlow Keras API. This is a machine learning model for sentiment classification.

Model

The model is MILNET described in Stefanos Angelidis and Mirella Lapata, Multiple Instance Learning Networks for Fine-Grained Sentiment Analysis. The model features the following points.

Multi instance learning on sentence level or elementary discourse unit (EDU) level
Segment features extracted by convolutional layers with different kernel sizes or with sentence encoder
Attention weights generated by GRU layer
Segment level classification and merged by attention weights

Pipeline

The processing pipeline is as follows.

Raw Data

The model uses Amazon product data as the training data. The downloaded JSON file is reformated so that they look like the following. The dataset should be balanced during the reformat step.

###5.0
Its a splitter that allows me to not be tortured by my children fighting over a toy. Anything that makes my life easier is a win-win! Its cheap and I will never travel without it!

###1.0
This product didn't even work. I plugged it into my Mac and plugged the other end into my tv. Nothing.

The RST parser will generate EDU_BREAK annotations in the raw data file, thus give indication for segmenting the raw data file into EDUs. The EDU annotated raw data file looks like this.

###5.0
Its a splitter EDU_BREAK that allows me to not be tortured by my children fighting over a toy .
Anything EDU_BREAK that makes my life easier EDU_BREAK is a win-win !
Its cheap and I will never travel without it !

###1.0
This product did n't even work .
I plugged it into my Mac EDU_BREAK and plugged the other end into my tv .
Nothing

Preprocessing

The preprocessing extracts the word list from the raw data file, lemmatize the words, then assign each word with an index (start from 1, 0 stands for padding). By running preprocessing.ipynb, three files will be generated: a hdf5 file containing the review features and labels, a pkl file containing the dictionary for mapping word to indices, and a npy file containing the embedding vectors of the words. The hdf5 file is the data source for the model, and the npy file is used as the embedding layer's untrainable weights of the model.

The hdf5 file has the following structure. In document and label directories there are batches of samples, archived as numpy arraies. The numpy array shape for each sample's feature is max_seg by max_word, 0 is used as paddings.

+-- electronics.hdf5
|	+-- document
|	|	+-- 1
|	|	+-- 2
|	|	+-- 3
|	|	+-- ...
|	+-- label
|	|	+-- 1
|	|	+-- 2
|	|	+-- 3
|	|	+-- ...

Training & Testing

The milnet.ipynb uses convolutional layers as encoding for the documents, while milnet_xling.ipynb uses XLING, a sentence level encoder, as the encoding for documents. Data in hdf5 file is read by data_generator, where the data will be shuffled, truncated (since the max_seg and max_word defined in preprocessing step is much larger than the proper value) and reformated into a suitable batch size. __balance_data is also called in this method, where reviews of star 2 and star 4 were removed, since we find this will improve the performance of classification.

Evaluate

The method performance_judge will help to evaluate the result. According to our test run, the result strongly depends on the dataset: the test accuracy is 66.6% for the electronics dataset, while it's 76.4% for the food dataset. Besides, we found using XLING embedding, segment the input into EDU level, and use __balance_data will help to improve the result. The result on food dataset with sentence level segmentation, word2vec embedding is as follows.

########## Training Error ##########
Accuracy: 0.82
_____ Class 0 _____
Precision	 0.811
Recall		 0.742
F1 Score	 0.774
_____ Class 1 _____
Precision	 0.821
Recall		 0.777
F1 Score	 0.798
_____ Class 2 _____
Precision	 0.827
Recall		 0.93
F1 Score	 0.875

############ Test Error ############
Accuracy: 0.77
_____ Class 0 _____
Precision	 0.75
Recall		 0.678
F1 Score	 0.711
_____ Class 1 _____
Precision	 0.776
Recall		 0.73
F1 Score	 0.752
_____ Class 2 _____
Precision	 0.777
Recall		 0.886
F1 Score	 0.827

Project Report

The project report will be updated soon.

Special Thanks

This is the course project of Master Praktikum - Machine Learning and Natural Language Processing for Opinion Mining (IN2106, IN4249), Technical University of Munich. I'd like to thanks my teammates: Tim Pfeifle and Hendrick Pauthner, our tutor: Hagerer Gerhard, and all member of other teams.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
milnet.ipynb		milnet.ipynb
milnet_xling.ipynb		milnet_xling.ipynb
preprocessing.ipynb		preprocessing.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MILNET in TensorFlow Keras

Model

Pipeline

Raw Data

Preprocessing

Training & Testing

Evaluate

Project Report

Special Thanks

About

Releases

Packages

Languages

License

Frost-Lee/milnet_tf_keras

Folders and files

Latest commit

History

Repository files navigation

MILNET in TensorFlow Keras

Model

Pipeline

Raw Data

Preprocessing

Training & Testing

Evaluate

Project Report

Special Thanks

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages