Skip to content

Frost-Lee/milnet_tf_keras

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MILNET in TensorFlow Keras

MILNET implementation in TensorFlow Keras API. This is a machine learning model for sentiment classification.

Model

The model is MILNET described in Stefanos Angelidis and Mirella Lapata, Multiple Instance Learning Networks for Fine-Grained Sentiment Analysis. The model features the following points.

  • Multi instance learning on sentence level or elementary discourse unit (EDU) level
  • Segment features extracted by convolutional layers with different kernel sizes or with sentence encoder
  • Attention weights generated by GRU layer
  • Segment level classification and merged by attention weights

Pipeline

The processing pipeline is as follows.

Raw Data

The model uses Amazon product data as the training data. The downloaded JSON file is reformated so that they look like the following. The dataset should be balanced during the reformat step.

###5.0
Its a splitter that allows me to not be tortured by my children fighting over a toy. Anything that makes my life easier is a win-win! Its cheap and I will never travel without it!

###1.0
This product didn't even work. I plugged it into my Mac and plugged the other end into my tv. Nothing.

The RST parser will generate EDU_BREAK annotations in the raw data file, thus give indication for segmenting the raw data file into EDUs. The EDU annotated raw data file looks like this.

###5.0
Its a splitter EDU_BREAK that allows me to not be tortured by my children fighting over a toy .
Anything EDU_BREAK that makes my life easier EDU_BREAK is a win-win !
Its cheap and I will never travel without it !

###1.0
This product did n't even work .
I plugged it into my Mac EDU_BREAK and plugged the other end into my tv .
Nothing
Preprocessing

The preprocessing extracts the word list from the raw data file, lemmatize the words, then assign each word with an index (start from 1, 0 stands for padding). By running preprocessing.ipynb, three files will be generated: a hdf5 file containing the review features and labels, a pkl file containing the dictionary for mapping word to indices, and a npy file containing the embedding vectors of the words. The hdf5 file is the data source for the model, and the npy file is used as the embedding layer's untrainable weights of the model.

The hdf5 file has the following structure. In document and label directories there are batches of samples, archived as numpy arraies. The numpy array shape for each sample's feature is max_seg by max_word, 0 is used as paddings.

+-- electronics.hdf5
|	+-- document
|	|	+-- 1
|	|	+-- 2
|	|	+-- 3
|	|	+-- ...
|	+-- label
|	|	+-- 1
|	|	+-- 2
|	|	+-- 3
|	|	+-- ...
Training & Testing

The milnet.ipynb uses convolutional layers as encoding for the documents, while milnet_xling.ipynb uses XLING, a sentence level encoder, as the encoding for documents. Data in hdf5 file is read by data_generator, where the data will be shuffled, truncated (since the max_seg and max_word defined in preprocessing step is much larger than the proper value) and reformated into a suitable batch size. __balance_data is also called in this method, where reviews of star 2 and star 4 were removed, since we find this will improve the performance of classification.

Evaluate

The method performance_judge will help to evaluate the result. According to our test run, the result strongly depends on the dataset: the test accuracy is 66.6% for the electronics dataset, while it's 76.4% for the food dataset. Besides, we found using XLING embedding, segment the input into EDU level, and use __balance_data will help to improve the result. The result on food dataset with sentence level segmentation, word2vec embedding is as follows.

########## Training Error ##########
Accuracy: 0.82
_____ Class 0 _____
Precision	 0.811
Recall		 0.742
F1 Score	 0.774
_____ Class 1 _____
Precision	 0.821
Recall		 0.777
F1 Score	 0.798
_____ Class 2 _____
Precision	 0.827
Recall		 0.93
F1 Score	 0.875

############ Test Error ############
Accuracy: 0.77
_____ Class 0 _____
Precision	 0.75
Recall		 0.678
F1 Score	 0.711
_____ Class 1 _____
Precision	 0.776
Recall		 0.73
F1 Score	 0.752
_____ Class 2 _____
Precision	 0.777
Recall		 0.886
F1 Score	 0.827

Project Report

The project report will be updated soon.

Special Thanks

This is the course project of Master Praktikum - Machine Learning and Natural Language Processing for Opinion Mining (IN2106, IN4249), Technical University of Munich. I'd like to thanks my teammates: Tim Pfeifle and Hendrick Pauthner, our tutor: Hagerer Gerhard, and all member of other teams.

About

MILNET implementation in TensorFlow Keras.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published