Method 1: Support vector classifier, Naive Bayes and Random Forrest Classifier using Sklearn library
The pipeline consists of three main steps:
- Vectorizing the text based on TF-IDF scores (includes removal of stop words)
- Fit the model and validate for train data
- Obtain predictions for the test data using the fitted model
The pipeline consists of the following steps:
- Build the vocabulary using train data and vectorize the data using the vocabulary
- Train the CNN model using the train data for 'n' epochs and validate on the test data
- Obtain predictions for the test data using the trained model
For both the methods train and validation split ratio was 0.8 and 0.2 respectively
Evaluation measures used are accuracy and F1 score
Observations:
- With method 1, accuracy was on average 53-55 %, F1 score: 55 %
- With method 2, accuracy was on average 65-68 %, F1 score: 64 %
The accuracy in general for both methods is less owing to the fact that the number of training samples used is less. As well as no pretrained model is used to make up for less training data. The performace of method 1 is less than in method 2 because of lesser capacity of the model to capture the complexity of th data.