Data: 102582 train sentiments, 34194 test sentiments, target: int in [1, 5]. Scoring: categorical_accuracy.
-
Delete nltk.corpus.stopwords.
-
Filter word frequences: delete words with frequence in test and train less than 2.
-
Delete all non alpha-num words.
-
Coding all test and train sentiments with keras.preprocessing.text.Tokenizer
-
Pad the lest side of encoded sentiments.
train_size | test_size | n_estimators | score on test | training time |
---|---|---|---|---|
51272 | 51272 | 50 | 0.33 | < 1 min |
51272 | 51272 | 400 | 0.35 | ~ 10-20 min |
train_size | test_size | kernel | score on test | training time |
---|---|---|---|---|
51272 | 51272 | rbf | ? | > 3 h |
51272 | 51272 | linear | 0.24 | < 1 min |
Pretrained glove http://nlp.stanford.edu/projects/glove/ dictionary: 6B tokens; dim=100; 400k different words. Neural network architecture:
sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
embedding_layer
Flatten()
Dense(300, activation='relu')
Dense(128, activation='relu')
out = Dense(5, activation='softmax')
model = Model(sequence_input, out)
model.compile(loss='categorical_crossentropy', optimizer='rmsprop', metrics=['acc'])
train_size | test_size | batch_size | nb_epoch | score | training time |
---|---|---|---|---|---|
51272 | 51272 | 128 | 2 | 0.41 | ~ 20 min |
Pretrained glove dictionary: 6B tokens; dim=100; 400k different words. Neural network architecture:
sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
embedding_layer
LSTM(50)
out = Dense(5, activation='softmax')
model = Model(sequence_input, out)
model.compile(loss='categorical_crossentropy', optimizer='rmsprop', metrics=['acc'])
train_size | test_size | batch_size | nb_epoch | score | training time |
---|---|---|---|---|---|
51272 | 51272 | 128 | 2 | 0.47 | ~ 60 min |
Pretrained glove dictionary: 6B tokens; dim=100; 400k different words. Neural network architecture:
sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
embedding_layer
LSTM(50, return_sequences=True)
LSTM(50, W_regularizer='l2')
out = Dense(5, activation='softmax')
model = Model(sequence_input, out)
model.compile(loss='categorical_crossentropy', optimizer='rmsprop', metrics=['acc'])
train_size | test_size | batch_size | nb_epoch | score | training time |
---|---|---|---|---|---|
51272 | 51272 | 128 | 2 | 0.42 | ~ 2h 30min |
Pretrained glove dictionary: 840B tokens; dim=300; 2.2m different words. Neural network architecture:
sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
embedding_layer
LSTM(150, W_regularizer='l2')
Dropout(0.25)
Dense(30, activation='relu', W_regularizer='l2')
out = Dense(5, activation='softmax', W_regularizer='l2
model = Model(sequence_input, out)
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['acc'])
Total params: 275285
train_size | test_size | batch_size | nb_epoch | score | training + test time |
---|---|---|---|---|---|
51272 | 51272 | 128 | 1 | 0.4499 | ~ 90 min |
51272 | 51272 | 128 | 2 | 0.5035 | ~ 90 min |
51272 | 51272 | 128 | 3 | 0.5170 | ~ 90 min |
Pretrained glove dictionary: 840B tokens; dim=300; 2.2m different words. Neural network architecture:
sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
embedding_layer
LSTM(50, W_regularizer='l2')
Dropout(0.25)
Dense(25, activation='relu', W_regularizer='l2')
out = Dense(5, activation='softmax', W_regularizer='l2)
model = Model(sequence_input, out)
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['acc'])
Total params: 71605
train_size | test_size | batch_size | nb_epoch | score | training + test time |
---|---|---|---|---|---|
51272 | 51272 | 128 | 1 | 0.4927 | ~ 25 min |
51272 | 51272 | 128 | 2 | 0.4929 | ~ 25 min |
51272 | 51272 | 128 | 3 | 0.5261 | ~ 25 min |
pretrained glove dictionary: 840B tokens; dim=300; 2.2m different words. Neural network architecture:
sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
embedding_layer
LSTM(25, W_regularizer='l2')
Dropout(0.25)
Dense(30, activation='relu', W_regularizer='l2')
out = Dense(5, activation='softmax', W_regularizer='l2')
model = Model(sequence_input, out)
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['acc'])
Total params: 33535
train_size | batch_size | nb_epoch |public leaderboard score | training time ------------ | ------------- | ----------| ---------- | ------------- | ---------- 102582 | 128 | 7 | 0.54056 | ~ 3 h
Grid mixture coefficient with 51272 train and 51272 test examples. After that train on all train data RF with 400 trees, NN with batch_size 128, nb_epoch = 15. Neural network architecture:
sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
embedding_layer
LSTM(25, W_regularizer='l2')
Dropout(0.25)
Dense(40, activation='relu', W_regularizer='l2')
out = Dense(5, activation='softmax', W_regularizer='l2')
model = Model(sequence_input, out)
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['acc'])
Best mixture is 0.959 * NN + (1 - 0.959) * RF.
train_size | public leaderboard score | private leaderboard score | training time |
---|---|---|---|
102582 | 0.55132 | 0.55513 | ~ 7 h |
Neural network architecture:
sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
embedding_layer
LSTM(25, W_regularizer='l2')
Dropout(0.25)
Dense(40, activation='relu', W_regularizer='l2')
out = Dense(5, activation='softmax', W_regularizer='l2')
model = Model(sequence_input, out)
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['acc'])
Total params: 33845
train_size | batch_size | nb_epoch | public leaderboard score | private leaderboard score | training time |
---|---|---|---|---|---|
102582 | 128 | 7 | 0.55472 | 0.55559 | ~ 7 h |