Replication on Quora dataset? #2

andra-pumnea · 2018-06-16T21:52:24Z

Hi! I am trying to run an experiment on Quora dataset. I am using the dataset split provided by: https://github.com/zhiguowang/BiMPM and created a quora.w2v file similarly to askubuntu.w2v and meta.w2v. I got the following error:

Using Theano backend.
INFO:Reading training sentence pairs from data/quora/train.tsv:
/ 298204 Elapsed Time: 0:10:34 /home/andrada.pumnea/anaconda3/lib/python3.6/site-packages/bs4/init.py:219: UserWarning: "b'.'" looks like a filename, not markup. You shouldprobably open this file and pass the filehandle intoBeautiful Soup.
'Beautiful Soup.' % markup)
| 384347 Elapsed Time: 0:13:40
INFO:...read 384348 pairs in 820.31 seconds.
INFO:...class distribution: 0 = 245042 (63.8%) | 1 = 139306 (36.2%)
INFO:Reading validation sentence pairs from data/quora/dev.tsv:
| 9999 Elapsed Time: 0:00:21
INFO:...read 10000 pairs in 21.21 seconds.
INFO:...class distribution: 0 = 5000 (50.0%) | 1 = 5000 (50.0%)
INFO:Reading testing sentence pairs from data/quora/test.tsv:
| 9999 Elapsed Time: 0:00:21
INFO:...read 10000 pairs in 21.26 seconds.
INFO:...class distribution: 0 = 5000 (50.0%) | 1 = 5000 (50.0%)
INFO:Vectorizing data:
INFO:...fitted tokenizer in 14.60 seconds;
INFO:...found 103831 unique tokens;
INFO:Load embeddings from models/quora2.w2v:
INFO:...read 36111 word embeddings in 2.82 seconds;
INFO:...created embedding matrix with shape (103832, 200);
INFO:...cached matrix in file models/quora2.w2v.min.cache.npy.
INFO:Creating CNN model:
INFO:...model created.
INFO:Compiling model:
INFO:...model 0105d13fe81945018824e64905d8f7ad compiled with optimizer: <keras.optimizers.SGD object at 0x7fd9dd23cef0>, lr (sgd-only): 0.005, loss: mse.
Model summary:

Layer (type) Output Shape Param # Connected to

input_1 (InputLayer) (None, None) 0

input_2 (InputLayer) (None, None) 0

embedding_1 (Embedding) (None, None, 200) 20766400 input_1[0][0]
input_2[0][0]

convolution1d_1 (Convolution1D) (None, None, 300) 180300 embedding_1[0][0]
embedding_1[1][0]

globalmaxpooling1d_1 (GlobalMaxPo(None, 300) 0 convolution1d_1[0][0]
convolution1d_1[1][0]

activation_1 (Activation) (None, 300) 0 globalmaxpooling1d_1[0][0]
globalmaxpooling1d_1[1][0]

merge_1 (Merge) (None, 1) 0 activation_1[0][0]
activation_1[1][0]

Total params: 20946700

INFO:Train on 384348 samples, validate on 10000 samples
INFO:Epoch 1/1
2% (11127 of 384348) |### | Elapsed Time: 0:23:50 ETA: 13:16:51
Parameter 8 to routine SGEMM NTCSGEMV SGER was incorrect
Floating point exception (core dumped)

I am using Ubuntu 16.04.3.

Any idea why it happened and how it can be fixed?

joaoantonioverdade · 2018-06-18T10:12:46Z

At first sight, I would say it is a memory problem, for such larger dataset there is not enough memory in the computer.
Watch the computer memory while training or try it with an incremental data approach.

andra-pumnea · 2018-07-01T16:29:28Z

I tried with smaller sample of Quora dataset (24k/6k/1k) and it still craches with the same error

Parameter 8 to routine SGEMM NTCSGEMV SGER was incorrect
Floating point exception (core dumped)

joaoantonioverdade · 2018-07-02T08:04:15Z

Does it run with the datasets we provide?
Are the requirement libs installed? (requirements.txt)

andra-pumnea · 2018-07-02T14:09:40Z

It runs with the provided datasets. I also installed the requirements. These the packages installed in my env:

beautifulsoup4==4.5.3
certifi==2018.4.16
chardet==3.0.4
h5py==2.7.1
idna==2.7
Keras==1.1.0
nltk==3.2
numpy==1.12.0
progressbar2==3.12.0
pymystem3==0.1.5
python-utils==2.3.0
PyYAML==3.12
requests==2.19.1
scipy==1.1.0
six==1.11.0
Theano==0.8.2
urllib3==1.23

And this is the dataset I'm trying to run it on: https://drive.google.com/open?id=1-TV22E2ZY-NqGHIYiFa5r1eF6bWOs1ar

I generated my own quora.w2v with the following command:
./word2vec -train data.txt -output vec.txt -size 200 -window 5 -sample 1e-4 -negative 5 -hs 0 -binary 0 -cbow 0 -iter 3

Any clue on why I am getting the error?

joaoantonioverdade · 2018-07-02T15:44:00Z

Using a different dataset and embeddings falls outside the scope of this repository and our work.

Nevertheless, if it runs with the provided dataset and embeddings I would say the problem must be the new dataset or the new embeddings.

There are some issues with the train.tsv you provided:

line 4200, the pair of sentences is missing
line 16535, one sentence is missing

I manage to train with 1000 samples from the dataset you provided by using the meta.w2v embeddings and changing the code to accept a smaller vocabulary.

Check if all the vocabulary from the dataset is represented in the embeddings.
Check if there are no encoding problem.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replication on Quora dataset? #2

Replication on Quora dataset? #2

andra-pumnea commented Jun 16, 2018

joaoantonioverdade commented Jun 18, 2018

andra-pumnea commented Jul 1, 2018

joaoantonioverdade commented Jul 2, 2018

andra-pumnea commented Jul 2, 2018 •

edited

Loading

joaoantonioverdade commented Jul 2, 2018

Replication on Quora dataset? #2

Replication on Quora dataset? #2

Comments

andra-pumnea commented Jun 16, 2018

Layer (type) Output Shape Param # Connected to

merge_1 (Merge) (None, 1) 0 activation_1[0][0] activation_1[1][0]

joaoantonioverdade commented Jun 18, 2018

andra-pumnea commented Jul 1, 2018

joaoantonioverdade commented Jul 2, 2018

andra-pumnea commented Jul 2, 2018 • edited Loading

joaoantonioverdade commented Jul 2, 2018

merge_1 (Merge) (None, 1) 0 activation_1[0][0]
activation_1[1][0]

andra-pumnea commented Jul 2, 2018 •

edited

Loading