Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replication on Quora dataset? #2

Open
andra-pumnea opened this issue Jun 16, 2018 · 5 comments
Open

Replication on Quora dataset? #2

andra-pumnea opened this issue Jun 16, 2018 · 5 comments

Comments

@andra-pumnea
Copy link

Hi! I am trying to run an experiment on Quora dataset. I am using the dataset split provided by: https://github.com/zhiguowang/BiMPM and created a quora.w2v file similarly to askubuntu.w2v and meta.w2v. I got the following error:

Using Theano backend.
INFO:Reading training sentence pairs from data/quora/train.tsv:
/ 298204 Elapsed Time: 0:10:34 /home/andrada.pumnea/anaconda3/lib/python3.6/site-packages/bs4/init.py:219: UserWarning: "b'.'" looks like a filename, not markup. You shouldprobably open this file and pass the filehandle intoBeautiful Soup.
'Beautiful Soup.' % markup)
| 384347 Elapsed Time: 0:13:40
INFO:...read 384348 pairs in 820.31 seconds.
INFO:...class distribution: 0 = 245042 (63.8%) | 1 = 139306 (36.2%)
INFO:Reading validation sentence pairs from data/quora/dev.tsv:
| 9999 Elapsed Time: 0:00:21
INFO:...read 10000 pairs in 21.21 seconds.
INFO:...class distribution: 0 = 5000 (50.0%) | 1 = 5000 (50.0%)
INFO:Reading testing sentence pairs from data/quora/test.tsv:
| 9999 Elapsed Time: 0:00:21
INFO:...read 10000 pairs in 21.26 seconds.
INFO:...class distribution: 0 = 5000 (50.0%) | 1 = 5000 (50.0%)
INFO:Vectorizing data:
INFO:...fitted tokenizer in 14.60 seconds;
INFO:...found 103831 unique tokens;
INFO:Load embeddings from models/quora2.w2v:
INFO:...read 36111 word embeddings in 2.82 seconds;
INFO:...created embedding matrix with shape (103832, 200);
INFO:...cached matrix in file models/quora2.w2v.min.cache.npy.
INFO:Creating CNN model:
INFO:...model created.
INFO:Compiling model:
INFO:...model 0105d13fe81945018824e64905d8f7ad compiled with optimizer: <keras.optimizers.SGD object at 0x7fd9dd23cef0>, lr (sgd-only): 0.005, loss: mse.
Model summary:


Layer (type) Output Shape Param # Connected to

input_1 (InputLayer) (None, None) 0


input_2 (InputLayer) (None, None) 0


embedding_1 (Embedding) (None, None, 200) 20766400 input_1[0][0]
input_2[0][0]


convolution1d_1 (Convolution1D) (None, None, 300) 180300 embedding_1[0][0]
embedding_1[1][0]


globalmaxpooling1d_1 (GlobalMaxPo(None, 300) 0 convolution1d_1[0][0]
convolution1d_1[1][0]


activation_1 (Activation) (None, 300) 0 globalmaxpooling1d_1[0][0]
globalmaxpooling1d_1[1][0]


merge_1 (Merge) (None, 1) 0 activation_1[0][0]
activation_1[1][0]

Total params: 20946700


INFO:Train on 384348 samples, validate on 10000 samples
INFO:Epoch 1/1
2% (11127 of 384348) |### | Elapsed Time: 0:23:50 ETA: 13:16:51
Parameter 8 to routine SGEMM NTCSGEMV SGER was incorrect
Floating point exception (core dumped)

I am using Ubuntu 16.04.3.

Any idea why it happened and how it can be fixed?

@joaoantonioverdade
Copy link
Contributor

At first sight, I would say it is a memory problem, for such larger dataset there is not enough memory in the computer.
Watch the computer memory while training or try it with an incremental data approach.

@andra-pumnea
Copy link
Author

I tried with smaller sample of Quora dataset (24k/6k/1k) and it still craches with the same error

Parameter 8 to routine SGEMM NTCSGEMV SGER was incorrect
Floating point exception (core dumped)

@joaoantonioverdade
Copy link
Contributor

Does it run with the datasets we provide?
Are the requirement libs installed? (requirements.txt)

@andra-pumnea
Copy link
Author

andra-pumnea commented Jul 2, 2018

It runs with the provided datasets. I also installed the requirements. These the packages installed in my env:

beautifulsoup4==4.5.3
certifi==2018.4.16
chardet==3.0.4
h5py==2.7.1
idna==2.7
Keras==1.1.0
nltk==3.2
numpy==1.12.0
progressbar2==3.12.0
pymystem3==0.1.5
python-utils==2.3.0
PyYAML==3.12
requests==2.19.1
scipy==1.1.0
six==1.11.0
Theano==0.8.2
urllib3==1.23

And this is the dataset I'm trying to run it on: https://drive.google.com/open?id=1-TV22E2ZY-NqGHIYiFa5r1eF6bWOs1ar

I generated my own quora.w2v with the following command:
./word2vec -train data.txt -output vec.txt -size 200 -window 5 -sample 1e-4 -negative 5 -hs 0 -binary 0 -cbow 0 -iter 3

Any clue on why I am getting the error?

@joaoantonioverdade
Copy link
Contributor

Using a different dataset and embeddings falls outside the scope of this repository and our work.

Nevertheless, if it runs with the provided dataset and embeddings I would say the problem must be the new dataset or the new embeddings.

There are some issues with the train.tsv you provided:

  • line 4200, the pair of sentences is missing
  • line 16535, one sentence is missing

I manage to train with 1000 samples from the dataset you provided by using the meta.w2v embeddings and changing the code to accept a smaller vocabulary.

Check if all the vocabulary from the dataset is represented in the embeddings.
Check if there are no encoding problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants