Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WMT-2017 transformer example: OOM error #3

Closed
MaksymDel opened this issue Mar 29, 2018 · 4 comments
Closed

WMT-2017 transformer example: OOM error #3

MaksymDel opened this issue Mar 29, 2018 · 4 comments

Comments

@MaksymDel
Copy link

Part of my stdout output:

[2018-03-29 18:23:41] Starting epoch 9
[2018-03-29 18:23:41] Training finished
[2018-03-29 18:23:46] Saving model to model/ens1/model.npz.best-ce-mean-words.npz
[2018-03-29 18:23:50] [valid] 16 : ce-mean-words : 7.50067 : new best
[2018-03-29 18:23:55] Saving model to model/ens1/model.npz.best-perplexity.npz
[2018-03-29 18:23:59] [valid] 16 : perplexity : 1809.25 : new best
tcmalloc: large alloc 1073741824 bytes == 0x2ea1a000 @ 
tcmalloc: large alloc 1610612736 bytes == 0x6ea1a000 @ 
tcmalloc: large alloc 2147483648 bytes == 0xf794000 @ 
tcmalloc: large alloc 2684354560 bytes == 0xf794000 @ 
tcmalloc: large alloc 3221225472 bytes == 0xcea1a000 @ 
tcmalloc: large alloc 3758096384 bytes == 0xfebe000 @ 
tcmalloc: large alloc 4294967296 bytes == 0x108ae000 @ 
tcmalloc: large alloc 4831838208 bytes == 0x10776000 @ 
tcmalloc: large alloc 5368709120 bytes == 0x10e64000 @ 
[2018-03-29 18:25:38] Error: out of memory - /storage/software/marian/src/marian/src/tensors/gpu/device.cu:30
./run-me.sh: line 108:  6273 Aborted

After that script continues.

I use 16gb GPU to train the model. Any ideas on this?

@MaksymDel
Copy link
Author

Resolved by removing all the model folder, regenerating data and re-running the script from scratch.

By default Marian resumes training when it sees that model folders are not free, right?

@emjotde
Copy link
Member

emjotde commented Mar 30, 2018

Yes. It does. With the example it is still a little bit wacky as the smoothed models (--exponential-smoothing) should not be the models which are used for resuming, but it does not seem to do harm either. We are currently working on making this fully correct.

@emjotde
Copy link
Member

emjotde commented Mar 30, 2018

BTW, these are the lines counts for files in the data folder:

   19122526 data/all.bpe.de
   19122526 data/all.bpe.en
    4561263 data/corpus.bpe.de
    4561263 data/corpus.bpe.en
    4590101 data/corpus.de
    4590101 data/corpus.en
    4561263 data/corpus.tc.de
    4561263 data/corpus.tc.en
     157788 data/corpus.tok.de
    4590101 data/corpus.tok.en
    4590101 data/corpus.tok.uncleaned.de
    4590101 data/corpus.tok.uncleaned.en
   10000000 data/news.2016.bpe.de
   10000000 data/news.2016.bpe.en
   10000000 data/news.2016.de
   10000000 data/news.2016.tc.de
   10000000 data/news.2016.tok.de
       2737 data/test2014.bpe.en
       2737 data/test2014.en
       2737 data/test2014.tc.en
       2737 data/test2014.tok.en
       2169 data/test2015.bpe.en
       2169 data/test2015.en
       2169 data/test2015.tc.en
       2169 data/test2015.tok.en
       2999 data/test2016.bpe.en
       2999 data/test2016.en
       2999 data/test2016.tc.en
       2999 data/test2016.tok.en
       3004 data/test2017.bpe.en
       3004 data/test2017.en
       3004 data/test2017.tc.en
       3004 data/test2017.tok.en
       2999 data/valid.bpe.de
       2999 data/valid.bpe.en
       2999 data/valid.de
       2999 data/valid.en
       2999 data/valid.tc.de
       2999 data/valid.tc.en
       2999 data/valid.tok.de
       2999 data/valid.tok.en

@MaksymDel
Copy link
Author

MaksymDel commented Mar 30, 2018

Thanks!

Closing for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants