python 3 create_data script #4

dpressel · 2020-03-21T14:51:17Z

Thanks for the paper and for making this source code available! The rest of the repository is all Python 3, so I thought it would be easier if the create_data script also worked as a Python 3 script. After converting the script, I was able to run all of the other steps and executing training without modifications

renll · 2020-03-21T15:38:55Z

Thanks a lot for your contribution! Could you please confirm that the model trained with the python3 script can reach exactly the same performance as claimed in the paper? I used to use another version of python3 script, but met weird reproducibility problem, thus I want to make sure that this one doesn't have such problem.

The pretrained model is also provided, thus you may first use the pretrained one for a fast check instead of training from scratch.

dpressel · 2020-03-25T20:23:10Z

I got a chance to compare your checkpoint vs what I trained and yeah they are different:

Your pretrained checkpoint

slot_acc = 0.9552117263843648

joint_ds_acc = 0.5580890336590663

joint_all_acc = 0.48792073832790445

My checkpoint trained locally with python3 preproc:

slot_acc = 0.9535830618892508

joint_ds_acc = 0.5485884907709012

joint_all_acc = 0.46796959826275786

I will try and run it again after reprocessing with python2 preproc and see if I can figure out what is happening.

dpressel · 2020-03-29T21:31:00Z

To follow up on this PR, I dont think that python 3 is the source of the issue -- it appears that there is no fixed random seed in default training, so it wouldnt then be expected to reach the same performance on each training run. I did a hard reset to your master and ran python2 pre-processing and also got differing results (and worse results than running with my Python 3 preprocessing):

Python 2 pre-processing

slot_acc = 0.9413680781758957

joint_ds_acc = 0.5256514657980456

joint_all_acc = 0.4599619978284473

Short of running it 10x each way, which seems like it would take at least 20 GPU days with a 1080ti, Im not sure how to validate this.

renll · 2020-03-30T04:40:02Z

No, that's not true. I am sure the random seed is fixed and I get exactly the same result for every training with python2 preprocessing. There must be something else going wrong (e.g. environment setup).

dpressel · 2020-03-30T04:47:25Z

Ah, you’re right, of course, sorry. Didn’t look closely at the training script CLI. Yeah, I see that there is a fixed seed set in training, so not sure what the reason is, but I am seeing a different result from yours using the master branch and python 2 preprocessing. Would it be possible to get a bit of early logging info from one of your runs to compare dev results without running the whole training and confirm that those match?

renll · 2020-03-30T05:06:34Z

Sure, I have uploaded the log file for reference.

dpressel · 2020-04-08T17:21:44Z

Hi, thanks for uploading the log, and sorry for the late response. As you probably suspected, the training loss differs at the first update:

time: 348.183, epoch:   1, updates:       20, train loss: 7.65903, train sloss: 8.03430, train vloss: 7.75720

vs the gold:

time: 429.001, epoch:   1, updates:       20, train loss: 7.84684, train sloss: 8.18349, train vloss: 7.86175

I tried to make a list of the requirements that Im seeing in the code and I was wondering how my versions compare to yours (these are retrieved from listing my conda envs). I didnt see a requirements.txt, but maybe you have something handy regarding what versions I should be running?
Here are select packages from my local conda env:

For Python3:

numpy                     1.16.2           py37h7e9f1db_0  
numpy-base                1.16.2           py37hde5b4d6_0  
python                    3.7.6                h0371630_
pytorch                   1.4.0           py3.7_cuda10.1.243_cudnn7.6.3_0    pytorch
pytorch-pretrained-bert   0.6.1                    pypi_0    pypi
yaml                      0.1.7                had09818_2

For Python2:

numpy                     1.16.5           py27h7e9f1db_0
numpy-base                1.16.5           py27hde5b4d6_0    
python                    2.7.16               h9bab390_7

dpressel · 2020-04-08T17:55:52Z

One more quick question, would you mind sharing (maybe gdrive) your {train, dev, test}_dials.json with me? This might help troubleshoot rather quickly!

If you use python2, results match previous, but also works with python3

dpressel · 2020-04-08T20:39:51Z

Hello, I managed to get the exact same behavior in Python2 and Python3 by replacing all dict() creation (= {}) with an OrderedDict. To test this I modified your create_data.py in this manner and mine, and the results are identical.

I am not sure what you want to do about this change, since it might affect your overall score, but I think then it proves that the score difference (at least between 2 and 3) is due to the difference in sort order of dictionaries between python2 and python3. Unfortunately I am not sure how to replicate python2 dictionary output with python3:

https://stackoverflow.com/questions/51769239/why-json-dumps-in-python-3-return-a-different-value-of-python-2#51769379

from this post: "There is no trivial way to make Python 3 dictionaries use a Python 2 ordering, so moving both 2 and 3 code bases to use OrderedDict is a more maintainable solution."

I have updated my PR with OrderedDict() which should make it easy to validate on your side that it matches in case you are interested. If I remove this OrderedDict() and go back to the dictionaries, my script would produce identical output to yours under python2, but different output under python 3. If you want, I can go ahead and check the OrderedDict() performance to see if results are close or higher than the published results.

renll · 2020-04-09T05:57:08Z

Hi, I have uploaded the requirements file for python == 2.7.3 environment, and you can first try this environment for preprocessing to see if you can reproduce the result. If you still meet problems, maybe you should try pytorch ==1.1.0 for python3 environment.

If unfortunately all of the above failed (not much possible I believe), I can provide the {train, dev, test}_dials.json files, which needs extra downloading and uploading routines.

OrderedDict is a nice catch, but I don't think the order of OrderedDict will be the same as the original python2 dictionaries. You can try this order and if it gives simliar or better results, that will be the best.

Thanks a lot for all of your works on this. You are really amazing!

python 3 create_data script

402ece7

dpressel added 2 commits April 8, 2020 15:40

This version matches exactly with python 2

b61646c

If you use python2, results match previous, but also works with python3

force same order with python2

db7c93e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

python 3 create_data script #4

python 3 create_data script #4

dpressel commented Mar 21, 2020

renll commented Mar 21, 2020

dpressel commented Mar 25, 2020

dpressel commented Mar 29, 2020 •

edited

Loading

renll commented Mar 30, 2020 •

edited

Loading

dpressel commented Mar 30, 2020 •

edited

Loading

renll commented Mar 30, 2020

dpressel commented Apr 8, 2020

dpressel commented Apr 8, 2020

dpressel commented Apr 8, 2020 •

edited

Loading

renll commented Apr 9, 2020 •

edited

Loading

python 3 create_data script #4

Are you sure you want to change the base?

python 3 create_data script #4

Conversation

dpressel commented Mar 21, 2020

renll commented Mar 21, 2020

dpressel commented Mar 25, 2020

dpressel commented Mar 29, 2020 • edited Loading

renll commented Mar 30, 2020 • edited Loading

dpressel commented Mar 30, 2020 • edited Loading

renll commented Mar 30, 2020

dpressel commented Apr 8, 2020

dpressel commented Apr 8, 2020

dpressel commented Apr 8, 2020 • edited Loading

renll commented Apr 9, 2020 • edited Loading

dpressel commented Mar 29, 2020 •

edited

Loading

renll commented Mar 30, 2020 •

edited

Loading

dpressel commented Mar 30, 2020 •

edited

Loading

dpressel commented Apr 8, 2020 •

edited

Loading

renll commented Apr 9, 2020 •

edited

Loading