Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

python 3 create_data script #4

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open

Conversation

dpressel
Copy link

Thanks for the paper and for making this source code available! The rest of the repository is all Python 3, so I thought it would be easier if the create_data script also worked as a Python 3 script. After converting the script, I was able to run all of the other steps and executing training without modifications

@renll
Copy link
Owner

renll commented Mar 21, 2020

Thanks a lot for your contribution! Could you please confirm that the model trained with the python3 script can reach exactly the same performance as claimed in the paper? I used to use another version of python3 script, but met weird reproducibility problem, thus I want to make sure that this one doesn't have such problem.

The pretrained model is also provided, thus you may first use the pretrained one for a fast check instead of training from scratch.

@dpressel
Copy link
Author

I got a chance to compare your checkpoint vs what I trained and yeah they are different:

Your pretrained checkpoint

slot_acc = 0.9552117263843648

joint_ds_acc = 0.5580890336590663

joint_all_acc = 0.48792073832790445

My checkpoint trained locally with python3 preproc:

slot_acc = 0.9535830618892508

joint_ds_acc = 0.5485884907709012

joint_all_acc = 0.46796959826275786

I will try and run it again after reprocessing with python2 preproc and see if I can figure out what is happening.

@dpressel
Copy link
Author

dpressel commented Mar 29, 2020

To follow up on this PR, I dont think that python 3 is the source of the issue -- it appears that there is no fixed random seed in default training, so it wouldnt then be expected to reach the same performance on each training run. I did a hard reset to your master and ran python2 pre-processing and also got differing results (and worse results than running with my Python 3 preprocessing):

Python 2 pre-processing

slot_acc = 0.9413680781758957

joint_ds_acc = 0.5256514657980456

joint_all_acc = 0.4599619978284473

Short of running it 10x each way, which seems like it would take at least 20 GPU days with a 1080ti, Im not sure how to validate this.

@renll
Copy link
Owner

renll commented Mar 30, 2020

No, that's not true. I am sure the random seed is fixed and I get exactly the same result for every training with python2 preprocessing. There must be something else going wrong (e.g. environment setup).

@dpressel
Copy link
Author

dpressel commented Mar 30, 2020

Ah, you’re right, of course, sorry. Didn’t look closely at the training script CLI. Yeah, I see that there is a fixed seed set in training, so not sure what the reason is, but I am seeing a different result from yours using the master branch and python 2 preprocessing. Would it be possible to get a bit of early logging info from one of your runs to compare dev results without running the whole training and confirm that those match?

@renll
Copy link
Owner

renll commented Mar 30, 2020

Sure, I have uploaded the log file for reference.

@dpressel
Copy link
Author

dpressel commented Apr 8, 2020

Hi, thanks for uploading the log, and sorry for the late response. As you probably suspected, the training loss differs at the first update:

time: 348.183, epoch:   1, updates:       20, train loss: 7.65903, train sloss: 8.03430, train vloss: 7.75720

vs the gold:

time: 429.001, epoch:   1, updates:       20, train loss: 7.84684, train sloss: 8.18349, train vloss: 7.86175

I tried to make a list of the requirements that Im seeing in the code and I was wondering how my versions compare to yours (these are retrieved from listing my conda envs). I didnt see a requirements.txt, but maybe you have something handy regarding what versions I should be running?
Here are select packages from my local conda env:

For Python3:

numpy                     1.16.2           py37h7e9f1db_0  
numpy-base                1.16.2           py37hde5b4d6_0  
python                    3.7.6                h0371630_
pytorch                   1.4.0           py3.7_cuda10.1.243_cudnn7.6.3_0    pytorch
pytorch-pretrained-bert   0.6.1                    pypi_0    pypi
yaml                      0.1.7                had09818_2  

For Python2:

numpy                     1.16.5           py27h7e9f1db_0
numpy-base                1.16.5           py27hde5b4d6_0    
python                    2.7.16               h9bab390_7  

@dpressel
Copy link
Author

dpressel commented Apr 8, 2020

One more quick question, would you mind sharing (maybe gdrive) your {train, dev, test}_dials.json with me? This might help troubleshoot rather quickly!

dpressel added 2 commits April 8, 2020 15:40
If you use python2, results match previous, but also
works with python3
@dpressel
Copy link
Author

dpressel commented Apr 8, 2020

Hello, I managed to get the exact same behavior in Python2 and Python3 by replacing all dict() creation (= {}) with an OrderedDict. To test this I modified your create_data.py in this manner and mine, and the results are identical.

I am not sure what you want to do about this change, since it might affect your overall score, but I think then it proves that the score difference (at least between 2 and 3) is due to the difference in sort order of dictionaries between python2 and python3. Unfortunately I am not sure how to replicate python2 dictionary output with python3:

https://stackoverflow.com/questions/51769239/why-json-dumps-in-python-3-return-a-different-value-of-python-2#51769379

from this post: "There is no trivial way to make Python 3 dictionaries use a Python 2 ordering, so moving both 2 and 3 code bases to use OrderedDict is a more maintainable solution."

I have updated my PR with OrderedDict() which should make it easy to validate on your side that it matches in case you are interested. If I remove this OrderedDict() and go back to the dictionaries, my script would produce identical output to yours under python2, but different output under python 3. If you want, I can go ahead and check the OrderedDict() performance to see if results are close or higher than the published results.

@renll
Copy link
Owner

renll commented Apr 9, 2020

Hi, I have uploaded the requirements file for python == 2.7.3 environment, and you can first try this environment for preprocessing to see if you can reproduce the result. If you still meet problems, maybe you should try pytorch ==1.1.0 for python3 environment.

If unfortunately all of the above failed (not much possible I believe), I can provide the {train, dev, test}_dials.json files, which needs extra downloading and uploading routines.

OrderedDict is a nice catch, but I don't think the order of OrderedDict will be the same as the original python2 dictionaries. You can try this order and if it gives simliar or better results, that will be the best.

Thanks a lot for all of your works on this. You are really amazing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants