-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
python 3 create_data script #4
base: master
Are you sure you want to change the base?
Conversation
Thanks a lot for your contribution! Could you please confirm that the model trained with the python3 script can reach exactly the same performance as claimed in the paper? I used to use another version of python3 script, but met weird reproducibility problem, thus I want to make sure that this one doesn't have such problem. The pretrained model is also provided, thus you may first use the pretrained one for a fast check instead of training from scratch. |
I got a chance to compare your checkpoint vs what I trained and yeah they are different: Your pretrained checkpoint
My checkpoint trained locally with python3 preproc:
I will try and run it again after reprocessing with python2 preproc and see if I can figure out what is happening. |
To follow up on this PR, I dont think that python 3 is the source of the issue -- it appears that there is no fixed random seed in default training, so it wouldnt then be expected to reach the same performance on each training run. I did a hard reset to your master and ran python2 pre-processing and also got differing results (and worse results than running with my Python 3 preprocessing): Python 2 pre-processing
Short of running it 10x each way, which seems like it would take at least 20 GPU days with a 1080ti, Im not sure how to validate this. |
No, that's not true. I am sure the random seed is fixed and I get exactly the same result for every training with python2 preprocessing. There must be something else going wrong (e.g. environment setup). |
Ah, you’re right, of course, sorry. Didn’t look closely at the training script CLI. Yeah, I see that there is a fixed seed set in training, so not sure what the reason is, but I am seeing a different result from yours using the master branch and python 2 preprocessing. Would it be possible to get a bit of early logging info from one of your runs to compare dev results without running the whole training and confirm that those match? |
Sure, I have uploaded the log file for reference. |
Hi, thanks for uploading the log, and sorry for the late response. As you probably suspected, the training loss differs at the first update:
vs the gold:
I tried to make a list of the requirements that Im seeing in the code and I was wondering how my versions compare to yours (these are retrieved from listing my conda envs). I didnt see a requirements.txt, but maybe you have something handy regarding what versions I should be running? For Python3:
For Python2:
|
One more quick question, would you mind sharing (maybe gdrive) your {train, dev, test}_dials.json with me? This might help troubleshoot rather quickly! |
If you use python2, results match previous, but also works with python3
Hello, I managed to get the exact same behavior in Python2 and Python3 by replacing all I am not sure what you want to do about this change, since it might affect your overall score, but I think then it proves that the score difference (at least between 2 and 3) is due to the difference in sort order of dictionaries between python2 and python3. Unfortunately I am not sure how to replicate python2 dictionary output with python3: from this post: "There is no trivial way to make Python 3 dictionaries use a Python 2 ordering, so moving both 2 and 3 code bases to use OrderedDict is a more maintainable solution." I have updated my PR with |
Hi, I have uploaded the requirements file for python == 2.7.3 environment, and you can first try this environment for preprocessing to see if you can reproduce the result. If you still meet problems, maybe you should try pytorch ==1.1.0 for python3 environment. If unfortunately all of the above failed (not much possible I believe), I can provide the {train, dev, test}_dials.json files, which needs extra downloading and uploading routines. OrderedDict is a nice catch, but I don't think the order of OrderedDict will be the same as the original python2 dictionaries. You can try this order and if it gives simliar or better results, that will be the best. Thanks a lot for all of your works on this. You are really amazing! |
Thanks for the paper and for making this source code available! The rest of the repository is all Python 3, so I thought it would be easier if the create_data script also worked as a Python 3 script. After converting the script, I was able to run all of the other steps and executing training without modifications