-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pytorch DDP - Timestamp must be non-decreasing for series attribute [Neptune fairseq integration] #1327
Comments
@harishankar-gopalan could you share a minimal reproducible example of your code? |
Ok so I am integrating a progress_bar implementation for neptune into Fairseq. All metrics logged from within the application is getting logged fine. The monitoring parameters alone is throwing an error as seen below:
I am attaching my implementation of the progress_bar below:
I do have minor changes in Fairseq Configs and in Fairseq train cli to accommodate the required command line params and wiring them up to initialize the Neptune run. |
For tracking DDP jobs using neptune, you can refer to this guide here: The main idea is to create the neptune run and log from rank 0.
Or, if you want to log from all ranks make sure you separate hardware monitoring namespace for each rank. |
Let me know if this helps you! |
Just checking in to see if you still need help with this :) |
@Blaizzy Apologies for the delay, I havent had a look into the resource you have shared yet. Give me this week to go through. I will update this thread on whether I need further help or not. |
No problem, looking forward to it. |
Hi @Blaizzy I am already using the recommended method of logging from rank 0 (master) process. Also for multiple epochs I am caching the run id and re-using it so that we can log to the same run id as recommended. I am attaching the relevant code where I am instantiating the Neptune project only for the master process similar to how it is already done for other logging vendors like Wandb, AzureML and the like as can be seen here in the fairseq repository. |
Could you share the error you are getting now? The error you were getting before was due to initializing the run in all processes. Now, in this current case, I believe you might be instantiating the run multiple times in the master process. This error occurs because all processes want to log into the same field at the same time which causes a race condition. |
Try to initialize the run once as described here: https://docs.neptune.ai/tutorials/running_distributed_training/#tracking-a-single-node-multi-gpu-job |
Will check. As far as I know the run init is called once, and then called with the "with_id" parameters set to a previous run that was init as I want to log to the same run the details of all the epochs. Apart from that the run_init is only called once. I will check once more if there are any other loose ends where the run_init gets called without the |
When you call it again with_id you are reinstating the run. That could be the culprit. Could you send me a minimal reproducible example? It can be only how you setup Neptune and use Neptune in your code. |
Closing this issue as it's stale |
I am also facing the same issue. I have run the job only on one GPU rank and still getting the logs as stated above(reproduced below).
Where is this exactly an issue ? As in, is there someway I can make changes to prevent this issue from occuring, some kind of sync/wait primitives available in neptune that I am not calling ? Any direction would help me better handle the logger implementation for Fairseq.
Originally posted by @harishankar-gopalan in #733 (comment)
The text was updated successfully, but these errors were encountered: