Pytorch DDP - Timestamp must be non-decreasing for series attribute [Neptune fairseq integration] #1327

Blaizzy · 2023-04-17T07:52:56Z

          Hi @Blaizzy I ham trying to use Neptune logger in pure Pytorch through Fairseq i.e trying to integrate Neptune logger for Fairseq package which already supports lots of your comptetitors like, WANDB, AzureML, Aim etc.

I am also facing the same issue. I have run the job only on one GPU rank and still getting the logs as stated above(reproduced below).

2023-04-16 18:04:48 | ERROR | neptune.internal.operation_processors.async_operation_processor | Error occurred during asynchronous operation processing: Timestamp must be non-decreasing for series attribute: monitoring/bb14c23a/stdout. Invalid point: 2023-04-16T12:34:46.659Z
2023-04-16 18:04:48 | ERROR | neptune.internal.operation_processors.async_operation_processor | Error occurred during asynchronous operation processing: Timestamp must be non-decreasing for series attribute: monitoring/bb14c23a/stdout. Invalid point: 2023-04-16T12:34:46.985Z
2023-04-16 18:04:48 | ERROR | neptune.internal.operation_processors.async_operation_processor | Error occurred during asynchronous operation processing: Timestamp must be non-decreasing for series attribute: monitoring/bb14c23a/stdout. Invalid point: 2023-04-16T12:34:46.987Z
2023-04-16 18:04:48 | ERROR | neptune.internal.operation_processors.async_operation_processor | Error occurred during asynchronous operation processing: Timestamp must be non-decreasing for series attribute: monitoring/bb14c23a/stdout. Invalid point: 2023-04-16T12:34:47.370Z
2023-04-16 18:04:48 | ERROR | neptune.internal.operation_processors.async_operation_processor | Error occurred during asynchronous operation processing: Timestamp must be non-decreasing for series attribute: monitoring/bb14c23a/stdout. Invalid point: 2023-04-16T12:34:47.371Z

Where is this exactly an issue ? As in, is there someway I can make changes to prevent this issue from occuring, some kind of sync/wait primitives available in neptune that I am not calling ? Any direction would help me better handle the logger implementation for Fairseq.

Originally posted by @harishankar-gopalan in #733 (comment)

The text was updated successfully, but these errors were encountered:

Blaizzy · 2023-04-17T07:53:45Z

@harishankar-gopalan could you share a minimal reproducible example of your code?

harishankar-gopalan · 2023-04-17T10:17:22Z

Ok so I am integrating a progress_bar implementation for neptune into Fairseq.

All metrics logged from within the application is getting logged fine. The monitoring parameters alone is throwing an error as seen below:

2023-04-17 15:45:26 | ERROR | neptune.internal.operation_processors.async_operation_processor | Error occurred during asynchronous operation processing: Timestamp must be non-decreasing for series attribute: monitoring/4fb0ddcb/memory. Invalid point: 2023-04-17T10:15:15.149Z

I am attaching my implementation of the progress_bar below:

try:
    import neptune
except ImportError:
    neptune = None


class NeptuneProgressBarWrapper(BaseProgressBar):
    cached_run_id = None

    def __init__(self, wrapped_bar, project_name, run_name, run_id, run_tags):
        self.wrapped_bar = wrapped_bar
        if neptune is None:
            logger.warning("neptune logger not found, pip install neptune")
            self.run = None
            return

        if NeptuneProgressBarWrapper.cached_run_id and not run_id:
            run_id = NeptuneProgressBarWrapper.cached_run_id

        if not run_id:
            self.run = neptune.init_run(
                project=project_name,
                name=run_name,
                tags=run_tags.split(",") if run_tags else None,
            )
            NeptuneProgressBarWrapper.cached_run_id = self.run["sys/id"].fetch()
        else:
            self.run = neptune.init_run(
                project=project_name,
                name=run_name,
                with_id=run_id,
            )
            logger.info(f"appending to existing run_id: {run_id}")
        logger.info(
            f"initialized Neptune logger with workspace={self.run._api_object.workspace}, backend_class={type(self.run._backend)}"
        )

    def __iter__(self):
        return iter(self.wrapped_bar)

    def __exit__(self):
        if self.run is not None:
            self.run.stop()

    def log(self, stats, tag=None, step=None):
        self._log_to_neptune(stats, tag, step)
        self.wrapped_bar.log(stats, tag, step)

    def print(self, stats, tag=None, step=None):
        self._log_to_neptune(stats, tag, step)
        self.wrapped_bar.print(stats, tag, step)

    def update_config(self, config):
        if self.run:
            self.run["parameters"] = self._format_stats(config)
        self.wrapped_bar.update_config(config)

    def _format_stat(self, stat):
        if isinstance(stat, tuple):
            stat = list(stat)
        if isinstance(stat, Number):
            stat = round(stat, 5)
        elif isinstance(stat, AverageMeter):
            stat = round(stat.avg, 5)
        elif isinstance(stat, TimeMeter):
            stat = round(stat.avg, 5)
        elif isinstance(stat, StopwatchMeter):
            stat = round(stat.sum, 5)
        elif torch.is_tensor(stat):
            stat = stat.tolist()
        return stat

    def _log_to_neptune(self, stats, tag=None, step=None):
        if self.run is None:
            return

        if step is None:
            step = stats["num_updates"]

        prefix = "" if tag is None else tag + "/"

        for key in stats.keys() - {"num_updates"}:
            name = prefix + key
            self.run[name].append(value=self._format_stat(stats[key]), step=step)

I do have minor changes in Fairseq Configs and in Fairseq train cli to accommodate the required command line params and wiring them up to initialize the Neptune run.

Blaizzy · 2023-04-17T17:19:02Z

Hi @harishankar-gopalan

For tracking DDP jobs using neptune, you can refer to this guide here:
https://docs.neptune.ai/tutorials/running_distributed_training/

The main idea is to create the neptune run and log from rank 0.

Or, if you want to log from all ranks make sure you separate hardware monitoring namespace for each rank.

https://docs.neptune.ai/tutorials/running_distributed_training/#logging-to-multiple-instances-of-the-same-run

Blaizzy · 2023-04-17T17:19:20Z

Let me know if this helps you!

Blaizzy · 2023-04-19T15:47:39Z

Hi @harishankar-gopalan

Just checking in to see if you still need help with this :)

harishankar-gopalan · 2023-04-20T13:58:04Z

@Blaizzy Apologies for the delay, I havent had a look into the resource you have shared yet. Give me this week to go through. I will update this thread on whether I need further help or not.

Blaizzy · 2023-04-21T15:40:02Z

No problem, looking forward to it.

harishankar-gopalan · 2023-04-22T04:22:23Z

Hi @Blaizzy I am already using the recommended method of logging from rank 0 (master) process. Also for multiple epochs I am caching the run id and re-using it so that we can log to the same run id as recommended.
Still I am getting the above mentioned warning. It does not affect the training, but I am not sure what logs am I losing by not addressing the above warning. Any assistance to get to the bottom of it would be really helpful.

I am attaching the relevant code where I am instantiating the Neptune project only for the master process similar to how it is already done for other logging vendors like Wandb, AzureML and the like as can be seen here in the fairseq repository.

Blaizzy · 2023-04-24T11:27:49Z

Could you share the error you are getting now?

The error you were getting before was due to initializing the run in all processes. Now, in this current case, I believe you might be instantiating the run multiple times in the master process.

This error occurs because all processes want to log into the same field at the same time which causes a race condition.

Blaizzy · 2023-04-24T11:31:42Z

Try to initialize the run once as described here: https://docs.neptune.ai/tutorials/running_distributed_training/#tracking-a-single-node-multi-gpu-job

harishankar-gopalan · 2023-04-25T03:13:09Z

Will check. As far as I know the run init is called once, and then called with the "with_id" parameters set to a previous run that was init as I want to log to the same run the details of all the epochs. Apart from that the run_init is only called once. I will check once more if there are any other loose ends where the run_init gets called without the with_id parameter set.

Blaizzy · 2023-04-25T08:45:34Z

When you call it again with_id you are reinstating the run. That could be the culprit.

Could you send me a minimal reproducible example? It can be only how you setup Neptune and use Neptune in your code.

Blaizzy · 2023-05-09T19:56:51Z

Closing this issue as it's stale

Blaizzy changed the title ~~Timestamp must be non-decreasing for series attribute [Neptune fairseq integration]~~ Pytorch DDP - Timestamp must be non-decreasing for series attribute [Neptune fairseq integration] Apr 17, 2023

Blaizzy self-assigned this Apr 17, 2023

Blaizzy closed this as completed May 9, 2023

lvxhnat mentioned this issue Aug 28, 2023

Timestamp must be non-decreasing for series attribute #1446

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pytorch DDP - Timestamp must be non-decreasing for series attribute [Neptune fairseq integration] #1327

Pytorch DDP - Timestamp must be non-decreasing for series attribute [Neptune fairseq integration] #1327

Blaizzy commented Apr 17, 2023

Blaizzy commented Apr 17, 2023

harishankar-gopalan commented Apr 17, 2023

Blaizzy commented Apr 17, 2023

Blaizzy commented Apr 17, 2023

Blaizzy commented Apr 19, 2023

harishankar-gopalan commented Apr 20, 2023

Blaizzy commented Apr 21, 2023

harishankar-gopalan commented Apr 22, 2023

Blaizzy commented Apr 24, 2023

Blaizzy commented Apr 24, 2023

harishankar-gopalan commented Apr 25, 2023

Blaizzy commented Apr 25, 2023

Blaizzy commented May 9, 2023

Pytorch DDP - Timestamp must be non-decreasing for series attribute [Neptune fairseq integration] #1327

Pytorch DDP - Timestamp must be non-decreasing for series attribute [Neptune fairseq integration] #1327

Comments

Blaizzy commented Apr 17, 2023

Blaizzy commented Apr 17, 2023

harishankar-gopalan commented Apr 17, 2023

Blaizzy commented Apr 17, 2023

Blaizzy commented Apr 17, 2023

Blaizzy commented Apr 19, 2023

harishankar-gopalan commented Apr 20, 2023

Blaizzy commented Apr 21, 2023

harishankar-gopalan commented Apr 22, 2023

Blaizzy commented Apr 24, 2023

Blaizzy commented Apr 24, 2023

harishankar-gopalan commented Apr 25, 2023

Blaizzy commented Apr 25, 2023

Blaizzy commented May 9, 2023