BUG: Error msg during training - Timestamp must be non-decreasing for series attribute #733

stonelazy · 2021-10-15T13:45:58Z

Describe the bug

When running Neptune logger in PytorchLightning with ddp > 1gpus.
Then there are continuous errors reading Error occured during asynchronous operation processing. Timestamp must be non-decreasing for series attribute
If the Neptune logger is offline, or if neptune logger is removed then this error isn't logged.
There are too many errors, and even the progress bar of the training is difficult to identify.

Reproduction

When i run with 4gpus, i was able to reproduce this.
https://colab.research.google.com/drive/1TOadmpet63eSXz6LMHVvdM-D6Gy0LDxe?usp=sharing

Expected behavior

If this is a valid error message, then there is no hint of what actions needs to be taken. If they are harmless/not valid kindly suggest a way to suppress this print.

Traceback

Error occurred during asynchronous operation processing: Timestamp must be non-decreasing for series attribute: monitoring/stdout. Invalid point: 2021-10-15T13:25:02.767Z
Error occurred during asynchronous operation processing: Timestamp must be non-decreasing for series attribute: monitoring/stdout. Invalid point: 2021-10-15T13:25:02.767Z

Environment

The output of pip list:
PyTorch version: 1.9.0+cu111
Is debug build: False
CUDA used to build PyTorch: 11.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.2 LTS (x86_64)
GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Clang version: Could not collect
CMake version: version 3.21.3
Libc version: glibc-2.31

Python version: 3.8.11 (default, Aug 3 2021, 15:09:35) [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-5.11.0-37-generic-x86_64-with-glibc2.17
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration:
GPU 0: RTX A6000
GPU 1: RTX A6000
GPU 2: RTX A6000
GPU 3: RTX A6000
GPU 4: RTX A6000
GPU 5: RTX A6000
GPU 6: RTX A6000
GPU 7: RTX A6000

Nvidia driver version: 460.91.03
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:

Versions of relevant libraries:
[pip3] mypy==0.910
[pip3] mypy-extensions==0.4.3
[pip3] neptune-pytorch-lightning==0.9.7
[pip3] numpy==1.21.2
[pip3] pytorch-lightning==1.4.9
[pip3] torch==1.9.0+cu111
[pip3] torch-poly-lr-decay==0.0.1
[pip3] torchaudio==0.9.0
[pip3] torchmetrics==0.4.1
[conda] blas 1.0 mkl
[conda] cudatoolkit 11.1.74 h6bb024c_0 nvidia
[conda] ffmpeg 4.3 hf484d3e_0 pytorch
[conda] mkl 2021.3.0 h06a4308_520
[conda] mkl-service 2.4.0 py38h7f8727e_0
[conda] mkl_fft 1.3.0 py38h42c9631_2
[conda] mkl_random 1.2.2 py38h51133e4_0
[conda] mypy 0.910 pypi_0 pypi
[conda] mypy-extensions 0.4.3 pypi_0 pypi
[conda] neptune-client 0.12.0 pypi_0 pypi
[conda] neptune-contrib 0.27.3 pypi_0 pypi
[conda] neptune-pytorch-lightning 0.9.7 pypi_0 pypi
[conda] numpy 1.21.1 pypi_0 pypi
[conda] numpy-base 1.21.2 py38h79a1101_0
[conda] pytorch-lightning 1.4.9 pypi_0 pypi
[conda] torch 1.9.0+cu111 pypi_0 pypi
[conda] torch-poly-lr-decay 0.0.1 pypi_0 pypi
[conda] torchaudio 0.9.0 pypi_0 pypi
[conda] torchmetrics 0.4.1 pypi_0 pypi

Additional context

The text was updated successfully, but these errors were encountered:

kamil-kaczmarek · 2021-10-20T12:29:55Z

Hey @stonelazy,

Thanks for this report.

You have more than 1 process writing in async way to the same field in neptune (monitoring/gpu_3 for example). In Neptune, for given field, we expect the step to be strictly increasing (x, x+1, x+1.0001, x+4, x+5, ...). When it's not true you see an error that you reported here. As a consequence this point is dropped and your run progress uninterrupted.

In common use like run["x"].log(2), Neptune takes consecutive integers staring at 0. This is what you observe in your loss charts on x axis.

So what is happening in your case is the following (all processes log to the same run, async):

process A: run["x"].log(value=7, step=1) (ok)
process B: run["x"].log(value=3, step=2) (ok)
process A: run["x"].log(value=1, step=2) (error)
process A: run["x"].log(value=4, step=3) (ok)
process B: run["x"].log(value=0, step=3) (error)
process A: run["x"].log(value=9, step=4) (ok)

In it not an error that will crash your run, at the same time, I see that it can be annoying.

Quick workaround is to disable stdout/err and hardware monitoring - as this is what I see gives you the error msgs (docs).

neptune.init(
    capture_stdout=False,
    capture_stderr=False,
    capture_hardware_metrics=False,
    ...,
    )

You will not get these error msgs at the expense of less granular view into your runs.

At Neptune we see this use case, and we are working on better handling of such needs.
Wanted to ask if you would be open to have a short online meeting with me -> I'm preparing some best practices for our users and - at the same time - I'm passing insights to the product team. I see that your insights/needs would help us a lot in making it right! https://calendly.com/kamil-neptune/30min

stonelazy · 2021-10-21T03:12:16Z

Thanks for the details. Yeah, i had it figured on disabling those 3 flags i didn't get the error and have been running my training with that for few days now. But the workaround comes at a dataloss, no longer we would be able to monitor CPU/GPU metrics. This workaround doesn't seem to be convincing.
And continuous error messages are totally annoying. I guess this issue feels like a showstopper for me.

kamil-kaczmarek · 2021-10-21T09:09:42Z

@stonelazy,

I appreciate your feedback. I will get back here with more details soon.

kamil-kaczmarek · 2021-10-22T13:24:54Z

@stonelazy,

For now I have some temporary workaround. You can try to suppress these errors like this:

import logging
logging.getLogger("neptune.new.internal.operation_processors.async_operation_processor").setLevel(logging.CRITICAL)

Not sure if it will work for GPUs as well, but it's worth trying. At the same time please use it wisely and as temporary workaround as it disables all errors from async processing.

I already passed info about your case to the product team for further consideration.

Let me know what you think?

kamil-kaczmarek · 2021-11-04T10:37:50Z

Hi @stonelazy,

I'm thinking what is your current status / thinking when it comes to the above issues?

stonelazy · 2021-11-08T04:03:59Z

Hi @kamil-kaczmarek am now not having any issues with the suggested workaround.

kamil-kaczmarek · 2021-11-08T13:14:49Z

@stonelazy ok good,

I recommend to transition to the PTL 1.5 and use our new logger:

documentation
PyTorch Lightning docs about Neptune Logger

stonelazy · 2021-11-09T08:56:13Z

If we move to PTL 1.5, we won't have this error ?

kamil-kaczmarek · 2021-11-09T12:10:54Z

@stonelazy unfortunately not, I'm sorry.

Referred to the new logger, due to improvements in the logger in other dimensions, especially it uses our new API under the hood.

qchenevier · 2021-12-21T23:06:17Z

@stonelazy,

For now I have some temporary workaround. You can try to suppress these errors like this:
import logging
logging.getLogger("neptune.new.internal.operation_processors.async_operation_processor").setLevel(logging.CRITICAL)

For people stumbling on this issue: you can also "cleanly" silence neptune logs in kedro using the logging.yml configuration file. To do so, you can add those lines in the loggers section of the file in conf/base/logging.yml:

loggers:

    # [ OTHER LOGGERS CONFIGURATION ...]

    neptune.new.internal.operation_processors.async_operation_processor:
        level: CRITICAL
        handlers: [console, info_file_handler, error_file_handler]
        propagate: no

Blaizzy · 2021-12-24T03:25:28Z

Hi @stonelazy,

Prince Canuma here, a Data Scientist at Neptune.ai,

I want to personally inform you of the good news! This issue is now fixed on the latest release of PyTorch-Lightning v1.5.7 🎊 🥳

All you need to do is upgrade the library to the latest release👍

Happy Christmas and a prosperous New Year in advance!

stevenmanton · 2023-02-14T00:35:33Z

I'm still seeing this error with pytorch-lightning==1.9.1 and neptune-client==0.16.17.

Blaizzy · 2023-02-15T16:00:50Z

Hi @stevenmanton

Could you share a minimal reproducible code and the setup you have(i.e., no of GPUs and strategy) ?

mrembalski · 2023-03-16T12:11:48Z

The issue is still present. Just logging the hardware metrics with more than 1 process results in a lot of spam. I don't know why this issue is closed. Please reopen it.

Blaizzy · 2023-03-16T14:46:18Z

Hi @mrembalski

Bummer that you're experiencing some issues. No worries, we're here to help!

I have a couple of questions that'll help me understand the situation better:

Are you using Pytorch Lightning or just pure Pytorch?
Could you share a minimal reproducible code snippet and your current setup (like the number of GPUs and the strategy you're using)?

Once I have this info, we'll dive right in and get things sorted out for you!

By the way, if you are using PyTorch we have since released a guide on our docs on how to track DDP jobs. Read more here:
https://docs.neptune.ai/tutorials/running_distributed_training/

mrembalski · 2023-03-16T21:10:04Z

I'm using Pytorch Lightning. I'll try to create a snippet tomorrow.

Blaizzy · 2023-03-17T09:49:45Z

Perfect!

I am looking forward to it :)

harishankar-gopalan · 2023-04-16T14:09:27Z

Hi @Blaizzy I ham trying to use Neptune logger in pure Pytorch through Fairseq i.e trying to integrate Neptune logger for Fairseq package which already supports lots of your comptetitors like, WANDB, AzureML, Aim etc.

I am also facing the same issue. I have run the job only on one GPU rank and still getting the logs as stated above(reproduced below).

2023-04-16 18:04:48 | ERROR | neptune.internal.operation_processors.async_operation_processor | Error occurred during asynchronous operation processing: Timestamp must be non-decreasing for series attribute: monitoring/bb14c23a/stdout. Invalid point: 2023-04-16T12:34:46.659Z
2023-04-16 18:04:48 | ERROR | neptune.internal.operation_processors.async_operation_processor | Error occurred during asynchronous operation processing: Timestamp must be non-decreasing for series attribute: monitoring/bb14c23a/stdout. Invalid point: 2023-04-16T12:34:46.985Z
2023-04-16 18:04:48 | ERROR | neptune.internal.operation_processors.async_operation_processor | Error occurred during asynchronous operation processing: Timestamp must be non-decreasing for series attribute: monitoring/bb14c23a/stdout. Invalid point: 2023-04-16T12:34:46.987Z
2023-04-16 18:04:48 | ERROR | neptune.internal.operation_processors.async_operation_processor | Error occurred during asynchronous operation processing: Timestamp must be non-decreasing for series attribute: monitoring/bb14c23a/stdout. Invalid point: 2023-04-16T12:34:47.370Z
2023-04-16 18:04:48 | ERROR | neptune.internal.operation_processors.async_operation_processor | Error occurred during asynchronous operation processing: Timestamp must be non-decreasing for series attribute: monitoring/bb14c23a/stdout. Invalid point: 2023-04-16T12:34:47.371Z

Where is this exactly an issue ? As in, is there someway I can make changes to prevent this issue from occuring, some kind of sync/wait primitives available in neptune that I am not calling ? Any direction would help me better handle the logger implementation for Fairseq.

kbruegge · 2023-10-06T07:59:40Z

@stonelazy,

For now I have some temporary workaround. You can try to suppress these errors like this:
import logging
logging.getLogger("neptune.new.internal.operation_processors.async_operation_processor").setLevel(logging.CRITICAL)
Not sure if it will work for GPUs as well, but it's worth trying. At the same time please use it wisely and as temporary workaround as it disables all errors from async processing.

I already passed info about your case to the product team for further consideration.

Let me know what you think?

Hey, the import changed (the "new" is gone from the path). Here is how I ignore these messages for now:

logging.getLogger("neptune.internal.operation_processors.async_operation_processor").setLevel(
    logging.CRITICAL
)

danieldanciu · 2024-09-17T09:39:10Z

This issue is still very much not resolved, despite having been marked as resolved. Our logs are absolutely illegible, because this message appears millions and millions of times:

Note that this is training on a single machine, so this cannot be caused by multiple loggers with slightly out-of-sync clocks. As far as I can tell, Neptune is actually trying to log messages with identical timestamps as keys, and this is causing this error. The following 3 messages, all have the same timestamp up to a millisecond precision. Unless Neptune uses microsecond precision for the logs, I think this must be the issue:

2024/09/17 11:29:29 | [neptune] [error ] Error occurred during asynchronous operation processing: Timestamp must be non-decreasing for series attribute: monitoring/resources/stdout. Invalid point: 2024-09-17T09:29:24.366Z
-- | --
2024/09/17 11:29:29 | [neptune] [error ] Error occurred during asynchronous operation processing: Timestamp must be non-decreasing for series attribute: monitoring/resources/stdout. Invalid point: 2024-09-17T09:29:24.366Z
2024/09/17 11:29:29 | [neptune] [error ] Error occurred during asynchronous operation processing: Timestamp must be non-decreasing for series attribute: monitoring/resources/stdout. Invalid point: 2024-09-17T09:29:24.366Z

danieldanciu · 2024-09-17T16:44:36Z

On a different note, a better way to filter out these spurious error messages without affecting actual legitimate error logs is to do something like:

class _FilterCallback(logging.Filterer):
    @override
    def filter(self, record: logging.LogRecord):
        return not (
            record.name == "neptune"
            and record.getMessage().startswith(
                "Error occurred during asynchronous operation processing: Timestamp must be non-decreasing for series attribute"
            )
        )

...
neptune.internal.operation_processors.async_operation_processor.logger.addFilter(_FilterCallback())

SiddhantSadangi · 2024-09-26T06:59:46Z

Hey @danieldanciu 👋
Thanks for pointing this out. While not a bug (more of a design issue), it should not have been marked as resolved.

We are working on an updated client version that will address this issue. Meanwhile, the workaround you've suggested to filter out such logs works like a charm, and we'll add it to our docs ✅

stonelazy changed the title ~~BUG:~~ BUG: Timestamp must be non-decreasing for series attribute Oct 15, 2021

stonelazy changed the title ~~BUG: Timestamp must be non-decreasing for series attribute~~ BUG: Error msg during training - Timestamp must be non-decreasing for series attribute Oct 15, 2021

kamil-kaczmarek self-assigned this Oct 20, 2021

Raalsky mentioned this issue Dec 10, 2021

Fixed NeptuneLogger when using DDP Lightning-AI/pytorch-lightning#11030

Merged

12 tasks

kamil-kaczmarek assigned Blaizzy and unassigned kamil-kaczmarek Dec 22, 2021

Blaizzy closed this as completed Dec 24, 2021

Blaizzy mentioned this issue Apr 17, 2023

Pytorch DDP - Timestamp must be non-decreasing for series attribute [Neptune fairseq integration] #1327

Closed

lvxhnat mentioned this issue Aug 28, 2023

Timestamp must be non-decreasing for series attribute #1446

Closed

SiddhantSadangi reopened this Sep 26, 2024

SiddhantSadangi assigned AurimasGr and unassigned Blaizzy Sep 26, 2024

SiddhantSadangi added the api label Sep 26, 2024

SiddhantSadangi pinned this issue Sep 26, 2024

ocharles mentioned this issue Dec 1, 2024

PPY-66: MetadataInconsistency throws a fatal error in sync mode | Neptune client crashes when uploading checkpoints if they rotate faster than uploads #1888

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Error msg during training - Timestamp must be non-decreasing for series attribute #733

BUG: Error msg during training - Timestamp must be non-decreasing for series attribute #733

stonelazy commented Oct 15, 2021

kamil-kaczmarek commented Oct 20, 2021

stonelazy commented Oct 21, 2021 •

edited

Loading

kamil-kaczmarek commented Oct 21, 2021

kamil-kaczmarek commented Oct 22, 2021

kamil-kaczmarek commented Nov 4, 2021

stonelazy commented Nov 8, 2021

kamil-kaczmarek commented Nov 8, 2021

stonelazy commented Nov 9, 2021

kamil-kaczmarek commented Nov 9, 2021

qchenevier commented Dec 21, 2021

Blaizzy commented Dec 24, 2021 •

edited

Loading

stevenmanton commented Feb 14, 2023

Blaizzy commented Feb 15, 2023

mrembalski commented Mar 16, 2023

Blaizzy commented Mar 16, 2023

mrembalski commented Mar 16, 2023

Blaizzy commented Mar 17, 2023

harishankar-gopalan commented Apr 16, 2023

kbruegge commented Oct 6, 2023

danieldanciu commented Sep 17, 2024

danieldanciu commented Sep 17, 2024 •

edited

Loading

SiddhantSadangi commented Sep 26, 2024

BUG: Error msg during training - Timestamp must be non-decreasing for series attribute #733

BUG: Error msg during training - Timestamp must be non-decreasing for series attribute #733

Comments

stonelazy commented Oct 15, 2021

Describe the bug

Reproduction

Expected behavior

Traceback

Environment

Additional context

kamil-kaczmarek commented Oct 20, 2021

stonelazy commented Oct 21, 2021 • edited Loading

kamil-kaczmarek commented Oct 21, 2021

kamil-kaczmarek commented Oct 22, 2021

kamil-kaczmarek commented Nov 4, 2021

stonelazy commented Nov 8, 2021

kamil-kaczmarek commented Nov 8, 2021

stonelazy commented Nov 9, 2021

kamil-kaczmarek commented Nov 9, 2021

qchenevier commented Dec 21, 2021

Blaizzy commented Dec 24, 2021 • edited Loading

stevenmanton commented Feb 14, 2023

Blaizzy commented Feb 15, 2023

mrembalski commented Mar 16, 2023

Blaizzy commented Mar 16, 2023

mrembalski commented Mar 16, 2023

Blaizzy commented Mar 17, 2023

harishankar-gopalan commented Apr 16, 2023

kbruegge commented Oct 6, 2023

danieldanciu commented Sep 17, 2024

danieldanciu commented Sep 17, 2024 • edited Loading

SiddhantSadangi commented Sep 26, 2024

stonelazy commented Oct 21, 2021 •

edited

Loading

Blaizzy commented Dec 24, 2021 •

edited

Loading

danieldanciu commented Sep 17, 2024 •

edited

Loading