Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Error msg during training - Timestamp must be non-decreasing for series attribute #733

Open
stonelazy opened this issue Oct 15, 2021 · 22 comments
Assignees
Labels

Comments

@stonelazy
Copy link

Describe the bug

When running Neptune logger in PytorchLightning with ddp > 1gpus.
Then there are continuous errors reading Error occured during asynchronous operation processing. Timestamp must be non-decreasing for series attribute
If the Neptune logger is offline, or if neptune logger is removed then this error isn't logged.
There are too many errors, and even the progress bar of the training is difficult to identify.

Reproduction

When i run with 4gpus, i was able to reproduce this.
https://colab.research.google.com/drive/1TOadmpet63eSXz6LMHVvdM-D6Gy0LDxe?usp=sharing

Expected behavior

If this is a valid error message, then there is no hint of what actions needs to be taken. If they are harmless/not valid kindly suggest a way to suppress this print.

Traceback

Error occurred during asynchronous operation processing: Timestamp must be non-decreasing for series attribute: monitoring/stdout. Invalid point: 2021-10-15T13:25:02.767Z
Error occurred during asynchronous operation processing: Timestamp must be non-decreasing for series attribute: monitoring/stdout. Invalid point: 2021-10-15T13:25:02.767Z

Environment

The output of pip list:
PyTorch version: 1.9.0+cu111
Is debug build: False
CUDA used to build PyTorch: 11.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.2 LTS (x86_64)
GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Clang version: Could not collect
CMake version: version 3.21.3
Libc version: glibc-2.31

Python version: 3.8.11 (default, Aug 3 2021, 15:09:35) [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-5.11.0-37-generic-x86_64-with-glibc2.17
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration:
GPU 0: RTX A6000
GPU 1: RTX A6000
GPU 2: RTX A6000
GPU 3: RTX A6000
GPU 4: RTX A6000
GPU 5: RTX A6000
GPU 6: RTX A6000
GPU 7: RTX A6000

Nvidia driver version: 460.91.03
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:

Versions of relevant libraries:
[pip3] mypy==0.910
[pip3] mypy-extensions==0.4.3
[pip3] neptune-pytorch-lightning==0.9.7
[pip3] numpy==1.21.2
[pip3] pytorch-lightning==1.4.9
[pip3] torch==1.9.0+cu111
[pip3] torch-poly-lr-decay==0.0.1
[pip3] torchaudio==0.9.0
[pip3] torchmetrics==0.4.1
[conda] blas 1.0 mkl
[conda] cudatoolkit 11.1.74 h6bb024c_0 nvidia
[conda] ffmpeg 4.3 hf484d3e_0 pytorch
[conda] mkl 2021.3.0 h06a4308_520
[conda] mkl-service 2.4.0 py38h7f8727e_0
[conda] mkl_fft 1.3.0 py38h42c9631_2
[conda] mkl_random 1.2.2 py38h51133e4_0
[conda] mypy 0.910 pypi_0 pypi
[conda] mypy-extensions 0.4.3 pypi_0 pypi
[conda] neptune-client 0.12.0 pypi_0 pypi
[conda] neptune-contrib 0.27.3 pypi_0 pypi
[conda] neptune-pytorch-lightning 0.9.7 pypi_0 pypi
[conda] numpy 1.21.1 pypi_0 pypi
[conda] numpy-base 1.21.2 py38h79a1101_0
[conda] pytorch-lightning 1.4.9 pypi_0 pypi
[conda] torch 1.9.0+cu111 pypi_0 pypi
[conda] torch-poly-lr-decay 0.0.1 pypi_0 pypi
[conda] torchaudio 0.9.0 pypi_0 pypi
[conda] torchmetrics 0.4.1 pypi_0 pypi

Additional context

image

@stonelazy stonelazy changed the title BUG: BUG: Timestamp must be non-decreasing for series attribute Oct 15, 2021
@stonelazy stonelazy changed the title BUG: Timestamp must be non-decreasing for series attribute BUG: Error msg during training - Timestamp must be non-decreasing for series attribute Oct 15, 2021
@kamil-kaczmarek kamil-kaczmarek self-assigned this Oct 20, 2021
@kamil-kaczmarek
Copy link
Contributor

Hey @stonelazy,

Thanks for this report.

You have more than 1 process writing in async way to the same field in neptune (monitoring/gpu_3 for example). In Neptune, for given field, we expect the step to be strictly increasing (x, x+1, x+1.0001, x+4, x+5, ...). When it's not true you see an error that you reported here. As a consequence this point is dropped and your run progress uninterrupted.

In common use like run["x"].log(2), Neptune takes consecutive integers staring at 0. This is what you observe in your loss charts on x axis.

So what is happening in your case is the following (all processes log to the same run, async):

process A: run["x"].log(value=7, step=1) (ok)
process B: run["x"].log(value=3, step=2) (ok)
process A: run["x"].log(value=1, step=2) (error)
process A: run["x"].log(value=4, step=3) (ok)
process B: run["x"].log(value=0, step=3) (error)
process A: run["x"].log(value=9, step=4) (ok)

In it not an error that will crash your run, at the same time, I see that it can be annoying.

Quick workaround is to disable stdout/err and hardware monitoring - as this is what I see gives you the error msgs (docs).

neptune.init(
    capture_stdout=False,
    capture_stderr=False,
    capture_hardware_metrics=False,
    ...,
    )

You will not get these error msgs at the expense of less granular view into your runs.

At Neptune we see this use case, and we are working on better handling of such needs.
Wanted to ask if you would be open to have a short online meeting with me -> I'm preparing some best practices for our users and - at the same time - I'm passing insights to the product team. I see that your insights/needs would help us a lot in making it right! https://calendly.com/kamil-neptune/30min

@stonelazy
Copy link
Author

stonelazy commented Oct 21, 2021

Thanks for the details. Yeah, i had it figured on disabling those 3 flags i didn't get the error and have been running my training with that for few days now. But the workaround comes at a dataloss, no longer we would be able to monitor CPU/GPU metrics. This workaround doesn't seem to be convincing.
And continuous error messages are totally annoying. I guess this issue feels like a showstopper for me.

@kamil-kaczmarek
Copy link
Contributor

@stonelazy,

I appreciate your feedback. I will get back here with more details soon.

@kamil-kaczmarek
Copy link
Contributor

@stonelazy,

For now I have some temporary workaround. You can try to suppress these errors like this:

import logging
logging.getLogger("neptune.new.internal.operation_processors.async_operation_processor").setLevel(logging.CRITICAL)

Not sure if it will work for GPUs as well, but it's worth trying. At the same time please use it wisely and as temporary workaround as it disables all errors from async processing.

I already passed info about your case to the product team for further consideration.

Let me know what you think?

@kamil-kaczmarek
Copy link
Contributor

Hi @stonelazy,

I'm thinking what is your current status / thinking when it comes to the above issues?

@stonelazy
Copy link
Author

Hi @kamil-kaczmarek am now not having any issues with the suggested workaround.

@kamil-kaczmarek
Copy link
Contributor

@stonelazy ok good,

I recommend to transition to the PTL 1.5 and use our new logger:

@stonelazy
Copy link
Author

If we move to PTL 1.5, we won't have this error ?

@kamil-kaczmarek
Copy link
Contributor

@stonelazy unfortunately not, I'm sorry.

Referred to the new logger, due to improvements in the logger in other dimensions, especially it uses our new API under the hood.

@qchenevier
Copy link

@stonelazy,

For now I have some temporary workaround. You can try to suppress these errors like this:

import logging
logging.getLogger("neptune.new.internal.operation_processors.async_operation_processor").setLevel(logging.CRITICAL)

For people stumbling on this issue: you can also "cleanly" silence neptune logs in kedro using the logging.yml configuration file. To do so, you can add those lines in the loggers section of the file in conf/base/logging.yml:

loggers:

    # [ OTHER LOGGERS CONFIGURATION ...]

    neptune.new.internal.operation_processors.async_operation_processor:
        level: CRITICAL
        handlers: [console, info_file_handler, error_file_handler]
        propagate: no

@Blaizzy
Copy link
Contributor

Blaizzy commented Dec 24, 2021

Hi @stonelazy,

Prince Canuma here, a Data Scientist at Neptune.ai,

I want to personally inform you of the good news! This issue is now fixed on the latest release of PyTorch-Lightning v1.5.7 🎊 🥳

All you need to do is upgrade the library to the latest release👍

Happy Christmas and a prosperous New Year in advance!

@Blaizzy Blaizzy closed this as completed Dec 24, 2021
@stevenmanton
Copy link

I'm still seeing this error with pytorch-lightning==1.9.1 and neptune-client==0.16.17.

@Blaizzy
Copy link
Contributor

Blaizzy commented Feb 15, 2023

Hi @stevenmanton

Could you share a minimal reproducible code and the setup you have(i.e., no of GPUs and strategy) ?

@mrembalski
Copy link

The issue is still present. Just logging the hardware metrics with more than 1 process results in a lot of spam. I don't know why this issue is closed. Please reopen it.

@Blaizzy
Copy link
Contributor

Blaizzy commented Mar 16, 2023

Hi @mrembalski

Bummer that you're experiencing some issues. No worries, we're here to help!

I have a couple of questions that'll help me understand the situation better:

  • Are you using Pytorch Lightning or just pure Pytorch?
  • Could you share a minimal reproducible code snippet and your current setup (like the number of GPUs and the strategy you're using)?

Once I have this info, we'll dive right in and get things sorted out for you!

By the way, if you are using PyTorch we have since released a guide on our docs on how to track DDP jobs. Read more here:
https://docs.neptune.ai/tutorials/running_distributed_training/

@mrembalski
Copy link

I'm using Pytorch Lightning. I'll try to create a snippet tomorrow.

@Blaizzy
Copy link
Contributor

Blaizzy commented Mar 17, 2023

Perfect!

I am looking forward to it :)

@harishankar-gopalan
Copy link

Hi @Blaizzy I ham trying to use Neptune logger in pure Pytorch through Fairseq i.e trying to integrate Neptune logger for Fairseq package which already supports lots of your comptetitors like, WANDB, AzureML, Aim etc.

I am also facing the same issue. I have run the job only on one GPU rank and still getting the logs as stated above(reproduced below).

2023-04-16 18:04:48 | ERROR | neptune.internal.operation_processors.async_operation_processor | Error occurred during asynchronous operation processing: Timestamp must be non-decreasing for series attribute: monitoring/bb14c23a/stdout. Invalid point: 2023-04-16T12:34:46.659Z
2023-04-16 18:04:48 | ERROR | neptune.internal.operation_processors.async_operation_processor | Error occurred during asynchronous operation processing: Timestamp must be non-decreasing for series attribute: monitoring/bb14c23a/stdout. Invalid point: 2023-04-16T12:34:46.985Z
2023-04-16 18:04:48 | ERROR | neptune.internal.operation_processors.async_operation_processor | Error occurred during asynchronous operation processing: Timestamp must be non-decreasing for series attribute: monitoring/bb14c23a/stdout. Invalid point: 2023-04-16T12:34:46.987Z
2023-04-16 18:04:48 | ERROR | neptune.internal.operation_processors.async_operation_processor | Error occurred during asynchronous operation processing: Timestamp must be non-decreasing for series attribute: monitoring/bb14c23a/stdout. Invalid point: 2023-04-16T12:34:47.370Z
2023-04-16 18:04:48 | ERROR | neptune.internal.operation_processors.async_operation_processor | Error occurred during asynchronous operation processing: Timestamp must be non-decreasing for series attribute: monitoring/bb14c23a/stdout. Invalid point: 2023-04-16T12:34:47.371Z

Where is this exactly an issue ? As in, is there someway I can make changes to prevent this issue from occuring, some kind of sync/wait primitives available in neptune that I am not calling ? Any direction would help me better handle the logger implementation for Fairseq.

@kbruegge
Copy link

kbruegge commented Oct 6, 2023

@stonelazy,

For now I have some temporary workaround. You can try to suppress these errors like this:

import logging
logging.getLogger("neptune.new.internal.operation_processors.async_operation_processor").setLevel(logging.CRITICAL)

Not sure if it will work for GPUs as well, but it's worth trying. At the same time please use it wisely and as temporary workaround as it disables all errors from async processing.

I already passed info about your case to the product team for further consideration.

Let me know what you think?

Hey, the import changed (the "new" is gone from the path). Here is how I ignore these messages for now:

logging.getLogger("neptune.internal.operation_processors.async_operation_processor").setLevel(
    logging.CRITICAL
)

@danieldanciu
Copy link

This issue is still very much not resolved, despite having been marked as resolved. Our logs are absolutely illegible, because this message appears millions and millions of times:
Screen Shot 2024-09-17 at 11 21 05 AM

Note that this is training on a single machine, so this cannot be caused by multiple loggers with slightly out-of-sync clocks. As far as I can tell, Neptune is actually trying to log messages with identical timestamps as keys, and this is causing this error. The following 3 messages, all have the same timestamp up to a millisecond precision. Unless Neptune uses microsecond precision for the logs, I think this must be the issue:

2024/09/17 11:29:29 | [neptune] [error ] Error occurred during asynchronous operation processing: Timestamp must be non-decreasing for series attribute: monitoring/resources/stdout. Invalid point: 2024-09-17T09:29:24.366Z
-- | --
2024/09/17 11:29:29 | [neptune] [error ] Error occurred during asynchronous operation processing: Timestamp must be non-decreasing for series attribute: monitoring/resources/stdout. Invalid point: 2024-09-17T09:29:24.366Z
2024/09/17 11:29:29 | [neptune] [error ] Error occurred during asynchronous operation processing: Timestamp must be non-decreasing for series attribute: monitoring/resources/stdout. Invalid point: 2024-09-17T09:29:24.366Z

@danieldanciu
Copy link

danieldanciu commented Sep 17, 2024

On a different note, a better way to filter out these spurious error messages without affecting actual legitimate error logs is to do something like:

class _FilterCallback(logging.Filterer):
    @override
    def filter(self, record: logging.LogRecord):
        return not (
            record.name == "neptune"
            and record.getMessage().startswith(
                "Error occurred during asynchronous operation processing: Timestamp must be non-decreasing for series attribute"
            )
        )

...
neptune.internal.operation_processors.async_operation_processor.logger.addFilter(_FilterCallback())

@SiddhantSadangi
Copy link
Member

Hey @danieldanciu 👋
Thanks for pointing this out. While not a bug (more of a design issue), it should not have been marked as resolved.

We are working on an updated client version that will address this issue. Meanwhile, the workaround you've suggested to filter out such logs works like a charm, and we'll add it to our docs ✅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests