-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Error msg during training - Timestamp must be non-decreasing for series attribute #733
Comments
Hey @stonelazy, Thanks for this report. You have more than 1 process writing in async way to the same field in neptune ( In common use like So what is happening in your case is the following (all processes log to the same run, async):
In it not an error that will crash your run, at the same time, I see that it can be annoying. Quick workaround is to disable stdout/err and hardware monitoring - as this is what I see gives you the error msgs (docs).
You will not get these error msgs at the expense of less granular view into your runs. At Neptune we see this use case, and we are working on better handling of such needs. |
Thanks for the details. Yeah, i had it figured on disabling those 3 flags i didn't get the error and have been running my training with that for few days now. But the workaround comes at a dataloss, no longer we would be able to monitor CPU/GPU metrics. This workaround doesn't seem to be convincing. |
I appreciate your feedback. I will get back here with more details soon. |
For now I have some temporary workaround. You can try to suppress these errors like this:
Not sure if it will work for GPUs as well, but it's worth trying. At the same time please use it wisely and as temporary workaround as it disables all errors from async processing. I already passed info about your case to the product team for further consideration. Let me know what you think? |
Hi @stonelazy, I'm thinking what is your current status / thinking when it comes to the above issues? |
Hi @kamil-kaczmarek am now not having any issues with the suggested workaround. |
@stonelazy ok good, I recommend to transition to the PTL 1.5 and use our new logger:
|
If we move to PTL 1.5, we won't have this error ? |
@stonelazy unfortunately not, I'm sorry. Referred to the new logger, due to improvements in the logger in other dimensions, especially it uses our new API under the hood. |
For people stumbling on this issue: you can also "cleanly" silence neptune logs in kedro using the loggers:
# [ OTHER LOGGERS CONFIGURATION ...]
neptune.new.internal.operation_processors.async_operation_processor:
level: CRITICAL
handlers: [console, info_file_handler, error_file_handler]
propagate: no |
Hi @stonelazy, Prince Canuma here, a Data Scientist at Neptune.ai, I want to personally inform you of the good news! This issue is now fixed on the latest release of PyTorch-Lightning v1.5.7 🎊 🥳 All you need to do is upgrade the library to the latest release👍 Happy Christmas and a prosperous New Year in advance! |
I'm still seeing this error with |
Could you share a minimal reproducible code and the setup you have(i.e., no of GPUs and strategy) ? |
The issue is still present. Just logging the hardware metrics with more than 1 process results in a lot of spam. I don't know why this issue is closed. Please reopen it. |
Hi @mrembalski Bummer that you're experiencing some issues. No worries, we're here to help! I have a couple of questions that'll help me understand the situation better:
Once I have this info, we'll dive right in and get things sorted out for you! By the way, if you are using PyTorch we have since released a guide on our docs on how to track DDP jobs. Read more here: |
I'm using Pytorch Lightning. I'll try to create a snippet tomorrow. |
Perfect! I am looking forward to it :) |
Hi @Blaizzy I ham trying to use Neptune logger in pure Pytorch through Fairseq i.e trying to integrate Neptune logger for Fairseq package which already supports lots of your comptetitors like, WANDB, AzureML, Aim etc. I am also facing the same issue. I have run the job only on one GPU rank and still getting the logs as stated above(reproduced below).
Where is this exactly an issue ? As in, is there someway I can make changes to prevent this issue from occuring, some kind of sync/wait primitives available in neptune that I am not calling ? Any direction would help me better handle the logger implementation for Fairseq. |
Hey, the import changed (the "new" is gone from the path). Here is how I ignore these messages for now: logging.getLogger("neptune.internal.operation_processors.async_operation_processor").setLevel(
logging.CRITICAL
) |
On a different note, a better way to filter out these spurious error messages without affecting actual legitimate error logs is to do something like: class _FilterCallback(logging.Filterer):
@override
def filter(self, record: logging.LogRecord):
return not (
record.name == "neptune"
and record.getMessage().startswith(
"Error occurred during asynchronous operation processing: Timestamp must be non-decreasing for series attribute"
)
)
...
neptune.internal.operation_processors.async_operation_processor.logger.addFilter(_FilterCallback()) |
Hey @danieldanciu 👋 We are working on an updated client version that will address this issue. Meanwhile, the workaround you've suggested to filter out such logs works like a charm, and we'll add it to our docs ✅ |
Describe the bug
When running Neptune logger in PytorchLightning with ddp > 1gpus.
Then there are continuous errors reading
Error occured during asynchronous operation processing. Timestamp must be non-decreasing for series attribute
If the Neptune logger is offline, or if neptune logger is removed then this error isn't logged.
There are too many errors, and even the progress bar of the training is difficult to identify.
Reproduction
When i run with 4gpus, i was able to reproduce this.
https://colab.research.google.com/drive/1TOadmpet63eSXz6LMHVvdM-D6Gy0LDxe?usp=sharing
Expected behavior
If this is a valid error message, then there is no hint of what actions needs to be taken. If they are harmless/not valid kindly suggest a way to suppress this print.
Traceback
Error occurred during asynchronous operation processing: Timestamp must be non-decreasing for series attribute: monitoring/stdout. Invalid point: 2021-10-15T13:25:02.767Z
Error occurred during asynchronous operation processing: Timestamp must be non-decreasing for series attribute: monitoring/stdout. Invalid point: 2021-10-15T13:25:02.767Z
Environment
The output of
pip list
:PyTorch version: 1.9.0+cu111
Is debug build: False
CUDA used to build PyTorch: 11.1
ROCM used to build PyTorch: N/A
OS: Ubuntu 20.04.2 LTS (x86_64)
GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Clang version: Could not collect
CMake version: version 3.21.3
Libc version: glibc-2.31
Python version: 3.8.11 (default, Aug 3 2021, 15:09:35) [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-5.11.0-37-generic-x86_64-with-glibc2.17
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration:
GPU 0: RTX A6000
GPU 1: RTX A6000
GPU 2: RTX A6000
GPU 3: RTX A6000
GPU 4: RTX A6000
GPU 5: RTX A6000
GPU 6: RTX A6000
GPU 7: RTX A6000
Nvidia driver version: 460.91.03
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Versions of relevant libraries:
Versions of relevant libraries:
[pip3] mypy==0.910
[pip3] mypy-extensions==0.4.3
[pip3] neptune-pytorch-lightning==0.9.7
[pip3] numpy==1.21.2
[pip3] pytorch-lightning==1.4.9
[pip3] torch==1.9.0+cu111
[pip3] torch-poly-lr-decay==0.0.1
[pip3] torchaudio==0.9.0
[pip3] torchmetrics==0.4.1
[conda] blas 1.0 mkl
[conda] cudatoolkit 11.1.74 h6bb024c_0 nvidia
[conda] ffmpeg 4.3 hf484d3e_0 pytorch
[conda] mkl 2021.3.0 h06a4308_520
[conda] mkl-service 2.4.0 py38h7f8727e_0
[conda] mkl_fft 1.3.0 py38h42c9631_2
[conda] mkl_random 1.2.2 py38h51133e4_0
[conda] mypy 0.910 pypi_0 pypi
[conda] mypy-extensions 0.4.3 pypi_0 pypi
[conda] neptune-client 0.12.0 pypi_0 pypi
[conda] neptune-contrib 0.27.3 pypi_0 pypi
[conda] neptune-pytorch-lightning 0.9.7 pypi_0 pypi
[conda] numpy 1.21.1 pypi_0 pypi
[conda] numpy-base 1.21.2 py38h79a1101_0
[conda] pytorch-lightning 1.4.9 pypi_0 pypi
[conda] torch 1.9.0+cu111 pypi_0 pypi
[conda] torch-poly-lr-decay 0.0.1 pypi_0 pypi
[conda] torchaudio 0.9.0 pypi_0 pypi
[conda] torchmetrics 0.4.1 pypi_0 pypi
Additional context
The text was updated successfully, but these errors were encountered: