NPT-14719: Offline mode messes up plots #1676

wouterzwerink · 2024-03-05T14:40:27Z

Describe the bug

Here's a comparison of the same plot from two different runs, where the only difference is I used offline mode and neptune sync afterwards for the latter. Both are the result of LearningRateMonitor for 4 epochs.

This also happens in other plots like loss. It seems to duplicate values at the start and end, but sometimes also messes up in between:

It does not matter if I use NeptuneLogger or log to the neptune run directly (for loss or metrics, the callback here always uses the NeptuneLogger), the offline version is always messed up

Reproduction

Use offline mode!

Expected behavior

Same plots in neptune regardless of mode

Traceback

If applicable, add traceback or log output/screenshots to help explain your problem.

Environment

The output of pip list:
Tried neptune 1.8.6 and 1.9.1, same results

The operating system you're using:
Linux

The output of python --version:
3.9

The text was updated successfully, but these errors were encountered:

SiddhantSadangi · 2024-03-05T15:17:24Z

Hey @wouterzwerink 👋

I am not able to reproduce this. Comparing the plots for async and offline runs gives me perfectly overlapping plots:

Could you share a minimal code sample that would help me reproduce this?

wouterzwerink · 2024-03-05T15:31:29Z

Thanks for looking into this. I can try to create a minimal example later.
From the top of my head there is a couple things we do that may be needed to reproduce:

Changing the .neptune folder location by temporarily changing directory when initializing the run
Using run["some_prefix"] instead of the run object. So we'd assign neptune_run = run["prefix"] and then log like neptune_run["value"].append(score)
Use a high flush period (900), though that should not affect offline runs

SiddhantSadangi · 2024-03-05T15:42:38Z

Still no luck, unfortunately 😞

Here's the code I used:

import os

import neptune
import numpy as np

np.random.seed(42)

# Changing `.neptune` folder
original_cwd = os.getcwd()
os.chdir("temp_folder")

run = neptune.init_run(mode="offline", flush_period=900)

os.chdir(original_cwd)

# Logging to namespace_handler
neptune_run = run["prefix"]

for _ in range(100):
    neptune_run["values"].append(np.random.rand())

wouterzwerink · 2024-03-05T15:56:18Z

Oh strange! How are you syncing the offline run?
We call run.stop() followed by a subprocess call to neptune sync --path {path} --project {project} --offline-only

path points to the changed directory, so temp_folder in your case

SiddhantSadangi · 2024-03-05T15:58:25Z

I was doing it manually from the terminal, but let me try your approach as well

wouterzwerink · 2024-03-05T15:59:34Z

I'll take some time tomorrow to try to isolate the issue, thanks again for looking into this

SiddhantSadangi · 2024-03-05T16:03:10Z

Same results

import os
import subprocess

import neptune
import numpy as np

np.random.seed(42)

# Changing `.neptune` folder
path = "temp_folder"
original_cwd = os.getcwd()
os.chdir(path)

run = neptune.init_run(mode="offline", flush_period=900)

os.chdir(original_cwd)

# Logging to namespace_handler
neptune_run = run["prefix"]

for _ in range(100):
    neptune_run["values"].append(np.random.rand())

# Stop and sync manually
run.stop()

subprocess.call(f"neptune sync --path {path} --offline-only")

wouterzwerink · 2024-03-10T15:28:05Z

Hi @SiddhantSadangi ! I have a script for you that reproduces the bug on my end:

import os

import neptune
import torch
import torch.nn.functional as F
from pytorch_lightning import LightningModule, Trainer, seed_everything
from pytorch_lightning.loggers import NeptuneLogger
from torch.utils.data import DataLoader, TensorDataset

PROJECT = "project-name"


class LitModel(LightningModule):
    def __init__(self):
        super().__init__()
        self.layer = torch.nn.Linear(32, 2)

    def forward(self, x):
        return self.layer(x)

    def training_step(self, batch, batch_idx):
        x, y = batch
        y_hat = self.layer(x)
        loss = F.cross_entropy(y_hat, y)
        self.log("train/loss", loss)
        return loss

    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters(), lr=0.02)


seed_everything(42)


NUM_SAMPLES = 10_000  # Bug does not happen if this is too small, e.g. 1000
x = torch.randn(NUM_SAMPLES, 32)
y = torch.randint(0, 2, (NUM_SAMPLES,))
dataset = TensorDataset(x, y)


def get_dataloader():
    # Bug does not happen if num_workers=0
    return DataLoader(dataset, batch_size=16, num_workers=4)


run = neptune.Run(
    project=PROJECT,
    mode="offline",  # Bug only happens in offline mode, not with sync or async
    flush_period=900,
    capture_hardware_metrics=False,  # Bug does not happen if these are enabled
    capture_stdout=False,
    capture_stderr=False,
    capture_traceback=False,
)


for prefix in ("prefix1", "prefix2"):
    logger = NeptuneLogger(
        run=run[prefix],
        log_model_checkpoints=False,
    )

    model = LitModel()
    dataloader = get_dataloader()
    trainer = Trainer(
        logger=logger,
        max_epochs=4,
    )
    trainer.fit(model, dataloader)

# Stop and sync manually
run.stop()
os.system(f"neptune sync --project {PROJECT} --offline-only")

SiddhantSadangi · 2024-03-11T09:18:55Z

Hey @wouterzwerink ,
Thanks for sharing the script!

I am, however, still not being able to reproduce the issue. I ran your script as is for the offline mode, and once with the default async mode, and got perfectly overlapping charts

Is anyone else in your team also facing the same issue?

wouterzwerink · 2024-03-11T10:01:55Z

@SiddhantSadangi interesting, can't seem to find whats causing this! What python version are you using? I'm on 3.9.18 and latest neptune and lightning.
I don't think anyone else is trying to use offline mode right now

SiddhantSadangi · 2024-03-11T10:16:52Z

I switched to WSL to use multiple dataloaders and forgot it was on Python 3.11.5. Let me try 3.9.18 too

SiddhantSadangi · 2024-03-11T10:27:29Z

Same results with Python 3.9.18, neptune 1.9.1, and lightning 2.2.1

SiddhantSadangi · 2024-03-11T10:28:31Z

Would it be possible for someone else in your team to try running in offline mode? It'll help us know if it's something specific to your setup, or something to do with the client in general

wouterzwerink · 2024-03-11T12:27:20Z

@SiddhantSadangi I'll ask someone else to run it too.

I found something interesting. Adding the following fixes the issue for me:

    def on_train_epoch_start(self) -> None:
        root_obj = self.logger.experiment.get_root_object()
        root_obj.wait(disk_only=True)

This fix works not only in the script, but also my actual code.

Perhaps this is some race condition.
My .neptune is on AWS EFS, a network filesystem, so writing to disk may be slower on my side which could explain why it's not reproducing on your side.

SiddhantSadangi · 2024-03-11T18:16:30Z

So after struggling with AWS' security groups, I tried running your code on EC2 with a mounted EFS volume that served as the NEPTUNE_DATA_DIRECTORY.. But I was still unable to reproduce the issue 😔

I will still have the engineers take a look in case the lag between writing to memory and flushing to disk might be causing some weird issues.

SiddhantSadangi · 2024-03-12T09:09:37Z

@wouterzwerink - could you mail us the contents of the .neptune folder as a ZIP archive to [email protected]?

SiddhantSadangi · 2024-03-12T09:25:16Z

Also, would it be possible for you to run neptune sync after the script has terminated? Maybe from the terminal or something?

wouterzwerink · 2024-03-12T10:38:13Z

Also, would it be possible for you to run neptune sync after the script has terminated? Maybe from the terminal or something?

Sure thing! Just did neptune clear, then the script with the os.system call removed, then manually synced with neptune sync. Results are the same:

SiddhantSadangi · 2024-03-12T11:22:08Z

@wouterzwerink - could you mail us the contents of the .neptune folder as a ZIP archive to [email protected]?

@wouterzwerink - Could you also share this?

SiddhantSadangi self-assigned this Mar 5, 2024

SiddhantSadangi added the pending Waiting for a response label Mar 5, 2024

SiddhantSadangi added help wanted and removed pending Waiting for a response labels Mar 11, 2024

SiddhantSadangi added the pending Waiting for a response label Mar 12, 2024

SiddhantSadangi changed the title ~~BUG: Offline mode messes up plots~~ NPT-14719: Offline mode messes up plots Mar 12, 2024

SiddhantSadangi removed the pending Waiting for a response label Mar 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NPT-14719: Offline mode messes up plots #1676

NPT-14719: Offline mode messes up plots #1676

wouterzwerink commented Mar 5, 2024

SiddhantSadangi commented Mar 5, 2024

wouterzwerink commented Mar 5, 2024

SiddhantSadangi commented Mar 5, 2024

wouterzwerink commented Mar 5, 2024 •

edited

Loading

SiddhantSadangi commented Mar 5, 2024

wouterzwerink commented Mar 5, 2024

SiddhantSadangi commented Mar 5, 2024

wouterzwerink commented Mar 10, 2024

SiddhantSadangi commented Mar 11, 2024

wouterzwerink commented Mar 11, 2024

SiddhantSadangi commented Mar 11, 2024

SiddhantSadangi commented Mar 11, 2024

SiddhantSadangi commented Mar 11, 2024

wouterzwerink commented Mar 11, 2024

SiddhantSadangi commented Mar 11, 2024

SiddhantSadangi commented Mar 12, 2024

SiddhantSadangi commented Mar 12, 2024

wouterzwerink commented Mar 12, 2024

SiddhantSadangi commented Mar 12, 2024

NPT-14719: Offline mode messes up plots #1676

NPT-14719: Offline mode messes up plots #1676

Comments

wouterzwerink commented Mar 5, 2024

Describe the bug

Reproduction

Expected behavior

Traceback

Environment

SiddhantSadangi commented Mar 5, 2024

wouterzwerink commented Mar 5, 2024

SiddhantSadangi commented Mar 5, 2024

wouterzwerink commented Mar 5, 2024 • edited Loading

SiddhantSadangi commented Mar 5, 2024

wouterzwerink commented Mar 5, 2024

SiddhantSadangi commented Mar 5, 2024

wouterzwerink commented Mar 10, 2024

SiddhantSadangi commented Mar 11, 2024

wouterzwerink commented Mar 11, 2024

SiddhantSadangi commented Mar 11, 2024

SiddhantSadangi commented Mar 11, 2024

SiddhantSadangi commented Mar 11, 2024

wouterzwerink commented Mar 11, 2024

SiddhantSadangi commented Mar 11, 2024

SiddhantSadangi commented Mar 12, 2024

SiddhantSadangi commented Mar 12, 2024

wouterzwerink commented Mar 12, 2024

SiddhantSadangi commented Mar 12, 2024

wouterzwerink commented Mar 5, 2024 •

edited

Loading