Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NPT-14719: Offline mode messes up plots #1676

Open
wouterzwerink opened this issue Mar 5, 2024 · 19 comments
Open

NPT-14719: Offline mode messes up plots #1676

wouterzwerink opened this issue Mar 5, 2024 · 19 comments
Assignees

Comments

@wouterzwerink
Copy link

Describe the bug

Here's a comparison of the same plot from two different runs, where the only difference is I used offline mode and neptune sync afterwards for the latter. Both are the result of LearningRateMonitor for 4 epochs.
image
image

This also happens in other plots like loss. It seems to duplicate values at the start and end, but sometimes also messes up in between:
image

It does not matter if I use NeptuneLogger or log to the neptune run directly (for loss or metrics, the callback here always uses the NeptuneLogger), the offline version is always messed up

Reproduction

Use offline mode!

Expected behavior

Same plots in neptune regardless of mode

Traceback

If applicable, add traceback or log output/screenshots to help explain your problem.

Environment

The output of pip list:
Tried neptune 1.8.6 and 1.9.1, same results

The operating system you're using:
Linux

The output of python --version:
3.9

@SiddhantSadangi SiddhantSadangi self-assigned this Mar 5, 2024
@SiddhantSadangi
Copy link
Member

Hey @wouterzwerink 👋

I am not able to reproduce this. Comparing the plots for async and offline runs gives me perfectly overlapping plots:
image

Could you share a minimal code sample that would help me reproduce this?

@SiddhantSadangi SiddhantSadangi added the pending Waiting for a response label Mar 5, 2024
@wouterzwerink
Copy link
Author

Thanks for looking into this. I can try to create a minimal example later.
From the top of my head there is a couple things we do that may be needed to reproduce:

  • Changing the .neptune folder location by temporarily changing directory when initializing the run
  • Using run["some_prefix"] instead of the run object. So we'd assign neptune_run = run["prefix"] and then log like neptune_run["value"].append(score)
  • Use a high flush period (900), though that should not affect offline runs

@SiddhantSadangi
Copy link
Member

Still no luck, unfortunately 😞
image

Here's the code I used:

import os

import neptune
import numpy as np

np.random.seed(42)

# Changing `.neptune` folder
original_cwd = os.getcwd()
os.chdir("temp_folder")

run = neptune.init_run(mode="offline", flush_period=900)

os.chdir(original_cwd)

# Logging to namespace_handler
neptune_run = run["prefix"]

for _ in range(100):
    neptune_run["values"].append(np.random.rand())

@wouterzwerink
Copy link
Author

wouterzwerink commented Mar 5, 2024

Oh strange! How are you syncing the offline run?
We call run.stop() followed by a subprocess call to neptune sync --path {path} --project {project} --offline-only

path points to the changed directory, so temp_folder in your case

@SiddhantSadangi
Copy link
Member

I was doing it manually from the terminal, but let me try your approach as well

@wouterzwerink
Copy link
Author

I'll take some time tomorrow to try to isolate the issue, thanks again for looking into this

@SiddhantSadangi
Copy link
Member

Same results

import os
import subprocess

import neptune
import numpy as np

np.random.seed(42)

# Changing `.neptune` folder
path = "temp_folder"
original_cwd = os.getcwd()
os.chdir(path)

run = neptune.init_run(mode="offline", flush_period=900)

os.chdir(original_cwd)

# Logging to namespace_handler
neptune_run = run["prefix"]

for _ in range(100):
    neptune_run["values"].append(np.random.rand())

# Stop and sync manually
run.stop()

subprocess.call(f"neptune sync --path {path} --offline-only")

@wouterzwerink
Copy link
Author

Hi @SiddhantSadangi ! I have a script for you that reproduces the bug on my end:

import os

import neptune
import torch
import torch.nn.functional as F
from pytorch_lightning import LightningModule, Trainer, seed_everything
from pytorch_lightning.loggers import NeptuneLogger
from torch.utils.data import DataLoader, TensorDataset

PROJECT = "project-name"


class LitModel(LightningModule):
    def __init__(self):
        super().__init__()
        self.layer = torch.nn.Linear(32, 2)

    def forward(self, x):
        return self.layer(x)

    def training_step(self, batch, batch_idx):
        x, y = batch
        y_hat = self.layer(x)
        loss = F.cross_entropy(y_hat, y)
        self.log("train/loss", loss)
        return loss

    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters(), lr=0.02)


seed_everything(42)


NUM_SAMPLES = 10_000  # Bug does not happen if this is too small, e.g. 1000
x = torch.randn(NUM_SAMPLES, 32)
y = torch.randint(0, 2, (NUM_SAMPLES,))
dataset = TensorDataset(x, y)


def get_dataloader():
    # Bug does not happen if num_workers=0
    return DataLoader(dataset, batch_size=16, num_workers=4)


run = neptune.Run(
    project=PROJECT,
    mode="offline",  # Bug only happens in offline mode, not with sync or async
    flush_period=900,
    capture_hardware_metrics=False,  # Bug does not happen if these are enabled
    capture_stdout=False,
    capture_stderr=False,
    capture_traceback=False,
)


for prefix in ("prefix1", "prefix2"):
    logger = NeptuneLogger(
        run=run[prefix],
        log_model_checkpoints=False,
    )

    model = LitModel()
    dataloader = get_dataloader()
    trainer = Trainer(
        logger=logger,
        max_epochs=4,
    )
    trainer.fit(model, dataloader)

# Stop and sync manually
run.stop()
os.system(f"neptune sync --project {PROJECT} --offline-only")

@SiddhantSadangi
Copy link
Member

Hey @wouterzwerink ,
Thanks for sharing the script!

I am, however, still not being able to reproduce the issue. I ran your script as is for the offline mode, and once with the default async mode, and got perfectly overlapping charts
image

Is anyone else in your team also facing the same issue?

@wouterzwerink
Copy link
Author

@SiddhantSadangi interesting, can't seem to find whats causing this! What python version are you using? I'm on 3.9.18 and latest neptune and lightning.
I don't think anyone else is trying to use offline mode right now

@SiddhantSadangi
Copy link
Member

I switched to WSL to use multiple dataloaders and forgot it was on Python 3.11.5. Let me try 3.9.18 too

@SiddhantSadangi
Copy link
Member

Same results with Python 3.9.18, neptune 1.9.1, and lightning 2.2.1

image

@SiddhantSadangi
Copy link
Member

Would it be possible for someone else in your team to try running in offline mode? It'll help us know if it's something specific to your setup, or something to do with the client in general

@wouterzwerink
Copy link
Author

@SiddhantSadangi I'll ask someone else to run it too.

I found something interesting. Adding the following fixes the issue for me:

    def on_train_epoch_start(self) -> None:
        root_obj = self.logger.experiment.get_root_object()
        root_obj.wait(disk_only=True)

This fix works not only in the script, but also my actual code.

Perhaps this is some race condition.
My .neptune is on AWS EFS, a network filesystem, so writing to disk may be slower on my side which could explain why it's not reproducing on your side.

@SiddhantSadangi
Copy link
Member

So after struggling with AWS' security groups, I tried running your code on EC2 with a mounted EFS volume that served as the NEPTUNE_DATA_DIRECTORY.. But I was still unable to reproduce the issue 😔

image

I will still have the engineers take a look in case the lag between writing to memory and flushing to disk might be causing some weird issues.

@SiddhantSadangi SiddhantSadangi added help wanted and removed pending Waiting for a response labels Mar 11, 2024
@SiddhantSadangi
Copy link
Member

@wouterzwerink - could you mail us the contents of the .neptune folder as a ZIP archive to [email protected]?

@SiddhantSadangi SiddhantSadangi added the pending Waiting for a response label Mar 12, 2024
@SiddhantSadangi
Copy link
Member

Also, would it be possible for you to run neptune sync after the script has terminated? Maybe from the terminal or something?

@wouterzwerink
Copy link
Author

Also, would it be possible for you to run neptune sync after the script has terminated? Maybe from the terminal or something?

Sure thing! Just did neptune clear, then the script with the os.system call removed, then manually synced with neptune sync. Results are the same:
image

@SiddhantSadangi SiddhantSadangi changed the title BUG: Offline mode messes up plots NPT-14719: Offline mode messes up plots Mar 12, 2024
@SiddhantSadangi
Copy link
Member

@wouterzwerink - could you mail us the contents of the .neptune folder as a ZIP archive to [email protected]?

@wouterzwerink - Could you also share this?

@SiddhantSadangi SiddhantSadangi removed the pending Waiting for a response label Mar 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants