-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NPT-14719: Offline mode messes up plots #1676
Comments
Hey @wouterzwerink 👋 I am not able to reproduce this. Comparing the plots for async and offline runs gives me perfectly overlapping plots: Could you share a minimal code sample that would help me reproduce this? |
Thanks for looking into this. I can try to create a minimal example later.
|
Oh strange! How are you syncing the offline run?
|
I was doing it manually from the terminal, but let me try your approach as well |
I'll take some time tomorrow to try to isolate the issue, thanks again for looking into this |
Same results import os
import subprocess
import neptune
import numpy as np
np.random.seed(42)
# Changing `.neptune` folder
path = "temp_folder"
original_cwd = os.getcwd()
os.chdir(path)
run = neptune.init_run(mode="offline", flush_period=900)
os.chdir(original_cwd)
# Logging to namespace_handler
neptune_run = run["prefix"]
for _ in range(100):
neptune_run["values"].append(np.random.rand())
# Stop and sync manually
run.stop()
subprocess.call(f"neptune sync --path {path} --offline-only") |
Hi @SiddhantSadangi ! I have a script for you that reproduces the bug on my end: import os
import neptune
import torch
import torch.nn.functional as F
from pytorch_lightning import LightningModule, Trainer, seed_everything
from pytorch_lightning.loggers import NeptuneLogger
from torch.utils.data import DataLoader, TensorDataset
PROJECT = "project-name"
class LitModel(LightningModule):
def __init__(self):
super().__init__()
self.layer = torch.nn.Linear(32, 2)
def forward(self, x):
return self.layer(x)
def training_step(self, batch, batch_idx):
x, y = batch
y_hat = self.layer(x)
loss = F.cross_entropy(y_hat, y)
self.log("train/loss", loss)
return loss
def configure_optimizers(self):
return torch.optim.Adam(self.parameters(), lr=0.02)
seed_everything(42)
NUM_SAMPLES = 10_000 # Bug does not happen if this is too small, e.g. 1000
x = torch.randn(NUM_SAMPLES, 32)
y = torch.randint(0, 2, (NUM_SAMPLES,))
dataset = TensorDataset(x, y)
def get_dataloader():
# Bug does not happen if num_workers=0
return DataLoader(dataset, batch_size=16, num_workers=4)
run = neptune.Run(
project=PROJECT,
mode="offline", # Bug only happens in offline mode, not with sync or async
flush_period=900,
capture_hardware_metrics=False, # Bug does not happen if these are enabled
capture_stdout=False,
capture_stderr=False,
capture_traceback=False,
)
for prefix in ("prefix1", "prefix2"):
logger = NeptuneLogger(
run=run[prefix],
log_model_checkpoints=False,
)
model = LitModel()
dataloader = get_dataloader()
trainer = Trainer(
logger=logger,
max_epochs=4,
)
trainer.fit(model, dataloader)
# Stop and sync manually
run.stop()
os.system(f"neptune sync --project {PROJECT} --offline-only") |
Hey @wouterzwerink , I am, however, still not being able to reproduce the issue. I ran your script as is for the offline mode, and once with the default Is anyone else in your team also facing the same issue? |
@SiddhantSadangi interesting, can't seem to find whats causing this! What python version are you using? I'm on 3.9.18 and latest neptune and lightning. |
I switched to WSL to use multiple dataloaders and forgot it was on Python 3.11.5. Let me try 3.9.18 too |
Would it be possible for someone else in your team to try running in offline mode? It'll help us know if it's something specific to your setup, or something to do with the client in general |
@SiddhantSadangi I'll ask someone else to run it too. I found something interesting. Adding the following fixes the issue for me: def on_train_epoch_start(self) -> None:
root_obj = self.logger.experiment.get_root_object()
root_obj.wait(disk_only=True) This fix works not only in the script, but also my actual code. Perhaps this is some race condition. |
So after struggling with AWS' security groups, I tried running your code on EC2 with a mounted EFS volume that served as the I will still have the engineers take a look in case the lag between writing to memory and flushing to disk might be causing some weird issues. |
@wouterzwerink - could you mail us the contents of the |
Also, would it be possible for you to run |
@wouterzwerink - Could you also share this? |
Describe the bug
Here's a comparison of the same plot from two different runs, where the only difference is I used offline mode and
neptune sync
afterwards for the latter. Both are the result ofLearningRateMonitor
for 4 epochs.This also happens in other plots like loss. It seems to duplicate values at the start and end, but sometimes also messes up in between:
It does not matter if I use
NeptuneLogger
or log to the neptune run directly (for loss or metrics, the callback here always uses the NeptuneLogger), the offline version is always messed upReproduction
Use offline mode!
Expected behavior
Same plots in neptune regardless of mode
Traceback
If applicable, add traceback or log output/screenshots to help explain your problem.
Environment
The output of
pip list
:Tried neptune 1.8.6 and 1.9.1, same results
The operating system you're using:
Linux
The output of
python --version
:3.9
The text was updated successfully, but these errors were encountered: