-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cannot resume in offline mode due to lack of sys/id
field
#588
Comments
(I've removed my previous comment) @wjaskowski initially we didn't plan to enable resuming runs in the offline mode. If I may ask why do you need to resume an offline run? Are you working with a multiprocessing / multi-script setup or is there a time break between the execution of the script and it's resume? |
The truth is that I just wanted to use resuming in debug mode which
initially did not work for me so I tried offline mode, which also failed.
…On Mon, 7 Jun 2021 at 14:52, Marcin Mycek ***@***.***> wrote:
(I've removed my previous comment)
@wjaskowski <https://github.com/wjaskowski> initially we didn't plan to
enable resuming runs in the offline mode. If I may ask why do you need to
resume an offline run? Are you working with a multiprocessing /
multi-script setup or is there a time break between the execution of the
script and it's resume?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#588 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABFZEHOGCF3OQW6347DHD3TTRS6KVANCNFSM45WOFQ3A>
.
|
Switching from spreadssheets to |
Switching from spreadssheets to Neptune.ai and How it Pushed... |
Hi @Diagrama3 How can I help you? |
@Blaizzy I would also like to be able to resume an init_project in debug mode for testing purposes. Can this be achieved? |
Hi @ljstrnadiii, Thanks for reaching out. Yes, it is. Example: import neptune.new as neptune
project = neptune.init_project(mode="debug") |
@Blaizzy , I tried to stop and init_project again in a separate process, but the key was not present. |
@ljstrnadiii by key you mean api_token, right? If so, you can read more about setting your api_token here: |
Hey there! |
@Blaizzy thanks for checking in. What I want to do is use debug mode in two separate processes:
but this is not possible from what I understand (even though it seems some files get written to tmp somewhere). |
In debug mode, no data is stored or sent anywhere. For the use case you want to test, currently, you have to log metadata to Neptune servers in But I can definitely see your point and I'll submit your comment as a feature request to the product team. |
Hey @ljstrnadiii! Just checking in to see if you still need help with this or if you need help with anything else. Feel free to drop me a message. 😊 |
@Blaizzy that is what I thought. We test in debug mode and use a neptune run in debug mode as a fixture where we can and that works well, but for some e2e tests, we can only pass a reference to a neptune run or project location. We have created a tests project in neptune for our e2e tests to keep things isolated a bit. Thanks for the clarification! |
It's my pleasure :) You are most welcome @ljstrnadiii! Your solution is quite interesting, and I would love to learn more about it if you don't mind. I think it could provide us with valuable insight that we can incorporate into the product. Let me know what you think |
The function of resuming offline runs is very useful. Many guys are using commercial GPU servers to train their models, the GPU server often has the longest running time limit for a single run, for example, Kaggle's time limit is 12 hours, so we have to divide the training work into several parts. While using the offline model, the training speed will be faster and the offline mode is preferred. When the work is done, the offline training data will be uploaded to the Neptune server. For my code Neptune will generate several offline outputs to .neptune directory. I use the command: It is executed ok, but only the last run is displayed on the website. It seems the last run overwrites the prior one. |
Hi @bg4xsd Thanks for reaching out and sharing your use case! I have also passed it as feedback to the product team. Regarding your code, I notice that you are using the Each time you run that script and then use the But I can see your point; thanks to your feedback and others, we can now start thinking of a potential solution to this use case. |
Hi @Blaizzy , |
Most welcome and thank you for your kind words! I'm happy you enjoy using Neptune as much as we love making for you :) |
I will let you know here once the feature is released. Other than that, is there anything else I could help you with? |
Hi @Blaizzy Hope to hear from you soon. By now, no more questions. Anyway, thank you again. |
Perfect, have a great week! :) |
Hi @Blaizzy ! Is this feature still on the radar? We train on cloud instances that somewhat frequently get interrupted. This prevents us from using offline mode, as we can not resume the same run in offline mode. |
This feature is on the radar. However, at the moment, we don't have an ETA for it. Could you share the tracebacks for the times your training gets interrupted? |
Hi @wouterzwerink , Do you still need help with this? |
The offline resume is useful for offline logging. Using online mode will decrease the long-time training speed. For using cloud GPU services, such as Kaggle, and Google's colab, the training procedure will be interrupted every 10~12 hours, so the offline resume function is meaningful. |
I understand. Could you share the tracebacks for the times your training gets interrupted? |
@Blaizzy I seem to have missed your question, sorry! |
@wouterzwerink great to hear! If anything pops up feel free to let me know. I'll be happy to help :) |
I am interested in this feature. It'd be very useful for multi-script programs. |
Since its been a while, I'll add that I'm still very interested in this feature |
ends up with:
The thing is that I don't try to fetch data from the server but from the run, whenever it stores its data.
The text was updated successfully, but these errors were encountered: