Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Usage of Tensorboard in Distributed MXNet #7341

Closed
LakeCarrot opened this issue Aug 4, 2017 · 4 comments
Closed

Usage of Tensorboard in Distributed MXNet #7341

LakeCarrot opened this issue Aug 4, 2017 · 4 comments

Comments

@LakeCarrot
Copy link

Hi all,
I tried to use Tensorboard to visualize my model training process. In the single-node training mode, the usage of Tensorboard is straightforward. Thing is different when it comes to the distributed training mode. Suppose I have 2 servers and 4 workers in my cluster, how can I use Tensorboard to track the overall training process? Basically, I can imagine there will be 4 different set of log files locate in each worker, and I need to use 4 separate Tensorboard processes to visualize the whole process.
After some research, I found the following question on StackOverflow, which said that in TensorFlow, only one of the workers need to write the log.
https://stackoverflow.com/questions/37411005/unable-to-use-tensorboard-in-distributed-tensorflow
I wonder what is the by-design usage of Tensorboard in Distributed MXNet? My main concern of writing summary on one of the worker is whether the log from a single worker can be a good representative to the overall learning process.
@zihaolucky Thanks a lot for your work to make the Tensorboard on MXNet come true. I wonder do you have any idea of my question?
Thanks in advance!
Bo

@zihaolucky
Copy link
Member

@LakeCarrot Here is an example https://github.com/apache/incubator-mxnet/blob/master/python/mxnet/contrib/tensorboard.py in which we use the callback to log training/eval metrics.

In functionality, I suppose the usage of TensorBoard would have no difference than in TensorFlow, as we just write something to the event file. So if you want to log/monitor the overall learning process, I think you should dive into the logging mechanism of MXNet.

But I haven't try distributed MXNet before, I'm not sure whether the example above could help. It depends on how these metrics are aggregated and computed, and where the event file is.

@LakeCarrot
Copy link
Author

LakeCarrot commented Aug 7, 2017

@zihaolucky Thanks for your reply. I have read through that example. One thing I want to double check is for now, the Tensorboard for MXNet only provide support for read data from local disk right? There is no support for reading from cloud storage like S3, Azure or distributed file system like HDFS.

@LakeCarrot LakeCarrot reopened this Aug 7, 2017
@zihaolucky
Copy link
Member

@LakeCarrot There're some discussions on this issue, https://stackoverflow.com/questions/40830085/tensorboard-can-not-read-summaries-on-google-cloud-storage. As far as I know, I haven't seen this feature in origin TensorBoard, and our standalone version only supports reading from local. This feature has been raised in dmlc/tensorboard#39, I'll follow this issue and update the status.

@LakeCarrot
Copy link
Author

@zihaolucky Good to know. Thanks a lot!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants