-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Usage of Tensorboard in Distributed MXNet #7341
Comments
@LakeCarrot Here is an example https://github.com/apache/incubator-mxnet/blob/master/python/mxnet/contrib/tensorboard.py in which we use the callback to log training/eval metrics. In functionality, I suppose the usage of TensorBoard would have no difference than in TensorFlow, as we just write something to the event file. So if you want to log/monitor the overall learning process, I think you should dive into the logging mechanism of MXNet. But I haven't try distributed MXNet before, I'm not sure whether the example above could help. It depends on how these metrics are aggregated and computed, and where the event file is. |
@zihaolucky Thanks for your reply. I have read through that example. One thing I want to double check is for now, the Tensorboard for MXNet only provide support for read data from local disk right? There is no support for reading from cloud storage like S3, Azure or distributed file system like HDFS. |
@LakeCarrot There're some discussions on this issue, https://stackoverflow.com/questions/40830085/tensorboard-can-not-read-summaries-on-google-cloud-storage. As far as I know, I haven't seen this feature in origin TensorBoard, and our standalone version only supports reading from local. This feature has been raised in dmlc/tensorboard#39, I'll follow this issue and update the status. |
@zihaolucky Good to know. Thanks a lot! |
Hi all,
I tried to use Tensorboard to visualize my model training process. In the single-node training mode, the usage of Tensorboard is straightforward. Thing is different when it comes to the distributed training mode. Suppose I have 2 servers and 4 workers in my cluster, how can I use Tensorboard to track the overall training process? Basically, I can imagine there will be 4 different set of log files locate in each worker, and I need to use 4 separate Tensorboard processes to visualize the whole process.
After some research, I found the following question on StackOverflow, which said that in TensorFlow, only one of the workers need to write the log.
https://stackoverflow.com/questions/37411005/unable-to-use-tensorboard-in-distributed-tensorflow
I wonder what is the by-design usage of Tensorboard in Distributed MXNet? My main concern of writing summary on one of the worker is whether the log from a single worker can be a good representative to the overall learning process.
@zihaolucky Thanks a lot for your work to make the Tensorboard on MXNet come true. I wonder do you have any idea of my question?
Thanks in advance!
Bo
The text was updated successfully, but these errors were encountered: