1. How to use the custom version framework or the different version of the cluster installation to execute the program?
Specify the required files for the related version of framework and dependent libraries by the submit parameters such as --file
,--cacheFile
or --cacheArchive
. Furthermore, setting the environment variables PYTHONPATH
as export PYTHONPATH=./:$PYTHONPATH
if necessary.
In order to view the progress of the execution both at the XLearning client and the application web interface, user need to print the progress to standard error as the format of "report:progress:<float type>"
in the execution program.
In the distributed mode of TensorFlow application, ClusterSpec is defined by setting the host and port of ps and worker preliminarily. XLearning implements the automatic construction of the ClusterSpec. User can get the information of ClusterSpec, job_name, task_index from the environment variables TF_CLUSTER_DEF, TF_ROLE, TF_INDEX, such as:
import os
import json
cluster_def = json.loads(os.environ["TF_CLUSTER_DEF"])
cluster = tf.train.ClusterSpec(cluster_def)
job_name = os.environ["TF_ROLE"]
task_index = int(os.environ["TF_INDEX"])