Frequently Asked Questions

Why use only one task (TF Node) per executor?

This was a design choice as the most obvious, easiest-to-reason-about mapping of a Spark resource to a TF node. The most visible example is that each executor's logs will only contain the logs of a single TF node, which is much easier to debug vs. interspersed logs. Additionally, most Spark/YARN configurations (e.g. memory) can be almost directly mapped to the TF node. Finally, in Hadoop/YARN environments (which was our primary focus), you can still have multiple executors assigned to the same physical host if there are enough available resources.

Note: for YARN, spark.executor.cores is usually defaulted to 1, so we just set spark.task.cpus=1 as well. If your environment has a different setting for spark.executor.cores, you can just set spark.task.cpus equal to that number to achieve the "one task per executor" goal.

Why doesn't algorithm X scale when I add more nodes?

Different algorithms are scalable in different ways, depending on whether they're limited by compute, memory, or I/O. In cases where an algorithm can run comfortably on a single-node (completing in a few seconds/minutes), adding the overhead of launching nodes in a cluster, coordinating work between nodes, and communicating gradients/weights across the cluster will generally be slower. For example, the distributed MNIST example is provided to demonstrate the process of porting a simple, easy-to-understand, distributed TF application over to TFoS. It it not intended to run faster than a single-node TF instance, where you can actually load the entire training set in memory and finish 1000 steps in a few seconds.

Similarly, the other examples also just demonstrate the conversion of different TF apps, and again they do not to demonstrate re-writing the code to be more scalable. Note that these examples were originally designed with different scalability models, as follows:

cifar10 - single-node, multi-gpu
mnist - multi-node, single-gpu-per-node
imagenet - multi-node, single-gpu-per-node
slim - single-node, multi-gpu OR multi-node, single-gpu-per-node

Finally, if an application takes a "painfully long" time to complete on a single-node, then it is a good candidate for scaling. However, you will still need to understand which resource is limited and how to modify in your code in order to scale for that resource in a distributed cluster.

What does `KeyError: 'input'` error mean?

Because we assume that one executor hosts only one TF node, you will encounter this error if your cluster is not configured accordingly. Specifically, this is seen when a RDD feeding task lands on an executor that doesn't contain a TF worker node, e.g. it lands on the PS node or even an idle executor.

Why does MNIST example hang?

There are several possible causes for this symptom:

In Hadoop/YARN environments, you are likely missing the path to the libhdfs.so in your spark.executorEnv.LD_LIBRARY_PATH. TensorFlow 1.1+ requires this to be set.
If you are using different settings than the original example, you may need to increase the amount of data being fed to the cluster with the --epochs argument, and you may need to increase the --steps argument accordingly.
In some environments, the hostname returned for the cluster_spec is not routable, so the TF nodes cannot connect to each other to form the cluster. There was a PR (#109, merged 20170728) to switch to using IP addresses internally, so you may need to update your version of TFoS.

For InputMode.SPARK, how are `epochs`, `steps`, `batch_size`, and `num_workers` correlated and how should they be set?

In general, an epoch is one pass through all of the records in a dataset. For TensorFlow, one step is one pass of the execution graph, e.g. sess.run(), which is usually done on one batch of the dataset. So, the number of steps in one epoch is equal to the number of records divided by the batch size. And for distributed TensorFlow, these steps will be "sharded" across your workers. So, putting this all together, you'd have:

steps = epochs * num_records_per_epoch / batch_size
steps_per_worker = steps / num_workers

For InputMode.SPARK, we use Spark's UnionRDD to simulate epochs when feeding a dataset as an RDD. However, TensorFlow applications can often finish before consuming all of the data, e.g. stopping at a maximum number of steps or when a metric reaches some value. To handle these cases, TensorFlowOnSpark provides a TFDataFeed.terminate() API to stop the Spark data feeding. Unfortunately, Spark must still fully consume and process all of the remaining partitions in order to consider that stage successful, so TensorFlowOnSpark just consumes the remaining Spark partitions without actually sending the data to the TensorFlow process. Obviously, this can be expensive and time-consuming, so it is highly-recommended that you closely match your epochs, batch_size, steps, and workers to avoid unnecessary overhead.

For example, using the formulas above, if you want more epochs, you should increase your steps accordingly, or if you increase the batch_size, you should decrease steps, etc.

For InputMode.SPARK, can I pass in more than one RDD, e.g. for train and test data?

Only one RDD can be passed into the TensorFlow processes at a time. The data from the Spark RDDs are fed into a queue on each executor, from which the local TensorFlow process will pull data. If you have multiple input RDDs with the same cardinality (e.g. features + labels), you can simply zip the two RDDs together and then extract the components on the TensorFlow side. If you have two distinct RDDs (e.g. train vs. test), you can either run a separate evaluation job that just loads the latest checkpoint on a periodic basis and evaluates on the test RDD, or you can persist the validation RDD to disk and use a tf.data.Dataset to load that data during the evaluation phase.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Frequently Asked Questions

Why use only one task (TF Node) per executor?

Why doesn't algorithm X scale when I add more nodes?

What does `KeyError: 'input'` error mean?

Why does MNIST example hang?

For InputMode.SPARK, how are `epochs`, `steps`, `batch_size`, and `num_workers` correlated and how should they be set?

For InputMode.SPARK, can I pass in more than one RDD, e.g. for train and test data?

Clone this wiki locally

Frequently Asked Questions

Why use only one task (TF Node) per executor?

Why doesn't algorithm X scale when I add more nodes?

What does KeyError: 'input' error mean?

Why does MNIST example hang?

For InputMode.SPARK, how are epochs, steps, batch_size, and num_workers correlated and how should they be set?

For InputMode.SPARK, can I pass in more than one RDD, e.g. for train and test data?

Clone this wiki locally

What does `KeyError: 'input'` error mean?

For InputMode.SPARK, how are `epochs`, `steps`, `batch_size`, and `num_workers` correlated and how should they be set?