Ongoing Poseidon/Firmament Work & Concerns #69

deepak-vij · 2018-07-17T18:22:52Z

Hi Ionel & Malte, hope things are great with you two. I wanted to give you an update and ask for your expert opinion/advise on one very important issue we have identified while doing detail throughput performance testing recently. Firstly, we have completed incorporating the following scheduling functionality into Firmament scheduler:

Node level Affinity/Anti-Affinity using the network flow graph approach, using the similar approach as for regular workloads/tasks with no affinity/anti-affinity requirements.
Pod Level Affinity/Anti-Affinity using Pod-at-a-time with multi-round scheduling approach. We had to optimize the multi-round process somewhat in order for better throughput. Currently, we do see that Firmament throughput is approx. 2X better even though we are doing pod-at-a-time processing.
Support for Taints/Tolerations.

Overall, throughput numbers are definitely in favor of Firmament scheduler by a great margin, as we earlier discovered as well. However, there is a one caveat in all this. The Job size (K8S replica-set, deployment or Jobs) is quite large with large no. of tasks per job in these tests. If the Job size is smaller and we have great number of Jobs, Firmament performance really degrades as you can see in the examples below. This is due to the fact that the solver has to deal with great no. of arcs in such cases.

Net-net is, based on our assessment, Firmament scheduler definitely does a great job in use cases where Job size is quite large. This is primarily true because due to equivalence classes as a mechanism for amortizing the work.

Large number of Jobs consisting of smaller number of tasks, throughput benefits are not there due to large number of arcs drawn in the graph.

Question for both of you is if there is a way to optimize all this and reduce the number of arcs in the following examples in order for Firmament to be a general purpose scheduler. Please let us know your thoughts and perspective on all this. I will also create an issue to this regards in CAMSAS as well. Thanks.

Node Anti-Affinity Scenario
• Let us assume there are 800 nodes in a cluster.
• In a scheduling run, let us say we are processing 15,200 pods.
• Let us say we use 800 replicate-sets with 19 replicas in each set.
• Let us also assume that we have set limit of 19 arcs between task EC and nodes (using machine individual ECs with each EC of capacity 1). This is essentially to load balance the incoming workloads/pods across multiple nodes (each arc increases incrementally in order to do load distribution across eligible machines).
• In a node anti-affinity use case scenario, let us assume that an incoming pod can go to any remaining 799 nodes as one single node in the cluster has conflict with the node level anti-affinity rule for incoming Pods.
• Accordingly, we end up having no. of arcs in the flow graph as = 799 * 800 * 19 = 12,144,800 arcs.
• Even though there are only 15,200 incoming pods in a scheduling run, we end up creating 12,144,800 arcs in the graph unnecessarily.
• Ideally, we should limit the no of arcs drawn between task EC and nodes (using machine ECs) to lowest cost 15,200 arcs only.

Normal Pod Scenario
• Let us assume there are 800 nodes in a cluster.
• In a scheduling run, let us say we are processing 15,200 pods.
• Let us say we use 3,040 replicate-sets with 5 replicas in each set. Each replica-set uses a unique CPU-Memory combination.
• Let us also assume that we have set limit of 19 arcs between task EC and nodes (using machine individual ECs with each EC of capacity 1). This is essentially to load balance the incoming workloads/pods across multiple nodes (each arc increases incrementally in order to do load distribution across eligible machines).
• Let us assume incoming pods can go to any of the 800 nodes.
• Accordingly, we end up having no. of arcs in the flow graph as = 3,040 * 19 * 800 = 46,208,000 arcs.
• Even though there are only 15,200 incoming pods in a scheduling run, we end up creating 46,208,000 arcs in the graph unnecessarily.
• Ideally, we should limit the no of arcs drawn between task EC and nodes (using machine ECs) to lowest cost 15,200 arcs only.

deepak-vij · 2018-07-17T18:25:18Z

@ms705 @ICGog @shivramsrivastava

deepak-vij · 2018-07-20T20:20:39Z

Hi Ionel & Malte, providing more clarity to my earlier note. Too many arcs in the flow graph for small jobs (jobs with few tasks in it) has become a very thorny issue in order to articulate the value proposition of Firmament scheduler. It would be great to get your invaluable feedback/comments.

First of all, a clarity around why do we have a pre-set of predefined number (19 in our case) of machine ECs (M0EC1, M0EC2,.., M2EC2) in order to do load distribution across filtered machines during each scheduling iteration. This is mainly because if we have only one arc from task EC node to machine node, then there is a chance that all the incoming flows (tasks) flow across the single arc, and overloading the single machine with so many tasks even though there are machines with lot of available resources. So to avoid scheduling many tasks on a same machine, we draw multiple arcs between a task EC and machine nodes using intermediate machine EC nodes. We connect task EC to multiple machine ECs and then these machine ECs are connected to corresponding machine node. The capacity on the arc (task EC to machine EC) and arc (machine EC to machine node) is set to 1. And costs on arcs between task EC and machine ECs are assigned incrementally.

Let us take an example where there are three machines M0, M1 & M2 and each machine has a capacity of 2 flows. Load distribution is achieved by assigning two arcs via corresponding machine ECs and the cost on each arc increases incrementally. In this case, arcs connecting to machine ECs for machine M0 have value of cost 100 and 200 respectively. Capacity on these arcs would be 1 each. In a nutshell, for each unit of capacity, there would be corresponding machine EC. The cost on these arcs would increase incremental in order to achieve load distribution across machines.

In a nutshell, this is the crux of the problem where we end up unnecessarily creating inordinate number arcs for even small jobs (jobs with few tasks in it). Essentially, we end up providing too many options for the solver even though the number of incoming tasks to the task EC are quite smaller.

As a way around this problem, we would need to programmatically control the number arcs created from task EC to machine ECs depending on the number of incoming task flow. This could become a complex logic in itself and as such, we would end up doing what graph solver should be doing in the first place.

Thanks and look forward to hearing from you.

Regards,
Deepak Vij

deepak-vij · 2018-07-23T17:54:34Z

In the “Normal Pod Scenario” which I stated earlier, we blindly connect 19 (some pre-determined constant number) arcs between a task EC and machines (via single capacity machine ECs). For example, for a cluster consisting of 800 eligible nodes for incoming 3,040 jobs each consisting of 5 tasks each, we end up having 3,040 * 19 * 800 arcs. This equals 48,640,000 arcs for all these jobs.

We can definitely reduce the number of arcs between task EC and machines (via single capacity machine ECs) depending on the incoming tasks and resources available for a machine. For example, in our example, let us say some eligible node can only fit in 3 tasks out of incoming 5 tasks. In that case, we only connect 3 tasks between task EC and machines (via single capacity machine ECs). Also, for a job consisting of 5 tasks, we will only connect 5 arcs between task EC and machines. This will reduce the number of arcs drawn for a scheduling run.

Also, we were also thinking of reducing the no. of eligible machines for incoming tasks depending on the no. incoming tasks. For example, in our example, as the number of incoming tasks for a job are 5, we only filter top 5 machines out of total eligible 800 machines. This will drastically reduce the number of arcs, especially for the smaller incoming jobs. But the problem arises due to overcommitting of top “n” nodes for a job. This becomes a serious problem with top “n” nodes approach. Unless we subtract the resource usage of a machine for each job, that will give a different set of top "n" nodes for the next job. But at the graph building time we don’t know which particular node to subtract the resource consumption from. Essentially, we end up overcoming top “n” machines with this approach.

We need to be able to figure out a way to filter out large number of machines for smaller jobs (jobs with 2-5 tasks). Otherwise, we end up having too many arcs and hence impact on the overall throughput. Hope this makes sense. Thanks.

deepak-vij · 2018-07-23T20:20:32Z

Also, it is worthwhile highlighting that too many arcs in the flow graph problem was there even prior to adding affinity/anti-affinity functionality within firmament. It so happened that all this while we have been doing throughput performance testing using jobs with lots of tasks in them. Also, for smaller jobs we were able to group multiple jobs in case they had the same CPU/Memory resource requirements combination. As a result, we were able to amortize work across jobs. That is why we saw good throughput numbers (20-30X in most cases).

Now that we have incorporated affinity/anti-affinity functionality, we cannot group multiple Jobs even though they may have same CPU/Memory resource requirements combination. This is because we need to distinguish between jobs in order to match Pod Label for incoming tasks with the already running tasks in order to support pod level anti-affinity. This is why too many arcs problem has surfaced now. Although, it existed before as well in certain cases, we did not happen to test these test cases previously.

deepak-vij · 2018-07-25T19:47:57Z

We are currently using CS2 (Cost Scaling) MCMF solver. Based on the OSDI paper you two published, it seems "Relaxation" algorithm has much better performance at scale for typical workloads. With that in mind, would Flowlessly/Relaxation solver help out in case of large input graph.

ICGog · 2018-07-26T10:57:28Z

I think that Flowlessly relaxation algorithm is very likely to improve performance because your graph has a nice structure: there's a total order of the arc costs between each EC & machine ECs. Relaxation, would first push flow on the lowest-cost arc, which in this graph is also likely to be the best arc to push flow. By contrast, cost scaling does not take advantage of your graph structure.

Regardless, it would be beneficial to spend more time to brainstorm ideas to reduce the number of arcs your graphs have. I don't remember exact numbers, but I think that even Relaxation would not solve a 48 million arcs in less than a second.

deepak-vij · 2018-07-26T18:13:51Z

In regards to reducing the no. of arcs in the flow graph, we have been successfully able to make good inroads into accomplishing this. As mentioned earlier that the main culprit was “pod level anti-affinity symmetry” functionality we recently added. Because of this, we could not group jobs even if they have the same CPU/Mem resource requirements as we need to check the Job specific Pod label in order to provide support for “pod level anti-affinity symmetry” functionality. This, in turn, results in a lot more no. of arcs in the flow graph and less work amortization across jobs.

After brainstorming all this a bit amongst us, we have successfully resolved the negative impact of "pod level anti-affinity symmetry" functionality on the normal pods by isolating conflicts due to “pod level anti-affinity symmetry” functionality on normal pods/tasks. With the latest design changes, for normal pods with no pod level anti-affinity symmetry conflict, we see very substantial benefits as we observed in the earlier throughput performance testing exercise we did couple of months ago (20-30X better).

With the latest design changes, irrespective of size of the "Jobs" (Replicaset, Deployment or Jobs), we are seeing substantial throughput benefits using Firmament scheduler as long as resource requirements (CPU/Mem) for incoming Pods are uniform. This is mainly because Firmament is able to efficiently amortize work across ReplicaSets/Deployments/Jobs.

It is important to highlight that the above mentioned performance gains are based on the fact that jobs within a scheduling run have similar CPU/Mem resource requirements – uniform resource requirements. In case of any differences in CPU/Mem combination for a Job, we don’t have any choice but to create a new task EC for that particular job which results in a higher no. of arcs and performance degradation.

Ionel/Malte, it would be great to get your input/suggestions how to mitigate this so that we can somehow group jobs even though there is a slight difference in CPU/Mem resource requirements. We need to have some kind of an approximation logic to mitigate this.

Although orthogonal to the ongoing discussion, one additional issue we observed related to bulk processing of incoming jobs in Firmament versus pod-at-a-time in Kubernetes default scheduler. This is due to overcommitting of eligible machines at the time of graph building process for a scheduling run. This is because as we are connecting arcs between task EC and machines (via machine ECs), for each task EC we don’t account for the resources consumed by the previous EC unless we subtract the resource usage of a machine for the previous EC/Job. But at the graph building time we don’t know which particular node to subtract the resource consumption from. This is something we don’t have any solution for yet. It would be great to get input on this and possible solution.

Look forward to your feedback/suggestions. Thanks.

deepak-vij · 2018-08-03T02:01:01Z

Hi Malte, one of the culprits for large number of arcs in the flow graph is the number of nodes in a large cluster. In our case, cluster has 800 machines in it. I was reading through your thesis, it seems you talk about mitigating this issue while describing CoCo cost model. In the thesis, you talk about possibly using "maximally aggregated subtrees" of the resource topology. Is this something can be explored to do optimizations for a large cluster consisting of large no. of machines in it. I am assuming what you are proposing is that we group machines in a cluster into a tree like structure (country->region->zone->rack).

As part of the filtering process, we consider connecting arcs to the topology level equivalence classes (rack, zone etc.) only. I am assuming this will reduce the no. of arcs drawn to machines in the cluster. Thanks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ongoing Poseidon/Firmament Work & Concerns #69

Ongoing Poseidon/Firmament Work & Concerns #69

deepak-vij commented Jul 17, 2018 •

edited

Loading

deepak-vij commented Jul 17, 2018

deepak-vij commented Jul 20, 2018

deepak-vij commented Jul 23, 2018

deepak-vij commented Jul 23, 2018

deepak-vij commented Jul 25, 2018

ICGog commented Jul 26, 2018

deepak-vij commented Jul 26, 2018 •

edited

Loading

deepak-vij commented Aug 3, 2018 •

edited

Loading

Ongoing Poseidon/Firmament Work & Concerns #69

Ongoing Poseidon/Firmament Work & Concerns #69

Comments

deepak-vij commented Jul 17, 2018 • edited Loading

deepak-vij commented Jul 17, 2018

deepak-vij commented Jul 20, 2018

deepak-vij commented Jul 23, 2018

deepak-vij commented Jul 23, 2018

deepak-vij commented Jul 25, 2018

ICGog commented Jul 26, 2018

deepak-vij commented Jul 26, 2018 • edited Loading

deepak-vij commented Aug 3, 2018 • edited Loading

deepak-vij commented Jul 17, 2018 •

edited

Loading

deepak-vij commented Jul 26, 2018 •

edited

Loading

deepak-vij commented Aug 3, 2018 •

edited

Loading