Skip to content

Performance of saga python

Andre Merzky edited this page Feb 14, 2013 · 33 revisions

Job Execution via shell adaptor:

Comparing shell adaptor to raw local shell performance (script here) results in the following numbers:

id type jobs/second Real Time (s) System Time (s) User Time (s) Utilization (%)
1
/bin/sleep 1 &
5,000.0 0.2 0.0 0.1 58.7
2
RUN sleep 1
87.7 11.4 0.2 0.8 9.2
3
RUN sleep 1 @ cyder.cct.lsu.edu/
28.1 35.5 0.1 0.2 0.7
4
js = saga.job.Service ('fork://localhost/')
j = js.create_job ({executable:'/bin/data', arguments :['1']})
57.8 17.3 8.1 0.8 51.2
5
js = saga.job.Service ('fork://localhost/')
j = js.run_job ("/bin/sleep 1")
56.8 17.6 8.2 0.8 51.3
6
js = saga.job.Service ('ssh://localhost/')
j = js.create_job ({executable:'/bin/data', arguments:['1']})
56.8 17.6 8.2 0.8 50.7
7
js = saga.job.Service ('ssh://[email protected]/')
j = js.create_job ({executable:'/bin/data', arguments:['1']})
4.3 228.1 65.9 9.0 32.8
8
js = saga.job.Service ('gsissh://trestles-login.sdsc.edu/')
j = js.create_job ({executable:'/bin/data', arguments:['1']})
4.0 247.3 68.6 9.6 31.6
  1. Plain shell is (as expected) very quick, and needs virtually no system resources. Time is spent mostly in the shell internals.
  2. The shell wrapper we use adds an order of magnitude, mostly on user time. That wrapper makes sure that job information (state, pid) are written to disk, spawns a monitoring daemon, and reports job ID etc. But most of the time is actually to wait for those state information to be consistent and to be writtent (waitpid, sync) -- thus the low system utilization.
  3. Running jobs on the shell wrapper over ssh is relatively quick, too -- as the input/output are streamed w/o blocking, there is almost no times wasted in waits.
  4. The python layer on top of the local wrapper adds a factor of ~1.5, which is quite acceptable, as that includes the complete SAGA and Python stack, and I/O capturing/parsing etc.
  5. The shortcut methods run_job() does not add significantly, despite the fact that it creates a new job description per job.
  6. Exchanging fork://localhost/ with ssh://localhost/ again adds almost nothing -- the overhead is solely due to the increased startup time.
  7. As expected, real remote operations add significant overhead -- this is owed to the fact that operations are synchronous, i.e. the adaptor waits for confirmation that the operation succeeded. This adds 1 roundtrip per operation (200ms * 1000 = 200 seconds). Locally that does not contribute, thus the adaptor and wrapper operations (4, 5, 6) are close to the nonblocking I/O versions (1, 2, 3).
  8. No significant difference between ssh and gsissh, as expected...

At this point, we reached saturation for a remote backend -- any adaptor building on top of the PTYShell and PTYProcess infrastructure will see the above limits, basically. There are three options for further scaling though: (A) concurrent job service instances, (B) asynchronous operations, and (C) bulk operations.

(A) Concurrent Job Services:

We can run test (7) again, but use an increasing number of job services (each in its own application thread), and observe the following behavior:

id threads Real Time (s) jobs/second
7.01 1 225.2 4.4
7.02 2 122.7 8.1
7.03 3 80.9 12.4
7.04 4 62.9 15.9
7.05 5 52.2 19.1
7.06 6 50.5 19.8
7.07 7 40.1 24.9
7.08 8 36.2 27.6
7.09 9 38.5 25.9
7.10 10 32.9 30.4
3 1 35.5 28.2

So, this scales up to what we have seen from a plain piped ssh submit (3)! FWIW, at this point the limiting factor on cyder is the concurrent disk access -- that is for some reason not handled well, and from a certain load on that seems to require unreasonable system CPU time.

Alas, I did not know that most ssh deamons limit the number of concurrent ssh connections to 10 -- so, this is the end of scaling -- for one host. But we can use multiple hosts concurrently:

id threads hosts Real Time (s) jobs/second
7.10 10 cyder 30.5 32.8
7.20 20 cyder, repex1 16.6 60.4
7.30 30 cyder, repex1, trestles 14.4 69.3
7.40 40 cyder, repex1, trestles, india 12.7 78.7
7.50 50 cyder, repex1, trestles, india, sierra 11.6 86.5

So that scales nicely, too, over ~200ms latency connections -- but saturation is obviously in sight. At that point, the local python instance consumes all CPU, and is busy doing I/O and string parsing -- that is the limiting factor.

O(100) jobs/second seem well achievable (there are some optimizations on shell wrapper level and on PTYProcess level left to do), but it is unlikely to go much further, really. Also, an obvious drawback is that the user needs to start application threads in order to get that performance.

(B) Asynchronous Operations and (C) Bulk Operations:

Both are actually emulated on SAGA-Python engine level -- but that won't by you much performance: asynchronous ops spawn a thread for each operation -- but that is only useful for really long running, blocking operations -- job submissions is relatively quick; bulk operations are broken down to individual operations, thus, duh!

Both optimizations cannot sensibly implemented without cooperation from the remote side, i.e. from the shell wrapper, or from the queuing system. For the shell wrapper, this is implement at the moment, and performance numbers will be updated once done -- -- but that is not highest priority, as the performance numbers are satisfying already.

Clone this wiki locally