Skip to content

Performance of saga python

Andre Merzky edited this page Feb 14, 2013 · 33 revisions

Job Execution via Shell based Adaptors:

Comparing shell adaptor to raw local shell performance (script here) results in the following numbers:

id type jobs/second Real Time (s) System Time (s) User Time (s) Utilization (%)
1
/bin/sleep 1 &
5,000.0 0.2 0.0 0.1 58.7
2
RUN sleep 1
87.7 11.4 0.2 0.8 9.2
3
RUN sleep 1 @ cyder.cct.lsu.edu/
28.1 35.5 0.1 0.2 0.7
4
js = saga.job.Service ('fork://localhost/')
j = js.create_job ({executable:'/bin/data', arguments :['1']})
57.8 17.3 8.1 0.8 51.2
5
js = saga.job.Service ('fork://localhost/')
j = js.run_job ("/bin/sleep 1")
56.8 17.6 8.2 0.8 51.3
6
js = saga.job.Service ('ssh://localhost/')
j = js.create_job ({executable:'/bin/data', arguments:['1']})
56.8 17.6 8.2 0.8 50.7
7
js = saga.job.Service ('ssh://[email protected]/')
j = js.create_job ({executable:'/bin/data', arguments:['1']})
4.3 228.1 65.9 9.0 32.8
8
js = saga.job.Service ('gsissh://trestles-login.sdsc.edu/')
j = js.create_job ({executable:'/bin/data', arguments:['1']})
4.0 247.3 68.6 9.6 31.6
  1. Plain shell is (as expected) very quick, and needs virtually no system resources. Time is spent mostly in the shell internals.
  2. The shell wrapper we use adds an order of magnitude, mostly on user time. That wrapper makes sure that job information (state, pid) are written to disk, spawns a monitoring daemon, and reports job ID etc. But most of the time is actually to wait for those state information to be consistent and to be writtent (waitpid, sync) -- thus the low system utilization.
  3. Running jobs on the shell wrapper over ssh is relatively quick, too -- as the input/output are streamed w/o blocking, there is almost no times wasted in waits.
  4. The python layer on top of the local wrapper adds a factor of ~1.5, which is quite acceptable, as that includes the complete SAGA and Python stack, and I/O capturing/parsing etc.
  5. The shortcut methods run_job() does not add significantly, despite the fact that it creates a new job description per job.
  6. Exchanging fork://localhost/ with ssh://localhost/ again adds almost nothing -- the overhead is solely due to the increased startup time.
  7. As expected, real remote operations add significant overhead -- this is owed to the fact that operations are synchronous, i.e. the adaptor waits for confirmation that the operation succeeded. This adds 1 roundtrip per operation (200ms * 1000 = 200 seconds). Locally that does not contribute, thus the adaptor and wrapper operations (4, 5, 6) are close to the nonblocking I/O versions (1, 2, 3).
  8. No significant difference between ssh and gsissh, as expected...

At this point, we reached saturation for a remote backend -- any adaptor building on top of the PTYShell and PTYProcess infrastructure will see the above limits, basically. There are three options for further scaling though: (A) concurrent job service instances, (B) asynchronous operations, and (C) bulk operations.

(A) Concurrent Job Services:

We can run test (7) again, but use an increasing number of job services (each in its own application thread), and observe the following behavior (script here):

id threads Real Time (s) jobs/second
7.01 1 225.2 4.4
7.02 2 122.7 8.1
7.03 3 80.9 12.4
7.04 4 62.9 15.9
7.05 5 52.2 19.1
7.06 6 50.5 19.8
7.07 7 40.1 24.9
7.08 8 36.2 27.6
7.09 9 38.5 25.9
7.10 10 32.9 30.4
3 1 35.5 28.2

So, this scales up to what we have seen from a plain piped ssh submit (3)! FWIW, at this point the limiting factor on cyder is the concurrent disk access -- that is for some reason not handled well, and from a certain load on that seems to require unreasonable system CPU time.

Alas, I did not know that most ssh deamons limit the number of concurrent ssh connections to 10 -- so, this is the end of scaling -- for one host. But we can use multiple hosts concurrently:

id threads hosts Real Time (s) jobs/second
7.10 10 cyder 30.5 32.8
7.20 20 cyder, repex1 16.6 60.4
7.30 30 cyder, repex1, trestles 14.4 69.3
7.40 40 cyder, repex1, trestles, india 12.7 78.7
7.50 50 cyder, repex1, trestles, india, sierra 11.6 86.5

So that scales nicely, too, over ~200ms latency connections -- but saturation is obviously in sight. At that point, the local python instance consumes all CPU, and is busy doing I/O and string parsing -- that is the limiting factor.

O(100) jobs/second seem well achievable (there are some optimizations on shell wrapper level and on PTYProcess level left to do), but it is unlikely to go much further, really. Also, an obvious drawback is that the user needs to start application threads in order to get that performance.

(B) Asynchronous Operations and (C) Bulk Operations:

Both are actually emulated on SAGA-Python engine level -- but that won't by you much performance: asynchronous ops spawn a thread for each operation -- but that is only useful for really long running, blocking operations -- job submissions is relatively quick; bulk operations are broken down to individual operations, thus, duh!

Both optimizations cannot sensibly implemented without cooperation from the remote side, i.e. from the shell wrapper, or from the queuing system. For the shell wrapper, this is implement at the moment, and performance numbers will be updated once done -- -- but that is not highest priority, as the performance numbers are satisfying already.

Asynchronous Notifications:

Asynchronous notifications are a major tool to achieve performance: instead of pulling remote entities for state updates, the saga-python process is push-notified of state updates. That is easier to code, safes resources, and scales significantly better than pull.

The measurements below show how asynchronous notifications scale for the Redis backend of the advert adaptor -- they are realized via redis pubsub channels: a saga-python process which writes an advert attribute (or performs any other state change) will automatically publish an update message on the respective channel; any other saga-python process which has callbacks registered for the respective event type will receive that notification (the callback is invoked).

The experiment below has one master process listening to n client processes which all continously update an advert attribute on a local redis instance (standard deviation is high):

id servers clients events real time (s) events/second
8.1 1 1 1,000 0.7 1,485.6
8.2 1 2 1,983 1.1 1,864.0
8.3 1 3 2,919 1.3 2,316.3
8.4 1 4 3,981 1.8 2,192.3
8.5 1 5 4,979 2.3 2,194.9
8.6 1 6 5,880 2.7 2,155.8

This plateous at that level, and declines slightly later on.

A profiling shows that the limiting factor is the thread which monitors the Redis pub/sub channel on server side. This immediately offers an optimization approach, should that be needed: instead of having one pub/sub channel for the whole advert namespace, one can distribute the event load over multiple channels (hashing over advert path for example) to achieve load sharing.

Adding more servers will not help at this point: each server will get the same notifications, and the callbacks are called in a separate server thread anyway.

The same experiments with a redis instance on repex1, clients running on repex1, and server running local:

id servers clients events real time (s) events/second
9.1 1 1 1,000 1.1 888.0
9.2 1 2 2,000 2.4 833.0
9.3 1 3 1,999 3.0 805.7

The limiting factor seems to be the pubsub communication from repex1 to local, but this still needs confirmation.

The reversed experiment (redis on repex, server on repex, clients on local) gives an abysmal performance:

id servers clients events real time (s) events/second
10.1 1 1 100 53.7 1.9
10.2 1 2 200 53.0 3.8
10.3 1 3 300 53.2 5.6
...          
10.15 1 15 1,483 63.2 23.5

The very good scaling with number of clients, and the slow performance for a single client, seems to point at too many roundtrips for write operations -- this needs more detailed profiling.

Clone this wiki locally