-
Notifications
You must be signed in to change notification settings - Fork 34
Performance of saga python
Comparing shell adaptor to raw local shell performance (script here) results in the following numbers:
id | type | jobs/second | Real Time (s) | System Time (s) | User Time (s) | Utilization (%) |
1 |
/bin/sleep 1 &
|
5,000.0 | 0.2 | 0.0 | 0.1 | 58.7 |
2 |
RUN sleep 1
|
87.7 | 11.4 | 0.2 | 0.8 | 9.2 |
3 |
RUN sleep 1 @ cyder.cct.lsu.edu/
|
28.1 | 35.5 | 0.1 | 0.2 | 0.7 |
4 |
js = saga.job.Service ('fork://localhost/')
j = js.create_job ({executable:'/bin/data', arguments :['1']})
|
57.8 | 17.3 | 8.1 | 0.8 | 51.2 |
5 |
js = saga.job.Service ('fork://localhost/')
j = js.run_job ("/bin/sleep 1")
|
56.8 | 17.6 | 8.2 | 0.8 | 51.3 |
6 |
js = saga.job.Service ('ssh://localhost/')
j = js.create_job ({executable:'/bin/data', arguments:['1']})
|
56.8 | 17.6 | 8.2 | 0.8 | 50.7 |
7 |
js = saga.job.Service ('ssh://[email protected]/')
j = js.create_job ({executable:'/bin/data', arguments:['1']})
|
4.3 | 228.1 | 65.9 | 9.0 | 32.8 |
8 |
js = saga.job.Service ('gsissh://trestles-login.sdsc.edu/')
j = js.create_job ({executable:'/bin/data', arguments:['1']})
|
4.0 | 247.3 | 68.6 | 9.6 | 31.6 |
- Plain shell is (as expected) very quick, and needs virtually no system resources. Time is spent mostly in the shell internals.
- The shell wrapper we use adds an order of magnitude, mostly on user time. That wrapper makes sure that job information (state, pid) are written to disk, spawns a monitoring daemon, and reports job ID etc. But most of the time is actually to wait for those state information to be consistent and to be writtent (waitpid, sync) -- thus the low system utilization.
- Running jobs on the shell wrapper over ssh is relatively quick, too -- as the input/output are streamed w/o blocking, there is almost no times wasted in waits.
- The python layer on top of the local wrapper adds a factor of ~1.5, which is quite acceptable, as that includes the complete SAGA and Python stack, and I/O capturing/parsing etc.
- The shortcut methods
run_job()
does not add significantly, despite the fact that it creates a new job description per job. - Exchanging
fork://localhost/
withssh://localhost/
again adds almost nothing -- the overhead is solely due to the increased startup time. - As expected, real remote operations add significant overhead -- this is owed to the fact that operations are synchronous, i.e. the adaptor waits for confirmation that the operation succeeded. This adds 1 roundtrip per operation (200ms * 1000 = 200 seconds). Locally that does not contribute, thus the adaptor and wrapper operations (4, 5, 6) are close to the nonblocking I/O versions (1, 2, 3).
- No significant difference between ssh and gsissh, as expected...
At this point, we reached saturation for a remote backend -- any adaptor
building on top of the PTYShell
and PTYProcess
infrastructure will see
the above limits, basically. There are three options for further scaling though:
(A) concurrent job service instances, (B) asynchronous operations, and (C) bulk operations.
We can run test (7) again, but use an increasing number of job services (each in its own application thread), and observe the following behavior:
id | threads | Real Time (s) | jobs/second |
3 | 1 | 35.5 | 28.2 |
7.01 | 1 | 225.2 | 4.4 |
7.02 | 2 | 122.7 | 8.1 |
7.03 | 3 | 80.9 | 12.4 |
7.04 | 4 | 62.9 | 15.9 |
7.05 | 5 | 52.2 | 19.1 |
7.06 | 6 | 50.5 | 19.8 |
7.07 | 7 | 40.1 | 24.9 |
7.08 | 8 | 36.2 | 27.6 |
7.09 | 9 | 38.5 | 25.9 |
7.10 | 10 | 32.9 | 30.4 |
So, this scales up to what we have seen from a plain piped ssh submit (3)! FWIW, at this point the limiting factor on cyder is the concurrent disk access -- that is for some reason not handled well, and from a certain load on that seems to require unreasonable system CPU time.
Alas, I did not know that most ssh deamons limit the number of concurrent ssh connections to 10 -- so, this is the end of scaling -- for one host. But we can use multiple hosts concurrently:
id | threads | hosts | Real Time (s) | jobs/second |
7.10 | 10 | cyder | 30.5 | 32.8 |
7.20 | 20 | cyder, repex1 | 16.6 | 60.4 |
7.30 | 30 | cyder, repex1, trestles | 14.4 | 69.3 |
7.40 | 40 | cyder, repex1, trestles, india | 12.7 | 78.7 |
7.50 | 50 | cyder, repex1, trestles, india, sierra | 11.6 | 86.5 |
So that scales nicely, too, over ~200ms latency connections -- but saturation is obviously in sight. At that point, the local python instance consumes all CPU, and is busy doing I/O and string parsing -- that is the limiting factor.
O(100) jobs/second seem well achievable (there are some optimizations on shell
wrapper level and on PTYProcess
level left to do), but it is unlikely to go
much further, really. Also, an obvious drawback is that the user needs to start
application threads in order to get that performance.
Both are actually emulated on SAGA-Python engine level -- but that won't by you much performance: asynchronous ops spawn a thread for each operation -- but that is only useful for really long running, blocking operations -- job submissions is relatively quick; bulk operations are broken down to individual operations, thus, duh!
Both optimizations cannot sensibly implemented without cooperation from the remote side, i.e. from the shell wrapper, or from the queuing system. For the shell wrapper, this is implement at the moment, and performance numbers will be updated once done -- -- but that is not highest priority, as the performance numbers are satisfying already.