-
Notifications
You must be signed in to change notification settings - Fork 34
PTY Layer
- What is it?
- Why does it exist?
- What are the problems?
- What are the alternatives?
SAGA performs different kinds of interactions with remote systems. Many of those systems are only accessible via shell-like tools, such as ssh, gsissh, ftp, gsisftp etc. 'Shell-like' means that those tools are mostly designed for interactive use: after connection setup they present a prompt and wait for commands on stdin, and then respond to those commands via stdout/stderr.
The PTY layer in SAGA consists of several components which handle interaction with those tools.
-
pty_process
a fork/exec'ed process which provides low level process management (is_alive
,kill
,wait
, ...) and process I/O (read
,write
,find
). -
pty_shell
: wraps aroundpty_process
to provide higher level methods:set_prompt
,find_prompt
,run_sync
,run_async
, simple file staging -
pty_shell_factory
: a factory forpty_shell
instances which is the only part which distinguishes between different tools and their individual startup and authentication mechanisms. It also provides connection caching for those tools which support it (ssh master channels).
When switching SAGA from C++ to Python, there were a number of requirements toward the interaction with remote systems:
- portability: support for ssh, gsissh
- better scalability, connection reuse (C++ used one ssh connection per job, which was slow and limited number of jobs)
- fast: pexpect was used in BLISS, but pexpect has a number of very conservative timeouts when detecting prompt, which means it needs several seconds for shell startup.
- user friendly: we were frequently hitting problems where user environments from .bashrc etc/ were not used, and thus jobs failed due to missing PATH or module loads
- maintainable: paramiko was considered functional, but buggy and hard to maintain
A number of other (python native or non-native) tools and libraries exist to perform similar tasks: popen, pexpect, paramiko, libssl, ..., but were not considered to provide the combined feature set above.
The PTY layer does relatively complex process management in pty_process
-- but that part is relatively stable and has seen little problems. However, if problems happen on this layer, such as connection timeouts, then it is difficult to provide error messages which make sense on a higher (shell) level.
The switching between backend types in pty_shell_factory
is somewhat messy code with many cases and branches, and has seen some churn due to support for different ssh versions, different system limits, etc. Despite the code complexity, this part does relatively simple things -- it basically creates a suitable command line and hands it to pty_shell
. It might benefit from refactoring, but is not considered to be very problematic.
The shell spawning in pty_shell
is problematic: the code applies various heuristics to ensure that a spawned shell is bootstrapping correctly, and to get the two-way communication channel initialized. This in particular includes the dreaded detection of the shell prompt which is needed to separate application output from shell noise.
Prompt detection is based on regular expressions. There are two major problems with that approach: it is impossible to know when the remote shell is done with putting out new characters (i.e. one cannot detect the 'end' of a stream), and it is impossible to provide regexes which only match prompts.
To mitigate that problem, we perform prompt detection only once, very thoroughly, and then set a custom prompt which is easier to detect. The initial detection remains painful though. Also, setting a custom prompt is not always possible (sftp), and is error prone (mechanisms differ depending on shell type, and are prone to side effects from shell customization, such as colorization, post-prompt-commands, etc).
Once a prompt is detected and a new prompt is set, the pty_shell performs relatively reliable, fast and stable.
Post Scriptum: in a number of recent cases, the shell does not perform stable for longer running jobs. Specifically sftp channels seem to frequently ignore keepalive
settings and can time out.
I still don't see any viable alternative when we want to adhere to the boundary conditions listed above. Paramiko does not support gsi; pexpect is slow and has the same prompt detection problems; libssl is difficult to code and does not support gsi; running one command per ssh connection would make live somewhat simpler, but is slow and eats resources; other solutions require a service component which we cannot reliably and generally deploy.
The question thus is: can we drop one or more of the requirements -- and how would that help?
Accepting lower performance would point us to pexpect -- it provides otherwise the same functionality (support for gsi). But pexpect is prone to the prompt detection problems -- and as that is the crux, we would not gain much.
dropping the requirement for GSI would open the path to a wide number of tools, including libssl and wrappers around it, which is very clean and fast. But GSI does seem to be critical for our target DCIs. I am not sure this would be viable...
when running exactly one command per ssh connection, the prompt detection mechanism can be considered secondary -- the MOTD and prompt etc. could 'simply' be declared part of the application output. That might not be very user friendly, but I expect it only really screws up some fringe use cases which can be handled as needed.
We have seen some problems in the past though with exhaustion of ssh channels -- the total number of shared and non-shared channels to any one target host is system limited. Also, connection setup is the single most important factor which dominates performance of the whole chain, so we would suffer significant slowdown (add ~500ms per interaction).
When not requiring the remote shell (and the remote application) to find the user environment as set in .bashrc etc., the shell startup becomes much easier -- we can always run /bin/sh as non-interactive non-login shell with a custom prompt, and voila! We had, however, several users complaining in the past that jobs which run seamlessly from the command line on the remote host did not run via SAGA, for exactly this problem.