-
Notifications
You must be signed in to change notification settings - Fork 192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added design document for remote and multi-machine launching #297
base: gh-pages
Are you sure you want to change the base?
Conversation
# Multi-Machine and Remote Launching | ||
## Goals | ||
- Connect to a remote host and run nodes on it | ||
- Ensure this capability does not compromise security |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're currently working on launch integration for security. In order to make sure it scales to remote launching, we might want to consider ways to send security files over to the remote machine.
This pull request has been mentioned on ROS Discourse. There might be relevant details there: https://discourse.ros.org/t/ros2-tooling-wg-next-meeting/12545/36 |
|
||
### Nodes | ||
#### launch\_ros.LaunchServiceNode | ||
This node runs on each machine in a system is is repsonsible for launching and managing remote processes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick:
This node runs on each machine in a system is is repsonsible for launching and managing remote processes. | |
This node runs on each machine in a system is is responsible for launching and managing remote processes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: This node runs on each machine in a system is it is
Together these components will provide the means to execute processes remotely and should cover most basic use cases. | ||
|
||
The LaunchServiceNode(s) will expose ROS2 services and topics for handling startup and shutdown of processes on remote machines. | ||
This component won't be necessary for basic execution of remote nodes, but would allow remote system to persist beyond the life of the initial LaunchService. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not clear on this use case, could you elaborate a little bit? especially on the difference from LifecycleNode
except process statistics?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The design is intended to allow you to run a node pretty much like they currently would run, that is, bound to the lifespan of the launch process itself. When the launch process dies, the SSH connection dies, and anything still running on that SSH terminal dies with it.
The LaunchServiceNode
is intended to work a little bit differently. It would be launched from the SSH terminal, but then detach and run without requiring the SSH connection to remain up. However, by doing that you lose the ability to directly kill a process if you actually want it to stop. Therefore, the LaunchServiceNode
would provide a way to go back and ask for particular nodes on that server to be started/stopped ad hoc.
This is certainly somewhat similar to some of the functionality envisioned by LifecycleNode
s, but would allow a way for us to remotely launch and manage nodes which were not natively designed to support the lifecycle feature. Lifecycle would provide nearly all the functionality we would need for this type of behavior, but should the node crash, Lifecycle won't let us bring it back for a full restart, and we also would be lacking other out-of-band control to deal with things such as unresponsive nodes.
(Obviously, these same arguments also apply to the LaunchServiceNode
itself, so that particular entity should be as clean as possible to reduce the possibility of it getting into an unmanageable state itself.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question to explicitly state in the doc: Who starts the LaunchServiceNode? How is its lifecycle expected to be managed?
| passphrase | String | Optional passphrase if the key uses one. | | ||
|
||
#### launch.actions.ExecuteRemote | ||
Base class for actions that use connection informationn contained in `machine` to start a process on a remote machine. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: informationn
Remote machines will be described by a Machine class which will also be a base class for protocol-specific implementations. | ||
Together these components will provide the means to execute processes remotely and should cover most basic use cases. | ||
|
||
The LaunchServiceNode(s) will expose ROS2 services and topics for handling startup and shutdown of processes on remote machines. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LaunchServiceNode could be designed as an independent component - that is possible to use purely locally to detach a launch command from an ongoing launch context - it could then be used by the remote launching functionality.
#### Params | ||
| Name | Type | Description | | ||
|---|---|---| | ||
| allowable\_processes | List[Executable] | List of executables the node is allowed to launch | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Explicit flag to allow remote launching any process/node? Early in development this could be super useful for testing out various nodes via remote context from developer machine to robot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or - is it not worth having this allow-list in here at all. Maybe SROS2 takes care of permissions well enough that we don't need to reimplement it. "If you use this you are exposing a vulnerability if you don't use SROS"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps, only allow the LaunchServiceNode to only start known ros nodes on the remote system, rather than arbitrary commands. Restricting to ROS-specific in the remote context, since the local launch can execute arbitrary SSH commands anyways
Some high level comments: while some of the aspects of the document are specific to remote execution:
several other parts are applicable to a local launch invocation too:
It might be good to separate these to make each invidual part easier to design, implement and review. |
Noting that we should call out explicitly the expectation that the LaunchServiceNode must be accessible within the current discoverable DDS network. The machine may be WAN-accessible via SSH but that case won't work because we use topics/services for all controls. |
- shutdown | ||
|
||
### Messages | ||
#### Heartbeat |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do all LaunchServiceNodes use a unique topic, or do all publish on the same topic (in which case would need to put machine/service ID in the messages)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this should be configurable, "heartbeat_topic", "process_status_topic"
| processor\_load | double | Percent of processor load coming from this process | | ||
| err\_msg | String | Error string | | ||
|
||
### Services |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noting that each LaunchServiceNode will need to namespace itself to make service calls uniquely-fulfillable
When talking about multi-machine message passing, how does time sync come into play? "Out of Scope" section in design is useful. |
How do we support Mac + Windows? Do we think the SSH process will work out of the box for those platforms? |
This pull request has been mentioned on ROS Discourse. There might be relevant details there: https://discourse.ros.org/t/ros-2-tsc-meeting-agenda-2020-09-17/16464/1 |
Hi everyone! For me as a ros2 developer, I really do miss that feature. Any timelines on this? I see that it's only a design doc and it's hard to say, but maybe it was included in some roadmap or so |
So far there is not a roadmap that I know of. We discussed the design in one of the Tooling Working Group meetings, which is where my above comments came from, but I think the next step is just an updated split design - we identified that this is really 2 separate features that can be independently designed and implemented. @mlanting should be able to clarify if any proof-of-concept code exists already, I can't remember. |
@AlexKaravaev, @pjreed made a simpler, temporary implementation of multi-machine launching that you can find here: https://github.com/pjreed/launch I will try to get the next version of the design document out in the next week or so. @roger-strain is currently working on a refactor of the launch system. Once that is complete and this design is approved we can being implementation. I expect it will be a few months before we have anything implemented, but pj's implementation should cover basic use cases. |
Just FYI, my changes in the |
@mlanting et al.: this could have been discussed in the previous PR, so if it has, please just RTFM me, but has there been any thought about using existing orchestration frameworks for deployment, configuration and orchestration of multi-machine ROS applications? I'm not immediately thinking of something like k8s, but perhaps something similar might exist, which is a little less 'heavy' but supports like operations ("infrastructure as code" and all that). Ian Sherman in his keynote at ROSCon18 posited that "backend distributed systems and IoT" (as he phrased it) might have solutions to challenges we still see as problems. Deploying, starting and coordinating a multi-machine distributed application seems like it could be something they might have solutions or best practices for. It does seem like they might have similar requirements (ie: liveness tracking, monitoring, pushing updates, distributed configuration, failover, etc). |
@AlexKaravaev: I haven't polished up my SSH-based launcher mostly because it was intended to be a stopgap solution that just implemented what we needed until a more robust solution is available, and I feel like some of the limitations (most significantly, the executable paths to nodes being resolved on the local computer before the command is sent to the remote computer) make it inappropriate for merging into the official ROS2 ecosystem at this time. Still, hopefully it works for you; let me know if you have any issues, and if it doesn't take too much work to make it more robust, it might make sense to get it into ROS2 proper until we've got a better system. @gavanderhoorn: When I made my |
@pjreed wrote:
the abstraction is nice, but wouldn't that still mean that My assumption is that it's going to take "the ROS community" quite some time to get to the level of functionality which these existing solutions have already reached. Besides that, it would also seem to be duplication of effort, which is never very nice. |
- Be able to reconnect to the system to monitor and manage nodes running remotely | ||
|
||
|
||
To achieve our goals, this update will consist of two primary components: an ExecuteRemote action, and a LaunchServiceNode. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the re-scope of this design - it seems like this "Goals" section shouldn't be talking about the LaunchServiceNode
since it's not specified within this design. If I remember correctly, we established that the LaunchServiceNode
was independently interesting and designable. I do see that we want to talk about how these two things will interact, but I'm not sure if that belongs here in this doc anymore.
|
||
To achieve our goals, this update will consist of two primary components: an ExecuteRemote action, and a LaunchServiceNode. | ||
|
||
The ExecuteRemote action will provide the basic capaiblity to start processes on remote machines. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ExecuteRemote action will provide the basic capaiblity to start processes on remote machines. | |
The ExecuteRemote action will provide the basic capability to start processes on remote machines. |
To achieve our goals, this update will consist of two primary components: an ExecuteRemote action, and a LaunchServiceNode. | ||
|
||
The ExecuteRemote action will provide the basic capaiblity to start processes on remote machines. | ||
It will build off of the refactor that is described [here](https://github.com/ros2/design/pull/272) and include a machine object as an additional parameter providing implementation-specific connection information. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Here" is never a very informative link title :)
It will build off of the refactor that is described [here](https://github.com/ros2/design/pull/272) and include a machine object as an additional parameter providing implementation-specific connection information. | |
It will build off of the refactor that is described in [Refactoring ExecuteProcess into Execute and Executable](https://github.com/ros2/design/pull/272) and include a machine object as an additional parameter providing implementation-specific connection information. |
The machine class will also be a base class for protocol-specific implementations. | ||
Together these components will provide the means to execute processes remotely and should cover most basic use cases. | ||
|
||
The [LaunchServiceNode(s)](154_roslaunch_remote_launch_service_node.md) will expose ROS2 services and topics for handling startup and shutdown of processes on remote machines. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be easier to evaluate this design tie-in if the linked document was available?
|---|---|---| | ||
| hostname | String | Hostname of the machine. | | ||
| port | int | Port on the host to connect to (default 22) | | ||
| ssh\_keys | String | Path to ssh key file | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be singular?
| ssh\_keys | String | Path to ssh key file | | |
| ssh\_key | String | Path to ssh key file | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch
| prefix | List[Substitution] | A set of commands/arguments to preceed the `cmd`, used for things like `gdb`/`valgrind` and defaults to the `LaunchConfiguration` called `launch-prefix` | | ||
| output | ?? | Configuration for process output logging. Defaults to `log` i.e. log both `stdout` and `stderr` to launch main log file and stderr to the screen. | | ||
| output\_format | ?? | For logging each output line, supporting `str.format()` substitutions with the following keys in scope: `line` to reference the raw output line and `this` to reference this action instance. | | ||
| log\_cmd | boolean | If `True`, prints the final cmd before executing the process, which is useful for debugging when substitutions are involved. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe should have a note that this will avoid printing any key passphrases or other sensitive information
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, good point, I will include a note about that.
| log\_cmd | boolean | If `True`, prints the final cmd before executing the process, which is useful for debugging when substitutions are involved. | | ||
| on\_exit | List[LaunchDescriptionEntity] | List of actions to execute upon process exit.| | ||
| persistent\_connection | boolean | Whether the LaunchService should maintain a persistent connection to the ssh instance | | ||
| machine | `launch.descriptions.SSHMachine` | The machine on which to execute the process | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Want this to take the abstract interface, right?
| machine | `launch.descriptions.SSHMachine` | The machine on which to execute the process | | |
| machine | `launch.descriptions.Machine` | The machine on which to execute the process | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are correct, thanks
I think overall this looks straightforward, but it still has remnants of the process management node, which shouldn't be necessary for the simple version of this design, right? |
The LaunchServiceNode and longer-term goals that motivated it are beyond the scope of this part of the design, so references to them have been removed. Fixed some typos. Added a note to avoid exposing sensitive information when logging commands. Distro A; OPSEC #2893 Signed-off-by: matthew.lanting <[email protected]>
Yeah, I think you're right. I've removed those references and fixed the errors you caught. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me now - though I'm not one with merge access here :)
This pull request has been mentioned on ROS Discourse. There might be relevant details there: https://discourse.ros.org/t/ros2-distributed-multi-pc-ros-launch-node-management-tool/39889/1 |
The document has been changed so much and it's been so long since the last time I submitted a PR for multi-machine launch changes I figured it'd be more appropriate just to create a new one.