Added design document for remote and multi-machine launching #297

mlanting · 2020-08-20T17:28:38Z

The document has been changed so much and it's been so long since the last time I submitted a PR for multi-machine launch changes I figured it'd be more appropriate just to create a new one.

kyrofa · 2020-08-24T19:44:47Z

articles/153_roslaunch_remote_launch.md

+# Multi-Machine and Remote Launching
+## Goals
+- Connect to a remote host and run nodes on it
+    - Ensure this capability does not compromise security


We're currently working on launch integration for security. In order to make sure it scales to remote launching, we might want to consider ways to send security files over to the remote machine.

ros-discourse · 2020-08-28T19:05:25Z

This pull request has been mentioned on ROS Discourse. There might be relevant details there:

https://discourse.ros.org/t/ros2-tooling-wg-next-meeting/12545/36

fujitatomoya · 2020-08-30T13:08:14Z

articles/153_roslaunch_remote_launch.md

+
+### Nodes
+#### launch\_ros.LaunchServiceNode
+This node runs on each machine in a system is is repsonsible for launching and managing remote processes.


nitpick:

Suggested change

This node runs on each machine in a system is is repsonsible for launching and managing remote processes.

This node runs on each machine in a system is is responsible for launching and managing remote processes.

nit: This node runs on each machine in a system is it is

fujitatomoya · 2020-08-30T13:47:15Z

articles/153_roslaunch_remote_launch.md

+Together these components will provide the means to execute processes remotely and should cover most basic use cases.
+
+The LaunchServiceNode(s) will expose ROS2 services and topics for handling startup and shutdown of processes on remote machines.
+This component won't be necessary for basic execution of remote nodes, but would allow remote system to persist beyond the life of the initial LaunchService.


I am not clear on this use case, could you elaborate a little bit? especially on the difference from LifecycleNode except process statistics?

The design is intended to allow you to run a node pretty much like they currently would run, that is, bound to the lifespan of the launch process itself. When the launch process dies, the SSH connection dies, and anything still running on that SSH terminal dies with it.

The LaunchServiceNode is intended to work a little bit differently. It would be launched from the SSH terminal, but then detach and run without requiring the SSH connection to remain up. However, by doing that you lose the ability to directly kill a process if you actually want it to stop. Therefore, the LaunchServiceNode would provide a way to go back and ask for particular nodes on that server to be started/stopped ad hoc.

This is certainly somewhat similar to some of the functionality envisioned by LifecycleNodes, but would allow a way for us to remotely launch and manage nodes which were not natively designed to support the lifecycle feature. Lifecycle would provide nearly all the functionality we would need for this type of behavior, but should the node crash, Lifecycle won't let us bring it back for a full restart, and we also would be lacking other out-of-band control to deal with things such as unresponsive nodes.

(Obviously, these same arguments also apply to the LaunchServiceNode itself, so that particular entity should be as clean as possible to reduce the possibility of it getting into an unmanageable state itself.)

Question to explicitly state in the doc: Who starts the LaunchServiceNode? How is its lifecycle expected to be managed?

jaisontj · 2020-09-04T16:10:51Z

articles/153_roslaunch_remote_launch.md

+| passphrase | String | Optional passphrase if the key uses one. |
+
+#### launch.actions.ExecuteRemote
+Base class for actions that use connection informationn contained in `machine` to start a process on a remote machine.


typo: informationn

emersonknapp · 2020-09-04T16:18:57Z

articles/153_roslaunch_remote_launch.md

+Remote machines will be described by a Machine class which will also be a base class for protocol-specific implementations.
+Together these components will provide the means to execute processes remotely and should cover most basic use cases.
+
+The LaunchServiceNode(s) will expose ROS2 services and topics for handling startup and shutdown of processes on remote machines.


This LaunchServiceNode could be designed as an independent component - that is possible to use purely locally to detach a launch command from an ongoing launch context - it could then be used by the remote launching functionality.

emersonknapp · 2020-09-04T16:25:21Z

articles/153_roslaunch_remote_launch.md

+#### Params
+| Name | Type | Description |
+|---|---|---|
+| allowable\_processes | List[Executable] | List of executables the node is allowed to launch |


Explicit flag to allow remote launching any process/node? Early in development this could be super useful for testing out various nodes via remote context from developer machine to robot.

Or - is it not worth having this allow-list in here at all. Maybe SROS2 takes care of permissions well enough that we don't need to reimplement it. "If you use this you are exposing a vulnerability if you don't use SROS"

Perhaps, only allow the LaunchServiceNode to only start known ros nodes on the remote system, rather than arbitrary commands. Restricting to ROS-specific in the remote context, since the local launch can execute arbitrary SSH commands anyways

dirk-thomas · 2020-09-04T16:37:30Z

Some high level comments: while some of the aspects of the document are specific to remote execution:

ExecuteRemote
RemoteSubstitution

several other parts are applicable to a local launch invocation too:

detach the main process and keep the launched processes running as well as continue handling event
`Heartbeat
ProcessStatus
all the services like QueryStatus, Start/StopProcess, Shutdown.

It might be good to separate these to make each invidual part easier to design, implement and review.

emersonknapp · 2020-09-04T16:39:00Z

Noting that we should call out explicitly the expectation that the LaunchServiceNode must be accessible within the current discoverable DDS network. The machine may be WAN-accessible via SSH but that case won't work because we use topics/services for all controls.

emersonknapp · 2020-09-04T16:41:35Z

articles/153_roslaunch_remote_launch.md

+- shutdown
+
+### Messages
+#### Heartbeat


Do all LaunchServiceNodes use a unique topic, or do all publish on the same topic (in which case would need to put machine/service ID in the messages)

Maybe this should be configurable, "heartbeat_topic", "process_status_topic"

emersonknapp · 2020-09-04T16:45:01Z

articles/153_roslaunch_remote_launch.md

+| processor\_load | double | Percent of processor load coming from this process |
+| err\_msg        | String | Error string |
+
+### Services


Noting that each LaunchServiceNode will need to namespace itself to make service calls uniquely-fulfillable

emersonknapp · 2020-09-04T16:51:33Z

When talking about multi-machine message passing, how does time sync come into play?

"Out of Scope" section in design is useful.

emersonknapp · 2020-09-04T16:54:39Z

How do we support Mac + Windows? Do we think the SSH process will work out of the box for those platforms?

ros-discourse · 2020-09-18T22:18:23Z

This pull request has been mentioned on ROS Discourse. There might be relevant details there:

https://discourse.ros.org/t/ros-2-tsc-meeting-agenda-2020-09-17/16464/1

AlexKaravaev · 2020-09-28T16:39:37Z

Hi everyone! For me as a ros2 developer, I really do miss that feature. Any timelines on this? I see that it's only a design doc and it's hard to say, but maybe it was included in some roadmap or so

emersonknapp · 2020-10-01T16:06:08Z

So far there is not a roadmap that I know of. We discussed the design in one of the Tooling Working Group meetings, which is where my above comments came from, but I think the next step is just an updated split design - we identified that this is really 2 separate features that can be independently designed and implemented.

@mlanting should be able to clarify if any proof-of-concept code exists already, I can't remember.

mlanting · 2020-10-06T17:35:56Z

@AlexKaravaev, @pjreed made a simpler, temporary implementation of multi-machine launching that you can find here: https://github.com/pjreed/launch

I will try to get the next version of the design document out in the next week or so. @roger-strain is currently working on a refactor of the launch system. Once that is complete and this design is approved we can being implementation. I expect it will be a few months before we have anything implemented, but pj's implementation should cover basic use cases.

pjreed · 2020-10-06T18:16:26Z

Just FYI, my changes in the launch repository (on the multi-machine-launch branch) basically just add in some changes necessary to abstract out the launching mechanism; I have another repository at https://github.com/pjreed/ssh_machine which adds an SSH-based launcher that can be used to launch nodes on remote machines via SSH. The functionality is similar to how ROS1's remote launching mechanism works, but there are a few limitations to it; there's some more documentation and an example in that ssh_machine repo.

AlexKaravaev · 2020-10-06T19:28:57Z

@mlanting @pjreed wow, thanks !
Will definitely check it out, because I was thinking of implementing similar ssh-based launcher. I understand the limitations and sure there is a need of multi-machine launch in ros2 native support, but still this repo can solve some current problems.

gavanderhoorn · 2020-10-06T19:43:48Z

@mlanting et al.: this could have been discussed in the previous PR, so if it has, please just RTFM me, but has there been any thought about using existing orchestration frameworks for deployment, configuration and orchestration of multi-machine ROS applications?

I'm not immediately thinking of something like k8s, but perhaps something similar might exist, which is a little less 'heavy' but supports like operations ("infrastructure as code" and all that).

Ian Sherman in his keynote at ROSCon18 posited that "backend distributed systems and IoT" (as he phrased it) might have solutions to challenges we still see as problems. Deploying, starting and coordinating a multi-machine distributed application seems like it could be something they might have solutions or best practices for. It does seem like they might have similar requirements (ie: liveness tracking, monitoring, pushing updates, distributed configuration, failover, etc).

pjreed · 2020-10-07T16:34:50Z

@AlexKaravaev: I haven't polished up my SSH-based launcher mostly because it was intended to be a stopgap solution that just implemented what we needed until a more robust solution is available, and I feel like some of the limitations (most significantly, the executable paths to nodes being resolved on the local computer before the command is sent to the remote computer) make it inappropriate for merging into the official ROS2 ecosystem at this time. Still, hopefully it works for you; let me know if you have any issues, and if it doesn't take too much work to make it more robust, it might make sense to get it into ROS2 proper until we've got a better system.

@gavanderhoorn: When I made my ssh_machine launcher, actually, one of the considerations I had in mind is that it would be good if the remote launching mechanism wasn't tied to any particular implementation; in theory you could also write, say, a K8sMachine class that extends the launch.machine.Machine class to use k8s for launching instead of SSH.

gavanderhoorn · 2020-10-07T18:40:10Z

@pjreed wrote:

When I made my ssh_machine launcher, actually, one of the considerations I had in mind is that it would be good if the remote launching mechanism wasn't tied to any particular implementation; in theory you could also write, say, a K8sMachine class that extends the launch.machine.Machine class to use k8s for launching instead of SSH.

the abstraction is nice, but wouldn't that still mean that launch is doing the orchestration? It would be delegating some parts of it to k8s, but that would be it.

My assumption is that it's going to take "the ROS community" quite some time to get to the level of functionality which these existing solutions have already reached. Besides that, it would also seem to be duplication of effort, which is never very nice.

…d goals section.

emersonknapp · 2021-01-12T23:01:31Z

articles/153_roslaunch_remote_launch.md

+    - Be able to reconnect to the system to monitor and manage nodes running remotely
+
+
+To achieve our goals, this update will consist of two primary components: an ExecuteRemote action, and a LaunchServiceNode.


Based on the re-scope of this design - it seems like this "Goals" section shouldn't be talking about the LaunchServiceNode since it's not specified within this design. If I remember correctly, we established that the LaunchServiceNode was independently interesting and designable. I do see that we want to talk about how these two things will interact, but I'm not sure if that belongs here in this doc anymore.

emersonknapp · 2021-01-12T23:01:49Z

articles/153_roslaunch_remote_launch.md

+
+To achieve our goals, this update will consist of two primary components: an ExecuteRemote action, and a LaunchServiceNode.
+
+The ExecuteRemote action will provide the basic capaiblity to start processes on remote machines.


Suggested change

The ExecuteRemote action will provide the basic capaiblity to start processes on remote machines.

The ExecuteRemote action will provide the basic capability to start processes on remote machines.

emersonknapp · 2021-01-12T23:02:57Z

articles/153_roslaunch_remote_launch.md

+To achieve our goals, this update will consist of two primary components: an ExecuteRemote action, and a LaunchServiceNode.
+
+The ExecuteRemote action will provide the basic capaiblity to start processes on remote machines.
+It will build off of the refactor that is described [here](https://github.com/ros2/design/pull/272) and include a machine object as an additional parameter providing implementation-specific connection information.


"Here" is never a very informative link title :)

Suggested change

It will build off of the refactor that is described [here](https://github.com/ros2/design/pull/272) and include a machine object as an additional parameter providing implementation-specific connection information.

It will build off of the refactor that is described in [Refactoring ExecuteProcess into Execute and Executable](https://github.com/ros2/design/pull/272) and include a machine object as an additional parameter providing implementation-specific connection information.

emersonknapp · 2021-01-12T23:05:28Z

articles/153_roslaunch_remote_launch.md

+The machine class will also be a base class for protocol-specific implementations.
+Together these components will provide the means to execute processes remotely and should cover most basic use cases.
+
+The [LaunchServiceNode(s)](154_roslaunch_remote_launch_service_node.md) will expose ROS2 services and topics for handling startup and shutdown of processes on remote machines.


It may be easier to evaluate this design tie-in if the linked document was available?

emersonknapp · 2021-01-12T23:07:12Z

articles/153_roslaunch_remote_launch.md

+|---|---|---|
+| hostname   | String | Hostname of the machine. |
+| port       | int | Port on the host to connect to (default 22) |
+| ssh\_keys  | String | Path to ssh key file |


emersonknapp · 2021-01-12T23:09:21Z

articles/153_roslaunch_remote_launch.md

+| prefix               | List[Substitution] | A set of commands/arguments to preceed the `cmd`, used for things like `gdb`/`valgrind` and defaults to the `LaunchConfiguration` called `launch-prefix` |
+| output               | ?? | Configuration for process output logging. Defaults to `log` i.e. log both `stdout` and `stderr` to launch main log file and stderr to the screen. |
+| output\_format       | ?? | For logging each output line, supporting `str.format()` substitutions with the following keys in scope: `line` to reference the raw output line and `this` to reference this action instance. |
+| log\_cmd             | boolean | If `True`, prints the final cmd before executing the process, which is useful for debugging when substitutions are involved. |


maybe should have a note that this will avoid printing any key passphrases or other sensitive information

Oh, good point, I will include a note about that.

emersonknapp · 2021-01-12T23:10:10Z

articles/153_roslaunch_remote_launch.md

+| log\_cmd             | boolean | If `True`, prints the final cmd before executing the process, which is useful for debugging when substitutions are involved. |
+| on\_exit             | List[LaunchDescriptionEntity] | List of actions to execute upon process exit.|
+| persistent\_connection | boolean | Whether the LaunchService should maintain a persistent connection to the ssh instance |
+| machine              | `launch.descriptions.SSHMachine` | The machine on which to execute the process |


You are correct, thanks

emersonknapp · 2021-01-12T23:12:12Z

I think overall this looks straightforward, but it still has remnants of the process management node, which shouldn't be necessary for the simple version of this design, right?

The LaunchServiceNode and longer-term goals that motivated it are beyond the scope of this part of the design, so references to them have been removed. Fixed some typos. Added a note to avoid exposing sensitive information when logging commands. Distro A; OPSEC #2893 Signed-off-by: matthew.lanting <[email protected]>

mlanting · 2021-02-03T19:55:26Z

Yeah, I think you're right. I've removed those references and fixed the errors you caught.

emersonknapp

This looks good to me now - though I'm not one with merge access here :)

ros-discourse · 2024-10-02T23:32:55Z

This pull request has been mentioned on ROS Discourse. There might be relevant details there:

https://discourse.ros.org/t/ros2-distributed-multi-pc-ros-launch-node-management-tool/39889/1

Adding inital draft to version control.

c6a790b

mlanting mentioned this pull request Aug 20, 2020

Multi-Machine Launching #255

Closed

wjwwood self-requested a review August 20, 2020 20:21

kyrofa reviewed Aug 24, 2020

View reviewed changes

fujitatomoya reviewed Aug 30, 2020

View reviewed changes

jaisontj reviewed Sep 4, 2020

View reviewed changes

emersonknapp reviewed Sep 4, 2020

View reviewed changes

clalancette added the backlog label Oct 26, 2020

Separated LaunchServiceNode design into a separate document, re-worke…

f27101a

…d goals section.

emersonknapp suggested changes Jan 12, 2021

View reviewed changes

emersonknapp approved these changes Feb 11, 2021

View reviewed changes

hidmic mentioned this pull request Mar 24, 2021

[launch] Multi-machine Launching ros2/launch#79

Open

vmayoral mentioned this pull request Feb 14, 2022

Architecture discussion BerkeleyAutomation/FogROS2#6

Closed

	This node runs on each machine in a system is is repsonsible for launching and managing remote processes.
	This node runs on each machine in a system is is responsible for launching and managing remote processes.

		- Be able to reconnect to the system to monitor and manage nodes running remotely


		To achieve our goals, this update will consist of two primary components: an ExecuteRemote action, and a LaunchServiceNode.


		To achieve our goals, this update will consist of two primary components: an ExecuteRemote action, and a LaunchServiceNode.

		The ExecuteRemote action will provide the basic capaiblity to start processes on remote machines.

	It will build off of the refactor that is described [here](https://github.com/ros2/design/pull/272) and include a machine object as an additional parameter providing implementation-specific connection information.
	It will build off of the refactor that is described in [Refactoring ExecuteProcess into Execute and Executable](https://github.com/ros2/design/pull/272) and include a machine object as an additional parameter providing implementation-specific connection information.

	\| ssh\_keys \| String \| Path to ssh key file \|
	\| ssh\_key \| String \| Path to ssh key file \|

	\| machine \| `launch.descriptions.SSHMachine` \| The machine on which to execute the process \|
	\| machine \| `launch.descriptions.Machine` \| The machine on which to execute the process \|

Added design document for remote and multi-machine launching #297

Are you sure you want to change the base?

Added design document for remote and multi-machine launching #297

Conversation

mlanting commented Aug 20, 2020

Choose a reason for hiding this comment

ros-discourse commented Aug 28, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

emersonknapp Sep 4, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dirk-thomas commented Sep 4, 2020

emersonknapp commented Sep 4, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

emersonknapp commented Sep 4, 2020 • edited Loading

emersonknapp commented Sep 4, 2020

ros-discourse commented Sep 18, 2020

AlexKaravaev commented Sep 28, 2020

emersonknapp commented Oct 1, 2020

mlanting commented Oct 6, 2020

pjreed commented Oct 6, 2020

AlexKaravaev commented Oct 6, 2020

gavanderhoorn commented Oct 6, 2020

pjreed commented Oct 7, 2020

gavanderhoorn commented Oct 7, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

emersonknapp commented Jan 12, 2021

mlanting commented Feb 3, 2021

emersonknapp left a comment

Choose a reason for hiding this comment

ros-discourse commented Oct 2, 2024

emersonknapp Sep 4, 2020 •

edited

Loading

emersonknapp commented Sep 4, 2020 •

edited

Loading