Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added design document for remote and multi-machine launching #297

Open
wants to merge 3 commits into
base: gh-pages
Choose a base branch
from

Conversation

mlanting
Copy link

The document has been changed so much and it's been so long since the last time I submitted a PR for multi-machine launch changes I figured it'd be more appropriate just to create a new one.

@mlanting mlanting mentioned this pull request Aug 20, 2020
@wjwwood wjwwood self-requested a review August 20, 2020 20:21
# Multi-Machine and Remote Launching
## Goals
- Connect to a remote host and run nodes on it
- Ensure this capability does not compromise security
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're currently working on launch integration for security. In order to make sure it scales to remote launching, we might want to consider ways to send security files over to the remote machine.

@ros-discourse
Copy link

This pull request has been mentioned on ROS Discourse. There might be relevant details there:

https://discourse.ros.org/t/ros2-tooling-wg-next-meeting/12545/36


### Nodes
#### launch\_ros.LaunchServiceNode
This node runs on each machine in a system is is repsonsible for launching and managing remote processes.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick:

Suggested change
This node runs on each machine in a system is is repsonsible for launching and managing remote processes.
This node runs on each machine in a system is is responsible for launching and managing remote processes.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This node runs on each machine in a system is it is

Together these components will provide the means to execute processes remotely and should cover most basic use cases.

The LaunchServiceNode(s) will expose ROS2 services and topics for handling startup and shutdown of processes on remote machines.
This component won't be necessary for basic execution of remote nodes, but would allow remote system to persist beyond the life of the initial LaunchService.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not clear on this use case, could you elaborate a little bit? especially on the difference from LifecycleNode except process statistics?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The design is intended to allow you to run a node pretty much like they currently would run, that is, bound to the lifespan of the launch process itself. When the launch process dies, the SSH connection dies, and anything still running on that SSH terminal dies with it.

The LaunchServiceNode is intended to work a little bit differently. It would be launched from the SSH terminal, but then detach and run without requiring the SSH connection to remain up. However, by doing that you lose the ability to directly kill a process if you actually want it to stop. Therefore, the LaunchServiceNode would provide a way to go back and ask for particular nodes on that server to be started/stopped ad hoc.

This is certainly somewhat similar to some of the functionality envisioned by LifecycleNodes, but would allow a way for us to remotely launch and manage nodes which were not natively designed to support the lifecycle feature. Lifecycle would provide nearly all the functionality we would need for this type of behavior, but should the node crash, Lifecycle won't let us bring it back for a full restart, and we also would be lacking other out-of-band control to deal with things such as unresponsive nodes.

(Obviously, these same arguments also apply to the LaunchServiceNode itself, so that particular entity should be as clean as possible to reduce the possibility of it getting into an unmanageable state itself.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question to explicitly state in the doc: Who starts the LaunchServiceNode? How is its lifecycle expected to be managed?

| passphrase | String | Optional passphrase if the key uses one. |

#### launch.actions.ExecuteRemote
Base class for actions that use connection informationn contained in `machine` to start a process on a remote machine.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: informationn

Remote machines will be described by a Machine class which will also be a base class for protocol-specific implementations.
Together these components will provide the means to execute processes remotely and should cover most basic use cases.

The LaunchServiceNode(s) will expose ROS2 services and topics for handling startup and shutdown of processes on remote machines.
Copy link
Contributor

@emersonknapp emersonknapp Sep 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LaunchServiceNode could be designed as an independent component - that is possible to use purely locally to detach a launch command from an ongoing launch context - it could then be used by the remote launching functionality.

#### Params
| Name | Type | Description |
|---|---|---|
| allowable\_processes | List[Executable] | List of executables the node is allowed to launch |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explicit flag to allow remote launching any process/node? Early in development this could be super useful for testing out various nodes via remote context from developer machine to robot.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or - is it not worth having this allow-list in here at all. Maybe SROS2 takes care of permissions well enough that we don't need to reimplement it. "If you use this you are exposing a vulnerability if you don't use SROS"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps, only allow the LaunchServiceNode to only start known ros nodes on the remote system, rather than arbitrary commands. Restricting to ROS-specific in the remote context, since the local launch can execute arbitrary SSH commands anyways

@dirk-thomas
Copy link
Member

Some high level comments: while some of the aspects of the document are specific to remote execution:

  • ExecuteRemote
  • RemoteSubstitution

several other parts are applicable to a local launch invocation too:

  • detach the main process and keep the launched processes running as well as continue handling event
  • `Heartbeat
  • ProcessStatus
  • all the services like QueryStatus, Start/StopProcess, Shutdown.

It might be good to separate these to make each invidual part easier to design, implement and review.

@emersonknapp
Copy link
Contributor

Noting that we should call out explicitly the expectation that the LaunchServiceNode must be accessible within the current discoverable DDS network. The machine may be WAN-accessible via SSH but that case won't work because we use topics/services for all controls.

- shutdown

### Messages
#### Heartbeat
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do all LaunchServiceNodes use a unique topic, or do all publish on the same topic (in which case would need to put machine/service ID in the messages)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this should be configurable, "heartbeat_topic", "process_status_topic"

| processor\_load | double | Percent of processor load coming from this process |
| err\_msg | String | Error string |

### Services
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noting that each LaunchServiceNode will need to namespace itself to make service calls uniquely-fulfillable

@emersonknapp
Copy link
Contributor

emersonknapp commented Sep 4, 2020

When talking about multi-machine message passing, how does time sync come into play?

"Out of Scope" section in design is useful.

@emersonknapp
Copy link
Contributor

How do we support Mac + Windows? Do we think the SSH process will work out of the box for those platforms?

@ros-discourse
Copy link

This pull request has been mentioned on ROS Discourse. There might be relevant details there:

https://discourse.ros.org/t/ros-2-tsc-meeting-agenda-2020-09-17/16464/1

@AlexKaravaev
Copy link

Hi everyone! For me as a ros2 developer, I really do miss that feature. Any timelines on this? I see that it's only a design doc and it's hard to say, but maybe it was included in some roadmap or so

@emersonknapp
Copy link
Contributor

So far there is not a roadmap that I know of. We discussed the design in one of the Tooling Working Group meetings, which is where my above comments came from, but I think the next step is just an updated split design - we identified that this is really 2 separate features that can be independently designed and implemented.

@mlanting should be able to clarify if any proof-of-concept code exists already, I can't remember.

@mlanting
Copy link
Author

mlanting commented Oct 6, 2020

@AlexKaravaev, @pjreed made a simpler, temporary implementation of multi-machine launching that you can find here: https://github.com/pjreed/launch

I will try to get the next version of the design document out in the next week or so. @roger-strain is currently working on a refactor of the launch system. Once that is complete and this design is approved we can being implementation. I expect it will be a few months before we have anything implemented, but pj's implementation should cover basic use cases.

@pjreed
Copy link
Contributor

pjreed commented Oct 6, 2020

Just FYI, my changes in the launch repository (on the multi-machine-launch branch) basically just add in some changes necessary to abstract out the launching mechanism; I have another repository at https://github.com/pjreed/ssh_machine which adds an SSH-based launcher that can be used to launch nodes on remote machines via SSH. The functionality is similar to how ROS1's remote launching mechanism works, but there are a few limitations to it; there's some more documentation and an example in that ssh_machine repo.

@AlexKaravaev
Copy link

@mlanting @pjreed wow, thanks !
Will definitely check it out, because I was thinking of implementing similar ssh-based launcher. I understand the limitations and sure there is a need of multi-machine launch in ros2 native support, but still this repo can solve some current problems.

@gavanderhoorn
Copy link
Contributor

@mlanting et al.: this could have been discussed in the previous PR, so if it has, please just RTFM me, but has there been any thought about using existing orchestration frameworks for deployment, configuration and orchestration of multi-machine ROS applications?

I'm not immediately thinking of something like k8s, but perhaps something similar might exist, which is a little less 'heavy' but supports like operations ("infrastructure as code" and all that).

Ian Sherman in his keynote at ROSCon18 posited that "backend distributed systems and IoT" (as he phrased it) might have solutions to challenges we still see as problems. Deploying, starting and coordinating a multi-machine distributed application seems like it could be something they might have solutions or best practices for. It does seem like they might have similar requirements (ie: liveness tracking, monitoring, pushing updates, distributed configuration, failover, etc).

@pjreed
Copy link
Contributor

pjreed commented Oct 7, 2020

@AlexKaravaev: I haven't polished up my SSH-based launcher mostly because it was intended to be a stopgap solution that just implemented what we needed until a more robust solution is available, and I feel like some of the limitations (most significantly, the executable paths to nodes being resolved on the local computer before the command is sent to the remote computer) make it inappropriate for merging into the official ROS2 ecosystem at this time. Still, hopefully it works for you; let me know if you have any issues, and if it doesn't take too much work to make it more robust, it might make sense to get it into ROS2 proper until we've got a better system.

@gavanderhoorn: When I made my ssh_machine launcher, actually, one of the considerations I had in mind is that it would be good if the remote launching mechanism wasn't tied to any particular implementation; in theory you could also write, say, a K8sMachine class that extends the launch.machine.Machine class to use k8s for launching instead of SSH.

@gavanderhoorn
Copy link
Contributor

@pjreed wrote:

When I made my ssh_machine launcher, actually, one of the considerations I had in mind is that it would be good if the remote launching mechanism wasn't tied to any particular implementation; in theory you could also write, say, a K8sMachine class that extends the launch.machine.Machine class to use k8s for launching instead of SSH.

the abstraction is nice, but wouldn't that still mean that launch is doing the orchestration? It would be delegating some parts of it to k8s, but that would be it.

My assumption is that it's going to take "the ROS community" quite some time to get to the level of functionality which these existing solutions have already reached. Besides that, it would also seem to be duplication of effort, which is never very nice.

- Be able to reconnect to the system to monitor and manage nodes running remotely


To achieve our goals, this update will consist of two primary components: an ExecuteRemote action, and a LaunchServiceNode.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the re-scope of this design - it seems like this "Goals" section shouldn't be talking about the LaunchServiceNode since it's not specified within this design. If I remember correctly, we established that the LaunchServiceNode was independently interesting and designable. I do see that we want to talk about how these two things will interact, but I'm not sure if that belongs here in this doc anymore.


To achieve our goals, this update will consist of two primary components: an ExecuteRemote action, and a LaunchServiceNode.

The ExecuteRemote action will provide the basic capaiblity to start processes on remote machines.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The ExecuteRemote action will provide the basic capaiblity to start processes on remote machines.
The ExecuteRemote action will provide the basic capability to start processes on remote machines.

To achieve our goals, this update will consist of two primary components: an ExecuteRemote action, and a LaunchServiceNode.

The ExecuteRemote action will provide the basic capaiblity to start processes on remote machines.
It will build off of the refactor that is described [here](https://github.com/ros2/design/pull/272) and include a machine object as an additional parameter providing implementation-specific connection information.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Here" is never a very informative link title :)

Suggested change
It will build off of the refactor that is described [here](https://github.com/ros2/design/pull/272) and include a machine object as an additional parameter providing implementation-specific connection information.
It will build off of the refactor that is described in [Refactoring ExecuteProcess into Execute and Executable](https://github.com/ros2/design/pull/272) and include a machine object as an additional parameter providing implementation-specific connection information.

The machine class will also be a base class for protocol-specific implementations.
Together these components will provide the means to execute processes remotely and should cover most basic use cases.

The [LaunchServiceNode(s)](154_roslaunch_remote_launch_service_node.md) will expose ROS2 services and topics for handling startup and shutdown of processes on remote machines.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be easier to evaluate this design tie-in if the linked document was available?

|---|---|---|
| hostname | String | Hostname of the machine. |
| port | int | Port on the host to connect to (default 22) |
| ssh\_keys | String | Path to ssh key file |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be singular?

Suggested change
| ssh\_keys | String | Path to ssh key file |
| ssh\_key | String | Path to ssh key file |

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch

| prefix | List[Substitution] | A set of commands/arguments to preceed the `cmd`, used for things like `gdb`/`valgrind` and defaults to the `LaunchConfiguration` called `launch-prefix` |
| output | ?? | Configuration for process output logging. Defaults to `log` i.e. log both `stdout` and `stderr` to launch main log file and stderr to the screen. |
| output\_format | ?? | For logging each output line, supporting `str.format()` substitutions with the following keys in scope: `line` to reference the raw output line and `this` to reference this action instance. |
| log\_cmd | boolean | If `True`, prints the final cmd before executing the process, which is useful for debugging when substitutions are involved. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe should have a note that this will avoid printing any key passphrases or other sensitive information

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, good point, I will include a note about that.

| log\_cmd | boolean | If `True`, prints the final cmd before executing the process, which is useful for debugging when substitutions are involved. |
| on\_exit | List[LaunchDescriptionEntity] | List of actions to execute upon process exit.|
| persistent\_connection | boolean | Whether the LaunchService should maintain a persistent connection to the ssh instance |
| machine | `launch.descriptions.SSHMachine` | The machine on which to execute the process |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Want this to take the abstract interface, right?

Suggested change
| machine | `launch.descriptions.SSHMachine` | The machine on which to execute the process |
| machine | `launch.descriptions.Machine` | The machine on which to execute the process |

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are correct, thanks

@emersonknapp
Copy link
Contributor

I think overall this looks straightforward, but it still has remnants of the process management node, which shouldn't be necessary for the simple version of this design, right?

The LaunchServiceNode and longer-term goals that motivated it are beyond
the scope of this part of the design, so references to them have been
removed.

Fixed some typos.

Added a note to avoid exposing sensitive information when logging
commands.

Distro A; OPSEC #2893

Signed-off-by: matthew.lanting <[email protected]>
@mlanting
Copy link
Author

mlanting commented Feb 3, 2021

Yeah, I think you're right. I've removed those references and fixed the errors you caught.

Copy link
Contributor

@emersonknapp emersonknapp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me now - though I'm not one with merge access here :)

@ros-discourse
Copy link

This pull request has been mentioned on ROS Discourse. There might be relevant details there:

https://discourse.ros.org/t/ros2-distributed-multi-pc-ros-launch-node-management-tool/39889/1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.