Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PVF worker: Add seccomp restrictions (restrict networking) #2009

Merged
merged 13 commits into from
Oct 31, 2023

Conversation

mrcnski
Copy link
Contributor

@mrcnski mrcnski commented Oct 24, 2023

Overview

We're already working on sandboxing by blocking all unneeded syscalls. However, due to the wide scope it will take a while longer. This PR starts with a much smaller scope, only blocking network-related syscalls until the above is ready.

For security we block the following with seccomp:

  1. creation of new sockets - these are unneeded in PVF jobs, and we can safely block them without affecting consensus.

  2. io_uring - as discussed here, io_uring allows for networking and needs to be blocked. See below for a discussion on the safety of doing this.

  3. connecting to sockets - the above two points are enough for networking and is what birdcage does (or used to do) to restrict networking. However, it is possible to connect to abstract unix sockets to do some kinds of sandbox escapes, so we also block the connect syscall.

Safety of blocking io_uring

(Intentionally left out of implementer's guide because it felt like too much detail.)

io_uring is just a way of issuing system calls in an async manner, and there is nothing stopping wasmtime from legitimately using it. Fortunately, at the moment it does not. Generally, not many applications use io_uring in production yet, because of the numerous kernel CVEs discovered. It's still under a lot of development. Android outright banned io_uring for these reasons.

Considering io_uring's status, and that it very likely would get detected either by our recently-added static analysis or by testing, I think it is fairly safe to block it.

Consensus analysis

If execution hits an edge case code path unique to a given machine, it's already taken a non-deterministic branch anyway. After all, we just care that the majority of validators reach the same result and preserve consensus. So worst-case scenario, there's a dispute, and we can always admit fault and refund the wrong validator. On the other hand, if all validators take the code path that results in a seccomp violation, then they would all vote against the current candidate, which is also fine. The violation would get logged (in big scary letters) and hopefully some validator reports it to us.

Actually, a worst-worse-case scenario is that 50% of validators vote against, so that there is no consensus. But so many things would have to go wrong for that to happen:

  1. An update to wasmtime is introduced that uses io_uring (unlikely as io_uring is mainly for IO-heavy applications)
  2. The new syscall is not detected by our static analysis
  3. It is never triggered in any of our tests
  4. It then gets triggered on some super edge case in production on 50% of validators causing a stall (bad but very unlikely)
  5. Or, it triggers on only a few validators causing a dispute (more likely but not as bad?)

Considering how many things would have to go wrong here, we believe it's safe to block io_uring.

Related

Closes #619
Original PR in Polkadot repo: paritytech/polkadot#7334
Birdcage io_uring PR: phylum-dev/birdcage#34

We're already working on sandboxing by [blocking all unneeded syscalls](#882). However, due to the wide scope it will take a while longer. This PR starts with a much smaller scope, only blocking network-related syscalls until the above is ready.

For security we block the following with `seccomp`:

- creation of new sockets - these are unneeded in PVF jobs, and we can safely block them without affecting consensus.

- `io_uring` - as discussed [here](paritytech/polkadot#7334 (comment)), io_uring allows for networking and needs to be blocked. See below for a discussion on the safety of doing this.

- `connect`ing to sockets - the above two points are enough for networking and is what birdcage does (or [used to do](phylum-dev/birdcage#47)) to restrict networking. However, it is possible to [connect to abstract unix sockets](https://lore.kernel.org/landlock/[email protected]/T/#u) to do some kinds of sandbox escapes, so we also block the `connect` syscall.

(Intentionally left out of implementer's guide because it felt like too much detail.)

`io_uring` is just a way of issuing system calls in an async manner, and there is nothing stopping wasmtime from legitimately using it. Fortunately, at the moment it does not. Generally, not many applications use `io_uring` in production yet, because of the numerous kernel CVEs discovered. It's still under a lot of development. Android outright banned `io_uring` for these reasons.

Considering `io_uring`'s status, and that it very likely would get detected either by our [recently-added static analysis](#1663) or by testing, I think it is fairly safe to block it.

If execution hits an edge case code path unique to a given machine, it's already taken a non-deterministic branch anyway. After all, we just care that the majority of validators reach the same result and preserve consensus. So worst-case scenario, there's a dispute, and we can always admit fault and refund the wrong validator. On the other hand, if all validators take the code path that results in a seccomp violation, then they would all vote against the current candidate, which is also fine. The violation would get logged (in big scary letters) and hopefully some validator reports it to us.

Actually, a worst-worse-case scenario is that 50% of validators vote against, so that there is no consensus. But so many things would have to go wrong for that to happen:

1. An update to `wasmtime` is introduced that uses io_uring (unlikely as io_uring is mainly for IO-heavy applications)
2. The new syscall is not detected by our static analysis
3. It is never triggered in any of our tests
4. It then gets triggered on some super edge case in production on 50% of validators causing a stall (bad but very unlikely)
5. Or, it triggers on only a few validators causing a dispute (more likely but not as bad?)

Considering how many things would have to go wrong here, we believe it's safe to block `io_uring`.

Closes #619
Original PR in Polkadot repo: paritytech/polkadot#7334
Starting the tokio runtime was calling `socketpair` and triggering the new
seccomp filter. Removed tokio since we wanted to do it soon anyway as part of
#649.
@mrcnski mrcnski added the T0-node This PR/Issue is related to the topic “node”. label Oct 24, 2023
@mrcnski mrcnski self-assigned this Oct 24, 2023
Copy link
Member

@eskimor eskimor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

polkadot/node/core/pvf/common/src/worker/mod.rs Outdated Show resolved Hide resolved
Comment on lines 73 to 75
//! # Action on caught syscall
//!
//! TODO
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of killing the process on violations, we can have seccomp notify the parent, having it kill the process and log what happened. We anyway want logging to ensure that the syscall detection script is really sound, before enabling a seccomp whitelist based on it in production. I've done some experiments and performance does not suffer too much.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how do you plan to have the parent notified? using ptrace or SECCOMP_RET_USER_NOTIF?

If we only want logging, we can just rely on the kernel's Audit logging of seccomp actions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ptrace. With SECCOMP_RET_USER_NOTIF we'd need to check if it's supported, as you previously pointed out. Maybe not a big deal as it was introduced in 5.0.

I considered audit logging, but AFAICT there are some issues with it. We'd need to manually parse the log which seems hacky (there is a Linux utility for it but not installed by default). Also, it's possible for operators to have disabled seccomp logging. ptrace seemed like the right way to do it, albeit more complex; but I'd really appreciate your input.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, it's possible for operators to have disabled seccomp logging.

well, users can disable ptrace too. But is this a concern? we only have access to the logs of nodes if they're run by us, right? we can configure the system whichever way we see fit.

I don't have a very strong opinion honestly. whichever option works, doesn't impact performance too much and isn't too complicated to implement.
I am guessing the most performant one is audit logging, because it's all happening in the kernel. But it's indeed more complicated to parse the logs and build the right plubming to feed them in loki

Copy link
Contributor Author

@mrcnski mrcnski Oct 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went with parsing the audit logs. Audit events are not guaranteed to be observable, but we don't necessarily need them: the logs we emit are merely informative, so that operators know what happened. Note that if we were to handle seccomp violations differently from regular worker death, then an attacker could perhaps abuse that non-determinism somehow.

Using ptrace instead would be more reliable and (I think) the performance would be fine, but it seemed more convoluted to implement that strategy.

Copy link
Contributor

@alindima alindima left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good

polkadot/node/core/pvf/common/Cargo.toml Outdated Show resolved Hide resolved
polkadot/node/core/pvf/common/src/worker/security/mod.rs Outdated Show resolved Hide resolved
Comment on lines 73 to 75
//! # Action on caught syscall
//!
//! TODO
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how do you plan to have the parent notified? using ptrace or SECCOMP_RET_USER_NOTIF?

If we only want logging, we can just rely on the kernel's Audit logging of seccomp actions

polkadot/node/core/pvf/src/host.rs Outdated Show resolved Hide resolved
polkadot/node/core/pvf/src/host.rs Outdated Show resolved Hide resolved
@mrcnski mrcnski requested a review from alindima October 25, 2023 08:56
NOTE: Log, but don't change the outcome. Not all validators may have auditing
enabled, so we don't want attackers to abuse a non-deterministic outcome.

TESTING: Some manual testing where seccomp events were triggered confirmed that
the logs are parsed correctly:

Prepare worker:
```
Oct 27 09:15:42.725  WARN parachain::pvf: failed to recv a prepare response: Custom { kind: UnexpectedEof, error: "early eof" } worker_pid=2691819
Oct 27 09:15:42.726 DEBUG parachain::pvf: checking audit log for seccomp violations worker_pid=2691819 audit_log_path="/var/log/syslog"
Oct 27 09:15:42.727 ERROR parachain::pvf: A forbidden syscall was attempted! This is a violation of our seccomp security policy. Report an issue ASAP! worker_pid=2691819 syscall=41 pvf=Pvf { code, code_hash: 0xdc4dc649bdebd4a529b6c3eb17da74b36afa9728eb7a0d834443510382490a35, executor_params: ExecutorParams([]), prep_timeout: 3s }
```

Execute worker:
```
Oct 27 09:17:28.006  WARN parachain::pvf: failed to recv an execute response worker_pid=2692712 validation_code_hash=0xdc4dc649bdebd4a529b6c3eb17da74b36afa9728eb7a0d834443510382490a35 error=Custom { kind: UnexpectedEof, error: "early eof" }
Oct 27 09:17:28.006 DEBUG parachain::pvf: checking audit log for seccomp violations worker_pid=2692712 audit_log_path="/var/log/syslog"
Oct 27 09:17:28.007 ERROR parachain::pvf: A forbidden syscall was attempted! This is a violation of our seccomp security policy. Report an issue ASAP! worker_pid=2692712 syscall=41 validation_code_hash=0xdc4dc649bdebd4a529b6c3eb17da74b36afa9728eb7a0d834443510382490a35 artifact_path="/tmp/.tmp4p6bN0/wasmtime_polkadot_v1.1.0_0xdc4dc649bdebd4a529b6c3eb17da74b36afa9728eb7a0d834443510382490a35_0x03170a2e7597b7b7e3d84c05391d139a62b157e78786d8c082f29dcf4c111314"
```
@mrcnski mrcnski requested a review from alindima October 27, 2023 09:50
@mrcnski mrcnski requested a review from eskimor October 27, 2023 10:00
Copy link
Contributor

@alindima alindima left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable. Another way of getting a log would be to use the Trap return action and install a signal handler for SIGSYS that logs and then kills the process.

But that get's tricky because you can only use async-signal-safe functions in signal handlers (and we'd have to audit all of that code constantly).

polkadot/node/core/pvf/tests/it/main.rs Outdated Show resolved Hide resolved
@mrcnski
Copy link
Contributor Author

mrcnski commented Oct 28, 2023

Perhaps we should only log for now, instead of killing the process. Just for one release. We haven't tested with all parachains, after all, and the socket call we're blocking is in the syscall lists for the processes.

I say we log, just to be safe. The jobs really shouldn't actually do any networking. (And creating a Unix socket would have already failed with our FS restrictions.)

@alindima
Copy link
Contributor

Perhaps we should only log for now, instead of killing the process. Just for one release. We haven't tested with all parachains, after all, and the socket call we're blocking is in the syscall lists for the processes.
I say we log, just to be safe. The jobs really shouldn't actually do any networking. (And creating a Unix socket would have already failed with our FS restrictions.)

Sounds good to me. We'll need some script that parses the audit logs from the kernel and notifies us if there's a logged violation

@mrcnski
Copy link
Contributor Author

mrcnski commented Oct 30, 2023

Another way of getting a log would be to use the Trap return action and install a signal handler for SIGSYS that logs and then kills the process.

But that get's tricky because you can only use async-signal-safe functions in signal handlers (and we'd have to audit all of that code constantly).

I agree that signal handlers are better avoided where possible. :P And, according to man:

This value results in the kernel sending a thread-directed SIGSYS signal to the triggering thread.

IIUC, with this scheme, an attacker could overwrite our signal handlers with his own.

We'll need some script that parses the audit logs from the kernel and notifies us if there's a logged violation

I believe we can just parse the logs after every job, similar to how we do it now. We don't need a script running continuously (if I understood you correctly).

I've confirmed that we have time to log for one release before enabling the protections.

@alindima
Copy link
Contributor

IIUC, with this scheme, an attacker could overwrite our signal handlers with his own.

You could add sigaction to the seccomp blacklist. Anyway, it's best to avoid signal handlers if possible, as we both agree

I believe we can just parse the logs after every job, similar to how we do it now. We don't need a script running continuously (if I understood you correctly).

Sure. We still need some kind of rough notification mechanism if a fault occurred. Or just remember to scan the logs regularly :D

@s0me0ne-unkn0wn
Copy link
Contributor

We'll need some script that parses the audit logs from the kernel

I'm not following the whole discussion so a possibly dumb question... Why do we need to parse audit logs instead of connecting to audit via netlink and observing audit events directly?

@mrcnski
Copy link
Contributor Author

mrcnski commented Oct 30, 2023

@s0me0ne-unkn0wn I don't know what netlink is. :P Parsing logs is never ideal, but this is the best way I could find since encountering the problem back in April. If your solution is quick to implement I could do it, otherwise I've already implemented and tested the parsing way.

@s0me0ne-unkn0wn
Copy link
Contributor

@mrcnski see man 3 audit_open and further links. It's the interface to the kernel's audit subsys. I think I sent you a usage example to Matrix PM once.

@mrcnski
Copy link
Contributor Author

mrcnski commented Oct 30, 2023

Thanks @s0me0ne-unkn0wn, it's a good idea if I can figure out how to do it. Raised a follow-up at #2080.

Use a combination of rusty-fork (separate processes in rust tests) and new
sessions to safely kill child workers in tests.
@mrcnski mrcnski requested a review from alindima October 30, 2023 11:57
@mrcnski mrcnski merged commit 9faea38 into master Oct 31, 2023
10 checks passed
@mrcnski mrcnski deleted the mrcnski/pvf-worker-restrict-networking branch October 31, 2023 10:08
mrcnski added a commit that referenced this pull request Nov 8, 2023
We've merged #2009 which logs seccomp violations for networking, but does not
switch on full seccomp restrictions (voting against blocks on violations).

We should monitor releases and once #2009 has been released, we can switch this
on for the next release.

Closes #2163
@mrcnski mrcnski mentioned this pull request Nov 24, 2023
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T0-node This PR/Issue is related to the topic “node”.
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

PVF worker: restrict network access
5 participants