# System call interception

LXD supports intercepting some specific system calls from unprivileged
containers. If they're considered to be safe, it executes them with
elevated privileges on the host.

Doing so comes with a performance impact for the syscall in question and
will cause some work for LXD to evaluate the request and if allowed,
process it with elevated privileges.

Enabling of specific system call interception options is done on a
per-container basis through container configuration options.

## Available system calls

(syscall-mknod)=
### `mknod` / `mknodat`

The `mknod` and `mknodat` system calls can be used to create a variety of special files.

Most commonly inside containers, they may be called to create block or character devices.
Creating such devices isn't allowed in unprivileged containers as this
is a very easy way to escalate privileges by allowing direct write
access to resources like disks or memory.

But there are files which are safe to create. For those, intercepting
this syscall may unblock some specific workloads and allow them to run
inside an unprivileged containers.

The devices which are currently allowed are:

- OverlayFS whiteout (char 0:0)
- `/dev/console` (char 5:1)
- `/dev/full` (char 1:7)
- `/dev/null` (char 1:3)
- `/dev/random` (char 1:8)
- `/dev/tty` (char 5:0)
- `/dev/urandom` (char 1:9)
- `/dev/zero` (char 1:5)

All file types other than character devices are currently sent to the
kernel as usual, so enabling this feature doesn't change their behavior
at all.

This can be enabled by setting {config:option}`instance-security:security.syscalls.intercept.mknod` to `true`.

### `bpf`

The `bpf` system call is used to manage eBPF programs in the kernel.
Those can be attached to a variety of kernel subsystems.

In general, loading of eBPF programs that are not trusted can be problematic as it
can facilitate timing based attacks.

LXD's eBPF support is currently restricted to programs managing devices
cgroup entries. To enable it, you need to set both
{config:option}`instance-security:security.syscalls.intercept.bpf` and
{config:option}`instance-security:security.syscalls.intercept.bpf.devices` to true.

### `mount`

The `mount` system call allows for mounting both physical and virtual file systems.
By default, unprivileged containers are restricted by the kernel to just
a handful of virtual and network file systems.

To allow mounting physical file systems, system call interception can be used.
LXD offers a variety of options to handle this.

{config:option}`instance-security:security.syscalls.intercept.mount` is used to control the entire
feature and needs to be turned on for any of the other options to work.

{config:option}`instance-security:security.syscalls.intercept.mount.allowed` allows specifying a list of
file systems which can be directly mounted in the container. This is the
most dangerous option as it allows the user to feed data that is not trusted at
the kernel. This can easily be used to crash the host system or to
attack it. It should only ever be used in trusted environments.

{config:option}`instance-security:security.syscalls.intercept.mount.shift` can be set on top of that so
the resulting mount is shifted to the UID/GID map used by the container.
This is needed to avoid everything showing up as `nobody`/`nogroup` inside
of unprivileged containers.

The much safer alternative to those is
{config:option}`instance-security:security.syscalls.intercept.mount.fuse` which can be set to pairs of
file-system name and FUSE handler. When this is set, an attempt at
mounting one of the configured file systems will be transparently
redirected to instead calling the FUSE equivalent of that file system.

As this is all running as the caller, it avoids the entire issue around
the kernel attack surface and so is generally considered to be safe,
though you should keep in mind that any kind of system call interception
makes for an easy way to overload the host system.

### `sched_setscheduler`

The `sched_setscheduler` system call is used to manage process priority.

Granting this may allow a user to significantly increase the priority of
their processes, potentially taking a lot of system resources.

It also allows access to schedulers like `SCHED_FIFO` which are generally
considered to be flawed and can significantly impact overall system
stability. This is why under normal conditions, only the real root user
(or global `CAP_SYS_NICE`) would allow its use.

(syscall-setxattr)=
### `setxattr`

The `setxattr` system call is used to set extended attributes on files.

The attributes which are handled by this currently are:

- `trusted.overlay.opaque` (OverlayFS directory whiteout)

Note that because the mediation must happen on a number of character
strings, there is no easy way at present to only intercept the few
attributes we care about. As we only allow the attributes above, this
may result in breakage for other attributes that would have been
previously allowed by the kernel.

This can be enabled by setting {config:option}`instance-security:security.syscalls.intercept.setxattr` to `true`.

### `sysinfo`

The `sysinfo` system call is used by some distributions instead of `/proc/` entries to report on resource usage.

In order to provide resource usage information specific to the container, rather than the whole system, this
syscall interception mode uses cgroup-based resource usage information to fill in the system call response.