ISOLATE(1)

Table of Contents

1. NAME
2. SYNOPSIS
3. DESCRIPTION
4. BASIC OPTIONS
5. LIMITS
6. ENVIRONMENT RULES
7. DIRECTORY RULES
8. CONTROL GROUPS
9. SPECIAL OPTIONS
10. META-FILES
11. RETURN VALUE
12. INSTALLATION
13. REPRODUCIBILITY
14. SYSTEM CALL RESTRICTIONS
15. LICENSE
16. SEE ALSO

1. NAME

isolate - Isolate a process using Linux Containers

2. SYNOPSIS

isolate options --init

isolate options --run -- program arguments

isolate options --cleanup

3. DESCRIPTION

Run program within a sandbox, so that it cannot communicate with the outside world and its resource consumption is limited. This can be used for example in a programming contest to run untrusted programs submitted by contestants in a controlled environment.

The sandbox is used in the following way:

Run isolate --init, which initializes the sandbox, creates its working directory and prints its name to the standard output. If the sandbox already existed, it is reset.
Populate the directory with the executable file of the program and its input files.
Call isolate --run to run the program. A single line describing the status of the program is written to the standard error stream.
Fetch the output of the program from the directory.
Run isolate --cleanup to remove temporary files. Does nothing if the sandbox was already cleaned up.

Please note that by default, the program is not allowed to start multiple processes of threads. If you need that, turn on the control group mode (see below).

4. BASIC OPTIONS

-b, --box-id=id: When you run multiple sandboxes in parallel, you have to assign unique IDs to them by this option. See the discussion on UIDs in the INSTALLATION section. The ID defaults to 0.
-M, --meta=file: Output meta-data on the execution of the program to a given file. See below for syntax of the meta-files.
-i, --stdin=file: Redirect standard input from file. The file has to be accessible inside the sandbox (which means that the sandboxed program can manipulate it arbitrarily). If not specified, standard input is inherited from the parent process.
-o, --stdout=file: Redirect standard output to file. The file has to be accessible inside the sandbox (which means that the sandboxed program can manipulate it arbitrarily). If not specified, standard output is inherited from the parent process and the sandbox manager does not write anything to it.
-r, --stderr=file: Redirect standard error output to file. The file has to be accessible inside the sandbox (which means that the sandboxed program can manipulate it arbitrarily). If not specified, standard error output is inherited from the parent process. See also --stderr-to-stdout.
--stderr-to-stdout: Redirect standard error output to standard output. This is performed after the standard output is redirected by --stdout. Mutually exclusive with --stderr.
-c, --chdir=dir: Change directory to dir before executing the program. This path must be relative to the root of the sandbox.
-v, --verbose: Tell the sandbox manager to be verbose and report on what is going on. Using -v multiple times produces even more jabber.
-s, --silent: Tell the sandbox manager to keep silence. No status messages are printed to stderr except for fatal errors of the sandbox itself. The combination of --verbose and --silent has an undefined effect.
--wait: Multiple instances of Isolate cannot manage the same sandbox simultaneously. If you attempt to do that, the new instance refuses to run. With this option, the new instance waits for the other instance to finish.

5. LIMITS

The following options can limit system resources consumed by the program.

-m, --mem=size: Limit address space of the program to size kilobytes. If more processes are allowed, this applies to each of them separately. If this limit is reached, further memory allocations fail (e.g., malloc returns NULL).
-t, --time=time: Limit run time of the program to time seconds. Fractional numbers are allowed. Time in which the OS assigns the processor to other tasks is not counted. If this limit is exceeded, the program is killed (after --extra-time, if set).
-w, --wall-time=time: Limit wall-clock time to time seconds. Fractional values are allowed. This clock measures the time from the start of the program to its exit, so it does not stop when the program has lost the CPU or when it is waiting for an external event. We recommend to use --time as the main limit, but set --wall-time to a much higher value as a precaution against sleeping programs. If this limit is exceeded, the program is killed.
-x, --extra-time=time: When the --time limit is exceeded, do not kill the program immediately, but wait until --extra-time seconds elapse since the start of the program. This allows one to report the real execution time, even if it exceeds the limit slightly. Fractional numbers are allowed.
-k, --stack=size: Limit process stack to size kilobytes. By default, the whole address space is available for the stack, but it is subject to the --mem limit. If this limit is exceeded, the program receives the SIGSEGV signal.
-n, --open-files=max: Limit number of open files to max. The default value is 64. Setting this option to 0 will result in unlimited open files. If this limit is reached, system calls creating file descriptors fail with error EMFILE.
-f, --fsize=size: Limit size of each file created (or modified) by the program to size kilobytes. In most cases, it is better to restrict overall disk usage by a disk quota (see below). This option can help in cases when quotas are not enabled on the underlying filesystem. If this limit is reached, system calls expanding files fail with error EFBIG and the program receives the SIGXFSZ signal.
-q, --quota=blocks,inodes: Set disk quota to a given number of blocks and inodes. This requires the filesystem to be mounted with support for quotas. Unlike other options, this one must be given to isolate --init. Please note that filesystems differ in what quotas they support and how these are controlled. The interface used by Isolate has been tested with the ext family of filesystems and tmpfs. If the quota is reached, system calls expanding files fail with error EDQUOT.
--core=size: Limit size of core files created when a process crashes to size kilobytes. Defaults to zero, meaning that no core files are produced inside the sandbox.
-p, --processes[=max]: Permit the program to create up to max processes and/or threads. Please keep in mind that time and memory limit do not work with multiple processes unless you enable the control group mode. If max is not given, an arbitrary number of processes can be run. By default, only one process is permitted. If this limit is exceeded, system calls creating processes fail with error EAGAIN.

6. ENVIRONMENT RULES

UNIX processes normally inherit all environment variables from their parent. The sandbox however passes only those variables which are explicitly requested by environment rules:

-E, --env=var: Inherit the variable var from the parent.
-E, --env=var=value: Set the variable var to value. When the value is empty, the variable is removed from the environment.
-e, --full-env: Inherit all variables from the parent.

The rules are applied in the order in which they were given, except for --full-env, which is applied first.

The list of rules is automatically initialized with -ELIBC_FATAL_STDERR_=1.

7. DIRECTORY RULES

The sandboxed process gets its own filesystem namespace, which contains only subtrees requested by directory rules:

-d, --dir=in=out[:options]: Bind the directory out as seen by the caller to the path in inside the sandbox. Multiple options can be separated by colons. If there already was a directory rule for in, it is replaced.
-d, --dir=dir[:options]: Bind the directory /dir to dir inside the sandbox. Multiple options can be separated by colons. If there already was a directory rule for in, it is replaced.
-d, --dir=in=: Remove a directory rule for the path in inside the sandbox.
-D, --no-default-dirs: Omit the default set of directory rules (see below). Care has to be taken to specify the correct set of rules (using --dir) for the executed program to run correctly. In particular, /box has to be bound.

Unless specified otherwise, all directories are bound read-only and restricted (no devices, no setuid binaries). This behavior can be modified using the options:

rw: Allow read-write access.
dev: Allow access to character and block devices.
noexec: Disallow execution of binaries.
maybe: Silently ignore the rule if the directory to be bound does not exist.
fs: Instead of binding a directory, mount a device-less filesystem called out. Currently, only the following filesystems are supported: proc, sysfs, tmpfs, devpts.
tmp: Bind a freshly created temporary directory writeable for the sandbox user. Accepts no out, implies rw.
norec: Do not bind recursively. Without this option, mount points in the outside directory tree are automatically propagated to the sandbox.

Unless --no-default-dirs is specified, the following rules are added automatically:

/box for the sandbox’s working directory (read-write)
/bin (read-only bind-mount)
/dev (non-recursive bind-mount with devices allowed)
/dev/shm (read-write tmpfs)
/lib and /lib64 (read-only bind mount, the latter only if it exists)
/proc (proc filesystem with only current user’s processes visible)
/tmp (read-write temporary directory)
+/usr* (read-only bind mount)

The rules are executed in the order in which they are given. Default rules come before all user rules. When a rule is replaced, it retains the original position in the order. This matters when one rule’s in is a sub-directory of another rule’s in. For example if you first bind to a and then to a/b, it will work as expected, but a sub-directory b must have existed in the directory bound to a (isolate never creates subdirectories in bound directories for security reasons). If the order is a/b before a, then the directory bound to a/b becomes invisible by the later binding on a.

8. CONTROL GROUPS

Isolate can make use of system control groups provided by the kernel to constrain programs consisting of multiple processes. Please note that this feature needs special system setup described in the INSTALLATION section.

--cg: Enable use of control groups. This should be specified with --init, --run and --cleanup.
--cg-mem=size: Limit total memory usage by the whole control group to size kilobytes. This should be specified with --run. Effect of reaching this limit depends on circumstances. If it happens during memory allocation, the allocation can fail or memory can be over-committed by the kernel. If it happens when handling a page fault, the whole process is killed by the OOM killer with the SIGSEGV signal.
--print-cg-root: Print the root of the control group hierarchy in /sys/ and exit. This is used by the isolate-check-environment script.

9. SPECIAL OPTIONS

The following options can be useful in special cases.

--share-net: By default, isolate creates a new network namespace for its child process. This namespace contains no network devices except for a per-namespace loopback. This prevents the program from communicating with the outside world. If you want to permit communication, you can use this switch to keep the child process in parent’s network namespace.
--inherit-fds: By default, isolate closes all file descriptors passed from its parent except for descriptors 0, 1, and 2. This prevents unintentional descriptor leaks. In some cases, passing extra descriptors to the sandbox can be desirable, so you can use this switch to make them survive.
--tty-hack: Try to handle interactive programs communicating over a tty. The sandboxed program will run in a separate process group, which will temporarily become the foreground process group of the terminal. When the program exits, the process group will be switched back to the caller. Please note that the program can do many nasty things including (but not limited to) changing terminal settings, changing the line discipline, and stuffing characters to the terminal’s input queue using the TIOCSTI ioctl. Use with extreme caution.
--special-files: By default, Isolate removes all special files (other than regular files and directories) created inside the sandbox. If you need them, this option disables that behavior, but you need to carefully check what you open.
--as-uid=uid, --as-gid=gid: Act on behalf of the specified user and group (only if Isolate was invoked by root). This is used in scenarios where a root-controlled process manages creation of sandboxes for regular users, usually in conjunction with the restricted_init option in the configuration file.
--check-config: Parse configuration, check that it is valid, and exit.

10. META-FILES

The meta-file contains miscellaneous meta-information on execution of the program within the sandbox. It is a textual file consisting of lines of format key:value. The following keys are defined:

cg-mem

When control groups are enabled, this is the total memory use by the whole control group (in kilobytes). If you use isolate --run multiple times in the same sandbox, the control group retains cached data from the previous runs, which also contributes to cg-mem.

cg-oom-killed

Present when the program was killed by the out-of-memory killer (e.g., because it has exceeded the memory limit of its control group). This is reported only on Linux 4.13 and later.

csw-forced

Number of context switches forced by the kernel.

csw-voluntary

Number of context switches caused by the process giving up the CPU voluntarily.

exitcode

The program has exited normally with this exit code.

exitsig

The program has exited after receiving this fatal signal.

killed

Present when the program was terminated by the sandbox (e.g., because it has exceeded the time limit).

max-rss

Maximum resident set size of the process (in kilobytes).

message

Status message, not intended for machine processing. E.g., "Time limit exceeded."

status

Two-letter status code:

RE — run-time error, i.e., exited with a non-zero exit code
SG — program died on a signal
TO — timed out
XX — internal error of the sandbox

time

Run time of the program in fractional seconds.

time-wall

Wall clock time of the program in fractional seconds.

Please note that not all keys have to be present. For example, no status nor message is reported upon normal termination.

11. RETURN VALUE

When the program inside the sandbox finishes correctly, the sandbox returns 0. If it finishes incorrectly, it returns 1. All other return codes signal an internal error.

12. INSTALLATION

Isolate depends on several advanced features of the Linux kernel, like different kinds of namespaces and control groups. These features are available in kernels of most Linux distributions now, but if you are building your own kernel, you have to be careful.

Isolate is designed to run setuid to root. The sub-process inside the sandbox then switches to a non-privileged user ID (different for each --box-id). The range of UIDs available and several filesystem paths are set in a configuration file, by default located in /usr/local/etc/isolate.

For control group mode:

Linux supports two incompatible implementations of control groups: cgroup v1 and v2. This version of Isolate requires v2, which is the default on recent systems.
If you are using systemd, you need to start the isolate.service (see service files in the systemd directory in Isolate’s source tree). It establishes isolate.scope whose cgroup subtree is delegated to Isolate by systemd. The service runs a simple daemon isolate-cg-keeper(8) to keep the scope alive.
If you are not using systemd, make sure that Isolate’s configuration file refers to the correct location where you have the cgroup filesystem mounted. Also make sure that whatever service manager you are using, it does not interfere with Isolate’s use of control groups.
Running Isolate in containers is not recommended, since container managers usually do not delegate control groups properly. Besides, you do not want to share the machine with other workloads, which would influence measurement of execution time. If you still want to use containers, you are on your own and you probably have to make them privileged.
Reporting memory usage requires Linux kernel 5.19 or newer.
Since memory limits do not affect swapped-out data, we recommend turning off swap completely.

Isolate expects that the root directory "/" is a mount point. When running isolate inside a chroot, this may not be the case, and isolate may fail with "Cannot privatize mounts". A workaround for this is to convert the root directory of the chroot into a mount point using a bind mount, prior to entering the chroot and running isolate. For example:

mount --bind /path/to/chroot /path/to/chroot

It is recommended to have sys.fs.protected_hardlinks sysctl set to 1 (which is probably default on modern Linux systems). Otherwise, the user running the sandbox could trick isolate to changing the owner of unrelated files.

If you have systemd-coredump installed, please keep in mind that it records core files even for processes inside the sandbox. As it configures the kernel to deliver core dumps using a pipe, it is not affected by the --core limit.

13. REPRODUCIBILITY

The reproducibility of results can be improved by tuning some kernel parameters, listed below. Some of these parameters can be checked using the program isolate-check-environment.

Disable address space randomization: sysctl kernel.randomize_va_space=0. Address space randomization can affect timing, memory usage, and program behavior. This setting can be made persistent through /etc/sysctl.d/.
Disable dynamic CPU frequency scaling. This is done by setting the cpufreq scaling governor in /sys/device/system/cpu/cpufreq/*/scaling_governor to performance. (On Intel CPUs, frequency scaling can be controlled by the intel_pstate driver, but it still provides its own performance controller to the cpufreq subsystem.)
Consider disabling frequency boosting on CPUs that might support it (this includes most i3/i5/i7 Intel CPUs and the AMD Zen architecture). This is done either by writing 1 to /sys/devices/system/cpu/intel_pstate/no_turbo (on Intel CPUs) or by writing 0 to /sys/devices/system/cpu/cpufreq/boost (other machines).
Run evaluations on a single CPU (core). The Linux scheduler has a tendency to randomly migrate tasks between CPUs, incurring cache migration costs. You can use isolate’s configuration file to pin the process to a specified CPU.
If you have CPU with a mix of different cores (e.g., P-cores and E-cores in certain Intel CPUs), pin the sandbox to a homogeneous subset of cores.
Disable automatic kernel support for transparent huge pages. Both /sys/kernel/mm/transparent_hugepage/enabled and /sys/kernel/mm/transparent_hugepage/defrag should be set to "madvise" or "never", and /sys/kernel/mm/transparent_hugepage/khugepaged/defrag to 0.
Disable swapping. If you really need swap space and you are using cgroups, make sure that you have the memsw controller enabled, so that swap space is properly accounted for.

See further suggestions in the IOI Technical Checklist.

14. SYSTEM CALL RESTRICTIONS

In general, Isolate does not limit what operations can the sandboxed process do. Instead, it places the process within a set of namespaces which restrict what objects can be accessed. E.g., it is allowed to create network sockets, but there is nowhere else to connect them to.

However, there are a few special cases of objects that are not properly namespaced by the Linux kernel. These objects can be exploited to smuggle information between two sandboxes running in parallel, or between two instances of the same sandbox running one after another.

To avoid such information leaks, Isolate forbids the use of a few system calls. This is controlled by the syscall_flags setting in the configuration file, which contains a sum of the following flags:

Keyrings (flag 1) — disables the keyctl system call which maintains keyrings that store cryptographic material. Keyrings can be used to establish system-wide or per-user persistent memory, through which data can be passed between successively run sandboxes.
Virtual machine sockets (flag 2) — disables creation of sockets of the AF_VSOCK address family. These sockets are intended for communication between virtual machines and their hypervisors, but their scope is the whole system. Therefore a vsock can be used to pass data between two sandboxes running simultaneously.
i-node operations (flag 4) — when an inode is accessible in multiple sandboxes (e.g., as a part of a read-only bind-mount of /usr, or even /proc/self/ns/uts), file locks on it are shared even across mount namespace boundaries. This can be used to leak information between simultaneously running sandboxes. This restriction forbids fcntl with F_SETLK, F_SETLKW, F_OFD_SETLK, F_OFD_SETLKW, F_SETLEASE, and F_NOTIFY, and also flock. Unlike the previously mentioned system calls, file locks could be possibly required by libraries used by legitimate programs.
io_uring (flag 8) — the io_uring kernel API for asynchronous submission of I/O operations also allows creation of sockets with arbitrary address families. Since this API is unlikely to be used in programming contests, we disable it completely.
legacy architectures (flag 16) — on some architectures, the kernel can execute code for non-native architectures (e.g., 32-bit binaries on 64-bit systems). This is not a security risk per se, but it brings extra complexity to already non-trivial system call filtering. Since programming contests seldom use non-native architectures, we recommend not allowing them by default and turning this bit off only if necessary. At the moment, the only legacy architecture recognized is x86 (i386) on x86_64 (amd64) systems, please let us know if you need more.

By default, all restrictions are in place.

15. LICENSE

Isolate was written by Martin Mares and Bernard Blackham. It can be distributed and used under the terms of the GNU General Public License version 2 or any later version.

16. SEE ALSO

isolate-check-environment(8), isolate-cg-keeper(8)