Limiting services

It is a common requirement to place limits on what services can do.

Running under the aegises of unprivileged user accounts

One way to limit services is to use the operating system's mechanisms that limit what unprivileged users can do. One "drops privileges" from running as the superuser to running as an unprivileged user. One then …

… controls what files, directories, processes, and other objects the unprivileged user account owns. Ownership grants the privilege to change the access permissions of owned objects.
One common approach is to not have the unprivileged user own any files or directories, but rather have the files and directories that it uses owned by the superuser. An access control rule on the object then grants read/write/execute access without granting ownership.
… controls the access that the unprivileged user account has to files, directories, processes, andother objects. In general such privileges should be minimal, with the unprivileged user account not able to read or to write things that it does not need to read or to write.
… sets per-user disc quotas and the like.

Sometimes the main dæmon program itself drops privileges, internally. Usually it does this as part of an overall sequence of setting up a changed root and then dropping privileges. (This is because it requires a lot more setup, and a lot more files and directories exposed, to use chain-loading to set up a changed root. Chain loading involves the overlay of new process images that must be visible, along with the dynamic loader and any dynamically loaded shared objects, in the changed root environment. All of that needs to be set up with hard links, bind mounts, and whatnot.) In such cases, one can use envuidgid to look up the user ID and group ID of the unprivileged user account that the program should switch to, and place them in the environment for it to read.

Othertimes privileges are dropped by a sequence of chain-loading tools leading up to the execution of the main dæmon program itself. Usually this is the case where there is no changed root involved. In such cases, one can use setuidgid or setuidgid-fromenv to drop privileges.

This is heavily used in logging services.

Resource limits

Operating systems provide one or two mechanisms for setting resource limits for services.

Using the original Unix resource limit mechanism

The original Unix resource limit mechanism is controllable in a run program by using the softlimit and the hardlimit utilities, which have the conventional daemontools style of interface (complete with a unified "memory" setting that sets several limits as one), or the ulimit utility, which has an interface similar to the built-in command of the same name in POSIX-conformant shells. This resource limit mechanism has some well-known lacks, which one may or may not hit, depending from exactly what one's dæmon does. A dæmon that never spawns child processes will not, for example, raise the well-known problem that some of these Unix resource limits are per-process.

For example: The authors of MongoDB recommend several resource settings for when running MongoDB under a service manager. The run program for the MongoDB service bundle implements them as follows:

hardlimit -o 64000 -p 64000
softlimit -o hard -p hard

Using the Linux control groups

The Linux "control groups" mechanism is an enhanced and improved version of the original Unix mechanism, intended to overcome some of its limitations with respect to limits constraining multiple processes. It is used from run programs with the move-to-control-group, the set-control-group-knob, and the delegate-control-to-knob utilities.

The basic principles of operation are these:

All service processes start out in the same control group as the service manager itself is running. This control group is created by the service manager as a child of the original control group that it started out running in. It moves itself into that sub-group, so that the original control group has no processes.
Each service's service bundle is responsible for moving the service processes out of the service manager's control group and into its own dedicated control group, and for placing the limits on the control group. This happens as follows:
- In the start program one uses move-to-control-group, set-control-group-knob, and delegate-control-group-to to create and then change control group, to set the various knobs on the control group to limit processes run within it, and to enable (if appropriate) the creation of further sub-groups by the unprivileged account that the service processes run as.
- In the run program one uses move-to-control-group to change control group to the same control group. This is done before dropping privileges.
Conventionally, the services run in sibling control groups to the service manager's control group.
When system-manager, per-user-manager, or some other program spawns the service manager in the first place, it places it into a control group. This original control group becomes a common root control group for a whole lot of sub-groups, including one for the service manager itself and others for individual services.
Control group names for services conventionally end in .service. The service manager, system manager, and per-user manager use names ending in .slice so that there are no possibilities for name conflicts. (These extensions are determined by the Linux control groups API, which explicitly guarantees that control group knob names will not end in these extensions.)

An example of this is the user-services@username service, whose start program sets up a control group for the service, changes to it, and allows the named user to make further sub-groups:

move-to-control-group ../"user-services@".service
move-to-control-group "user-services@username".service
foreground delegate-control-group-to username ;

Its run program changes to the same control group and then drops privileges:

move-to-control-group ../"user-services@".service
move-to-control-group "user-services@username".service
setsid
setuidgid --supplementary username

Notice that this is an instance of a service that is generated (by the external formats conversion mechanism individually for each user) from a template. It employs a convention of a two-level set of control groups, one for all services generated from the template and one for each individual instance.

An example of a service that twiddles control group knobs is the dbus service, whose start program limits the number of processes that can run in the control group:

foreground set-control-group-knob ../cgroup.subtree_control "+pids" ;
move-to-control-group ../dbus.service
oom-kill-protect -- -800
foreground set-control-group-knob --percent-of /proc/sys/kernel/threads-max --infinity-is-max pids.max 20 ;

Its run program only needs to change to the same control group before dropping privileges (which is actually done by the main dæmon program itself):

move-to-control-group ../dbus.service
oom-kill-protect -- -800

This uses set-control-group-knob for two things:

It ensures that the "pids" controller is enabled in the control group, by writing to the cgroup.subtree_control file in its parent control group.
It limits the number of processes in the control group by writing to the pids.max file in the control group itself. (The various additional settings result from this being a generated start program that takes this setting from a data file. The data file allows expressing the number in two other ways, as a percentage of the kernel's threads-max setting and as the word "infinity". Neither of those is actually understood by the Linux control groups mechanism itself. The additional settings translate those into the actual knob values that the Linux control group mechanism accepts.)

A full description of what control group knobs there are and what limits they effect is beyond the scope of this Guide. See the documentation that accompanies the kernel, in particular Documentation/cgroup-v2.txt.

There is a notion circulated that a central "control groups manager" is required for Linux control groups. This is simply untrue, and the result of a control group "manager" (which merely did some rules matching in order to slap control groups onto processes that did not do control groups themselves) and a rejected proposal from systemd being presented on the World Wide Web for many years as if it were accomplished fact. Control groups do not require a central "manager", and were designed to be used in a distributed fashion with no central controller at all. The distributed operation here, where individual services create and configure control groups, separate to the system manager and service manager which also create and configure other control groups, is demonstration of that.

An example of what this results in

Here is a (slightly shortened) view of what the (unified) control groups tree looks like, as printed by systemd-cgls /, on a system that uses the native system manager, per-user manager, and service manager. The instances of /sbin/init are the system manager (PID 1), its logging service (PID 204), and the system-wide service manager (PID 205).

/:
├━me.slice
│ └━1 /sbin/init
├━service-manager.slice
│ ├━ttylogin@.service
│ │ ├━ttylogin@vc3-tty.service
│ │ │ └━935 login
│ │ │ └━27326 systemd-cgls /
│ │ └━ttylogin@vc2-tty.service
│ │   └━941 login
│ ├━tinydns.service
│ │ └━926 tinydns
│ ├━dnscache.service
│ │ └━927 dnscache
│ ├━NetworkManager.service
│ │ ├━1020 NetworkManager --no-daemon
│ │ └━1636 /sbin/dhclient -d -q -sf /usr/lib/NetworkManager/nm-dhcp-helper -p…
│ ├━dbus.service
│ │ └━846 dbus-daemon --config-file ./system-wide.conf --nofork --nopidfile -…
│ ├━udev-log.service
│ │ └━245 cyclog udev/
│ ├━me.slice
│ │ └━205 /sbin/init
│ ├━user-services@.service
│ │ └━user-services@jim.service
│ │   ├━me.slice
│ │   │ └━27299 per-user-manager
│ │   ├━service-manager.slice
│ │   │ └━me.slice
│ │   │   ├━27302 service-manager
│ │   │   ├━simple-servers-log.service
│ │   │   │ └━27309 cyclog jim/simple-servers/
│ │   │   └━urxvt.service
│ │   │     ├━27312 urxvtd
│ │   │     └━27313 urxvtd
│ │   └━per-user-manager-log.slice
│ │     └━27301 cyclog --max-file-size 262144 --max-total-size 1048576 .
│ ├━klogd.service
│ │ └━847 klog-read
│ ├━udev.service
│ │ └━250 udevd --debug
│ └━cyclog@.service
│   ├━cyclog@dnscache.service
│   │ └━725 cyclog dnscache/
│   ├━cyclog@NetworkManager.service
│   │ └━713 cyclog NetworkManager/
│   ├━cyclog@terminal-emulator@vc2.service
│   │ └━724 cyclog terminal-emulator@vc2/
│   ├━cyclog@local-syslog-read.service
│   │ └━738 cyclog local-syslog-read/
│   ├━cyclog@tinydns.service
│   │ └━720 cyclog tinydns/
│   ├━cyclog@dbus.service
│   │ └━735 cyclog dbus/
│   ├━cyclog@terminal-emulator@vc3.service
│   │ └━716 cyclog terminal-emulator@vc3/
│   ├━cyclog@ttylogin@vc2-tty.service
│   │ └━759 cyclog ttylogin@vc2-tty/
│   ├━cyclog@ttylogin@vc3-tty.service
│   │ └━760 cyclog ttylogin@vc3-tty/
│   └━cyclog@klogd.service
│     └━711 cyclog klogd/
└━system-manager-log.slice
  └━204 /sbin/init

Other toolsets and other settings

The nosh toolset is not the only toolset with chain loading tools for affecting dæmon process state. Other toolsets include various useful chain loading tools relating to resource usage control, such as:

rtprio (BSD) and chrt (Linux): Change scheduling priority.
numactl (Linux): Change NUMA settings.

Mounts and namespaces

Linux has a system of namespaces which can be used to limit what a service sees of the rest of the system. (See the Linux kernel doco for details of what the namespaces are.)

Manipulating Linux namespaces is the province of the unshare, set-mount-object, make-private-fs, and make-read-only-fs commands, used in chains in run programs. With them a process detaches from one or more shared namespaces, and then manipulates its (now) private namespaces to show a different view of the system.

For example, one can set up a "no hardware devices" view of the world, where only the "API" devices (for shared memory, pseudo-terminals, file descriptors, randomness, and suchlike) are available, with the following chain:

unshare --mount
set-mount-object --recursive slave /
make-private-fs --devices
set-mount-object --recursive shared /