Confinement

Isolating programs

Paul Krzyzanowski

October 1, 2020

Two lessons we learned from experience are that applications can be compromised and that applications may not always be trusted. Server applications, in particular, such as web servers and mail servers have been compromised over and over again. This is particularly harmful as they often run with elevated privileges and on systems on which normal users do not have accounts. The second category of risk is that we may not always trust an application. We trust our web server to work properly but we cannot necessarily trust that the game we downloaded from some unknown developer will not try to upload our files, destroy our data, or try to change our system configuration. In fact, unless we have the ability to scrutinize the codebase of a service, we will not know for sure if it tries to modify any system settings or writes files to unexpected places.

With this resignation to security in mind, we need to turn our attention to limiting the resources available to an application and making sure that a misbehaving application cannot harm the rest of the system. These are the goals of confinement.

Our initial thoughts to achieving confinement may involve proper use of access controls. For example, we can run server applications as low-privilege users and make sure that we have set proper read/write/execute permissions on files, read/write/search permissions on directories, or even set up role-based policies.

However, access controls usually do not give us the ability to set permissions for “don’t allow access to anything else.” For example, we may want our web server to have access to all files in /home/httpd but nothing outside of that directory. Access controls do not let us express that rule. Instead, we are responsible for changing the protections of every file on the system and making sure it cannot be accessed by “other”. We also have to hope that no users change those permissions. In essence, we must disallow the ability for anyone to make files publicly accessible because we never want our web server to access them. We may be able to use mandatory access control mechanisms if they are available but, depending on the system, we may not be able to restrict access properly either. More likely, we will be at risk of comprehension errors and be likely to make a configuration error, leaving parts of the system vulnerable. To summarize, even if we can get access controls to help, we will not have high assurance that they do.

Access controls also only focus on protecting access to files and devices. A system has other resources, such as CPU time, memory, disk space, and network. We may want to control how much of all of these an application is allowed to use. POSIX systems provide a setrlimit system call that allows one to set limits on certain resources for the current process and its children. These controls include the ability to set file size limits, CPU time limits, various memory size limits, and maximum number of open files.

We also may want to control the network identity for an application. All applications share the same IP address on a system but this may allow a compromised application to exploit address-based access controls. For example, you may be able to connect to or even log into system that believe you are a trusted computer. An exploited application may end up confusing network intrusion detection systems.

Just limiting access through resource limits and file permissions is also insufficient for services that run as root. If an attacker can compromise an app and get root access to execute arbitrary functions, she can change resource limits (just call setrlimit with different values), change any file permissions, and even change the IP address and domain name of the system.

In order to truly confine an application, we would like to create a set of mechanisms that enforce access controls to all of a system’s resources, are easy to use so that we have high assurance in knowing that the proper restrictions are in place, and work with a large class of applications. We can’t quite get all of this yet but we can come close.

chroot

The oldest app confinement mechanism is Unix’s chroot system call and command, originally introduced in 1979 in the seventh edition[1]. The chroot system call changes the root directory of the calling process to the directory specified as a parameter.

chroot("/home/httpd/html");

Sets the root of the file system to /home/httpd/html for the process and any processes it creates. The process cannot see any files outside that subset of the directory tree. This isolation is often called a chroot jail.

Jailkits

If you run chroot, you will likely get an error along the lines of:

# chroot newroot
chroot: failed to run command ‘/bin/bash’: No such file or directory

This is because /bin/bash is not within the root (in this case, the newroot directory). You’ll then create a bin subdirectory and try running chroot again and get the same error:

# mkdir newroot/bin
# ln /bin/bash newroot/bin/bash
# chroot newroot
chroot: failed to run command ‘/bin/bash’: No such file or directory

You’ll find that is also insufficient and that you’ll need to bring in the shared libraries that /bin/bash needs by mounting /lib, /lib64, and /usr/lib within that root just to enable the shell to run. Otherwise, it cannot load the libraries it needs since it cannot see above its root (i.e., outside its jail). To simplify this process, a jailkit simplifies the process of setting up a chroot jail by providing a set of utilities to make it easier to create the desired environment within the jail and populate it with basic accounts, commands, and directories.

Problems with chroot

Chroot only limits access to the file system namespace. It does not restrict access to resources and does not protect the machine’s network identity. Applications that are compromised to give the attacker root access make the entire system vulnerable since the attacker has access to all system calls.

Chroot is available only to administrators. If this was not the case then any user would be able to get root access within the chroot jail. You would:

  1. Create a chroot jail
  2. Populate it with the shell program and necessary support libraries
  3. Create a link inside your jail to the /bin/su command (set user, which allows you to authenticate to become any user)
  4. Create password files within the jail with a known password for root. On Linux systems, you would typically create an etc/passwd file that contains information about the user account (name, user ID, home directory, startup shell) and an etc/shadow file that contains the actual passwords.
  5. Use the chroot command to enter the jail.
  6. Run su root to become the root user. The command will prompt you for a password and validate it against the password file. Since all processes run within the jail, the password file is the one you set up in your jail, so you knwo the password.

You’re still in the jail but you have root access. Now you will need to escape from the jail.

Escaping from chroot

If someone manages to compromise an application running inside a chroot jail and become root, they are still in the jail but have access to all system calls. For example, they can send signals to kill all other processes or shut down the system. This would be an attack on availability.

Attaining root access also provides a few ways of escaping the jail. On POSIX systems, all non-networked devices are accessible as files within the filesystem. Even memory is accessible via a file (/dev/mem). An intruder in a jail can create a memory device (character device, major number = 1, minor number = 1):

mknod mem c 1 1

With the memory device, the attacker can patch system memory to change the root directory of the jail. More simply, an attacker can create a block device with the same device numbers as that of the main file system. For example, the root file system on one of my Linux systems is /dev/sda1 with a major number of 8 and a minor number of 1. An attacker can recreate that in the jail:

mknod rootdisk b 8 1

and then mount it as a file system within the jail:

mount -t ext4 rootdisk myroot

Now the attacker, still in the jail, has full access to the entire file system, which is as good as being out of the jail. He can add user accounts, change passwords, delete log files, run any commands, and even reboot the system to get a clean login.

FreeBSD Jails

Chroot was good in confining the namespace of an application but useless against providing security if an application had root access and did nothing to restrict access to other resources.

FreeBSD Jails are an enhancement to the idea of chroot. Jails provide a restricted filesystem namespace, just like chroot does, but also place restrictions on what processes are allowed to do within the jail, including selectively removing privileges from the root user in the jail. For example, processes within a jail may be configured to:

  • Bind only to sockets with a specified IP address and specific ports
  • Communicate only with other processes within the jail and none outside
  • Not be able to load kernel modules, even if root
  • Have restricted access to system calls that include:
    • Ability to create raw network sockets
    • Ability to create devices
    • Modify the network configuration
    • Mount or unmount filesystems

FreeBSD Jails are a huge improvement over chroot since known escapes, such as creating devices and mounting filesystems and even rebooting the system are disallowed. Depending on the application, policies may be coarse. The changed root provides all or nothing access to a part of the file system. This does not make Jails suitable for applications such as a web browser, which may be untrusted but may need access to files outside of the jail. Think about web-based applications such as email, where a user may want to upload or download attachments. Jails also do not prevent malicious apps from accessing the network and trying to attack other machines … or from trying to crash the host operating system. Moreover, FreeBSD Jails is a BSD-only solution. With an estimated 0.95…1.7% share of server deployments, it is a great solution on an operating system that is not that widely used.

Linux namespaces, cgroups, and capabilities

Linux’s answer to FreeBSD Jails was a combination of three elements: control groups, namespaces, and capabilities.

Control groups (cgroups)

Linux control groups, also called cgroups, allow you to allocate resources such as CPU time, system memory, disk bandwidth, network bandwidth, and the ability to monitor resource usage among user-defined groups of processes. This allows, for example, an administrator to allocate a larger share of the processor to a critical server application.

An administrator creates one or more cgroups and assigns resource limits to each of them. Then any application can be assigned to a control group and will not be able to use more than the resource limits configured in that control group. Applications are unaware of these limits. Control groups are organized in a hierarchy similar to processes. Child cgroups inherit some attributes from the parents.

Linux namespaces

Chroot only restricted the filesystem namespace. The filesystem namespace is the best known namespace in the system but not the only one. Linux namespaces Namespaces provide control over how processes are isolated in the following namespaces:

Namespace Description Controls
IPC System V IPC, POSIX message queues Objects created in an IPC namespace are only visible to other processes in that namespace (CLONE_NEWIPC)
Network Network devices, stacks, ports Isolates IP protocol stacks, IP routing tables, firewalls, socket port numbers ( CLONE_NEWNET)
Mount Mount points A set of processes can have their own distinct mount points and view of the file system (CLONE_NEWNS)
PID Process IDs Processes in different PID namespaces can have their process IDs – the child cannot see parent processes or other namespaces (CLONE_NEWPID)
User User & group IDs Per-namespace user/group IDs. Also, you can be root in a namespace but have restricted privileges ( CLONE_NEWUSER )
UTS host name and domain name setting hostname and domainname will not affect rest of the system (CLONE_NEWUTS)
Cgroup control group Sets a new control group for a process (CLONE_NEWCGROUP)

A process can dissociate any or all of these namespaces from its parent via the unshare system call. For example, by unsharing the PID namespace, a process gets a no longer sees other processes and will only see itself and any child processes it creates.

The Linux clone system call is similar to fork in that it creates a new process. However, it allows you to pass flags that will specify which parts of the execution context will be shared with the parent. For example, a cloned process may choose to share memory and open file descriptors, which will make it behave like threads. It can also choose to share – or not – any of the elements of the namespace.

Capabilities

A problem that FreeBSD Jails tackled was that of restricting the power of root inside a Jail. You could be a root user but still disallowed from executing certain system calls. POSIX (Linux) capabilities[2] tackle this issue as well.

Traditionally, Unix systems distinguished privileged versus unprivileged processes. Privileged processes were those that ran with a user ID of 0, called the root user. When running as root, the operating system would allow access to all system calls and all access permission checks were bypassed. You could do anything.

Linux capabilities identify groups of operations, called capabilities, that can be controlled independently on a per-thread basis. The list is somewhat long, 38 groups of controls, and includes capabilities such as:

  • CAP_CHOWN: make arbitrary changes to file UIDs and GIDs
  • CAP_DAC_OVERRIDE: bypass read/write/execute checks
  • CAP_KILL: bypass permission checks for sending signals
  • CAP_NET_ADMIN: network management operations
  • CAP_NET_RAW: allow RAW sockets
  • CAP_SETUID: arbitrary manipulation of process UIDs
  • CAP_SYS_CHROOT: enable chroot

The kernel keeps track of four capability sets for each thread. A capability set is a list of zero or more capabilities. The sets are:

  • Permitted: If a capability is not in this set, the thread or its children can never require that capability. This limits the power of what a process and its children can do.

  • Inheritable: These capabilities will be inherited when a thread calls execve to execute a program (POSIX programs are executed with the same thread; we are not creating a new process)

  • Effective: This is the current set of capabilities that the thread is using. The kernel uses these to perform permission checks.

  • Ambient: This is similar to Inheritable and contains a set of capabilities that are preserved across an execve of a program that is not privileged. If a setuid or setgid program is run, will clear the ambient set. These are created to allow a partial use of root features in a controlled manner. It is useful for user-level device drivers or software that needs a specific privilege (e.g., for certain networking operations).

A child process created via fork (the standard way of creating processes) will inherit copies of its parent’s capability sets following the rules of which capabilities have been marked as inheritable.

A set of capabilities can be assigned to an executable file by the administrator. They are stored as a file’s extended attributes (along with access control lists, checksums, and arbitrary user-defined name-value pairs). When the program runs, the executing process may further restrict the set of capabilities under which it operates if it chooses to do so (for example, after performing an operation that required the capability and knowing that it will no longer need to do so).

From a security point of view, the key concept of capabilities is that they allow us to provide limited elevation of privileges to a process. A process does not need to run as root (user ID 0) but can still be granted very specific privileges. For example, we can grant the ping command the ability to access raw sockets so it can send an ICMP ping message on the network but not have any other administrative powers. The application does not need to run as root and even if an attacker manages to inject code, the opportunities for attack will be restricted.

The Linux combination of cgroups, namespaces, and capabilities provides a powerful set of mechanisms to

  1. Set limits on the system resources (processor, disk, network) that a group of processes will use.

  2. Constrain the namespace, making parts of the filesystem or the existence of other processes or users invisible.

  3. Give restricted privileges to specific applicatiosn so they do not need to run as root.

This enables us to create stronger jails and have a fine degree of control as to what processes are or are not allowed to do in that jail.

While bugs have been found these mechanisms, the more serious problem is that of comprehension. The system has become far, far more complex than it was in the days of chroot. A user has to learn quite a lot to use these mechanisms properly. Failure to understand their behavior fully can create vulnerabilities. For example, namespaces do not prohibit a process from making privileged system calls. They simply limit what a process can see. A process may not be able to send a kill signal to another process only because it does not share the same process ID namespace.

Together with capabilities, namespaces allow a restricted environment that also places limits on the abilities to perform operations even if a process is granted root privileges. This enables ordinary users to create namespaces. You can create a namespace and even create a process running as a root user (UID 0) within that namespace but it will have no capabilities beyond those that were granted to the user; the user ID of 0 gets mapped by the kernel to a non-privileged user.

Containers

Software rarely lives as an isolated application. Some software requires multiple applications and most software relies on the installation of other libraries, utilities, and packages. Keeping track of these dependencies can be difficult. Worse yet, updating one shared component can sometimes cause another application to break. What was needed was a way to isolate the installation, execution, and management of multiple software packages that run on the same system.

Various attempts were undertaken to address these problems.

  1. The most basic was to fix problems when they occurred. This required carefully following instructions for installation, update, and configuration of software and extensive testing of all services on the system when anything changed. Should something break, the service would be unavailable until the problems were fixed.

  2. A drastic, but thorough, approach to isolation was to simply run each service on its own computer. That avoids conflicts in library versions and other dependencies. However, it is an expensive solution, is cumbersome, and is often overkill in most environments.

  3. Finally, administrators could deploy virtual machines. This is a technology that allows one to run multiple operating systems on one computer and gives the illusion of services running on distinct systems. However, this is a heavyweight solution. Every service needs its own installation of the operating system and all supporting software for the service as well as standard services (networking, device management, shell, etc.). It is not efficient in terms of CPU, disk, or memory resources – or even administration effort.

Containers are a mechanism that were originally created not for security but to make it easy to package, distribute, relocate, and deploy collections of software. The focus of containers is not to enable end users to install and run their favorite apps but rather for administrators to be able to deploy a variety of services on a system. A container encapsulates all the necessary software for a service, all of its dependencies, and its configuration into one package that can be easily passed around, installed, and removed.

In many ways, a container feels like a virtual machine. Containers provide a service with a private process namespace, its own network interface, and its own set of libraries to avoid problems with incompatible versions used by other software. Containers also allow an administrator to give the service restricted powers even if it runs with root (administrator) privileges. Unlike a virtual machine, however, multiple containers on one system all share the same operating system and kernel modules.

Containers are not a new mechanism. They are implemented using Linux’s control groups, namespaces, and capabilities to provide resource control, isolation, and privilege control, respectively. They also make use of a copy on write (CoW) file system. This makes it easy to create new containers where the file system can track the changes made by that container over a clean base version of a file system. Having multiple containers running the same services becomes extremely efficient. Containers can also take advantage of AppArmor, which is a Linux kernel module that provides a basic form of mandatory access controls based on the pathnames of files. It allows an administrator to restrict the ability of a program to access specific files even within its file system namespace.

The best-known container framework is Docker. A Docker Image is a file format that creates a package of applications, their supporting libraries, and other needed files. This image can be stored and deployed on many environments. Docker made it easy to deploy containers using git-like commands (docker push, docker commit) and also to perform incremental updates. By using a copy on write file system, Docker images can be kept immutable (read-only) while any changes to the container during its execution are stored separately.

As people found Docker to be useful, the next design goal was to make it easier to manage containers across a network of many computers. This is called container orchestration. There are many solutions for this, including Apache Mesos, Kubernetes, Nomad, and Docker Swarm. The best known of these is kubernetes, which was designed by Google. It coordinates storage of containers, failure of hardware and containers, and dynamic scaling: deploying the container on more machines to handle increased load. Kubernetes is coordination software, not a container system; it uses the Docker framework to run the actual container.

Even though containers were designed to simplify software deployment rather than provide security to services, they do offer several benefits in the area of security:

  • They make use of namespaces, cgroups, and capabilities with restricted capabilities configured by default. This provides isolation among containers.

  • Containers provide a strong separation of policy (defined by the container configuration) from the enforcement mechanism (handled by the operating system).

  • They improve availability by providing the ability to have a watchdog timer monitor the running of applications and restarting them if necessary. With orchestration systems such as Kubernetes, containers can be re-deployed on another system if a computer fails.

  • The environment created by a container is reproducible. The same container can be deployed on multiple systems and tested in different environments. This provides consistency and aids in testing and ensuring that the production deployment matches the one used for development and test. Moreover, it is easy to inspect exactly how a container is configured. This avoids problems encountered by manual installation of components where an administrator may forget to configure something or may install different versions of a required library.

  • While containers add nothing new to security, they help avoid comprehension errors. Even default configurations will provide improved security over the defaults in the operating system and configuring containers is easier than learning and defining the rules for capabilities, control groups, and namespaces. Administrators are more likely to get this right or import containers that are already configured with reasonable restrictions.

Containers are not a security panacea. Because all containers run under the same operating system, any kernel exploits can affect the security of all containers. Similarly, any denial of service attacks, whether affecting the network or monopolizing the processor, will impact all containers on the system. If implemented and configured properly, capabilities, namespaces, and control groups should ensure that privilege escalation cannot take place. However, bugs in the implementation or configuration may create a vulnerability. Finally, one has to be concerned with the integrity of the container itself. Who configured it, who validated the software inside of it, and is there a chance that it may have been modified by an adversary either at the server or in transit?

Virtual Machines

As a general concept, virtualization is the addition of a layer of abstraction to physical devices. With virtual memory, for example, a process has the impression that it owns the entire memory address space. Different processes can all access the same virtual memory location and the memory management unit (MMU) on the processor maps each access to the unique physical memory locations that are assigned to the process.

Process virtual machines present a virtual CPU that allows programs to execute on a processor that does not physically exist. The instructions are interpreted by a program that simulates the architecture of the pseudo machine. Early pseudo-machines included o-code for BCPL and P-code for Pascal. The most popular pseudo-machine today is the Java Virtual Machine (JVM). This simulated hardware does not even pretend to access the underlying system at a hardware level. Process virtual machines will often allow “special” calls to invoke system functions or provide a simulation of some generic hardware platform.

Operating system virtualization is provided by containers, where a group of processes is presented with the illusion of running on a separate operating system but in reality shares the operating system with other groups of processes – they are just not visible to the processes in the container.

System virtual machines allow a physical computer to act like several real machines with each machine running its own operating system (on a virtual machine) and applications that interact with that operating system. The key to this machine virtualization is to not allow each operating system to have direct access to certain privileged instructions in the processor. These instructions would allow an operating system to directly access I/O ports, MMU settings, the task register, the halt instruction and other parts of the processor that could interfere with the processor’s behavior and with the other operating systems on the system. Instead, a trap and emulate approach is used. Privileged instructions, as well as system interrupts, are caught by the Virtual Machine Monitor (VMM), also known as a hypervisor. The hypervisor arbitrates access to physical resources and presents a set of virtual device interfaces to each guest operating system (including the memory management unit, I/O ports, disks, and network interfaces). The hypervisor also handles preemption. Just as an operating system may suspend a process to allow another process to run, the hypervisor will suspend an operating system to give other operating systems a chance to run.

The two configurations of virtual machines are hosted virtual machines and native virtual machines. With a hosted virtual machine (also called a type 2 hypervisor), the computer has a primary operating system installed that has access to the raw machine (all devices, memory, and file system). This host operating system does not run in a virtual environment. One or more guest operating systems can then be run on virtual machines. The VMM serves as a proxy, converting requests from the virtual machine into operations that get sent to and executed on the host operating system. A native virtual machine (also called a type 1 hypervisor) is one where there is no “primary” operating system that owns the system hardware. The hypervisor is in charge of access to the devices and provides each operating system drivers for an abstract view of all the devices.

Security implications

Unlike app confinement mechanisms such as jails, containers, or sandboxes, virtual machines enable isolation all the way through the operating system. A compromised application, even with escalated privileges, can wreak havoc only within the virtual machine. Even compromises to the operating system kernel are limited to that virtual machine. However, a compromised virtual machine is not much different form having a compromised physical machine sitting inside your organization: not desirable and capable of attacking other systems in your environment.

Multiple virtual machines are usually deployed on one physical system. In cases such as cloud services (e.g., such as those provided by Amazon), a single physical system may host virtual machines from different organizations or running applications with different security requirements. If a malicious application on a highly secure system can detect that it is co-resident on a computer that is hosting another operating system and that operating system provides fewer restrictions, the malware may be able to create a covert channel to communicate between the highly secure system with classified data and the more open system. A covert channel is a general term to describe the the ability for processes to communicate via some hidden mechanism when they are forbidden by policy to do so. In this case, the channel can be created via a side channel attack. A side channel is the ability to get or transmit information using some aspects of a system’s behavior, such as changes in power consumption, radio emissions, acoustics, or performance. For example, processes on both systems, even though they are not allowed to send network messages, may create a means of communicating by altering and monitoring system load. The malware on the classified VM can create CPU-intensive task at specific times. Listener software on the unclassified VM can do CPU-intensive tasks at a constant rate and periodically measure their completion times. These completion times may vary based on whether the classified system is doing CPU-intensive work. The variation in completion times creates a means of sending 1s and 0s and hence transmitting a message.


  1. Note that Wikipedia and many other sites refer to this as “Version 7 Unix”. Unix has been under continuous evolution at Bell Labs from 1969 through approximately 1989. As such, it did not have versions. Instead, an updated set of manuals was published periodically. Installations of Unix have been referred to by the editions of their manuals.  ↩

  2. Linux capabilities are not to be confused with the concept of capability lists, which are a form of access control that Linux does not use).  ↩

Last modified November 25, 2020.
recycled pixels