
Linux containers
A Linux container is made up of several building blocks, the two most important of which are namespaces and control groups (cgroups). Both of these are Linux kernel features. Namespaces provide logical partitions of certain kinds of system resources, such as the mounting point (mnt), the process ID (PID), and the network (net). To further understand the concept of isolation, let's look at some simple examples on the pid namespace. The following examples are from Ubuntu 18.04.1 and util-linux 2.31.1.
When we type ps axf in our Terminal, we'll see a long list of running processes:
$ ps axf
PID TTY STAT TIME COMMAND
2 ? S 0:00 [kthreadd]
4 ? I< 0:00 \_ [kworker/0:0H]
5 ? I 0:00 \_ [kworker/u2:0]
6 ? I< 0:00 \_ [mm_percpu_wq]
7 ? S 0:00 \_ [ksoftirqd/0]
...
Let's now enter a new pid namespace with unshare, which is able to disassociate a process resource part by part into a new namespace. We'll then check the processes again:
$ sudo unshare --fork --pid --mount-proc=/proc /bin/sh
# ps axf
PID TTY STAT TIME COMMAND
1 pts/0 S 0:00 /bin/sh
2 pts/0 R+ 0:00 ps axf
You'll find that the pid of the shell process at the new namespace becomes 1 and all other processes have disappeared. This means you've successfully created a pid container. Let's switch to another session outside the namespace and list the processes again:
$ ps axf ## from another terminal
PID TTY STAT TIME COMMAND
...
1260 pts/0 Ss 0:00 \_ -bash
1496 pts/0 S 0:00 | \_ sudo unshare --fork --pid --mount-proc=/proc /bin/sh
1497 pts/0 S 0:00 | \_ unshare --fork --pid --mount-proc=/proc /bin/sh
1498 pts/0 S+ 0:00 | \_ /bin/sh
1464 pts/1 Ss 0:00 \_ -bash
...
You can still see the other processes and your shell process within the new namespace. With the pid namespace's isolation, processes inhabiting different namespaces can't see each other. However, if one process uses a considerable amount of system resources, such as the memory, it could cause the system to run out of that resource and become unstable. In other words, an isolated process could still disrupt other processes or even crash the whole system if we don't impose resource usage restrictions on it.
The following diagram illustrates the PID namespaces and how an Out-Of-Memory (OOM) event can affect other processes outside a child namespace. The numbered blocks are the processes in the system, and the numbers are their PIDs. Blocks with two numbers are processes created with the child namespace, where the second number represents their PIDs in the child namespace. In the upper part of the diagram, there's still free memory available in the system. Later on, however, in the lower part of the diagram, the processes in the child namespace exhaust the remaining memory in the system. Due to the lack of free memory, the host kernel then starts the OOM killer to release memory, the victims of which are likely to be processes outside the child namespace. In the example here, processes 8 and 13 in the system are killed:

In light of this, cgroups is utilized here to limit resource usage. Like namespaces, this can impose constraints on different kinds of system resources. Let's continue from our pid namespace, generate some loadon the CPU with yes > /dev/null, and then monitor it with top:
## in the container terminal
# yes > /dev/null & top
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2 root 20 0 7468 788 724 R 99.7 0.1 0:15.14 yes
1 root 20 0 4628 780 712 S 0.0 0.1 0:00.00 sh
3 root 20 0 41656 3656 3188 R 0.0 0.4 0:00.00 top
Our CPU load reaches 100%, as expected. Let's now limit it with the cgroup CPU. cgroups are organized as folders under /sys/fs/cgroup/. First, we need to switch to the host session:
## on the host session
$ ls /sys/fs/cgroup
blkio cpu cpuacct cpu,cpuacct cpuset devices freezer hugetlb memory net_cls net_cls,net_prio net_prio perf_event pids rdma systemd unified
Each folder represents the resources it controls. It's pretty easy to create a cgroup and control processes with it: just create a folder under the resource type with any name and append the process IDs you'd like to control to tasks. Here, we want to throttle the CPU usage of our yes process, so create a new folder under cpu and find out the PID of the yes process:
## also on the host terminal
$ ps ax | grep yes | grep -v grep
1658 pts/0 R 0:42 yes
$ sudo mkdir /sys/fs/cgroup/cpu/box && \
echo 1658 | sudo tee /sys/fs/cgroup/cpu/box/tasks > /dev/null
We've just added yes into the newly created box CPU group, but the policy remains unset, and the process still runs without any restrictions. Set a limit by writing the desired number into the corresponding file and check the CPU usage again:
$ echo 50000 | sudo tee /sys/fs/cgroup/cpu/box/cpu.cfs_quota_us > /dev/null
## go back to namespaced terminal, check stats with top
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2 root 20 0 7468 748 684 R 50.3 0.1 6:43.68 yes
1 root 20 0 4628 784 716 S 0.0 0.1 0:00.00 sh
3 root 20 0 41656 3636 3164 R 0.0 0.4 0:00.08 top
The CPU usage is dramatically reduced, meaning that our CPU throttle works.
The previous two examples elucidate how a Linux container isolates system resources. By putting more confinements in an application, we can build a fully isolated box, including filesystems and networks, without encapsulating an operating system within it.