上QQ阅读APP看书，第一时间看更新

Linux containers

A Linux container is made up of several building blocks, the two most important of which are namespaces and control groups (cgroups). Both of these are Linux kernel features. Namespaces provide logical partitions of certain kinds of system resources, such as the mounting point (mnt), the process ID (PID), and the network (net). To further understand the concept of isolation, let's look at some simple examples on the pid namespace. The following examples are from Ubuntu 18.04.1 and util-linux 2.31.1.

When we type ps axf in our Terminal, we'll see a long list of running processes:

$ ps axf
  PID TTY  STAT TIME COMMAND
    2 ?    S    0:00 [kthreadd]
    4 ?    I<   0:00 \_ [kworker/0:0H]
    5 ?    I    0:00 \_ [kworker/u2:0]
    6 ?    I<   0:00 \_ [mm_percpu_wq]
    7 ?    S    0:00 \_ [ksoftirqd/0]
...

ps is a utility that is used to report current processes on the system. ps axf provides a list of all processes in a forest.

Let's now enter a new pid namespace with unshare, which is able to disassociate a process resource part by part into a new namespace. We'll then check the processes again:

$ sudo unshare --fork --pid --mount-proc=/proc /bin/sh
# ps axf
  PID TTY      STAT   TIME COMMAND
    1 pts/0    S      0:00 /bin/sh
    2 pts/0    R+     0:00 ps axf

You'll find that the pid of the shell process at the new namespace becomes 1 and all other processes have disappeared. This means you've successfully created a pid container. Let's switch to another session outside the namespace and list the processes again:

$ ps axf ## from another terminal
  PID TTY    STAT TIME COMMAND
 ...
 1260 pts/0  Ss   0:00 \_ -bash
 1496 pts/0  S    0:00 | \_ sudo unshare --fork --pid --mount-proc=/proc /bin/sh
 1497 pts/0  S    0:00 | \_ unshare --fork --pid --mount-proc=/proc /bin/sh
 1498 pts/0  S+   0:00 | \_ /bin/sh
 1464 pts/1  Ss   0:00 \_ -bash
 ...

You can still see the other processes and your shell process within the new namespace. With the pid namespace's isolation, processes inhabiting different namespaces can't see each other. However, if one process uses a considerable amount of system resources, such as the memory, it could cause the system to run out of that resource and become unstable. In other words, an isolated process could still disrupt other processes or even crash the whole system if we don't impose resource usage restrictions on it.

The following diagram illustrates the PID namespaces and how an Out-Of-Memory (OOM) event can affect other processes outside a child namespace. The numbered blocks are the processes in the system, and the numbers are their PIDs. Blocks with two numbers are processes created with the child namespace, where the second number represents their PIDs in the child namespace. In the upper part of the diagram, there's still free memory available in the system. Later on, however, in the lower part of the diagram, the processes in the child namespace exhaust the remaining memory in the system. Due to the lack of free memory, the host kernel then starts the OOM killer to release memory, the victims of which are likely to be processes outside the child namespace. In the example here, processes 8 and 13 in the system are killed:

In light of this, cgroups is utilized here to limit resource usage. Like namespaces, this can impose constraints on different kinds of system resources. Let's continue from our pid namespace, generate some loadon the CPU with yes > /dev/null, and then monitor it with top:

## in the container terminal
# yes > /dev/null & top
PID USER PR  NI   VIRT  RES  SHR S %CPU %MEM    TIME+ COMMAND
  2 root 20   0   7468  788  724 R 99.7  0.1  0:15.14 yes
  1 root 20   0   4628  780  712 S  0.0  0.1  0:00.00 sh
  3 root 20   0  41656 3656 3188 R  0.0  0.4  0:00.00 top

Our CPU load reaches 100%, as expected. Let's now limit it with the cgroup CPU. cgroups are organized as folders under /sys/fs/cgroup/. First, we need to switch to the host session:

## on the host session
$ ls /sys/fs/cgroup
blkio  cpu  cpuacct  cpu,cpuacct  cpuset  devices  freezer  hugetlb  memory  net_cls  net_cls,net_prio  net_prio  perf_event  pids  rdma  systemd  unified

Each folder represents the resources it controls. It's pretty easy to create a cgroup and control processes with it: just create a folder under the resource type with any name and append the process IDs you'd like to control to tasks. Here, we want to throttle the CPU usage of our yes process, so create a new folder under cpu and find out the PID of the yes process:

## also on the host terminal
$ ps ax | grep yes | grep -v grep
 1658 pts/0    R      0:42 yes
$ sudo mkdir /sys/fs/cgroup/cpu/box && \
  echo 1658 | sudo tee /sys/fs/cgroup/cpu/box/tasks > /dev/null

We've just added yes into the newly created box CPU group, but the policy remains unset, and the process still runs without any restrictions. Set a limit by writing the desired number into the corresponding file and check the CPU usage again:

$ echo 50000 | sudo tee /sys/fs/cgroup/cpu/box/cpu.cfs_quota_us > /dev/null

## go back to namespaced terminal, check stats with top
PID USER PR  NI   VIRT  RES  SHR S %CPU %MEM    TIME+ COMMAND
  2 root 20  0    7468  748  684 R 50.3  0.1 6:43.68  yes
  1 root 20  0    4628  784  716 S  0.0  0.1 0:00.00  sh
  3 root 20  0   41656 3636 3164 R  0.0  0.4 0:00.08  top

The CPU usage is dramatically reduced, meaning that our CPU throttle works.

The previous two examples elucidate how a Linux container isolates system resources. By putting more confinements in an application, we can build a fully isolated box, including filesystems and networks, without encapsulating an operating system within it.