Docker and Container Isolation

25 min readAug 31, 2024

Docker is an open-source platform that provides a virtual execution environment for running software. This virtual execution environment is called a container or containerized process.

The containers include the application and all its dependencies, and can run independently of the host operating system, which allows developers to ensure that their code will run consistently in any environment. In other words, applications bundled in containers can run anywhere Docker is installed.

The Docker Client is the primary user interface for interacting with Docker. It’s what most users will directly engage with when working with Docker.

Command-Line Interface: Users typically interact with Docker through CLI commands like docker run, docker build, docker pull, etc.
Communication Protocol: The client communicates with the Docker daemon using a REST API, which can be over a UNIX socket or a network interface.
Multiple Clients: A single Docker daemon can handle requests from multiple Docker clients, allowing for flexible management scenarios.
Remote Connections: The Docker client can connect to a remote Docker daemon, enabling management of Docker containers on different machines.

The Docker daemon (dockerd) is the heart of Docker operations. It’s a background service running on the host system that manages Docker objects.

API Server: It listens for Docker API requests and processes them.
Object Management: The daemon is responsible for creating and managing Docker objects such as images, containers, networks, and volumes.
Container Lifecycle: It handles the entire lifecycle of containers from creation to deletion.
Image Building: When you run a docker build command, the daemon processes the Dockerfile and creates a new image.
Inter-daemon Communication: In a Docker swarm setup, daemons can communicate with each other to manage Docker services across multiple nodes.

Comparing Docker Container and Virtual Machines

Virtual machines support hardware-level virtualization through a hypervisor on top of the host operating system. On top of this, a separate operating system (Guest OS) is installed, which operates independently and can run applications. Each virtual machine has its own independent operating system kernel, which can result in higher resource consumption.

Individual VMs use separate OSes, which provides better isolation (security) compared to Docker, but they have the disadvantage of being heavier, slower, and having more overhead. Unlike VMs, Docker containers don’t emulate hardware. Instead, they share the kernel of the host operating system. This means that each container runs as an independent process, but they all use the same operating system kernel, which supports OS-level virtualization and allocates individual user spaces.

To create virtualized spaces, Docker uses Linux features such as pivot-root, namespaces, and cgroups, providing process-level isolated environments and resources. Docker containers don’t include an operating system; they only contain the application and its necessary libraries and configurations. This makes them much lighter and faster than virtual machines. In other words, containers don’t have a Guest OS or hypervisor, which reduces overhead, allowing processes to run much more lightly and making container replication and deployment easier.

Linux Process

A process is an instance of a running program. Each process has a unique PID (Process ID) and includes attributes such as memory space, file descriptors, user and group IDs, current working directory, and environment variables. Additionally, Linux manages processes in a tree structure, operating with parent and child process relationships. Processes run in user mode and kernel mode, and are allocated CPU and memory by the kernel’s scheduler. Processes are the basic unit that uses CPU and memory, and the OS kernel (Cgroup) manages the resources of each process.

One of the most fascinating aspects of Linux systems, which is crucial for understanding how processes (and by extension, containers) work, is the /proc directory. This virtual filesystem provides a real-time glimpse into the heart of the Linux kernel and running processes.

$ docker exec -it 1ce17f355b70 /bin/bash

root@1ce17f355b70:/workspace# ls /proc
1     325   asound cgroups   crypto     driver     filesystems  ioports   key-users kpagecount  mdstat   mounts  pagetypeinfo  scsi      stat     sysvipc  uptime      vmstat
1164  7     bootconfig cmdline   devices    dynamic_debug  fs   irq    keys  kpageflags  meminfo  mtd     partitions    self      swaps     thread-self  version     zoneinfo
1170  857   buddyinfo consoles  diskstats  execdomains    interrupts  kallsyms  kmsg  loadavg     misc     mtrr    pressure    slabinfo  sys     timer_list  version_signature
187   acpi  bus  cpuinfo   dma      fb      iomem  kcore    kpagecgroup locks     modules  net     schedstat    softirqs  sysrq-trigger  tty   vmallocinfo

The /proc directory is a pseudo-filesystem that doesn't exist on disk. Instead, it's created dynamically by the Linux kernel to provide information about the system's state, running processes, hardware configurations, and more. This directory is a goldmine for system administrators, developers, and anyone interested in the inner workings of a Linux system :)

When you run ls /proc, you'll see a mix of numbered directories (each corresponding to a running process) and various files containing system information.

Numbered directories: The numbers you see (like 1, 1164, 1170, 187, 325, 7, 857) represent Process IDs (PIDs). Each of these is a directory containing information about a specific running process.
System information files: The other entries are files or directories that provide various system-wide information.
Kernel information: Entries such as kallsyms, modules, and filesystems offer insights into kernel operations.
Process-specific information: Directories like self and thread-self are symbolic links that processes can use to refer to their own /proc entries.
Hardware information: Files like interrupts, ioports, and dma provide low-level hardware details.

loadavg: Shows system load averages
cpuinfo: Contains CPU information
meminfo: Provides memory usage details
uptime: Indicates how long the system has been running
version: Gives kernel version information

root@1ce17f355b70:/workspace# cat /proc/meminfo
MemTotal:       49187644 kB
MemFree:         5477920 kB
MemAvailable:   45807972 kB
Buffers:         1415864 kB
Cached:         37302992 kB
SwapCached:        22368 kB
Active:         10759492 kB
Inactive:       30014936 kB
Active(anon):    1917264 kB
Inactive(anon):   142452 kB
Active(file):    8842228 kB
Inactive(file): 29872484 kB
Unevictable:          64 kB
Mlocked:              64 kB
SwapTotal:       2097148 kB
SwapFree:        1996968 kB

root@1ce17f355b70:/workspace# cat /proc/interrupts
            CPU0       CPU1       CPU2       CPU3       CPU4       CPU5       CPU6       CPU7       CPU8       CPU9       CPU10      CPU11      CPU12      CPU13      CPU14      CPU15
   0:         11          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0  IR-IO-APIC    2-edge      timer
   8:          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0  IR-IO-APIC    8-edge      rtc0
   9:          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0  IR-IO-APIC    9-fasteoi   acpi
  14:          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0  IR-IO-APIC   14-fasteoi   INT34C6:00
  16:          0          0          0          0          0          0          0          0          0          0          0          0          0         14          0          0  IR-IO-APIC   16-fasteoi   i801_smbus
  17:          0          0          0        590          0          0          0          0          0          0          0          0          0          0          0          0  IR-IO-APIC   17-fasteoi   snd_hda_intel:card2, snd_hda_intel:card3
 120:          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0  DMAR-MSI    0-edge      dmar0
 121:          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0  IR-PCI-MSI-0000:00:01.0    0-edge      PCIe PME, aerdrv, pcie-dpc

root@1ce17f355b70:/workspace# cat /proc/loadavg
0.00 0.00 0.00 1/987 1176

This /proc/loadavg file gives you a quick look at system load:

The first three numbers represent the system load averages for the past 1, 5, and 15 minutes.
The fourth field shows the number of currently running processes and the total number of processes.
The last number is the PID of the most recently created process.

/proc/version file contains information about the Linux kernel version, including:

Kernel version number
GCC version used to compile the kernel
Build date

root@1ce17f355b70:/workspace# cat /proc/version
Linux version 6.5.0-44-generic (buildd@lcy02-amd64-103) (x86_64-linux-gnu-gcc-12 (Ubuntu 12.3.0-1ubuntu1~22.04) 12.3.0, GNU ld (GNU Binutils for Ubuntu) 2.38) #44~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Tue Jun 18 14:36:16 UTC 2

/proc/filesystems file lists all the filesystems that the kernel currently supports. It's particularly useful when working with containers, as you might need to ensure certain filesystem support is available.

root@1ce17f355b70:/workspace# cat /proc/filesystems
nodev sysfs
nodev tmpfs
nodev bdev
nodev proc
nodev cgroup
nodev cgroup2
nodev cpuset
nodev devtmpfs
nodev configfs
nodev debugfs
nodev tracefs
nodev securityfs
nodev sockfs
nodev bpf
nodev pipefs
nodev ramfs
nodev hugetlbfs
nodev devpts
 ext3
 ext2
 ext4
 squashfs
 vfat
nodev ecryptfs
 fuseblk
nodev fuse
nodev fusectl
nodev efivarfs
nodev mqueue
nodev pstore
nodev autofs
nodev binfmt_misc
nodev overlay

/proc/partitions file provides information about all recognized disk partitions, including:

Major and minor device numbers
Number of blocks
Partition names

root@1ce17f355b70:/workspace# cat /proc/partitions
major minor  #blocks  name

   7        0          4 loop0
   7        1      56996 loop1
   7        2      76056 loop2
   7        3      76024 loop3
   7        4      67040 loop4
   7        5      76056 loop5
   7        6     508908 loop6
   7        7     517212 loop7
 259        0  500107608 nvme0n1
 259        1     524288 nvme0n1p1
 259        2  499581952 nvme0n1p2
 259        3  976762584 nvme1n1
   8        0 7814026584 sda
   7        8        500 loop8
   7        9      63380 loop9
   7       10      93888 loop10
   7       11      13236 loop11
   7       12      12620 loop12
   7       13      39664 loop13
   7       14      39760 loop14
   7       15        476 loop15
   7       16       7212 loop16
   7       17     276244 loop17
   7       18       7216 loop18
   7       19     275784 loop19

Here’s an example output of the docker info command.

The following shows information about the Docker client:

The Docker Engine version (27.0.3)
Installed plugins like Docker Buildx and Docker Compose

Client: Docker Engine - Community
 Version:    27.0.3
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc.)
    Version:  v0.15.1
    Path:     /usr/libexec/docker/cli-plugins/docker-buildx
  compose: Docker Compose (Docker Inc.)
    Version:  v2.28.1
    Path:     /usr/libexec/docker/cli-plugins/docker-compose

This section provides details about the Docker server (daemon):

Container statistics (running, paused, stopped)
Number of images
Server version
Storage Driver: In this case, it’s using overlay2, which is a union filesystem that allows Docker to efficiently manage image layers and container filesystems.
Cgroup information: This system is using Cgroup Version 2 with systemd as the Cgroup driver. Cgroups are a Linux kernel feature that limits, accounts for, and isolates the resource usage (CPU, memory, disk I/O, network, etc.) of a collection of processes.

Server:
 Containers: 44
  Running: 4
  Paused: 0
  Stopped: 40
 Images: 37
 Server Version: 27.0.3
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Using metacopy: false
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: systemd
 Cgroup Version: 2

The following shows:

Available plugins for volumes, networking, and logging
Swarm mode status
Container runtimes: The default runtime is runc, which is the reference implementation of OCI (Open Container Initiative) runtime-spec. It’s responsible for spawning and running containers.
Versions of key components like containerd and runc

Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
 Swarm: inactive
 Runtimes: runc io.containerd.runc.v2
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: ae71819c4f5e67bb4d5ae76a6b735f29cc25774e
 runc version: v1.1.13-0-g58aa920
 init version: de40ad0

The section provides:

Security options and profiles
Host system information (kernel, OS, architecture)
Hardware resources available to Docker
Docker’s root directory and other configuration details

 section shows:

Available plugins for volumes, networking, and logging
Swarm mode status
Container runtimes, with runc as the default
Versions of key components like containerd and runc

Docker creates its own network interfaces and modifies the host’s network configuration. When you install Docker, it adds new network interfaces to your system. You can view these with the ip command. Note the docker0 interface, which is the default bridge network Docker creates.

$ ip -br -c addr
lo               UNKNOWN        127.0.0.1/8 ::1/128
enp5s0           UP             10.0.0.38/24 fe80::af87:4f6f:dab3:52ed/64
docker0          UP             172.17.0.1/16 fe80::42:e9ff:feee:9670/64
veth0239435@if4  UP             fe80::4c6c:45ff:fe71:bf5e/64
veth731e609@if6  UP             fe80::3c68:e8ff:fe7a:e618/64
vethc12e350@if44 UP             fe80::1003:a3ff:fee0:8ea3/64
br-2a79290d1214  UP             172.18.0.1/16 fe80::42:c9ff:fe16:c6b6/64
veth56336f4@if63 UP             fe80::4422:e4ff:fe2b:8620/64

Docker also modifies the host’s routing table which shows how traffic to Docker networks (172.17.0.0/16 and 172.18.0.0/16) is routed through Docker-created interfaces.

$ ip -c route
default via 10.0.0.1 dev enp5s0 proto dhcp metric 100
10.0.0.0/24 dev enp5s0 proto kernel scope link src 10.0.0.38 metric 100
169.254.0.0/16 dev enp5s0 scope link metric 1000
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1
172.18.0.0/16 dev br-2a79290d1214 proto kernel scope link src 172.18.0.1

Docker extensively uses iptables to manage network isolation and port forwarding where the rules handle traffic forwarding between containers and to/from the host system.

$ sudo iptables -t filter -S
-P INPUT ACCEPT
-P FORWARD DROP
-P OUTPUT ACCEPT
-N DOCKER
-N DOCKER-ISOLATION-STAGE-1
-N DOCKER-ISOLATION-STAGE-2
-N DOCKER-USER
...

Checking the status of docker service.

systemctl status docker -l --no-pager

Looking at the root directory of docker.

tree -L 3 /var/lib/docker

Unix Domain Socket & Docker Sockets

TCP/IP: Process 1 -> Network Stack -> Loopback Interface -> Network Stack -> Process 2. Unix Domain Socket: Process 1 -> Kernel -> Process 2. https://www.verycosy.net/posts/2023/09/unix-domain-socket

Unix Domain Sockets are a mechanism for Inter-Process Communication (IPC) within an operating system. They provide an abstracted interface to protocol elements implemented in the OS kernel and are a type of device file. To illustrate, the output confirms that /var/run/docker.sock is indeed a socket file, which is used to communicate with the Docker daemon.

As being mentioned above, Docker uses a client-server architecture, where the Docker client communicates with the Docker daemon. This communication doesn’t occur over TCP, but instead uses Unix Domain Sockets. Unix Domain Sockets are significantly faster than TCP/IP for local inter-process communication. They don’t require the overhead of network protocol stacks.

$ ls -l /run/docker.sock /var/run/docker.sock
srw-rw---- 1 root docker 0  Aug 13 15:25 /run/docker.sock
srw-rw---- 1 root docker 0  Aug 13 15:25 /var/run/docker.sock

$ file /var/run/docker.sock
/var/run/docker.sock: socket

When working with Docker, it’s crucial to consider security implications. The official Docker documentation advises against managing Docker with root privileges due to potential security issues.

By default, the Unix socket used by Docker is owned by the root user and the docker group.

The Docker daemon always runs as root.
Any user with access to this socket effectively has root privileges on the host system.

This is why running Docker commands typically requires sudo or root privileges. The solution to manage Docker safely as a non-root user is to add your user to the docker group.

Check if the docker group exists and if your user is a member:

$ getent group | grep docker
docker:x:999:sigridjineth

2. If your user isn’t listed, add them to the docker group. Log out and log back in for the changes to take effect.

$ sudo usermod -aG docker $USER

In certain scenarios, such as when running CI/CD pipelines with Jenkins, you might need to execute Docker commands from within a container. While Docker-in-Docker (DinD) is an option, it’s often discouraged due to performance, complexity, and security issues. Instead, we can use a technique called Docker-out-of-Docker (DooD).

docker run --rm -it -v /var/run/docker.sock:/var/run/docker.sock -v /usr/bin/docker:/usr/bin/docker ubuntu:latest bash

This command does not only mounts the host’s Docker socket (/var/run/docker.sock) into the container, but also mounts the Docker binary (/usr/bin/docker) into the container.

Now, within this container, you can run Docker commands that will be executed on the host’s Docker daemon.

docker info
docker run -d --rm --name webserver nginx:alpine
docker ps
docker rm -f webserver
docker ps -a

Container Isolation

Container isolation is a fundamental concept in containerization technologies like Docker. It allows multiple containerized applications to run on the same host system while remaining separated from each other and the host. This isolation is achieved through various Linux kernel features, but it all started with a simple command: chroot.

chroot, short for "change root," is a Unix operation that changes the apparent root directory for the current running process and its children. While not a complete containerization solution, chroot laid the groundwork for the isolation techniques used in modern container technologies.

chroot changes the root directory (/) of a process to a specified directory.
This creates a confined environment where the process cannot access files outside its new “root” directory.
It provides a basic level of file system isolation, which is crucial for security and resource management.
It only provides filesystem isolation, not process or network isolation.
It’s possible to “break out” of a chroot environment, making it insufficient for strong security measures.
It doesn’t provide resource limitation capabilities.

Despite these limitations, chroot was a significant step towards the containerization we know today. Let’s walk through a practical example of using chroot to create a basic isolated environment.

First, we’ll create a new directory to serve as our isolated root:

sudo su -
cd /tmp
mkdir myroot

We need to create the basic directory structure and copy essential files:

mkdir -p myroot/bin
cp /usr/bin/sh myroot/bin/
mkdir -p myroot/{lib64,lib/x86_64-linux-gnu}
cp /lib/x86_64-linux-gnu/libc.so.6 myroot/lib/x86_64-linux-gnu/
cp /lib64/ld-linux-x86-64.so.2 myroot/lib64

Let’s check our new root structure:

$ tree myroot

myroot
├── bin
│   └── sh
├── lib
│   └── x86_64-linux-gnu
│       └── libc.so.6
└── lib64
    └── ld-linux-x86-64.so.2

4 directories, 3 files

Now, let’s enter our isolated environment. You’ll likely see an error because the ls command isn't available in our isolated environment. This demonstrates the isolation — we only have access to the files and commands we explicitly added to our new root.

chroot myroot /bin/sh

ls
/bin/sh: 1: ls: not found

exit

Let’s check the location and dependencies of the ls command, copy ls and its dependencies to our myroot directory.

which ls
# Output: /usr/bin/ls

ldd /usr/bin/ls
# Output:
#   linux-vdso.so.1 (0x00007ffe7db40000)
#   libselinux.so.1 => /lib/x86_64-linux-gnu/libselinux.so.1 (0x00007628bbca5000)
#   libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007628bba00000)
#   libpcre2-8.so.0 => /lib/x86_64-linux-gnu/libpcre2-8.so.0 (0x00007628bb969000)
#   /lib64/ld-linux-x86-64.so.2 (0x00007628bbd0b000)

------------------------------------------------

cp /usr/bin/ls myroot/bin/
cp /lib/x86_64-linux-gnu/{libselinux.so.1,libc.so.6,libpcre2-8.so.0} myroot/lib/x86_64-linux-gnu/
cp /lib64/ld-linux-x86-64.so.2 myroot/lib64

------------------------------------------------

tree myroot
# Output:
# myroot
# ├── bin
# │   ├── ls
# │   └── sh
# ├── lib
# │   └── x86_64-linux-gnu
# │       ├── libc.so.6
# │       ├── libpcre2-8.so.0
# │       └── libselinux.so.1
# └── lib64
#     └── ld-linux-x86-64.so.2
#
# 4 directories, 6 files

Let’s enter our chroot environment again and explore, then inside the chroot environment you can see the following.

# Check the current directory
pwd
# Output: /

# List files in the current directory
ls
# Output: bin  lib  lib64

# Try to move up the directory tree
cd ..
pwd
# Output: /

ls
# Output: bin  lib  lib64

Notice that even when we try to move up the directory tree with cd .., we remain in the root directory of our chroot environment. This demonstrates the isolation effect in which chroot environment sees / as its root. In root,/tmp/myroot is in the host system.

# Inside chroot
pwd
# Output: /

# Exit chroot
exit

# Outside chroot, in the host system
pwd
# Output: /tmp/myroot

Add the ps command and its dependencies to our chroot environment.

# Find ps location
which ps
# Output: /usr/bin/ps

# Check ps dependencies
root@sigridjineth-Z590-VISION-G:~# ldd /bin/sh
 linux-vdso.so.1 (0x00007ffc679a1000)
 libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x000079aa9c400000)
 /lib64/ld-linux-x86-64.so.2 (0x000079aa9c817000)

# Copy ps and its dependencies
cp /usr/bin/ps /tmp/myroot/bin/
cp /lib/x86_64-linux-gnu/{libprocps.so.8,libc.so.6,libsystemd.so.0,liblzma.so.5,libgcrypt.so.20,libgpg-error.so.0,libzstd.so.1,libcap.so.2} /tmp/myroot/lib/x86_64-linux-gnu/
mkdir -p /tmp/myroot/usr/lib/x86_64-linux-gnu
cp /usr/lib/x86_64-linux-gnu/liblz4.so.1 /tmp/myroot/usr/lib/x86_64-linux-gnu/
cp /lib64/ld-linux-x86-64.so.2 /tmp/myroot/lib64/

# Also copy mount and mkdir commands (they might be useful)
cp /usr/bin/mount /tmp/myroot/bin/
cp /lib/x86_64-linux-gnu/{libmount.so.1,libc.so.6,libblkid.so.1,libselinux.so.1,libpcre2-8.so.0} /tmp/myroot/lib/x86_64-linux-gnu/
cp /usr/bin/mkdir /tmp/myroot/bin/

chroot myroot /bin/sh

# Inside chroot
ps
# Output: Error: /proc must be mounted
#  To mount /proc at boot you need an /etc/fstab line like:
#      proc   /proc   proc    defaults
#  In the meantime, mount /proc /proc -t proc

We receive an error message suggesting that we need to mount the proc filesystem. The ps command reads information about running processes from the proc filesystem. Without access to /proc, ps can't function correctly. chroot only changes the root directory for a process. It doesn't create a new instance of kernel data structures or mount points. This is why we can't see /proc in our chroot environment by default.

In a normal Linux system, procfs is automatically mounted. We can verify this using the mount command. This mounting is why commands like ps (which reads process information from /proc) work in the host system.

$ mount | grep proc
proc on /proc type proc (rw,nosuid,nodev,noexec,relatime)
systemd-1 on /proc/sys/fs/binfmt_misc type autofs (rw,relatime,fd=29,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=17224)
binfmt_misc on /proc/sys/fs/binfmt_misc type binfmt_misc (rw,nosuid,nodev,noexec,relatime)

Now, let’s try to mount procfs in our chroot environment. We get an error because the /proc directory doesn't exist in our chroot environment. This illustrates an important point about isolation — our chroot environment starts with only the directories and files we explicitly added to it.

chroot myroot /bin/sh
# mount -t proc proc /proc
mount: /proc: mount point does not exist.

Let’s create the /proc directory and try mounting again.

# mkdir /proc
# mount -t proc proc /proc
# ps
    PID TTY          TIME CMD
   4755 ?        00:00:00 bash
   6997 ?        00:00:00 bash
 396395 ?        00:00:00 bash
1614567 ?        00:00:00 sudo
1614568 ?        00:00:00 su
1614569 ?        00:00:00 bash
1614596 ?        00:00:00 sh
1614628 ?        00:00:00 ps

While chroot provides basic file system isolation, it's important to understand its limitations, particularly from a security standpoint. Let's explore a practical example that demonstrates why chroot alone is insufficient for secure containerization.

Consider the following C program that attempts to escape from a chroot environment.

Creates a new directory .out
Changes the root to this new directory
Navigates up the directory tree multiple times (potentially escaping the original chroot)
Changes the root again to the current directory (which might be outside the original chroot)
Executes a shell
Let’s compile this program and place it in our myroot directory.

#include <sys/stat.h>
#include <unistd.h>

int main(void)
{
  mkdir(".out", 0755);
  chroot(".out");
  chdir("../../../../../");
  chroot(".");

  return execl("/bin/sh", "-i", NULL);
}

$ gcc -o myroot/escape_chroot escape_chroot.c

chroot myroot /bin/sh

# Inside the chroot environment
ls /
# Output: bin  escape_chroot  lib  lib64  proc  usr

# Try to navigate outside the chroot
cd ../../
cd ../../
ls
# Output: bin  escape_chroot  lib  lib64  proc  usr

# The above shows we're still confined within the chroot

# Now, let's run our escape program
./escape_chroot

# Check our new environment
ls /
# Output: bin  boot  cdrom  dev  etc  home  lib  lib32  lib64  libx32  lost+found  media  mnt  opt  proc  root  run  sbin  snap  srv  swapfile  sys  tmp  usr  var

This vulnerability illustrates why chroot alone is not suitable as the foundation for secure containerization.

Incomplete Isolation: chroot only provides file system isolation, not process or network isolation.
Escapable: As demonstrated, it’s possible to break out of a chroot environment under certain conditions.
Lack of Resource Controls: chroot doesn't provide any mechanism for limiting CPU, memory, or other resource usage.

pivot_root and Mount Namespaces

Mount namespaces provide a powerful isolation mechanism by allowing processes to have their own view of the file system hierarchy. Each namespace can have its own set of mount points while changing to mounts in one namespace don’t affect others. Also, processes can mount and unmount file systems without affecting the host or other containers.

While similar to chroot, pivot_root provides a more secure way to change the root file system for a process. pivot_root works by moving the current root file system to a specified directory; making a new directory the new root file system. It helps to “pivots” the root, providing a clean separation from the host’s file system.

Copy-on-Write Principle: When a new mount namespace is created, it starts as a copy of the parent’s mount namespace. This is an efficient way to create new namespaces without duplicating all the mount information.

root@sigridjineth-Z590-VISION-G:/tmp# unshare --mount /bin/sh
# df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/nvme0n1p2  468G  436G  8.3G  99% /
tmpfs            24G     0   24G   0% /dev/shm
tmpfs           4.7G  2.6M  4.7G   1% /run
tmpfs           5.0M  4.0K  5.0M   1% /run/lock
tmpfs           4.7G  104K  4.7G   1% /run/user/1000
efivarfs        256K   92K  160K  37% /sys/firmware/efi/efivars
/dev/nvme0n1p1  511M  6.2M  505M   2% /boot/efi
overlay         468G  436G  8.3G  99% /var/lib/docker/overlay2/4bba73ef34523c80969c100c4cb9fcaa9801fe041b606ad2abe667d9a8ba3970/merged
overlay         468G  436G  8.3G  99% /var/lib/docker/overlay2/a3415d332756d5939b206399272a05062f0491f1671979885c8875a6c47fb089/merged
overlay         468G  436G  8.3G  99% /var/lib/docker/overlay2/bd11bf44b6a013bc98c6696d67cc2eb959f30f668d9b5ac0746da1abdcbf41bd/merged
overlay         468G  436G  8.3G  99% /var/lib/docker/overlay2/5c7e9032dcfde36af51a09e393a963c45a962f9375af2873d97aa4c3bdbffcc5/merged
# exit

root@sigridjineth-Z590-VISION-G:/tmp# df -h
Filesystem      Size  Used Avail Use% Mounted on
tmpfs           4.7G  2.6M  4.7G   1% /run
/dev/nvme0n1p2  468G  436G  8.3G  99% /
tmpfs            24G     0   24G   0% /dev/shm
tmpfs           5.0M  4.0K  5.0M   1% /run/lock
efivarfs        256K   92K  160K  37% /sys/firmware/efi/efivars
/dev/nvme0n1p1  511M  6.2M  505M   2% /boot/efi
tmpfs           4.7G  104K  4.7G   1% /run/user/1000
overlay         468G  436G  8.3G  99% /var/lib/docker/overlay2/4bba73ef34523c80969c100c4cb9fcaa9801fe041b606ad2abe667d9a8ba3970/merged
overlay         468G  436G  8.3G  99% /var/lib/docker/overlay2/a3415d332756d5939b206399272a05062f0491f1671979885c8875a6c47fb089/merged
overlay         468G  436G  8.3G  99% /var/lib/docker/overlay2/bd11bf44b6a013bc98c6696d67cc2eb959f30f668d9b5ac0746da1abdcbf41bd/merged
overlay         468G  436G  8.3G  99% /var/lib/docker/overlay2/5c7e9032dcfde36af51a09e393a963c45a962f9375af2873d97aa4c3bdbffcc5/merged

First, we’ll use the unshare command to create a new mount namespace, which creates a new shell in a seperate mount namespace.

root@sigridjineth-Z590-VISION-G:/tmp# unshare --mount /bin/sh
#

Let’s create a new directory to serve as our new root and mount a temporary file system on it, which creates an empty, memory-based file system for our new root. Change to the new root directory and execute pivot_root.

pivot_root is a system call and command that changes the root filesystem of the current process and its children. Unlike chroot, which merely alters the perceived root directory, pivot_root actually swaps out the entire root filesystem, providing stronger isolation.

This makes new_root the new root file system and moves the old root to put_old_root. You should see an empty directory, the new isolated root.

mkdir new_root
mount -t tmpfs none new_root

mkdir new_root/put_old_root
cd new_root
$ pivot_root . put_old_root

cd /
ls -l

tmpfs (Temporary File System) is a temporary filesystem that resides in memory and/or your swap partition. Using tmpfs can significantly enhance container isolation and performance.

# Let's create a new directory and mount a tmpfs filesystem on it.
mkdir /tmp/new_root
mount -t tmpfs none /tmp/new_root

ls -l /tmp
# drwxrwxrwt 2 root         root               40  Sep  1 02:34 new_root
# We can see that new_root is now an empty directory backed by tmpfs.

To confirm that our tmpfs is correctly mounted, we can use the df command. The output with none filesystem shows that we have a 24GB tmpfs mounted at /tmp/new_root.

df -h

Filesystem      Size  Used Avail Use% Mounted on
...
none             24G     0   24G   0% /tmp/new_root
...

The none filesystem with the mount command attaches another filesystem to the root filesystem tree, creating an environment where data is stored in memory and is not retained after system reboot.
“none” indicates that we’re not mounting a physical device (like a hard drive partition) or a network filesystem.
It’s used when mounting pseudo-filesystems like tmpfs, which don’t correspond to any physical device.
“none” serves as a placeholder in the mount command where you’d normally specify a device.

Unlike our earlier chroot example, you'll find that you cannot escape this environment. The pivot_root command has effectively isolated our filesystem, preventing access to the parent namespace's root.

mkdir new_root
mount -t tmpfs none new_root
df -h

mkdir new_root/put_old_root
cd new_root

pivot_root . put_old_root

cd /
ls -l

./escape_chroot
cd ../../../
ls

Namespaces

Namespaces are a feature of the Linux kernel that partitions kernel resources such that one set of processes sees one set of resources while another set of processes sees a different set of resources. This is the key mechanism that enables containers to have their own isolated view of the system. It provides the foundational isolation that allows containers to operate securely and independently on a shared host system.

root@sigridjineth-Z590-VISION-G:/tmp/myroot# cd /tmp
root@sigridjineth-Z590-VISION-G:/tmp# ls -al /proc/$$/ns
total 0
dr-x--x--x 2 root root 0  9月  1 02:51 .
dr-xr-xr-x 9 root root 0  9月  1 02:44 ..
lrwxrwxrwx 1 root root 0  9月  1 02:51 cgroup -> 'cgroup:[4026531835]'
lrwxrwxrwx 1 root root 0  9月  1 02:51 ipc -> 'ipc:[4026531839]'
lrwxrwxrwx 1 root root 0  9月  1 02:51 mnt -> 'mnt:[4026533081]'
lrwxrwxrwx 1 root root 0  9月  1 02:51 net -> 'net:[4026531840]'
lrwxrwxrwx 1 root root 0  9月  1 02:51 pid -> 'pid:[4026531836]'
lrwxrwxrwx 1 root root 0  9月  1 02:51 pid_for_children -> 'pid:[4026531836]'
lrwxrwxrwx 1 root root 0  9月  1 02:51 time -> 'time:[4026531834]'
lrwxrwxrwx 1 root root 0  9月  1 02:51 time_for_children -> 'time:[4026531834]'
lrwxrwxrwx 1 root root 0  9月  1 02:51 user -> 'user:[4026531837]'
lrwxrwxrwx 1 root root 0  9月  1 02:51 uts -> 'uts:[4026531838]'

This command shows the namespaces associated with the current shell process. Each namespace is represented by a symbolic link with a unique inode number.

root@sigridjineth-Z590-VISION-G:/tmp# lsns -p $$
        NS TYPE   NPROCS     PID USER COMMAND
4026531834 time      392       1 root /sbin/init splash
4026531835 cgroup    363       1 root /sbin/init splash
4026531836 pid       363       1 root /sbin/init splash
4026531837 user      390       1 root /sbin/init splash
4026531838 uts       359       1 root /sbin/init splash
4026531839 ipc       365       1 root /sbin/init splash
4026531840 net       360       1 root /sbin/init splash
4026533081 mnt         2 1615549 root bash
root@sigridjineth-Z590-VISION-G:/tmp#

We can also use the lsns command to list namespaces.

root@sigridjineth-Z590-VISION-G:/tmp# lsns -p $$
        NS TYPE   NPROCS     PID USER COMMAND
4026531834 time      392       1 root /sbin/init splash
4026531835 cgroup    363       1 root /sbin/init splash
4026531836 pid       363       1 root /sbin/init splash
4026531837 user      390       1 root /sbin/init splash
4026531838 uts       359       1 root /sbin/init splash
4026531839 ipc       365       1 root /sbin/init splash
4026531840 net       360       1 root /sbin/init splash
4026533081 mnt         2 1615549 root bash
root@sigridjineth-Z590-VISION-G:/tmp# unshare -m
root@sigridjineth-Z590-VISION-G:/tmp# lsns -p $$
        NS TYPE   NPROCS     PID USER COMMAND
4026531834 time      392       1 root /sbin/init splash
4026531835 cgroup    363       1 root /sbin/init splash
4026531836 pid       363       1 root /sbin/init splash
4026531837 user      390       1 root /sbin/init splash
4026531838 uts       359       1 root /sbin/init splash
4026531839 ipc       365       1 root /sbin/init splash
4026531840 net       360       1 root /sbin/init splash
4026533082 mnt         2 1617290 root -bash

The key difference in these outputs is in the mnt type namespace, Before unshare -m: The mount namespace ID was 4026533081. After unshare -m: The mount namespace ID changed to 4026533082. This change in the namespace ID indicates that a new mount namespace has been created. The unshare -m command creates this new mount namespace, effectively isolating the mount points of the new process from the parent namespace.

The UTS (UNIX Time-Sharing System) namespace is another crucial component in container isolation. The name “UTS” comes from the historical context of Unix Time-Sharing systems, where multiple users needed to share server resources efficiently. By isolating these identifiers, containers can have their own unique hostnames and domain names without conflicting with the host system or other containers.

root@sigridjineth-Z590-VISION-G:/tmp# lsns -p $$
        NS TYPE   NPROCS     PID USER COMMAND
4026531834 time      392       1 root /sbin/init splash
4026531835 cgroup    363       1 root /sbin/init splash
4026531836 pid       363       1 root /sbin/init splash
4026531837 user      390       1 root /sbin/init splash
4026531838 uts       359       1 root /sbin/init splash
4026531839 ipc       365       1 root /sbin/init splash
4026531840 net       360       1 root /sbin/init splash
4026533081 mnt         2 1615549 root bash

-------------------------------------------------------
$ unshare -u

root@sigridjineth-Z590-VISION-G:/tmp# lsns -p $$
        NS TYPE   NPROCS     PID USER COMMAND
4026531834 time      393       1 root /sbin/init splash
4026531835 cgroup    364       1 root /sbin/init splash
4026531836 pid       364       1 root /sbin/init splash
4026531837 user      391       1 root /sbin/init splash
4026531839 ipc       366       1 root /sbin/init splash
4026531840 net       361       1 root /sbin/init splash
4026533082 mnt         3 1617290 root -bash
4026533083 uts         2 1617508 root -bash

The key difference in these outputs is the addition of a new UTS namespace:

Before: The UTS namespace (4026531838) was shared with the init process.
After: A new UTS namespace (4026533083) has been created for the current bash session.

This change indicates that the current process now has its own isolated UTS namespace, separate from the host system.

IPC (Inter-Process Communication) namespaces isolate certain IPC resources, such as System V IPC objects and POSIX message queues.

# The processes in different IPC namespaces cannot communicate using IPC mechanisms.
root@sigridjineth-Z590-VISION-G:/tmp# unshare -i
root@sigridjineth-Z590-VISION-G:/tmp# lsns -p $$
        NS TYPE   NPROCS     PID USER COMMAND
4026531834 time      397       1 root /sbin/init splash
4026531835 cgroup    368       1 root /sbin/init splash
4026531836 pid       368       1 root /sbin/init splash
4026531837 user      395       1 root /sbin/init splash
4026531840 net       365       1 root /sbin/init splash
4026533082 mnt         4 1617290 root -bash
4026533083 uts         3 1617508 root -bash
4026533084 ipc         2 1617711 root -bash

PID namespaces isolate the process ID number space. This means that processes in different PID namespaces can have the same PID. Inside the new PID namespace, the first process gets PID 1, just like in a new system. However, from the parent namespace, this process will have a different PID.

$ echo $$  # Note the current PID
unshare -fp --mount-proc /bin/sh
echo $$  # This should show PID 1
1617711
# echo $$
1
# exit

Inside the new PID namespace, the first process gets PID 1, just like in a new system. However, from the parent namespace, this process will have a different PID:

root@sigridjineth-Z590-VISION-G:/tmp# unshare -fp --mount-proc /bin/sh
# ps -ef
UID          PID    PPID  C STIME TTY          TIME CMD
root           1       0  0 03:04 pts/1    00:00:00 /bin/sh
root           2       1  0 03:04 pts/1    00:00:00 ps -ef

# exit
# ps -ef
root        7418    6745  0  8月13 ?      00:00:00 /bin/sh

User namespaces isolate user and group ID number spaces. This allows a process to have root privileges inside a namespace without having them outside.

whoami  # Note current user
unshare -U --map-root-user /bin/sh
whoami  # This should show 'root'

root@sigridjineth-Z590-VISION-G:/tmp # ps -ef | grep "/bin/sh"
root        7418    6745  0  8月13 ?      00:00:00 /bin/sh
root        8634    6745  0  8月13 ?      00:00:00 /bin/sh
root       34268    6745  0  8月13 ?      00:00:00 /bin/sh
root       39650    6745  0  8月13 ?      00:00:00 /bin/sh
root       98024    6745  0  8月13 ?      00:00:00 /bin/sh
root      109256    6745  0  8月13 ?      00:00:00 /bin/sh
root     1618058 1617508  0 03:06 pts/1    00:00:00 grep --color=auto /bin/sh

Resource Isolation with cgroups

cgroups (control groups) is a Linux kernel feature that allows for fine-grained control and monitoring of system resources used by processes. It plays a vital role in containerization by enabling, resource limitation, prioritization, accounting, control.

Resource Limits: Docker uses cgroups to set limits on CPU, memory, and I/O usage for each container.
Resource Prioritization: cgroups allow for setting relative priorities between containers when competing for resources.
Resource Monitoring: cgroups provide detailed statistics about resource usage, which Docker can use for monitoring and logging.
Isolation: By placing each container in its own cgroup, Docker ensures that containers can’t interfere with each other’s resource allocations.

When Docker creates a container, it:

Creates a new cgroup for the container
Sets resource limits based on the container’s configuration
Assigns the container’s processes to this cgroup

cgroups are typically mounted as a virtual file system. In modern Linux systems, you’ll find cgroup-related files and directories under /sys/fs/cgroup/.

tree -L 1 /sys/fs/cgroup

Modern Linux distributions, including Ubuntu 22.04, use cgroup v2. We can verify this with the mount command. cgroup v2 provides a unified hierarchy, simplifying management and improving performance compared to cgroup v1.

mount -t cgroup2

Let’s run a simple process and examine its cgroup association.

sleep 100000 &
cat /proc/$(pgrep sleep)/cgroup

The output shows the cgroup hierarchy to which our sleep process belongs. The process is part of a user slice, which is further divided into user-specific and session-specific scopes.

We can get more detailed information about a process, including its cgroup associations, using the proc filesystem:

tree /proc/$(pgrep sleep) -L 2

To truly understand how cgroups enable resource isolation in containerization, let’s walk through a practical demonstration. We’ll focus on isolating CPU and memory resources, mirroring techniques used in container technologies like Docker.

apt install cgroup-tools stress htop -y

# creating a single thread to 1 cpu to use full utils.
$ stress -c 1
stress: info: [1619373] dispatching hogs: 1 cpu, 0 io, 0 vm, 0 hdd

Let’s create a new cgroup for our experiment:

cd /sys/fs/cgroup
mkdir test_cgroup_parent && cd test_cgroup_parent

root@sigridjineth-Z590-VISION-G:/sys/fs/cgroup/test_cgroup_parent# tree
.
├── cgroup.controllers
├── cgroup.events
├── cgroup.freeze
├── cgroup.kill
├── cgroup.max.depth
├── cgroup.max.descendants
├── cgroup.pressure
├── cgroup.procs
├── cgroup.stat
├── cgroup.subtree_control
├── cgroup.threads
├── cgroup.type
├── cpu.idle
├── cpu.max
├── cpu.max.burst
├── cpu.pressure
├── cpuset.cpus
├── cpuset.cpus.effective
├── cpuset.cpus.partition
├── cpuset.mems
├── cpuset.mems.effective
├── cpu.stat
├── cpu.uclamp.max

We need to enable the CPU controller for our cgroup:

echo "+cpu" >> cgroup.subtree_control

Let’s limit the CPU usage to 10% of a single CPU core. This sets the CPU quota to 100,000 microseconds out of every 1,000,000 microsecond period, effectively limiting CPU usage to 10%.

echo "100000 1000000" > cpu.max

Create a child cgroup.

mkdir test_cgroup_child && cd test_cgroup_child

Assign the current shell to the cgroup. This moves the current shell process into our new cgroup, subjecting it to our resource limits.

echo $$ > cgroup.procs

Let’s generate CPU load then monitor the cpu usage.

stress -c 1

htop

You should observe that the stress process is limited to about 10% CPU usage, demonstrating our cgroup-based CPU isolation.

Now, let’s demonstrate memory isolation using cgroups. In the parent cgroup directory, run the following.

/sys/fs/cgroup/test_cgroup_parent# echo "+memory" >> cgroup.subtree_control

Let’s set a 100MB memory limit:

echo "100M" > memory.max

In the child cgroup directory:

cd test_cgroup_child/

stress -m 1 --vm-bytes 512M --vm-keep

You’ll notice that this command fails almost immediately. Unlike CPU limiting, which throttles usage, memory limiting causes the kernel to terminate processes that exceed the limit. This behavior is similar to what happens in container environments when a container exceeds its memory allocation, resulting in an Out of Memory (OOM) error.