Introduction

So I wanted to play with systemd-nspawn and got a Fedora22 container running in a minute:

$ yum --releasever=22 --installroot=/var/lib/container/fedora22 install systemd passwd yum fedora-release vim-minimal iputils
$ systemd-nspawn -D /var/lib/container/fedora22 passwd -d root
$ systemctl start systemd-nspawn@fedora22
$ machinectl login fedora22

Now I wanted to test my network connection running ping but suddenly:

-bash-4.3# /bin/ping 8.8.8.8
-bash: /bin/ping: Operation not permitted

What the heck? Clearly something is wrong here, let’s debug this!

strace to the rescue

strace is your first weapon. Let’s see which syscall is denied:

-bash-4.3# strace /bin/ping
execve("/bin/ping", ["/bin/ping"], [/* 16 vars */]) = -1 EPERM (Operation not permitted)
write(2, "strace: exec: Operation not perm"..., 38strace: exec: Operation not permitted
) = 38
exit_group(1)                           = ?
+++ exited with 1 +++

OK, so it’s execve that is failing. Why is that¹? Lets check the manual:

$ man execve | grep -e "^[[:space:]]\+EPERM"
   EPERM  The filesystem is mounted nosuid, the user is not the superuser, and the file has the set-user-ID or set-group-ID bit set.
   EPERM  The process is being traced, the user is not the superuser and the file has the set-user-ID or set-group-ID bit set.

I’m a superuser so none of those apply². We have figure this out by ourselves.

ftrace to the rescue

Thanks to our strace output we know who to blame - it’s the kernel. So let’s see why we can’t ping our favourite DNS server. We’ll use ftrace.

Let’s do some preparations first. We need debugfs³ and (in recent kernel versions) tracefs mounted:

$ mount -t debugfs debugfs /sys/kernel/debug
$ mount -t tracefs tracefs /sys/kernel/debug/tracing

Then we need a helper script⁴:

$ cat <<__EOF__ > test.sh
#!/bin/bash
debugfs=/sys/kernel/debug
echo 0 > $debugfs/tracing/tracing_on
echo function_graph > $debugfs/tracing/current_tracer
echo \$\$ > $debugfs/tracing/set_ftrace_pid
echo 1 > $debugfs/tracing/tracing_on
exec \$@
__EOF__
$ chmod +x test.sh

This script will set ftrace filter to the PID of current process and the exec file supplied as argument. This way we will only see function calls made by our process. Here’s how to use it:

$ test.sh /bin/ping; echo 0 > /sys/kernel/debug/tracing/tracing_on
$ cat /sys/kernel/debug/tracing/trace > trace.txt

Now it’s time to analyze trace.txt. It’s a large file but since we know we are looking for a return value from execve syscall, we can quickly search for do_execveat_common twice to get to the function leave. Here’s what we can see (after removing some uninteresting parts marked with […]):

3)               |      prepare_binprm() {
3)               |        security_bprm_set_creds() {
3)               |          selinux_bprm_set_creds() {
3)               |            cap_bprm_set_creds() {
3)               |              get_vfs_caps_from_disk() {
3)               |                generic_getxattr() {
3)   1.112 us    |                  xattr_resolve_name();
3)               |                  ext4_xattr_security_get() {
3)               |                    ext4_xattr_get() {
[...]
3)   0.650 us    |                        mb_cache_entry_insert();
3)               |                        mb_cache_entry_free() {
3)               |                          __mb_cache_entry_release() {
3)   0.180 us    |                            _raw_spin_lock();
3)               |                            __mb_cache_entry_forget.isra.2() {
3)   0.078 us    |                              kmem_cache_free();
3)   0.500 us    |                            }
3)   1.564 us    |                          }
3)   2.194 us    |                        }
3)   6.400 us    |                      }
3)   0.316 us    |                      ext4_xattr_find_entry();
3)   0.057 us    |                      __brelse();
3)   0.073 us    |                      up_read();
3) + 31.207 us   |                    }
3) + 31.725 us   |                  }
3) + 33.842 us   |                }
3) + 34.474 us   |              }
3) + 35.516 us   |            }
3) + 36.183 us   |          }
3) + 36.823 us   |        }
3) + 37.294 us   |      }
3)   0.070 us    |      acct_arg_size.isra.12();
3)               |      mmput() {
[...]
3) + 33.073 us   |      }
3)               |      free_bprm() {
[...]
3)   5.873 us    |      }
3)   0.059 us    |      kfree();
3)               |      putname() {
3)               |        kmem_cache_free() {
3)   0.138 us    |          __slab_free();
3)   0.611 us    |        }
3)   1.025 us    |      }
3) ! 366.803 us  |    } /* do_execveat_common.isra.30 */
3) ! 369.865 us  |  } /* SyS_execve */
3)               |  syscall_trace_leave() {

Now lets see the source of this function and compare. From the bottom:

        retval = prepare_binprm(bprm);
        if (retval < 0)
                goto out;
[...]
out:
        if (bprm->mm) {
                acct_arg_size(bprm, 0);
                mmput(bprm->mm);
        }

out_unmark:
        current->fs->in_exec = 0;
        current->in_execve = 0;

out_free:
        free_bprm(bprm);
        kfree(pathbuf);

out_files:
        if (displaced)
                reset_files_struct(displaced);
out_ret:
        putname(filename);
        return retval;
}

Yeah, that our cleaning code and we can clearly see it executed in the trace. Just before this code we have a massive amount of } which denotes unwinding the stack just after ext4_xattr_find_entry call that was initiated by prepare_binprm. So that’s where our syscall was interrupted!

Now we’re going up the stack. ext4_xattr_find_entry does not return EPERM, neither do it’s caller ext4_xattr_security_get, generic_getxattr or get_vfs_caps_from_disk. So we finally are at cap_bprm_set_creds where we can’t seem to find a call to the get_vfs_caps_from_disk. Looks like something was inlined. Since we entered get_vfs_caps_from_disk at the beginning of a function, we have our first suspect:

ret = get_file_caps(bprm, &effective, &has_cap);
if (ret < 0)
        return ret;

It matches perfectly, lets see it:

rc = get_vfs_caps_from_disk(bprm->file->f_path.dentry, &vcaps);
if (rc < 0) {
[...]
}
rc = bprm_caps_from_vfs_caps(&vcaps, bprm, effective, has_cap);

OK, we found our call to get_vfs_caps_from_disk but it can’t return EPERM so we have to look further. This bprm_caps_from_vfs_caps does not look like it’s called but after checking we is it’s an inline function and it does what we were looking for:

/*
 * Calculate the new process capability sets from the capability sets attached
 * to a file.
 */
static inline int bprm_caps_from_vfs_caps(struct cpu_vfs_cap_data *caps,
[...]
    if (permitted & ~new->cap_permitted.cap[i])
        /* insufficient to execute correctly */
        ret = -EPERM;

Conclusions

OK, so apparently our file is trying to get a capability from a filesystem. Let’s verify that:

-bash-4.3# getcap /bin/ping
/bin/ping = cap_net_admin,cap_net_raw+ep
-bash-4.3# cat /proc/$$/status |grep ^Cap
CapInh: 0000000000000000
CapPrm: 00000000fdecafff
CapEff: 00000000fdecafff
CapBnd: 00000000fdecafff

Indeed our file is configured through xattrs to grant the process two capabilities:

#define CAP_NET_ADMIN        12
#define CAP_NET_RAW          13

But our capabilities mask is set to 0xfdecafff which does not contain those bits:

>>> 0xfdecafff & (1<<12)
0

Let’s remove this capability from the list and see how this works:

-bash-4.3# setcap cap_net_raw+ep /bin/ping
-bash-4.3# /bin/ping
Usage: ping [-aAbBdDfhLnOqrRUvV] [-c count] [-i interval] [-I interface]

Work’s like a charm. So what is happening? If you go to nspawn.c, you can find that it will drop all the capabilities that are not on arg_retain list:

static int drop_capabilities(void) {
        return capability_bounding_set_drop(~arg_retain, false);
}

Which does not contain CAP_NET_ADMIN unless we used --private-network switch:

arg_retain = (arg_retain | plus | (arg_private_network ? 1ULL << CAP_NET_ADMIN : 0)) & ~minus;

This is even documented in the systemd-nspawn man page:

--capability=
    List one or more additional capabilities to grant the container.
    Takes a comma-separated list of capability names, see
    capabilities(7) for more information. Note that the following
    capabilities will be granted in any way: CAP_CHOWN,
    CAP_DAC_OVERRIDE, CAP_DAC_READ_SEARCH, CAP_FOWNER, CAP_FSETID,
    CAP_IPC_OWNER, CAP_KILL, CAP_LEASE, CAP_LINUX_IMMUTABLE,
    CAP_NET_BIND_SERVICE, CAP_NET_BROADCAST, CAP_NET_RAW, CAP_SETGID,
    CAP_SETFCAP, CAP_SETPCAP, CAP_SETUID, CAP_SYS_ADMIN,
    CAP_SYS_CHROOT, CAP_SYS_NICE, CAP_SYS_PTRACE, CAP_SYS_TTY_CONFIG,
    CAP_SYS_RESOURCE, CAP_SYS_BOOT, CAP_AUDIT_WRITE,
    CAP_AUDIT_CONTROL. Also CAP_NET_ADMIN is retained if
    --private-network is specified. If the special value "all" is
    passed, all capabilities are retained.

And that’s the end of the fun.

This instantly ringed a bell and I already knew what’s happening but lets pretend I didn’t since it’a a good excuse to show you some debugging techniques and have some fun. ↩
I have posted a patch adding relevant information to the man-page so your copy may contain it. ↩
Actually having debugfs is not needed, tracefs is enough. ↩
Or we could use some ftrace frontend but this way is simpler and you can learn how it actually works. ↩