Introduction
So I wanted to play with systemd-nspawn
and got a Fedora22 container running
in a minute:
$ yum --releasever=22 --installroot=/var/lib/container/fedora22 install systemd passwd yum fedora-release vim-minimal iputils
$ systemd-nspawn -D /var/lib/container/fedora22 passwd -d root
$ systemctl start systemd-nspawn@fedora22
$ machinectl login fedora22
Now I wanted to test my network connection running ping
but suddenly:
-bash-4.3# /bin/ping 8.8.8.8
-bash: /bin/ping: Operation not permitted
What the heck? Clearly something is wrong here, let’s debug this!
strace to the rescue
strace
is your first weapon. Let’s see which syscall is denied:
-bash-4.3# strace /bin/ping
execve("/bin/ping", ["/bin/ping"], [/* 16 vars */]) = -1 EPERM (Operation not permitted)
write(2, "strace: exec: Operation not perm"..., 38strace: exec: Operation not permitted
) = 38
exit_group(1) = ?
+++ exited with 1 +++
OK, so it’s execve
that is failing. Why is that1? Lets check the manual:
$ man execve | grep -e "^[[:space:]]\+EPERM"
EPERM The filesystem is mounted nosuid, the user is not the superuser, and the file has the set-user-ID or set-group-ID bit set.
EPERM The process is being traced, the user is not the superuser and the file has the set-user-ID or set-group-ID bit set.
I’m a superuser so none of those apply2. We have figure this out by ourselves.
ftrace to the rescue
Thanks to our strace output we know who to blame - it’s the kernel. So let’s see why we can’t ping our favourite DNS server. We’ll use ftrace.
Let’s do some preparations first. We need debugfs
3 and (in recent kernel
versions) tracefs
mounted:
$ mount -t debugfs debugfs /sys/kernel/debug
$ mount -t tracefs tracefs /sys/kernel/debug/tracing
Then we need a helper script4:
$ cat <<__EOF__ > test.sh
#!/bin/bash
debugfs=/sys/kernel/debug
echo 0 > $debugfs/tracing/tracing_on
echo function_graph > $debugfs/tracing/current_tracer
echo \$\$ > $debugfs/tracing/set_ftrace_pid
echo 1 > $debugfs/tracing/tracing_on
exec \$@
__EOF__
$ chmod +x test.sh
This script will set ftrace filter to the PID of current process and the exec file supplied as argument. This way we will only see function calls made by our process. Here’s how to use it:
$ test.sh /bin/ping; echo 0 > /sys/kernel/debug/tracing/tracing_on
$ cat /sys/kernel/debug/tracing/trace > trace.txt
Now it’s time to analyze trace.txt
. It’s a large file but since we know we
are looking for a return value from execve
syscall, we can quickly search for
do_execveat_common
twice to get to the function leave. Here’s what we can see
(after removing some uninteresting parts marked with […]):
3) | prepare_binprm() {
3) | security_bprm_set_creds() {
3) | selinux_bprm_set_creds() {
3) | cap_bprm_set_creds() {
3) | get_vfs_caps_from_disk() {
3) | generic_getxattr() {
3) 1.112 us | xattr_resolve_name();
3) | ext4_xattr_security_get() {
3) | ext4_xattr_get() {
[...]
3) 0.650 us | mb_cache_entry_insert();
3) | mb_cache_entry_free() {
3) | __mb_cache_entry_release() {
3) 0.180 us | _raw_spin_lock();
3) | __mb_cache_entry_forget.isra.2() {
3) 0.078 us | kmem_cache_free();
3) 0.500 us | }
3) 1.564 us | }
3) 2.194 us | }
3) 6.400 us | }
3) 0.316 us | ext4_xattr_find_entry();
3) 0.057 us | __brelse();
3) 0.073 us | up_read();
3) + 31.207 us | }
3) + 31.725 us | }
3) + 33.842 us | }
3) + 34.474 us | }
3) + 35.516 us | }
3) + 36.183 us | }
3) + 36.823 us | }
3) + 37.294 us | }
3) 0.070 us | acct_arg_size.isra.12();
3) | mmput() {
[...]
3) + 33.073 us | }
3) | free_bprm() {
[...]
3) 5.873 us | }
3) 0.059 us | kfree();
3) | putname() {
3) | kmem_cache_free() {
3) 0.138 us | __slab_free();
3) 0.611 us | }
3) 1.025 us | }
3) ! 366.803 us | } /* do_execveat_common.isra.30 */
3) ! 369.865 us | } /* SyS_execve */
3) | syscall_trace_leave() {
Now lets see the source of this function and compare. From the bottom:
retval = prepare_binprm(bprm);
if (retval < 0)
goto out;
[...]
out:
if (bprm->mm) {
acct_arg_size(bprm, 0);
mmput(bprm->mm);
}
out_unmark:
current->fs->in_exec = 0;
current->in_execve = 0;
out_free:
free_bprm(bprm);
kfree(pathbuf);
out_files:
if (displaced)
reset_files_struct(displaced);
out_ret:
putname(filename);
return retval;
}
Yeah, that our cleaning code and we can clearly see it executed in the trace.
Just before this code we have a massive amount of }
which denotes unwinding
the stack just after ext4_xattr_find_entry
call that was initiated by
prepare_binprm
. So that’s where our syscall was interrupted!
Now we’re going up the stack. ext4_xattr_find_entry does not return
EPERM
, neither do it’s caller ext4_xattr_security_get,
generic_getxattr or get_vfs_caps_from_disk. So we finally are at
cap_bprm_set_creds where we can’t seem to find a call to the
get_vfs_caps_from_disk
. Looks like something was inlined. Since we entered
get_vfs_caps_from_disk
at the beginning of a function, we have our first suspect:
ret = get_file_caps(bprm, &effective, &has_cap);
if (ret < 0)
return ret;
It matches perfectly, lets see it:
rc = get_vfs_caps_from_disk(bprm->file->f_path.dentry, &vcaps);
if (rc < 0) {
[...]
}
rc = bprm_caps_from_vfs_caps(&vcaps, bprm, effective, has_cap);
OK, we found our call to get_vfs_caps_from_disk
but it can’t return EPERM
so we have to look further. This bprm_caps_from_vfs_caps does not look
like it’s called but after checking we is it’s an inline function and it does
what we were looking for:
/*
* Calculate the new process capability sets from the capability sets attached
* to a file.
*/
static inline int bprm_caps_from_vfs_caps(struct cpu_vfs_cap_data *caps,
[...]
if (permitted & ~new->cap_permitted.cap[i])
/* insufficient to execute correctly */
ret = -EPERM;
Conclusions
OK, so apparently our file is trying to get a capability from a filesystem. Let’s verify that:
-bash-4.3# getcap /bin/ping
/bin/ping = cap_net_admin,cap_net_raw+ep
-bash-4.3# cat /proc/$$/status |grep ^Cap
CapInh: 0000000000000000
CapPrm: 00000000fdecafff
CapEff: 00000000fdecafff
CapBnd: 00000000fdecafff
Indeed our file is configured through xattrs
to grant the process two
capabilities:
#define CAP_NET_ADMIN 12
#define CAP_NET_RAW 13
But our capabilities mask is set to 0xfdecafff which does not contain those bits:
>>> 0xfdecafff & (1<<12)
0
Let’s remove this capability from the list and see how this works:
-bash-4.3# setcap cap_net_raw+ep /bin/ping
-bash-4.3# /bin/ping
Usage: ping [-aAbBdDfhLnOqrRUvV] [-c count] [-i interval] [-I interface]
Work’s like a charm. So what is happening? If you go to nspawn.c, you can
find that it will drop all the capabilities that are not on arg_retain
list:
static int drop_capabilities(void) {
return capability_bounding_set_drop(~arg_retain, false);
}
Which does not contain CAP_NET_ADMIN
unless we used --private-network
switch:
arg_retain = (arg_retain | plus | (arg_private_network ? 1ULL << CAP_NET_ADMIN : 0)) & ~minus;
This is even documented in the systemd-nspawn man page:
--capability=
List one or more additional capabilities to grant the container.
Takes a comma-separated list of capability names, see
capabilities(7) for more information. Note that the following
capabilities will be granted in any way: CAP_CHOWN,
CAP_DAC_OVERRIDE, CAP_DAC_READ_SEARCH, CAP_FOWNER, CAP_FSETID,
CAP_IPC_OWNER, CAP_KILL, CAP_LEASE, CAP_LINUX_IMMUTABLE,
CAP_NET_BIND_SERVICE, CAP_NET_BROADCAST, CAP_NET_RAW, CAP_SETGID,
CAP_SETFCAP, CAP_SETPCAP, CAP_SETUID, CAP_SYS_ADMIN,
CAP_SYS_CHROOT, CAP_SYS_NICE, CAP_SYS_PTRACE, CAP_SYS_TTY_CONFIG,
CAP_SYS_RESOURCE, CAP_SYS_BOOT, CAP_AUDIT_WRITE,
CAP_AUDIT_CONTROL. Also CAP_NET_ADMIN is retained if
--private-network is specified. If the special value "all" is
passed, all capabilities are retained.
And that’s the end of the fun.
-
This instantly ringed a bell and I already knew what’s happening but lets pretend I didn’t since it’a a good excuse to show you some debugging techniques and have some fun. ↩
-
I have posted a patch adding relevant information to the man-page so your copy may contain it. ↩
-
Actually having
debugfs
is not needed,tracefs
is enough. ↩ -
Or we could use some ftrace frontend but this way is simpler and you can learn how it actually works. ↩