Day 4: seccomp BPF Filter (filter mode deep dive)

A deep dive into seccomp filter mode: the BPF data layout, why pointer args can’t be dereferenced (TOCTOU and atomic context), the cBPF instruction skeleton, the multi-ABI bypasses every filter must defend against, and the eight SECCOMP_RET_* actions that power modern container security — including the USER_NOTIF + fd-injection pattern that runc/crun use.

1. Filter Mode Overview Link to heading

Flow Link to heading

[user] syscall → kernel entry →
  walk the BPF program list on task_struct->seccomp.filter →
    take the strictest return value (action) →
      ALLOW: dispatch through sys_call_table
      ERRNO: immediately return -1 + errno
      KILL_*: SIGKILL
      TRAP: send SIGSYS
      LOG: like ALLOW, plus an audit record
      TRACE: hand off to ptrace tracer
      USER_NOTIF: kernel notifies a listener fd, supervisor decides

Hard prerequisite for installing a filter Link to heading

You must first:

prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);

Why:

  • An unprivileged user (no CAP_SYS_ADMIN) installing a filter must enable NNP — the kernel enforces this.
  • NNP semantics: the process promises that neither it nor its descendants will gain privileges via setuid binaries / file capabilities.
  • Without NNP, the kernel worries: a buggy filter + an exec into a setuid binary means the filter rides into the privileged process — that’s a fresh attack surface.
  • NNP itself is also irrevocable, and is inherited across fork/exec.

libseccomp’s seccomp_load() automatically sets NNP (unless you explicitly disable it via SCMP_FLTATR_CTL_NNP). The raw-BPF path requires you to do it manually.


2. struct seccomp_data — what the filter can see Link to heading

struct seccomp_data {
    int   nr;                  // syscall number (4B)
    __u32 arch;                // AUDIT_ARCH_X86_64, etc. (4B)
    __u64 instruction_pointer; // RIP that issued the syscall (8B)
    __u64 args[6];             // syscall args from rdi/rsi/rdx/r10/r8/r9 (8B × 6)
};

The BPF program reads fields via BPF_LD | BPF_W | BPF_ABS plus offsetof(struct seccomp_data, field).

Important constraint: args are raw values, not contents Link to heading

  • Integer args (open flags, mmap prot): can be compared directly.
  • Pointer args (path string, buffer addr): args contain only the address numberyou cannot dereference.

Example: for mmap(addr, len, PROT_READ|PROT_EXEC, ...), you can filter on args[2] (prot) checking whether PROT_EXEC is set. But you cannot filter “execve’s path must not be /bin/sh” — args[0] is just an address, you don’t see the string contents.

Why user pointers can’t be dereferenced (two reasons) Link to heading

1. TOCTOU race The filter reads the path, sees "/bin/echo", and approves. After the check, another thread does strcpy(path, "/bin/sh"). By the time the kernel actually exec’s, the value has changed. There’s a time gap between the filter’s check and the kernel’s use, and what the filter saw isn’t what the kernel uses.

2. Atomic context can’t fault (more fundamental)

  • The cBPF instruction set has no “safe copy from user data” primitive (no copy_from_user equivalent).
  • The filter runs at syscall entry, in atomic context, with interrupts disabled.
  • If BPF dereferences a user pointer and the page isn’t resident → page fault → fault handler needs to sleep → atomic context can’t sleep → panic.
  • Even if BPF were extended to allow deref, what should the filter do when deref fails (EFAULT)? There’s no clean semantic.

Remember: TOCTOU is “even if you could deref, it wouldn’t be safe”; atomic-context-can’t-fault is “you can’t deref at all”.

What if you need content checks? Link to heading

  • AppArmor / SELinux (LSM hook): the kernel has already parsed the path; decisions are made on path/inode/label → Day 5.
  • SECCOMP_RET_TRACE: hand control to a ptrace tracer, which can safely deref in user space.
  • SECCOMP_RET_USER_NOTIF (5.0+): a supervisor receives notifications via fd; user space decides.
  • BPF-LSM (modern): hang eBPF on LSM hooks, with BPF helpers for safe deref.

3. BPF Instruction Skeleton (cBPF) Link to heading

Three core instruction types Link to heading

// Load: read data from seccomp_data into the BPF accumulator
BPF_STMT(BPF_LD | BPF_W | BPF_ABS, offsetof(struct seccomp_data, nr))

// Jump: compare accumulator against an immediate, branch jt or jf
BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_execve, jt, jf)

// Return: hand the kernel an action
BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ERRNO | EPERM)

jt/jf semantics (don’t get them backwards) Link to heading

BPF_JUMP(op, value, jt, jf):

  • Condition true → skip jt instructions
  • Condition false → skip jf instructions
  • jt=0 = on true, don’t skip; just continue to the next instruction
  • jt=1, jf=0 = on true, skip the next instruction; only execute it on false

Two ways to write a simple “if equal then action”:

// Form A: equal → execute action (fall through), otherwise → skip past action
BPF_JUMP(..., NR, 0, 1),
BPF_STMT(..., ACTION),     // executes when equal

// Form B: equal → jump to action, otherwise → continue
BPF_JUMP(..., NR, 1, 0),
BPF_STMT(..., DEFAULT),    // executes when not equal
BPF_STMT(..., ACTION),     // jumped to when equal

jt/jf aren’t restricted to (0,1) and (1,0) Link to heading

Any 0–255 offset works. Multi-branch dispatch can jump straight to the end:

BPF_JUMP(..., __NR_execve,   5, 0),   // equal: skip 5 → straight to KILL
BPF_JUMP(..., __NR_execveat, 4, 0),
BPF_JUMP(..., __NR_ptrace,   3, 0),
BPF_JUMP(..., __NR_mount,    2, 0),
BPF_JUMP(..., __NR_unshare,  1, 0),
BPF_STMT(..., ALLOW),
BPF_STMT(..., KILL),

libseccomp internally builds jump tables / binary trees, so hundreds of rules still execute efficiently.


4. Standard Filter Template (block execve) Link to heading

#include <unistd.h>
#include <stddef.h>
#include <errno.h>
#include <sys/prctl.h>
#include <linux/seccomp.h>
#include <linux/filter.h>
#include <linux/audit.h>
#include <linux/unistd.h>

int main(void) {
    if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0) < 0) return 1;

    struct sock_filter filter[] = {
        // === ARCH CHECK (must be first) ===
        BPF_STMT(BPF_LD | BPF_W | BPF_ABS,
                 offsetof(struct seccomp_data, arch)),
        BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, AUDIT_ARCH_X86_64, 1, 0),
        BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_KILL_PROCESS),

        // === LOAD SYSCALL NR ===
        BPF_STMT(BPF_LD | BPF_W | BPF_ABS,
                 offsetof(struct seccomp_data, nr)),

        // === Check x32 ABI bit (extra bypass defense) ===
        BPF_JUMP(BPF_JMP | BPF_JGE | BPF_K, 0x40000000, 0, 1),
        BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_KILL_PROCESS),

        // === Reject execve ===
        BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_execve, 0, 1),
        BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ERRNO | EPERM),

        // === Reject execveat (bypass-defense) ===
        BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_execveat, 0, 1),
        BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ERRNO | EPERM),

        // === Default allow ===
        BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),
    };

    struct sock_fprog prog = {
        .len = sizeof(filter) / sizeof(filter[0]),   // don't hardcode!
        .filter = filter,
    };

    // Modern: syscall(SYS_seccomp, ...) supports flags
    if (syscall(SYS_seccomp, SECCOMP_SET_MODE_FILTER, 0, &prog) < 0)
        return 1;

    return 0;
}

5. ARCH CHECK in Detail (the core of bypass defense) Link to heading

Three ABIs on x86_64 Link to heading

ABIarch constantexecve nrNotes
x86_64 (LP64)AUDIT_ARCH_X86_6459Normal 64-bit process
x32 (ILP32 but 64-bit registers)AUDIT_ARCH_X86_64 ⚠️520 (__X32_SYSCALL_BIT | 59)Shares the arch constant — must check the high bit of nr
i386 (32-bit compat)AUDIT_ARCH_I38611Via int 0x80 or 32-bit syscall

The attack when arch isn’t checked Link to heading

Goal: invoke execve. Filter blocked __NR_execve (59 on x86_64).

Method 1 (i386 compat):
  Switch to a 32-bit code segment, int 0x80 with eax=11
  → kernel sees nr=11, arch=AUDIT_ARCH_I386
  → filter doesn't check arch, compares nr=11 != 59, allows
  → execve runs ✗

Method 2 (x32):
  syscall with rax=520
  → kernel sees nr=520, arch=AUDIT_ARCH_X86_64
  → filter compares nr=520 != 59, allows
  → execve (x32 variant) runs ✗

Strict defense Link to heading

// 1. arch must be x86_64, otherwise kill
BPF_STMT(LD ABS, offsetof(arch)),
BPF_JUMP(JEQ K, AUDIT_ARCH_X86_64, 1, 0),
BPF_STMT(RET K, KILL_PROCESS),       // wrong arch, kill outright

// 2. nr must not have the x32 bit set
BPF_STMT(LD ABS, offsetof(nr)),
BPF_JUMP(JGE K, 0x40000000, 0, 1),   // nr >= X32_BIT?
BPF_STMT(RET K, KILL_PROCESS),       // x32 path, kill

Simplification rule: in modern distros nobody really uses x32 (Debian deprecated it), and i386 is increasingly rare. If the arch isn’t on the allowlist, KILL — safer than writing rules for every ABI.

libseccomp does this for you: seccomp_arch_add(ctx, SCMP_ARCH_X86) to opt in explicitly; without it, only the native arch is supported.


6. Full SECCOMP_RET_* Semantics Link to heading

Return-value layout Link to heading

A 32-bit return: high 16 bits = action, low 16 bits = data:

[action: 16 bits][data: 16 bits]
ActionValueSyscall executes?Behavior
KILL_PROCESS0x80000000NoSIGKILL the entire process (all threads together)
KILL_THREAD (=KILL)0x00000000NoSIGKILL the current thread, other threads continue ⚠️
TRAP0x00030000NoSend SIGSYS, the handler can catch it
ERRNO0x00050000 | errnoNoImmediately return -1, errno = data
USER_NOTIF0x7fc00000SuspendsKernel notifies an fd; supervisor decides
TRACE0x7ff00000SuspendsHand control to ptrace tracer
LOG0x7ffc0000YesLike ALLOW, plus audit log
ALLOW0x7fff0000YesNormal execution

Lower numerical value = higher priority: when multiple filters stack, the smallest = strictest action wins. KILL_PROCESS is strictest, ALLOW is loosest.

Practical uses for each action Link to heading

KILL_PROCESS vs KILL_THREAD: the latter only kills the current thread; others keep running — easy way to get a half-dead process with inconsistent state. Production default: KILL_PROCESS. The KILL_THREAD alias KILL is a historical wart.

TRAP: the program installs its own SIGSYS handler. When the filter hits, control enters the handler, which can read syscall info (siginfo_t carries nr/arch/ip) and decide what to do next. Used in sandboxes to implement “custom syscall emulation”.

ERRNO: program thinks the syscall failed but keeps running.

  • Pro: high compatibility — programs follow their normal error-path fallback.
  • Con: weak security — an attacker who sees the failure can try other attacks.
  • Docker’s default profile uses this.

LOG: for debugging / audit phases. Set everything to LOG, run the workload, examine the audit log to see which syscalls were invoked, and use that to write the real filter. Requires CONFIG_AUDIT in the kernel + SECCOMP_FILTER_FLAG_LOG to take effect.

Analogous to AppArmor’s complain mode. Same “observe first, tighten later” workflow.

TRACE: hand control to a ptrace tracer.

  • Tracer sets PTRACE_O_TRACESECCOMP via PTRACE_SETOPTIONS.
  • On filter hit, tracer receives SIGTRAP (with the low 16 bits of RET data).
  • Tracer can: modify args and let it through / replace with a harmless syscall / return a fake errno / skip.
  • Powerful but expensive (a context switch into the tracer per syscall).
  • Used by older sandboxes: early gVisor, Firejail, minijail.

USER_NOTIF (kernel 5.0+, modern): syscall is suspended; the kernel notifies a supervisor via a listener fd.

  • No ptrace relationship needed — a fully independent process can listen.
  • fd-based — works with epoll / io_uring.
  • 5.9+ SECCOMP_IOCTL_NOTIF_ADDFD: the supervisor can inject an fd into the sandboxed process — a nuclear feature.
    • Typical pattern: a process inside a container wants to mount → seccomp intercepts → notif goes to the container manager → manager performs the real mount on the host → injects the resulting fd back via notif.
  • Heavily used by runc / crun / youki / containerd for “userspace syscall emulation”.
  • Production-grade container security frontier.

7. Filter Lifecycle Link to heading

EventFilter state
seccomp_load() (default)Installed on the current thread’s task_struct, per-thread
SECCOMP_FILTER_FLAG_TSYNCSynced to all sibling threads in the same thread group
pthread_create() without TSYNCNew thread has no filter — sandbox hole ⚠️
fork() / clone()Inherited (BPF prog is refcount-shared, not copied)
seccomp_load() on top of an existing filterStacked — all filters run, strictest return wins
execve()Inherited (filter rides along with the new image)
Process exitRefcount drops, BPF prog eventually freed

Operations that don’t exist Link to heading

  • ❌ Uninstall a filter
  • ❌ Loosen a filter
  • ❌ Clear filters and reinstall

Monotonic accumulation, irrevocable — the core security guarantee of the seccomp model. With container runtimes nesting layers (host → docker → process), each layer adds its own restrictions; once anything tightens, it stays tight.

Why TSYNC matters Link to heading

A filter installed without TSYNC only affects the thread that installed it. In a multi-threaded program:

  • Main thread enters sandbox, but worker threads don’t → an attacker hops to a worker thread to bypass.
  • You must use seccomp(SECCOMP_SET_MODE_FILTER, SECCOMP_FILTER_FLAG_TSYNC, ...).

TSYNC constraint: every sibling must be able to install the same filter, or the call fails (and returns the conflicting tid).


8. Real-World Filter Pitfalls Link to heading

8.1 Forgetting execveat Link to heading

Blocking execve requires also blocking execveat (syscall 322), or syscall(SYS_execveat, AT_FDCWD, "/bin/ls", ...) bypasses you.

One “semantic action” often maps to multiple syscalls — miss one and you’re bypassed:

  • open / openat / openat2
  • stat / fstat / lstat / newfstatat / statx
  • accept / accept4
  • recv / recvfrom / recvmsg / recvmmsg
  • pipe / pipe2
  • dup / dup2 / dup3
  • fork / vfork / clone / clone3

A common LLM-generated-filter failure mode is missing these variants. When evaluating filters, “does it cover the syscall family?” deserves to be its own metric.

8.2 Forgetting multi-ABI Link to heading

See §5. x32 / i386 bypasses.

8.3 Locking yourself out Link to heading

After a filter blocks execve, if the process that installed the filter then tries to exec the sandbox runner → dead.

Correct ordering:

fork  child {
    set namespace / cgroup / NNP
    execve("/runner", ...)
    // runner installs its own seccomp filter
}

Or use the USER_NOTIF + supervisor pattern, where the supervisor installs after exec.

8.4 Hardcoding .len Link to heading

struct sock_fprog prog = { .len = 7, .filter = filter };   // ❌ forget to update when adding rules → broken
struct sock_fprog prog = {
    .len = sizeof(filter) / sizeof(filter[0]),             // ✓
    .filter = filter
};

8.5 Not checking return values Link to heading

prctl / seccomp(2) / seccomp_rule_add / seccomp_load all need to be checked. Possible failures:

  • Kernel built without CONFIG_SECCOMP_FILTER
  • BPF program too big (32K-instruction limit)
  • Verifier rejected it (BPF program has a bug)
  • NNP not set and you’re not root

Note: libseccomp returns negative errno — pass -rc to strerror.


9. libseccomp vs raw cBPF Trade-offs Link to heading

Dimensionlibseccompraw cBPF
Multi-ABI supportAutomatic (seccomp_arch_add)Manually write each arch branch
ExpressivenessHigh-level API (SCMP_A0(SCMP_CMP_EQ, ...))Direct BPF instructions, full control
Performance optimizationInternal jump tables / binary treesSequential by default; you optimize
NNP automationYesMust do manually
DebuggabilityHigh abstraction makes errors hard to traceThe BPF program is what you see
Use caseProductionLearning / extreme customization

In practice: almost all production uses libseccomp (or a higher-level OCI runtime spec). Raw BPF is used only when performance is extreme, rules are tiny, and depending on libseccomp isn’t an option.


10. Modern seccomp(2) Syscall Link to heading

syscall(SYS_seccomp, op, flags, args)

Advantages over the old prctl(PR_SET_SECCOMP, ...):

  • Supports flags
  • Cleaner API

Important flags Link to heading

flagMeaning
SECCOMP_FILTER_FLAG_TSYNCSync to all sibling threads
SECCOMP_FILTER_FLAG_LOGLOG action actually writes audit
SECCOMP_FILTER_FLAG_SPEC_ALLOWDisable Spectre v4 mitigation (trusted process)
SECCOMP_FILTER_FLAG_NEW_LISTENERReturns an fd for USER_NOTIF
SECCOMP_FILTER_FLAG_TSYNC_ESRCHOn TSYNC failure, return ESRCH instead of conflicting tid

11. Debugging Filters Link to heading

See which syscall got rejected Link to heading

Set SECCOMP_RET_LOG, then look at audit:

sudo ausearch -m SECCOMP -ts recent
# or
sudo dmesg | grep audit | grep SECCOMP

Output includes syscall nr, comm, pid, arch.

Real-time view via bpftrace Link to heading

sudo bpftrace -e '
  tracepoint:syscalls:sys_enter_* /comm == "myprog"/ {
    @[probe] = count();
  }'

strace still works Link to heading

After seccomp is installed, strace can still trace (because ptrace sits above seccomp). Filter-rejected syscalls show as <unfinished ...> plus SIGKILL/SIGSYS.

Diagnosing when NNP isn’t set Link to heading

If filter installation fails, the cause is likely a missing NNP. strace will show seccomp(...) returning EACCES.


12. Takeaways Link to heading

  1. NNP first, filter second — the precondition for unprivileged installation.
  2. First instruction is always an arch check — three ABIs (x86_64 / x32 / i386); x32 shares the arch constant, so check the nr high bit.
  3. Pointer args cannot be dereferenced (TOCTOU + atomic context). For content checks, move up to LSM or use USER_NOTIF.
  4. One semantic action ≠ one syscall — must cover the family (execve+execveat, open+openat+openat2, …).
  5. Filters accumulate monotonically and are irrevocable; inherited across fork/exec; per-thread (use TSYNC).
  6. Eight RET actions ranked by strictness; production should use KILL_PROCESS, not KILL_THREAD.
  7. USER_NOTIF is the modern container-security frontier — fd-based + supports fd injection.
  8. Production uses libseccomp + the seccomp(2) syscall, not prctl.

Day 5 Preview Link to heading

Tomorrow: AppArmor. The pain points filter mode can’t address (content checks) are exactly what the LSM layer solves. Highlights:

  • AppArmor is path-based; SELinux is label-based.
  • Profile syntax (r/w/ix/Px/Cx/Ux).
  • enforce / complain modes — the same workflow shape as seccomp’s ALLOW / LOG.
  • Path-based’s fundamental weakness (hardlink / bind mount / symlink bypasses).
  • How seccomp + AppArmor divide responsibilities.