Day 3: seccomp Basics + Strict Mode

How seccomp intercepts syscalls at the kernel entry, what SECCOMP_MODE_STRICT actually allows (and the _exit() trap that fools nearly everyone), plus the design philosophy that filter mode inherits: monotonic, irrevocable security state.

1. What seccomp Is Link to heading

Core role: intercept syscalls at the syscall entry point. It’s the syscall-dimension line of defense in a sandbox.

Two modes:

  • SECCOMP_MODE_STRICT — only 4 syscalls allowed. Minimal, hardcoded.
  • SECCOMP_MODE_FILTER — rules expressed as a BPF program. Flexible, mainstream (Day 4).

Where it sits in the stack:

[user-space syscall]
        ↓
[seccomp filter]   ← syscall nr + integer-arg granularity
        ↓
[capability check] ← coarse-grained permissions
        ↓
[LSM hook (AppArmor/SELinux)]  ← path/inode/label granularity (Day 5)
        ↓
[actual kernel handler]

Each layer enforces what it’s good at. seccomp cannot see path-string contents (more on this Day 4).


2. SECCOMP_MODE_STRICT Link to heading

The 4 allowed syscalls Link to heading

syscallnr (x86_64)Why it’s allowed
read0Read from already-open fd
write1Write to already-open fd
exit60You need a way to exit
rt_sigreturn15Auto-called by the kernel on signal-handler return; banning it deadlocks

Notably not in the list:

  • exit_group (231) — what glibc’s _exit() actually invokes. Trap: _exit(0) under strict mode gets SIGKILL’d.
  • openat — want to open a new file? Nope. You only get fds inherited across exec.
  • getpid, brk, mmap — all banned.

The intended pattern Link to heading

Pre-open all fds → enter strict → run pure-compute / IO task → exit.

Historically used by:

  • Some early Chrome workers
  • Sandboxed compilers and contest judges

It’s too restrictive in practice; once filter mode landed, strict mode was effectively retired. But understanding strict is the key to understanding seccomp’s design philosophy: security properties accumulate monotonically and cannot be revoked.

How to enable Link to heading

// Method 1: prctl
prctl(PR_SET_SECCOMP, SECCOMP_MODE_STRICT);

// Method 2: modern seccomp(2)
syscall(SYS_seccomp, SECCOMP_SET_MODE_STRICT, 0, NULL);

3. Experimental Code (verifying the 4 allowed syscalls) Link to heading

The critical trap Link to heading

Don’t use _exit(0)! glibc’s _exit() internally calls exit_group (syscall 231), not exit (syscall 60). Strict only allows 60, so:

  • printf succeeds, then _exit(0) → exit_group → SIGKILL
  • The write itself worked, but exit_group failing makes you see “write killed” and conclude (wrongly) that write was banned.

Correct approach: use raw syscall(SYS_exit, 0) to invoke nr 60 directly.

Code skeleton Link to heading

#include <unistd.h>
#include <sys/prctl.h>
#include <linux/seccomp.h>
#include <sys/syscall.h>
#include <sys/wait.h>
#include <fcntl.h>

void test_syscall(const char *name, void (*fn)(void)) {
    pid_t pid = fork();
    if (pid == 0) {
        // child: enter strict, run fn, raw exit
        if (prctl(PR_SET_SECCOMP, SECCOMP_MODE_STRICT) < 0) _exit(99);
        fn();
        syscall(SYS_exit, 0);   // ★ NOT _exit(0)
    }
    int status;
    waitpid(pid, &status, 0);
    if (WIFSIGNALED(status))
        printf("%s killed by signal %d\n", name, WTERMSIG(status));
    else
        printf("%s exited with %d\n", name, WEXITSTATUS(status));
}

void do_read()      { char buf[1]; syscall(SYS_read, 0, buf, 0); }
void do_write()     { syscall(SYS_write, 1, "hi", 2); }
void do_exit()      { syscall(SYS_exit, 0); }
void do_sigreturn() { syscall(SYS_rt_sigreturn); }   // allowed, but no valid frame on stack → SIGSEGV
void do_getpid()    { syscall(SYS_getpid); }
void do_openat()    { syscall(SYS_openat, AT_FDCWD, "/etc/passwd", O_RDONLY); }

Expected output Link to heading

read       exited 0           ← allowed (len=0 returns immediately)
write      "hi" + exited 0    ← allowed
exit       exited 0           ← allowed
sigreturn  killed by SIGSEGV (11)   ← syscall allowed, but no valid frame, program crashes itself
getpid     killed by SIGKILL (9)    ← banned
openat     killed by SIGKILL (9)    ← banned

Reverse-validating sigreturn: signal 11 (SEGV) rather than 9 (KILL) proves the syscall was allowed to execute — it ran, read garbage off the stack, jumped to a bogus address, and the program crashed itself. If sigreturn had been banned, you’d see SIGKILL like getpid.


4. Two Lessons Link to heading

4.1 _exit()SYS_exit Link to heading

C functionActual syscallBehavior
exit(0) (after atexit + stdio flush)exit_group (231)Whole process exits
_exit(0)exit_group (231)Whole process exits, skips atexit/flush
syscall(SYS_exit, 0)exit (60)Only the current thread exits

When reasoning about syscall behavior, look at the syscall number — not the C function name. Under strict mode, this distinction is fatal.

4.2 Don’t only watch for SIGKILL Link to heading

  • SIGKILL (9): almost certainly seccomp killed it (strict’s default action; filter’s RET_KILL_PROCESS / RET_KILL_THREAD).
  • SIGSYS (31): triggered by RET_TRAP; the program can catch it.
  • SIGSEGV (11): program crashed itself, not seccomp.
  • SIGBUS / SIGILL: program crashed itself.

This distinction is a lifesaver when diagnosing seccomp issues.


5. How to Discover Which Syscalls Strict Allows Link to heading

Read the kernel source Link to heading

kernel/seccomp.c:

static const int mode1_syscalls[] = {
    __NR_seccomp_read, __NR_seccomp_write,
    __NR_seccomp_exit, __NR_seccomp_sigreturn,
    -1, /* negative terminated */
};

Brute-force enumerate Link to heading

Loop through every nr (0 → ~440), launch a child for each, and see which ones don’t die. This kind of “differential experiment” is invaluable when reverse-engineering an unknown sandbox — and is exactly the technique you’d use to evaluate LLM-generated filters.


6. Strict vs Filter Mode Trade-offs Link to heading

Dimensionstrictfilter
ExpressivenessFixed 4 syscallsArbitrary BPF program
ComplexityAlmost zero configRequires writing BPF (or libseccomp)
DebuggabilityEasy (alive or dead)Hard (a wrong filter can lock you out)
Use caseCompute + pure IO tasksMainstream sandboxing
Modern usageAlmost noneContainers, browsers, sandbox runners

7. Takeaways Link to heading

  1. Strict mode = read/write/exit/sigreturn — designed for the “pre-open fds, then drop into sandbox” pattern.
  2. _exit calls exit_group, not exit — the glibc wrapper will lie to you.
  3. Distinguish signals: SIGKILL = seccomp killed, SIGSYS = TRAP, SIGSEGV = program self-crashed.
  4. seccomp security state accumulates monotonically and is irrevocable — a design property that filter mode also preserves.
  5. Modern code uses seccomp(2) syscall, not prctl — the syscall supports flags (TSYNC, etc.).

Day 4 Preview Link to heading

Tomorrow: BPF filters. Key points:

  • prctl(PR_SET_NO_NEW_PRIVS, 1, ...) is mandatory before unprivileged filter installation.
  • BPF programs see struct seccomp_data (nr / arch / ip / args[6]).
  • The very first thing any filter must do is an arch check; mismatch with expected ABI → KILL (x32 / i386 bypass).
  • Pointer-typed args cannot be dereferenced — only their address values can be compared (two reasons: TOCTOU + atomic context).