Day 3: seccomp Basics + Strict Mode
How seccomp intercepts syscalls at the kernel entry, what SECCOMP_MODE_STRICT actually allows (and the _exit() trap that fools nearly everyone), plus the design philosophy that filter mode inherits: monotonic, irrevocable security state.
1. What seccomp Is Link to heading
Core role: intercept syscalls at the syscall entry point. It’s the syscall-dimension line of defense in a sandbox.
Two modes:
SECCOMP_MODE_STRICT— only 4 syscalls allowed. Minimal, hardcoded.SECCOMP_MODE_FILTER— rules expressed as a BPF program. Flexible, mainstream (Day 4).
Where it sits in the stack:
[user-space syscall]
↓
[seccomp filter] ← syscall nr + integer-arg granularity
↓
[capability check] ← coarse-grained permissions
↓
[LSM hook (AppArmor/SELinux)] ← path/inode/label granularity (Day 5)
↓
[actual kernel handler]
Each layer enforces what it’s good at. seccomp cannot see path-string contents (more on this Day 4).
2. SECCOMP_MODE_STRICT Link to heading
The 4 allowed syscalls Link to heading
| syscall | nr (x86_64) | Why it’s allowed |
|---|---|---|
read | 0 | Read from already-open fd |
write | 1 | Write to already-open fd |
exit | 60 | You need a way to exit |
rt_sigreturn | 15 | Auto-called by the kernel on signal-handler return; banning it deadlocks |
Notably not in the list:
exit_group(231) — what glibc’s_exit()actually invokes. Trap:_exit(0)under strict mode gets SIGKILL’d.openat— want to open a new file? Nope. You only get fds inherited across exec.getpid,brk,mmap— all banned.
The intended pattern Link to heading
Pre-open all fds → enter strict → run pure-compute / IO task → exit.
Historically used by:
- Some early Chrome workers
- Sandboxed compilers and contest judges
It’s too restrictive in practice; once filter mode landed, strict mode was effectively retired. But understanding strict is the key to understanding seccomp’s design philosophy: security properties accumulate monotonically and cannot be revoked.
How to enable Link to heading
// Method 1: prctl
prctl(PR_SET_SECCOMP, SECCOMP_MODE_STRICT);
// Method 2: modern seccomp(2)
syscall(SYS_seccomp, SECCOMP_SET_MODE_STRICT, 0, NULL);
3. Experimental Code (verifying the 4 allowed syscalls) Link to heading
The critical trap Link to heading
Don’t use _exit(0)! glibc’s _exit() internally calls exit_group (syscall 231), not exit (syscall 60). Strict only allows 60, so:
printfsucceeds, then_exit(0)→ exit_group → SIGKILL- The write itself worked, but exit_group failing makes you see “write killed” and conclude (wrongly) that write was banned.
Correct approach: use raw syscall(SYS_exit, 0) to invoke nr 60 directly.
Code skeleton Link to heading
#include <unistd.h>
#include <sys/prctl.h>
#include <linux/seccomp.h>
#include <sys/syscall.h>
#include <sys/wait.h>
#include <fcntl.h>
void test_syscall(const char *name, void (*fn)(void)) {
pid_t pid = fork();
if (pid == 0) {
// child: enter strict, run fn, raw exit
if (prctl(PR_SET_SECCOMP, SECCOMP_MODE_STRICT) < 0) _exit(99);
fn();
syscall(SYS_exit, 0); // ★ NOT _exit(0)
}
int status;
waitpid(pid, &status, 0);
if (WIFSIGNALED(status))
printf("%s killed by signal %d\n", name, WTERMSIG(status));
else
printf("%s exited with %d\n", name, WEXITSTATUS(status));
}
void do_read() { char buf[1]; syscall(SYS_read, 0, buf, 0); }
void do_write() { syscall(SYS_write, 1, "hi", 2); }
void do_exit() { syscall(SYS_exit, 0); }
void do_sigreturn() { syscall(SYS_rt_sigreturn); } // allowed, but no valid frame on stack → SIGSEGV
void do_getpid() { syscall(SYS_getpid); }
void do_openat() { syscall(SYS_openat, AT_FDCWD, "/etc/passwd", O_RDONLY); }
Expected output Link to heading
read exited 0 ← allowed (len=0 returns immediately)
write "hi" + exited 0 ← allowed
exit exited 0 ← allowed
sigreturn killed by SIGSEGV (11) ← syscall allowed, but no valid frame, program crashes itself
getpid killed by SIGKILL (9) ← banned
openat killed by SIGKILL (9) ← banned
Reverse-validating sigreturn: signal 11 (SEGV) rather than 9 (KILL) proves the syscall was allowed to execute — it ran, read garbage off the stack, jumped to a bogus address, and the program crashed itself. If sigreturn had been banned, you’d see SIGKILL like getpid.
4. Two Lessons Link to heading
4.1 _exit() ≠ SYS_exit
Link to heading
| C function | Actual syscall | Behavior |
|---|---|---|
exit(0) (after atexit + stdio flush) | exit_group (231) | Whole process exits |
_exit(0) | exit_group (231) | Whole process exits, skips atexit/flush |
syscall(SYS_exit, 0) | exit (60) | Only the current thread exits |
When reasoning about syscall behavior, look at the syscall number — not the C function name. Under strict mode, this distinction is fatal.
4.2 Don’t only watch for SIGKILL Link to heading
- SIGKILL (9): almost certainly seccomp killed it (strict’s default action; filter’s
RET_KILL_PROCESS/RET_KILL_THREAD). - SIGSYS (31): triggered by
RET_TRAP; the program can catch it. - SIGSEGV (11): program crashed itself, not seccomp.
- SIGBUS / SIGILL: program crashed itself.
This distinction is a lifesaver when diagnosing seccomp issues.
5. How to Discover Which Syscalls Strict Allows Link to heading
Read the kernel source Link to heading
kernel/seccomp.c:
static const int mode1_syscalls[] = {
__NR_seccomp_read, __NR_seccomp_write,
__NR_seccomp_exit, __NR_seccomp_sigreturn,
-1, /* negative terminated */
};
Brute-force enumerate Link to heading
Loop through every nr (0 → ~440), launch a child for each, and see which ones don’t die. This kind of “differential experiment” is invaluable when reverse-engineering an unknown sandbox — and is exactly the technique you’d use to evaluate LLM-generated filters.
6. Strict vs Filter Mode Trade-offs Link to heading
| Dimension | strict | filter |
|---|---|---|
| Expressiveness | Fixed 4 syscalls | Arbitrary BPF program |
| Complexity | Almost zero config | Requires writing BPF (or libseccomp) |
| Debuggability | Easy (alive or dead) | Hard (a wrong filter can lock you out) |
| Use case | Compute + pure IO tasks | Mainstream sandboxing |
| Modern usage | Almost none | Containers, browsers, sandbox runners |
7. Takeaways Link to heading
- Strict mode = read/write/exit/sigreturn — designed for the “pre-open fds, then drop into sandbox” pattern.
_exitcallsexit_group, notexit— the glibc wrapper will lie to you.- Distinguish signals: SIGKILL = seccomp killed, SIGSYS = TRAP, SIGSEGV = program self-crashed.
- seccomp security state accumulates monotonically and is irrevocable — a design property that filter mode also preserves.
- Modern code uses
seccomp(2)syscall, notprctl— the syscall supports flags (TSYNC, etc.).
Day 4 Preview Link to heading
Tomorrow: BPF filters. Key points:
prctl(PR_SET_NO_NEW_PRIVS, 1, ...)is mandatory before unprivileged filter installation.- BPF programs see
struct seccomp_data(nr / arch / ip / args[6]). - The very first thing any filter must do is an arch check; mismatch with expected ABI → KILL (x32 / i386 bypass).
- Pointer-typed args cannot be dereferenced — only their address values can be compared (two reasons: TOCTOU + atomic context).