Week2 Day 3: K8s security — seccomp

Where Week 1 and Week 2 meet: K8s seccomp is the same kernel seccomp from Week 1, except runc installs the BPF filter for you instead of your application. This post traces the full kubelet → containerd → runc chain (and what runc actually does — namespaces, cgroups, NNP, caps, seccomp, AppArmor, then exec), the three profile types (Unconfined / RuntimeDefault / Localhost), why the profile JSON is just the declarative form of the cBPF you’d hand-write, why SCMP_ACT_ERRNO lets the process survive while mkdir returns EPERM, and the crucial limitation: seccomp blocks syscalls, not intent — block mkdir and touch (which uses openat) still creates files. Plus /proc/<pid>/status as the ground-truth proof a filter is loaded.

0. Where this fits Link to heading

Week 1 Day 3-4 covered the kernel mechanics of seccomp (BPF filter, PR_SET_SECCOMP, syscall interception, SECCOMP_RET_* actions). Today is how K8s wraps that into a pod spec field — the underlying kernel seccomp is identical; only the act of installing the filter moves from “the application does it” to “runc does it for you.”

So today leans heavily on Week 1: you’ll find that the K8s seccompProfile field is just the BPF you hand-wrote, underneath.


1. In K8s, who installs the filter Link to heading

1.1 Recap: what seccomp is Link to heading

seccomp = a BPF program attached to each process; on every syscall entry it checks the syscall number + integer args and decides ALLOW / ERRNO / KILL / LOG / TRACE / USER_NOTIF.

Week 1 key points (used today):

  • per-process, inherited across fork/exec
  • sees the syscall number + integer args, but can’t dereference pointers (the seccomp/AppArmor split)
  • must PR_SET_NO_NEW_PRIVS before installing a filter
  • a process installs the filter on itself (not parent-installs-on-child), propagated by inheritance

1.2 In K8s: runc installs it Link to heading

On bare Linux, the program itself calls prctl(PR_SET_SECCOMP). In K8s you don’t touch code, you only write pod YAML — the act of installing the filter is delegated to the container runtime, runc.

The full chain:

kubectl apply → kube-apiserver (store in etcd, schedule)
   → kubelet (node agent, delegates via CRI)
   → containerd (high-level runtime: images/lifecycle, delegates via OCI)
   → runc (low-level runtime: actually creates the isolated process)
       → an isolated Linux process

runc is the bottom of this chain. containerd hands it an OCI spec (JSON describing the namespaces/cgroups/seccomp/caps it wants), and runc:

  1. clone/fork a process with the namespaces
  2. place it in the cgroup
  3. set PR_SET_NO_NEW_PRIVS
  4. drop capabilities
  5. install the seccomp BPF filter (compiled from the OCI spec)
  6. attach the AppArmor profile
  7. exec your actual program

This list is everything from Week 1 + Week2 Day 1. A container isn’t a kernel object on Linux — there’s no struct container; a container = a process + a set of kernel isolation mechanisms applied to it. runc is the thing that “applies these mechanisms to a process per the spec.”

runc installs the filter on itself, then execs into the container (exec doesn’t swap the process, only the program image; the filter stays on the process). So “runc installs it” really means “runc installs it on itself, then becomes the container.”

1.3 Why so many layers Link to heading

Standardized interfaces (CRI / OCI) make each layer independently replaceable:

LayerConcernReplace with
kubeletWhich pods run on this node(fixed)
containerdImages/lifecycle/CRICRI-O
runcCreate the isolated process/OCIcrun / gVisor / Kata

Want stronger isolation? Swap runc for gVisor (a userspace kernel intercepting all syscalls) or Kata (a lightweight VM per container) — the upper layers don’t change.

1.4 Why the filter must be installed before exec Link to heading

The protection window must have no gap:

Right: runc installs filter → exec program → program's first instruction is protected ✅
Wrong: exec program → program runs (these syscalls are unguarded!) → only then install filter ❌

Install-after-exec leaves an “unguarded window” between the program’s first instruction and the filter being loaded, in which a malicious program could make dangerous syscalls. The filter must be in place first.

This is a unifying principle running through everything: seccomp installed before exec, AppArmor attached at bprm_check_security (execve time), PSA blocking at admission — all take effect before the protected object starts running. Any “run first, protect later” leaves an exploitable window (the root of TOCTOU-class bugs).

PR_SET_NO_NEW_PRIVS is a precondition for installing seccomp: it guarantees the program can’t later escalate via setuid, which would otherwise let the filter ride into a privileged context and create a fresh attack surface.


2. The three seccomp profile types Link to heading

securityContext:
  seccompProfile:
    type: <one of three>
typeWhat filter is installedSecurity
UnconfinedNo filter at allWeakest (all syscalls allowed)
RuntimeDefaultThe runtime’s built-in default profileMedium (blocks a batch of dangerous syscalls)
LocalhostA profile file you wroteCustom

Unconfined Link to heading

All syscalls allowed. The historical default before K8s 1.27 — which is why many old clusters actually have no seccomp protection.

RuntimeDefault Link to heading

The runtime (containerd) ships a profile blocking syscalls a container almost never needs but that have a large attack surface (mount/reboot/kexec_load/bpf, etc.) — roughly 40-60 blocked, ~300 allowed. The recommended production baseline — zero config, a free layer of defense, good compatibility. PSA restricted requires at least this.

Localhost Link to heading

A custom JSON profile placed at /var/lib/kubelet/seccomp/ on the node, referenced by the pod:

seccompProfile:
  type: Localhost
  localhostProfile: profiles/block-mkdir.json   # relative to /var/lib/kubelet/seccomp/

“Localhost” means the profile file is local to this node, not network localhost.

The profile JSON (Week 1 bridge) Link to heading

{
  "defaultAction": "SCMP_ACT_ALLOW",
  "syscalls": [
    {
      "names": ["mkdir", "mkdirat"],
      "action": "SCMP_ACT_ERRNO",
      "errnoRet": 1
    }
  ]
}

Mapping to Week 1:

  • defaultAction = the BPF program’s final fallback action
  • syscalls[].names = the if (nr == SYS_mkdir) check
  • SCMP_ACT_ERRNO = SECCOMP_RET_ERRNO, SCMP_ACT_ALLOW = SECCOMP_RET_ALLOW

runc compiles this JSON into a real cBPF program and installs it. You hand-wrote BPF bytecode in Week 1; K8s lets you write JSON and runc compiles it — the same thing runs in the kernel.

Whitelist vs blacklist Link to heading

  • Whitelist (defaultAction: ERRNO + list the allowed): forgetting one = deny (fail-closed), safe but hard to maintain (must enumerate every syscall the app uses)
  • Blacklist (defaultAction: ALLOW + list the denied): forgetting one = allow; newly-introduced dangerous syscalls slip through by default. RuntimeDefault is a blacklist variant

High-security scenarios use a whitelist, but you first run real traffic with SCMP_ACT_LOG to collect the syscall list (same as Week 1 Day 5’s AppArmor complain→logprof→enforce).

ERRNO vs KILL Link to heading

  • SCMP_ACT_ERRNO: the syscall returns -1 + sets errno, the process stays alive and decides what to do (fallback / report / exit on its own)
  • SCMP_ACT_KILL: SIGKILL the process immediately

Production mostly uses ERRNO (KILL would crash a benign program that merely touched an edge-case syscall — an availability risk). But for a syscall that should never appear and is a strong attack signal (kexec_load inside a container), KILL is safer.


3. Profile deployment + experiments Link to heading

3.1 The Localhost deployment problem Link to heading

The profile must exist at /var/lib/kubelet/seccomp/ on the node before the pod can reference it. A profile is a node-level resource, but pods are cluster-scheduled — a pod can land on any node, so the profile must exist on every node it could be scheduled to, else CreateContainerError.

Production distribution: Security Profiles Operator (SPO, manages profiles via CRDs and auto-distributes) / DaemonSet / config management. RuntimeDefault has no such problem (built into the runtime, zero distribution).

3.2 Experiment 1: RuntimeDefault baseline Link to heading

Deploy a busybox pod with seccompProfile: type: RuntimeDefault, run ls / mkdir / echo — all succeed. RuntimeDefault is transparent to normal apps (those common syscalls are among the ~300 allowed).

mkdir succeeds with no output = Unix “no news is good news.” Confirm the baseline doesn’t block mkdir first, so when experiment 2 does block it, we can attribute the block to the custom profile.

3.3 Experiment 2: a custom Localhost profile blocking mkdir Link to heading

# write the profile to both kind nodes (the pod could land on either)
cat > /tmp/block-mkdir.json <<'EOF'
{"defaultAction":"SCMP_ACT_ALLOW","syscalls":[{"names":["mkdir","mkdirat"],"action":"SCMP_ACT_ERRNO","errnoRet":1}]}
EOF
for node in stripe-day2-control-plane stripe-day2-worker; do
  docker exec "$node" mkdir -p /var/lib/kubelet/seccomp/profiles
  docker cp /tmp/block-mkdir.json "$node":/var/lib/kubelet/seccomp/profiles/block-mkdir.json
done

(docker exec/cp into kind nodes works only because kind nodes are containers; production nodes are real machines — use a DaemonSet/SPO/Ansible.)

The pod references type: Localhost + localhostProfile: profiles/block-mkdir.json. Results:

CommandResultMeaning
mkdir /tmp/blockedOperation not permitted + exit 1Blocked by seccomp, EPERM
ls /normalgetdents not in the block list
touch /tmp/afilesucceedsopenat(O_CREAT) isn’t mkdir, not blocked
echo "still alive"normal, pod didn’t crashERRNO is a gentle block

Three key observations:

  1. Operation not permitted isn’t a real permission issue — seccomp made mkdir return EPERM; the program can’t tell “really no permission” from “blocked” (information camouflage)
  2. touch succeeding proves blocking is syscall-precise — touch uses openat(O_CREAT), a different syscall from mkdir; the profile only listed mkdir/mkdirat. seccomp blocks by syscall number, it doesn’t understand the intent “create a filesystem object”
  3. The pod didn’t crash — ERRNO doesn’t kill the process; mkdir failed but the process is alive

3.4 Experiment 3: Unconfined control Link to heading

Same busybox + same mkdir, the only variable being seccompProfile:

  • Localhost block-mkdir → EPERM, blocked
  • Unconfined → succeeds (exit 0)

One variable, different result → confirms the block comes from the profile, not some other factor. This is A/B control to establish causation.

3.5 Experiment 4: /proc confirms the filter is really loaded Link to heading

kubectl exec ... -- grep Seccomp /proc/1/status
podSeccompSeccomp_filters
no-mkdir (Localhost)21
unconfined00

Seccomp values: 0=disabled, 1=strict mode (Week 1’s mode allowing only 4 syscalls), 2=filter mode (a BPF filter loaded). All three K8s profile types use filter mode (2); strict mode (1) can’t be configured, so you’ll never see it.

/proc/<pid>/status’s Seccomp field is the ground truth for auditing whether a container is actually protected — unforgeable (reported directly by the kernel). More reliable than trusting the pod spec — a spec saying RuntimeDefault could actually be unprotected if the runtime is too old to support it; only this field is the truth.


4. Gotchas log Link to heading

4.1 The command array must be all strings Link to heading

command: ["sleep", 3600] errors with cannot unmarshal number into ... type string3600 must be "3600". command maps to the image ENTRYPOINT, each element a command-line token, and arguments in Unix are always strings (argv[] is all char*). Another strict-typing fail-closed.

4.2 command set to mkdir → CrashLoopBackOff Link to heading

Accidentally command: ["mkdir", "3600"], so the container’s main process is mkdir 3600, which exits instantly (exit 0) → restartPolicy Always restarts it → exits again → BackOff.

The counterintuitive part: Exit Code: 0 (success) also triggers the restart loop. Under restartPolicy: Always, the container restarts regardless of exit code whenever it exits. A long-running container’s main process must never exit (sleep / a long-lived server).

Judge health by restart frequency, not just count: 4 times in 49 seconds = crash loop; 23 times over 23 hours = hourly sleep expiry, normal. BackOff colliding with kubectl exec gives container not found (the container doesn’t exist during the backoff gap) — debug CrashLoop with kubectl logs --previous, not exec.

4.3 The namespace must exist first Link to heading

Applying to a nonexistent ns → NotFound; K8s won’t auto-create namespaces (a ns carries RBAC/PSA/quota boundaries, and auto-creating would make those policies uncontrollable).


5. Takeaways — seeds for the Stripe project Link to heading

5.1 seccomp blocks syscalls, not intent — large bypass space Link to heading

Block mkdir and touch (openat) still creates files. To restrict by “path/intent” you need AppArmor (LSM intervenes after path resolution). This is exactly the seccomp/AppArmor division: seccomp owns the syscall boundary, AppArmor owns file-path semantics. “Which layer should a given attack’s mitigation live at” is the key judgment.

5.2 adversarial robustness is the core eval dimension Link to heading

An LLM generating “block mkdir” looks like it stops the PoC, but the attacker switches to mknod / openat(O_CREAT|O_DIRECTORY) variants and bypasses it. Truly robust requires blocking the whole syscall family or switching layers (AppArmor by path). Today’s “blocked mkdir but touch still creates” is the minimal demo of this bypass.

5.3 A/B control to prove causation Link to heading

To prove a mitigation actually stops an attack, you must run the same PoC in both with-and-without-mitigation environments and verify the difference. “It failed after adding the mitigation” alone isn’t enough (something else might have blocked it). This is the paradigm a mitigation-testing pipeline must automate.

5.4 Verify ground truth, don’t trust declarations Link to heading

/proc/<pid>/status’s Seccomp field / NetworkPolicy’s actual timeout — both look at the kernel/runtime’s actual state, not the spec declaration. A declaration doesn’t mean it’s in effect (runtime version / CNI support / profile distribution can all make a declaration fall through).


6. Takeaways Link to heading

  1. K8s seccomp = Week 1 kernel seccomp, runc installs the filter (compiles JSON→cBPF before exec)
  2. runc is the bottom of container creation: kubelet→containerd→runc, applying ns/cgroup/seccomp/caps/AppArmor per the OCI spec then exec
  3. A container isn’t a kernel object: container = process + a set of isolation mechanisms
  4. Three profiles: Unconfined (no defense) / RuntimeDefault (blacklist baseline, recommended) / Localhost (custom)
  5. The profile JSON is the declarative form of Week 1’s BPF logic, SCMP_ACT_ERRNO = SECCOMP_RET_ERRNO
  6. ERRNO returns an error without killing; KILL kills — production mostly uses ERRNO
  7. seccomp blocks by syscall number, doesn’t understand intent — block mkdir but touch (openat) bypasses; for path-based control use AppArmor
  8. A Localhost profile is a node-level resource — must be on every node; production uses SPO/DaemonSet
  9. /proc/<pid>/status’s Seccomp field is ground truth — 2=loaded, 0=not; audit by this
  10. Protection must be in place before exec (no window) + the filter must be self-installed and propagated by inheritance

7. Tomorrow’s preview (Day 4) Link to heading

AppArmor in K8s. Continuing on the stripe-day2 cluster:

  • Write an AppArmor profile denying writes to /data, attach via securityContext.appArmorProfile
  • From inside the pod, touch /data/x to verify the block; check the kind node’s dmesg for apparmor="DENIED"
  • Contrast with today: seccomp blocks syscalls (mkdir blockable but touch bypasses), AppArmor blocks paths (blocks by the /data path regardless of which syscall) — verify the complementarity firsthand
  • Gotcha: kind nodes under macOS Docker Desktop may not have AppArmor enabled by default; Day 4’s first task is confirming node support