Day 5: AppArmor + layered sandbox design
How AppArmor actually attaches to a process (via the bprm_check_security LSM hook at execve time, keyed on binary path), why bash script.sh silently runs unconfined while ./script.sh does not, the six exec modifiers (ix/Px/Cx/Ux and their setuid-preserving uppercase forms), the hardlink and bind-mount tricks that bypass path-based MAC, and why a production sandbox layers namespace + capability + seccomp + AppArmor + cgroup — with the argument that, if you can only afford two, seccomp + AppArmor is the highest-ROI pair.
1. Where AppArmor sits Link to heading
The core: AppArmor is an LSM (Linux Security Module) hook that intercepts inside syscall handling. It enforces path-based MAC (Mandatory Access Control).
Split with seccomp:
- seccomp: at the syscall entry, granularity = nr + integer args
- AppArmor: at LSM hooks, granularity = the resolved result of the syscall — paths, network protocol/family
- seccomp can’t dereference pointers (Day 4); AppArmor fills that blind spot.
Split with SELinux:
- AppArmor: path-based — easy to write, easy to bypass (hardlink / bind mount)
- SELinux: label-based — hard to write, hard to bypass (label = inode xattr)
- Engineering trade-off: Ubuntu/Stripe lean AppArmor because the config is simpler and K8s integration is cleaner.
2. How profiles attach Link to heading
Trigger point Link to heading
At execve time, the kernel LSM hook bprm_check_security fires:
- Kernel finishes parsing the target binary (ELF / shebang)
- The LSM chain calls AppArmor’s hook
- AppArmor looks up a profile by the target binary’s path
- Match → install the profile context into
task->security(concretely,aa_task_ctx) - No match → unconfined (this is not default-deny — a common misconception)
execvethen continues; by the time the new image’s first instruction runs, the profile is in place
The key is a path string, not an inode Link to heading
- AppArmor uses the kernel’s
d_path()to get a resolved absolute path - Multiple hardlinks to the same binary each look up independently — this is the root of the path-based weakness (§5)
- No exact match → fall through to wildcards (
/foo/**); ultimately fall back to unconfined
Fork without exec Link to heading
copy_processuses LSM hookstask_alloc/cred_prepareto copy the parent’s security context to the child- The whole fork chain inherits the same profile until some descendant
execves and triggers a new lookup - So shell built-ins and worker forks without exec all inherit.
The classic trap: bash myscript.sh vs ./myscript.sh
Link to heading
| Invocation | What execve sees | Path AppArmor looks up | Result |
|---|---|---|---|
./myscript.sh (script +x) | Kernel handles shebang, then exec’s the script path | /path/to/myscript.sh | Script’s profile attaches |
bash myscript.sh | /bin/bash | /bin/bash | Script’s profile does not attach; bash’s profile attaches (if any) |
source myscript.sh | No execve at all | Current shell’s profile | Profile unchanged |
Production lesson: give your script an executable bit and call it as ./script.sh. This is exactly why an earlier experiment of mine — bash myapp.sh — had no profile in effect; switching to /root/.../myapp.sh fixed it.
change_profile (voluntary switch)
Link to heading
aa-exec -p strict_profile -- /usr/bin/cmd
A confined process can switch itself into a stricter profile, but the original profile must allow it:
change_profile -> /strict_profile,
Use cases: a supervisor that forks and then pins each child to a profile, or a process that self-tightens after some initialization stage.
3. Profile syntax Link to heading
Basic shape Link to heading
#include <tunables/global> # globals (@{HOME} etc.)
/path/to/program { # profile head = attach path
#include <abstractions/base> # preset abstractions (dynamic linker / libc)
capability net_bind_service, # re-check capability (stricter than just dropping caps)
network inet stream, # network rule (AF + type)
/etc/myapp/config.conf r, # file rule: path + mode
/var/log/myapp/*.log rw, # wildcard
@{HOME}/.myapp/** rwk, # var + recursive glob + lock
/usr/bin/helper Cx, # exec a child program: exec modifier
}
File access modes Link to heading
| Modifier | Meaning |
|---|---|
r | read |
w | write (includes truncate / append) |
a | append-only (write but cannot truncate / seek) |
l | link (may create a hardlink to this file) |
k | lock (flock / fcntl) |
m | mmap with PROT_EXEC (plain PROT_READ mmap doesn’t need m) |
Production gotchas:
- Binaries need
mr, not justr, or PROT_EXEC mmap of their segments fails wimpliesa;adoesn’t implyw(append-only protects log integrity)- Forget
mand you get the bizarre “cancatit but can’texecit” error.
Exec modifiers (the heart of it) Link to heading
| Modifier | Which profile does the child get | Use case | Risk |
|---|---|---|---|
ix (inherit) | Inherit current profile | Child is a helper, same constraints | safe |
Px (profile) | Switch to a standalone profile | Child is another program with its own profile | safe |
Cx (child profile) | Switch to a nested hat | Sub-profile defined inside parent; child enters it | safe |
Ux (unconfined) | Drop AppArmor entirely | Only for high-trust helpers | dangerous |
Case matters: setuid handling Link to heading
- Lowercase (
ix/px/cx/ux): strip setuid escalation - Uppercase (
iX/Px/Cx/Ux): preserve setuid
Production almost always uses uppercase; lowercase is rare.
Fallback modifiers Link to heading
/bin/helper Px -> helper_profile, # switch to helper_profile
/bin/helper Pix, # if no target profile, fall back to inherit
/bin/helper Cix, # same idea, Cx + ix fallback
A bare Px returns EACCES if the target profile isn’t loaded, so defensive production rules write Pix / Cix as a safety net.
Child profile (hat) nesting Link to heading
/usr/bin/myapp {
/usr/bin/myapp r,
/bin/helper Cx -> helper_hat,
profile helper_hat { # ← hat lives inside parent profile
/tmp/helper.input r,
/tmp/helper.output w,
}
}
After exec, the helper runs under myapp//helper_hat (the double slash is the hat naming convention).
- Tighter than
Px(a hat can’t see profiles outside its parent) - Tighter than
ix(a hat is a subset of the parent’s rights)
Abstractions Link to heading
Preset rule bundles; production profiles almost always pull these in:
| abstraction | What it covers |
|---|---|
abstractions/base | The minimum any Linux program needs (mmap libc / read ld.so.cache / vDSO / etc.) |
abstractions/nameservice | resolv.conf / hosts / nsswitch |
abstractions/python | Paths a Python interpreter touches |
abstractions/openssl | SSL libs + CA certs |
abstractions/X | X11 |
The way a production profile opens: #include <abstractions/base>. Near-boilerplate.
A seed for LLM eval: LLMs tend to enumerate every path explicitly rather than reach for an abstraction → the profile grows long and brittle (one libc patch and it breaks). “Profile-level semantic abstraction” is an independent eval metric.
4. Enforce vs Complain Link to heading
| Dimension | Enforce | Complain |
|---|---|---|
| On violation | Deny, return -EACCES / -EPERM | Allow, keep running |
| Audit log | apparmor="DENIED" | apparmor="ALLOWED" |
| Process behavior | Constrained, may fail | Unaware, runs normally |
| Use case | Production protection | Profile development / workload study |
| Risk | A buggy profile can break the service | No protection, observation only |
Mirror image of seccomp Link to heading
aa-complain≈SECCOMP_RET_LOG(allow + log)aa-enforce≈SECCOMP_RET_ERRNO/SECCOMP_RET_KILL_PROCESS(block)
The same “observe first, then constrain” workflow shows up across LSM and seccomp.
Switching Link to heading
sudo aa-complain /path/to/program # to complain
sudo aa-enforce /path/to/program # to enforce
sudo aa-disable /path/to/program # unload
sudo aa-status # all profile states
5. Production workflow: complain → logprof → enforce Link to heading
1. aa-genprof /path/to/program
├─ generate an empty skeleton profile, default complain
└─ tail audit in the background
2. Run real workload (production-like traffic / test suite)
└─ every unauthorized access → audit ALLOWED
3. aa-logprof
├─ reads audit log
├─ for each "unauthorized but allowed" event, asks you
└─ you pick: allow / deny / glob / abstraction / inherit; it writes back into the profile
4. Loop 2-3 until the profile converges
5. aa-enforce
└─ ship to production
6. Keep watching audit DENIED
├─ true attack → alert
└─ false positive → patch profile
Two traps in the workflow Link to heading
1. Complain doesn’t just log, it induces violations. Under complain the program “thinks it can do anything” and exercises code paths it wouldn’t otherwise touch (fallback branches). You think complain covered everything, then enforce surfaces new DENIEDs. Counter: do a short enforce run too and collect another round.
2. logprof should be suggesting abstractions, not single rules. It will ask “use abstractions/base?” instead of “allow /etc/ld.so.cache r?”. A high-quality production profile is recognizable by its mix of abstractions + targeted refinements.
6. Path-based weakness Link to heading
Why it exists Link to heading
A path is not a property of the file — it’s a property of how you name the file in some namespace. The same inode can have many paths; the same path can point to different inodes over time. Rules are bound to paths → change the path↔inode mapping and you bypass.
Hardlink bypass Link to heading
Threat model A: profile allows reading /tmp/safe.log; attacker does:
ln /etc/passwd /tmp/safe.log # link passwd to a path the profile allows
cat /tmp/safe.log # program reads the "legal path", actual content is passwd
Preconditions:
- Unix read permission on the source (modern kernels add
protected_hardlinks=1) - Write permission on the target directory
/etc/passwd is 0644 readable, /tmp is writable → attack works.
Note: the reverse (ln /etc/shadow /tmp/x) usually doesn’t — shadow is 0640, a normal user can’t read it, so they can’t hardlink it.
Higher-value variants: linking /proc/$$/maps or a device node into a profile-allowed path.
Bind mount bypass (nastier than hardlink) Link to heading
mount --bind /etc /tmp/safe_dir
cat /tmp/safe_dir/passwd # actually /etc/passwd
- Doesn’t need per-file permission, just mount capability (
CAP_SYS_ADMIN, or inside a user namespace) - Rebrands an entire directory tree in one shot
- Works across filesystems (hardlink doesn’t)
Direct K8s relevance: CAP_SYS_ADMIN inside a container = able to bind-mount = bypass the host’s AppArmor. Production K8s pods must drop SYS_ADMIN.
Symlinks Link to heading
AppArmor 4.x+ resolves symlinks before matching by default, killing most symlink attacks. But:
- TOCTOU races are still possible (create/delete a symlink between check and use)
- Explicit
l(link) control can forbid creating hardlinks at all
vs SELinux (label-based) Link to heading
| AppArmor (path) | SELinux (label) | |
|---|---|---|
| Identity | path string | inode xattr security.selinux |
| Hardlink | rule rides on path → bypass via re-link | label stays with inode → no bypass |
| Bind mount | rule rides on path → bypass | label unchanged → no bypass |
| Config complexity | simple (path globs) | very complex (type/role/user) |
| K8s integration | one line in pod spec | painful |
| Learning curve | gentle | steep |
Why Stripe picks AppArmor: engineering trade-off — config is easy and K8s integration is good, and the bypass paths are mostly catchable by the seccomp + caps + namespace layers underneath.
7. Five-layer sandbox design (the core) Link to heading
Each layer’s job Link to heading
| Layer | What it governs | What it stops |
|---|---|---|
| namespace | View isolation (mnt/net/pid/user/uts/ipc/cgroup) | Lateral movement / info leak |
| capability | Slicing root into 38 caps | Privilege escalation |
| seccomp | syscall nr + integer args | Kernel attack surface (syscall 0day) |
| AppArmor | Resolved path / network protocol | Info leak / persistence |
| cgroup | Resource quotas (CPU/mem/PID/IO) | DoS |
Why they must be layered Link to heading
Each layer covers a different semantic dimension; any single layer has structural blind spots:
- seccomp sees syscall nr → can’t see path content → needs AppArmor
- AppArmor sees paths → can’t see namespace transitions → needs caps / ns
- caps see permission families → can’t see specific actions → needs seccomp for fine-grained syscall denial
- namespaces isolate the view → don’t restrict what you do inside the view → needs AppArmor / seccomp
- cgroups cap resources → don’t see access semantics → needs AppArmor
Defense in depth: one CVE breaks one layer, the next layer catches it.
Canonical case: runc CVE-2019-5736 Link to heading
- Attack: hardlink + procfs trick to overwrite the host’s runc binary → container escape
- Bypasses namespace (procfs exposes host PIDs)
- Bypasses caps (default caps were enough)
- The only layer that could stop it was AppArmor (docker’s default profile later added
deny /proc/sys/** w)
One layer doesn’t cut it. Multi-layer defense is mandatory.
Pick just two — seccomp + AppArmor Link to heading
Why:
- Widest combined coverage — seccomp owns syscall boundary (kernel attack surface), AppArmor owns syscall result (file / network)
- They cover each other’s blind spots (pointer args / kernel attacks)
- Don’t depend on namespaces — bare-metal processes can use them, which is critical for Stripe’s fleet-level mitigation story
Why not just namespace + cap: those do isolation and coarse permissions; they don’t stop application-logic attacks. Modern attacks are mostly logic flaws, not missing isolation.
Why not just AppArmor:
- LSM has path-bypass blind spots
- Kernel-level attacks bypass LSM entirely
- seccomp is the last line for kernel attack surface
Production containers turn on all five, but for simpler threat models (trusted code + supply-chain defense), seccomp + AppArmor is the highest ROI pair.
8. Seeds for the Stripe project Link to heading
Mitigation is not a single artifact Link to heading
The best mitigation for a CVE might be “1 seccomp rule + 2 AppArmor rules + drop 1 cap.” LLMs that emit a single artifact will usually under-cover.
Which layer to mitigate at is a design decision Link to heading
The same attack can be stopped at several layers, but trade-offs differ:
- seccomp: strict but coarse (whole syscall blocked)
- AppArmor: fine but bypassable (path-based)
- cap: wide blast radius (dropping
SYS_ADMINmay break unrelated things)
Making the LLM explain why this layer is a key eval signal.
Adversarial robustness is the core metric Link to heading
Not just “does the mitigation stop the original PoC?”, but:
- Can the attacker bypass with a small tweak?
- Variant syscall (
execve→execveat)? - Path tricks (hardlink / bind mount)?
- Different ABI (i386 / x32)?
Each variant is its own test case.
Typical LLM failures when writing AppArmor profiles Link to heading
- Doesn’t use abstractions → long and brittle
- Misses the
mmodifier → PROT_EXEC mmap fails - Uses
Pxwithout a fallback (Pix) - Doesn’t account for hardlink / bind mount bypass
- Enumerates every path explicitly — you can’t read off “what core asset is being protected”
Each of these is an independent eval metric.
9. Debugging AppArmor Link to heading
Profile state Link to heading
sudo aa-status # all profiles, sorted by enforce/complain
sudo aa-status | grep myapp # a specific profile
Is the profile actually attached? Link to heading
# While the program is running:
cat /proc/<pid>/attr/current # prints "/path/to/program (enforce)"
ps -eo pid,comm,label # system-wide
Note: if the profile doesn’t allow reading /proc/*/attr/current, the program reading itself will EACCES — that EACCES is itself indirect evidence the profile is in effect.
Violation records Link to heading
sudo dmesg -T | grep apparmor | tail
sudo journalctl -k | grep apparmor | tail
sudo ausearch -m APPARMOR -ts recent # if auditd is installed
sudo grep apparmor /var/log/syslog | tail
A full DENIED record looks like:
audit: type=1400 audit(...): apparmor="DENIED"
operation="open"
profile="/root/apparmor_test/myapp.sh"
name="/etc/shadow"
pid=... comm="cat"
requested_mask="r" denied_mask="r"
fsuid=0 ouid=0
Every field matters:
operation: open / exec / mount / capable / …profile: which profile blocked itname: the path that was blockedrequested_maskvsdenied_mask: what was asked for, what was deniedcomm: the program that actually ran
Lab pitfalls (ones I hit) Link to heading
aa-statusshows 0 confined processes: profile is loaded but no live process is currently using it — attach is per-process, when the program exits the count goes to zerodmesgshows no DENIED: kernel ring buffer got flushed by other logs (UFW etc.), or audit rate-limited it- Script invoked as
bash script.sh: profile doesn’t attach (the profile head is the script path, not bash) /proc/*/attr/currentnot readable: the profile didn’t allow it — not a kernel bug
10. Takeaways Link to heading
- AppArmor = LSM hook + path-based MAC, enforced at the syscall result layer
- Profile attach keys on binary path; inherited across fork, re-looked-up on execve
bash script.shdoes not attach the script’s profile — you must./script.sh- Six exec modifiers:
ix(inherit) /Px(standalone) /Cx(hat) /Ux(drop) — uppercase preserves setuid mr≠r— binaries must allow PROT_EXEC mmap- Complain ≈ seccomp LOG, workflow: complain → logprof → enforce
- Path-based weakness: hardlink / bind mount changes the path↔inode map, so the rules bypass
- Five-layer sandbox split: ns / cap / seccomp / AppArmor / cgroup — each covers a different blind spot
- If you pick two, pick seccomp + AppArmor — complementary and namespace-independent
- Three dimensions for LLM-mitigation eval: layer correctness / abstraction quality / adversarial robustness
Week 1 wrap-up Link to heading
| Day | Topic | Core takeaway |
|---|---|---|
| 1 | syscall ABI + the user→kernel path | x86_64 registers / syscall instruction’s HW side effects / kernel dispatch |
| 2 | strace + common syscalls | The three startup layers (ld.so / libc init / main) / isatty pattern / fd reuse race |
| 3 | seccomp strict mode | The 4 allowed syscalls + design philosophy / _exit vs SYS_exit trap |
| 4 | seccomp BPF filter | arch check / no pointer deref (two reasons) / syscall family / 8 RET actions / USER_NOTIF |
| 5 | AppArmor + layered sandbox | profile attach / 6 modifiers / path-based weakness / 5 layers complementary |
Week 2 preview: K8s security is essentially packaging this week’s contents into a pod spec:
securityContext.capabilities→ cap dropsecurityContext.seccompProfile→ seccomp filter (RuntimeDefault/Localhost)securityContext.appArmorProfile→ AppArmor profile (per-container)pod.spec.hostNetwork/hostPID→ namespace isolationresources.limits→ cgroup- PodSecurityStandards (restricted / baseline / privileged) → preset combinations of the five layers
The mental model is already in place; K8s is just a declarative wrapper. Week 2 should go quickly.