Day 5: AppArmor + layered sandbox design

How AppArmor actually attaches to a process (via the bprm_check_security LSM hook at execve time, keyed on binary path), why bash script.sh silently runs unconfined while ./script.sh does not, the six exec modifiers (ix/Px/Cx/Ux and their setuid-preserving uppercase forms), the hardlink and bind-mount tricks that bypass path-based MAC, and why a production sandbox layers namespace + capability + seccomp + AppArmor + cgroup — with the argument that, if you can only afford two, seccomp + AppArmor is the highest-ROI pair.

1. Where AppArmor sits Link to heading

The core: AppArmor is an LSM (Linux Security Module) hook that intercepts inside syscall handling. It enforces path-based MAC (Mandatory Access Control).

Split with seccomp:

  • seccomp: at the syscall entry, granularity = nr + integer args
  • AppArmor: at LSM hooks, granularity = the resolved result of the syscall — paths, network protocol/family
  • seccomp can’t dereference pointers (Day 4); AppArmor fills that blind spot.

Split with SELinux:

  • AppArmor: path-based — easy to write, easy to bypass (hardlink / bind mount)
  • SELinux: label-based — hard to write, hard to bypass (label = inode xattr)
  • Engineering trade-off: Ubuntu/Stripe lean AppArmor because the config is simpler and K8s integration is cleaner.

2. How profiles attach Link to heading

Trigger point Link to heading

At execve time, the kernel LSM hook bprm_check_security fires:

  1. Kernel finishes parsing the target binary (ELF / shebang)
  2. The LSM chain calls AppArmor’s hook
  3. AppArmor looks up a profile by the target binary’s path
  4. Match → install the profile context into task->security (concretely, aa_task_ctx)
  5. No match → unconfined (this is not default-deny — a common misconception)
  6. execve then continues; by the time the new image’s first instruction runs, the profile is in place

The key is a path string, not an inode Link to heading

  • AppArmor uses the kernel’s d_path() to get a resolved absolute path
  • Multiple hardlinks to the same binary each look up independently — this is the root of the path-based weakness (§5)
  • No exact match → fall through to wildcards (/foo/**); ultimately fall back to unconfined

Fork without exec Link to heading

  • copy_process uses LSM hooks task_alloc / cred_prepare to copy the parent’s security context to the child
  • The whole fork chain inherits the same profile until some descendant execves and triggers a new lookup
  • So shell built-ins and worker forks without exec all inherit.

The classic trap: bash myscript.sh vs ./myscript.sh Link to heading

InvocationWhat execve seesPath AppArmor looks upResult
./myscript.sh (script +x)Kernel handles shebang, then exec’s the script path/path/to/myscript.shScript’s profile attaches
bash myscript.sh/bin/bash/bin/bashScript’s profile does not attach; bash’s profile attaches (if any)
source myscript.shNo execve at allCurrent shell’s profileProfile unchanged

Production lesson: give your script an executable bit and call it as ./script.sh. This is exactly why an earlier experiment of mine — bash myapp.sh — had no profile in effect; switching to /root/.../myapp.sh fixed it.

change_profile (voluntary switch) Link to heading

aa-exec -p strict_profile -- /usr/bin/cmd

A confined process can switch itself into a stricter profile, but the original profile must allow it:

change_profile -> /strict_profile,

Use cases: a supervisor that forks and then pins each child to a profile, or a process that self-tightens after some initialization stage.


3. Profile syntax Link to heading

Basic shape Link to heading

#include <tunables/global>        # globals (@{HOME} etc.)

/path/to/program {                # profile head = attach path
    #include <abstractions/base>  # preset abstractions (dynamic linker / libc)

    capability net_bind_service,  # re-check capability (stricter than just dropping caps)
    network inet stream,          # network rule (AF + type)

    /etc/myapp/config.conf r,     # file rule: path + mode
    /var/log/myapp/*.log rw,      # wildcard
    @{HOME}/.myapp/** rwk,        # var + recursive glob + lock

    /usr/bin/helper Cx,           # exec a child program: exec modifier
}

File access modes Link to heading

ModifierMeaning
rread
wwrite (includes truncate / append)
aappend-only (write but cannot truncate / seek)
llink (may create a hardlink to this file)
klock (flock / fcntl)
mmmap with PROT_EXEC (plain PROT_READ mmap doesn’t need m)

Production gotchas:

  • Binaries need mr, not just r, or PROT_EXEC mmap of their segments fails
  • w implies a; a doesn’t imply w (append-only protects log integrity)
  • Forget m and you get the bizarre “can cat it but can’t exec it” error.

Exec modifiers (the heart of it) Link to heading

ModifierWhich profile does the child getUse caseRisk
ix (inherit)Inherit current profileChild is a helper, same constraintssafe
Px (profile)Switch to a standalone profileChild is another program with its own profilesafe
Cx (child profile)Switch to a nested hatSub-profile defined inside parent; child enters itsafe
Ux (unconfined)Drop AppArmor entirelyOnly for high-trust helpersdangerous

Case matters: setuid handling Link to heading

  • Lowercase (ix / px / cx / ux): strip setuid escalation
  • Uppercase (iX / Px / Cx / Ux): preserve setuid

Production almost always uses uppercase; lowercase is rare.

Fallback modifiers Link to heading

/bin/helper Px -> helper_profile,    # switch to helper_profile
/bin/helper Pix,                     # if no target profile, fall back to inherit
/bin/helper Cix,                     # same idea, Cx + ix fallback

A bare Px returns EACCES if the target profile isn’t loaded, so defensive production rules write Pix / Cix as a safety net.

Child profile (hat) nesting Link to heading

/usr/bin/myapp {
    /usr/bin/myapp r,
    /bin/helper Cx -> helper_hat,

    profile helper_hat {           # ← hat lives inside parent profile
        /tmp/helper.input r,
        /tmp/helper.output w,
    }
}

After exec, the helper runs under myapp//helper_hat (the double slash is the hat naming convention).

  • Tighter than Px (a hat can’t see profiles outside its parent)
  • Tighter than ix (a hat is a subset of the parent’s rights)

Abstractions Link to heading

Preset rule bundles; production profiles almost always pull these in:

abstractionWhat it covers
abstractions/baseThe minimum any Linux program needs (mmap libc / read ld.so.cache / vDSO / etc.)
abstractions/nameserviceresolv.conf / hosts / nsswitch
abstractions/pythonPaths a Python interpreter touches
abstractions/opensslSSL libs + CA certs
abstractions/XX11

The way a production profile opens: #include <abstractions/base>. Near-boilerplate.

A seed for LLM eval: LLMs tend to enumerate every path explicitly rather than reach for an abstraction → the profile grows long and brittle (one libc patch and it breaks). “Profile-level semantic abstraction” is an independent eval metric.


4. Enforce vs Complain Link to heading

DimensionEnforceComplain
On violationDeny, return -EACCES / -EPERMAllow, keep running
Audit logapparmor="DENIED"apparmor="ALLOWED"
Process behaviorConstrained, may failUnaware, runs normally
Use caseProduction protectionProfile development / workload study
RiskA buggy profile can break the serviceNo protection, observation only

Mirror image of seccomp Link to heading

  • aa-complainSECCOMP_RET_LOG (allow + log)
  • aa-enforceSECCOMP_RET_ERRNO / SECCOMP_RET_KILL_PROCESS (block)

The same “observe first, then constrain” workflow shows up across LSM and seccomp.

Switching Link to heading

sudo aa-complain /path/to/program     # to complain
sudo aa-enforce /path/to/program      # to enforce
sudo aa-disable /path/to/program      # unload
sudo aa-status                        # all profile states

5. Production workflow: complain → logprof → enforce Link to heading

1. aa-genprof /path/to/program
   ├─ generate an empty skeleton profile, default complain
   └─ tail audit in the background

2. Run real workload (production-like traffic / test suite)
   └─ every unauthorized access → audit ALLOWED

3. aa-logprof
   ├─ reads audit log
   ├─ for each "unauthorized but allowed" event, asks you
   └─ you pick: allow / deny / glob / abstraction / inherit; it writes back into the profile

4. Loop 2-3 until the profile converges

5. aa-enforce
   └─ ship to production

6. Keep watching audit DENIED
   ├─ true attack → alert
   └─ false positive → patch profile

Two traps in the workflow Link to heading

1. Complain doesn’t just log, it induces violations. Under complain the program “thinks it can do anything” and exercises code paths it wouldn’t otherwise touch (fallback branches). You think complain covered everything, then enforce surfaces new DENIEDs. Counter: do a short enforce run too and collect another round.

2. logprof should be suggesting abstractions, not single rules. It will ask “use abstractions/base?” instead of “allow /etc/ld.so.cache r?”. A high-quality production profile is recognizable by its mix of abstractions + targeted refinements.


6. Path-based weakness Link to heading

Why it exists Link to heading

A path is not a property of the file — it’s a property of how you name the file in some namespace. The same inode can have many paths; the same path can point to different inodes over time. Rules are bound to paths → change the path↔inode mapping and you bypass.

Threat model A: profile allows reading /tmp/safe.log; attacker does:

ln /etc/passwd /tmp/safe.log      # link passwd to a path the profile allows
cat /tmp/safe.log                 # program reads the "legal path", actual content is passwd

Preconditions:

  • Unix read permission on the source (modern kernels add protected_hardlinks=1)
  • Write permission on the target directory

/etc/passwd is 0644 readable, /tmp is writable → attack works.

Note: the reverse (ln /etc/shadow /tmp/x) usually doesn’t — shadow is 0640, a normal user can’t read it, so they can’t hardlink it.

Higher-value variants: linking /proc/$$/maps or a device node into a profile-allowed path.

mount --bind /etc /tmp/safe_dir
cat /tmp/safe_dir/passwd          # actually /etc/passwd
  • Doesn’t need per-file permission, just mount capability (CAP_SYS_ADMIN, or inside a user namespace)
  • Rebrands an entire directory tree in one shot
  • Works across filesystems (hardlink doesn’t)

Direct K8s relevance: CAP_SYS_ADMIN inside a container = able to bind-mount = bypass the host’s AppArmor. Production K8s pods must drop SYS_ADMIN.

AppArmor 4.x+ resolves symlinks before matching by default, killing most symlink attacks. But:

  • TOCTOU races are still possible (create/delete a symlink between check and use)
  • Explicit l (link) control can forbid creating hardlinks at all

vs SELinux (label-based) Link to heading

AppArmor (path)SELinux (label)
Identitypath stringinode xattr security.selinux
Hardlinkrule rides on path → bypass via re-linklabel stays with inode → no bypass
Bind mountrule rides on path → bypasslabel unchanged → no bypass
Config complexitysimple (path globs)very complex (type/role/user)
K8s integrationone line in pod specpainful
Learning curvegentlesteep

Why Stripe picks AppArmor: engineering trade-off — config is easy and K8s integration is good, and the bypass paths are mostly catchable by the seccomp + caps + namespace layers underneath.


7. Five-layer sandbox design (the core) Link to heading

Each layer’s job Link to heading

LayerWhat it governsWhat it stops
namespaceView isolation (mnt/net/pid/user/uts/ipc/cgroup)Lateral movement / info leak
capabilitySlicing root into 38 capsPrivilege escalation
seccompsyscall nr + integer argsKernel attack surface (syscall 0day)
AppArmorResolved path / network protocolInfo leak / persistence
cgroupResource quotas (CPU/mem/PID/IO)DoS

Why they must be layered Link to heading

Each layer covers a different semantic dimension; any single layer has structural blind spots:

  • seccomp sees syscall nr → can’t see path content → needs AppArmor
  • AppArmor sees paths → can’t see namespace transitions → needs caps / ns
  • caps see permission families → can’t see specific actions → needs seccomp for fine-grained syscall denial
  • namespaces isolate the view → don’t restrict what you do inside the view → needs AppArmor / seccomp
  • cgroups cap resources → don’t see access semantics → needs AppArmor

Defense in depth: one CVE breaks one layer, the next layer catches it.

Canonical case: runc CVE-2019-5736 Link to heading

  • Attack: hardlink + procfs trick to overwrite the host’s runc binary → container escape
  • Bypasses namespace (procfs exposes host PIDs)
  • Bypasses caps (default caps were enough)
  • The only layer that could stop it was AppArmor (docker’s default profile later added deny /proc/sys/** w)

One layer doesn’t cut it. Multi-layer defense is mandatory.

Pick just two — seccomp + AppArmor Link to heading

Why:

  • Widest combined coverage — seccomp owns syscall boundary (kernel attack surface), AppArmor owns syscall result (file / network)
  • They cover each other’s blind spots (pointer args / kernel attacks)
  • Don’t depend on namespaces — bare-metal processes can use them, which is critical for Stripe’s fleet-level mitigation story

Why not just namespace + cap: those do isolation and coarse permissions; they don’t stop application-logic attacks. Modern attacks are mostly logic flaws, not missing isolation.

Why not just AppArmor:

  • LSM has path-bypass blind spots
  • Kernel-level attacks bypass LSM entirely
  • seccomp is the last line for kernel attack surface

Production containers turn on all five, but for simpler threat models (trusted code + supply-chain defense), seccomp + AppArmor is the highest ROI pair.


8. Seeds for the Stripe project Link to heading

Mitigation is not a single artifact Link to heading

The best mitigation for a CVE might be “1 seccomp rule + 2 AppArmor rules + drop 1 cap.” LLMs that emit a single artifact will usually under-cover.

Which layer to mitigate at is a design decision Link to heading

The same attack can be stopped at several layers, but trade-offs differ:

  • seccomp: strict but coarse (whole syscall blocked)
  • AppArmor: fine but bypassable (path-based)
  • cap: wide blast radius (dropping SYS_ADMIN may break unrelated things)

Making the LLM explain why this layer is a key eval signal.

Adversarial robustness is the core metric Link to heading

Not just “does the mitigation stop the original PoC?”, but:

  • Can the attacker bypass with a small tweak?
  • Variant syscall (execveexecveat)?
  • Path tricks (hardlink / bind mount)?
  • Different ABI (i386 / x32)?

Each variant is its own test case.

Typical LLM failures when writing AppArmor profiles Link to heading

  • Doesn’t use abstractions → long and brittle
  • Misses the m modifier → PROT_EXEC mmap fails
  • Uses Px without a fallback (Pix)
  • Doesn’t account for hardlink / bind mount bypass
  • Enumerates every path explicitly — you can’t read off “what core asset is being protected”

Each of these is an independent eval metric.


9. Debugging AppArmor Link to heading

Profile state Link to heading

sudo aa-status                       # all profiles, sorted by enforce/complain
sudo aa-status | grep myapp          # a specific profile

Is the profile actually attached? Link to heading

# While the program is running:
cat /proc/<pid>/attr/current         # prints "/path/to/program (enforce)"
ps -eo pid,comm,label                # system-wide

Note: if the profile doesn’t allow reading /proc/*/attr/current, the program reading itself will EACCES — that EACCES is itself indirect evidence the profile is in effect.

Violation records Link to heading

sudo dmesg -T | grep apparmor | tail
sudo journalctl -k | grep apparmor | tail
sudo ausearch -m APPARMOR -ts recent          # if auditd is installed
sudo grep apparmor /var/log/syslog | tail

A full DENIED record looks like:

audit: type=1400 audit(...): apparmor="DENIED"
  operation="open"
  profile="/root/apparmor_test/myapp.sh"
  name="/etc/shadow"
  pid=... comm="cat"
  requested_mask="r" denied_mask="r"
  fsuid=0 ouid=0

Every field matters:

  • operation: open / exec / mount / capable / …
  • profile: which profile blocked it
  • name: the path that was blocked
  • requested_mask vs denied_mask: what was asked for, what was denied
  • comm: the program that actually ran

Lab pitfalls (ones I hit) Link to heading

  • aa-status shows 0 confined processes: profile is loaded but no live process is currently using it — attach is per-process, when the program exits the count goes to zero
  • dmesg shows no DENIED: kernel ring buffer got flushed by other logs (UFW etc.), or audit rate-limited it
  • Script invoked as bash script.sh: profile doesn’t attach (the profile head is the script path, not bash)
  • /proc/*/attr/current not readable: the profile didn’t allow it — not a kernel bug

10. Takeaways Link to heading

  1. AppArmor = LSM hook + path-based MAC, enforced at the syscall result layer
  2. Profile attach keys on binary path; inherited across fork, re-looked-up on execve
  3. bash script.sh does not attach the script’s profile — you must ./script.sh
  4. Six exec modifiers: ix (inherit) / Px (standalone) / Cx (hat) / Ux (drop) — uppercase preserves setuid
  5. mrr — binaries must allow PROT_EXEC mmap
  6. Complain ≈ seccomp LOG, workflow: complain → logprof → enforce
  7. Path-based weakness: hardlink / bind mount changes the path↔inode map, so the rules bypass
  8. Five-layer sandbox split: ns / cap / seccomp / AppArmor / cgroup — each covers a different blind spot
  9. If you pick two, pick seccomp + AppArmor — complementary and namespace-independent
  10. Three dimensions for LLM-mitigation eval: layer correctness / abstraction quality / adversarial robustness

Week 1 wrap-up Link to heading

DayTopicCore takeaway
1syscall ABI + the user→kernel pathx86_64 registers / syscall instruction’s HW side effects / kernel dispatch
2strace + common syscallsThe three startup layers (ld.so / libc init / main) / isatty pattern / fd reuse race
3seccomp strict modeThe 4 allowed syscalls + design philosophy / _exit vs SYS_exit trap
4seccomp BPF filterarch check / no pointer deref (two reasons) / syscall family / 8 RET actions / USER_NOTIF
5AppArmor + layered sandboxprofile attach / 6 modifiers / path-based weakness / 5 layers complementary

Week 2 preview: K8s security is essentially packaging this week’s contents into a pod spec:

  • securityContext.capabilities → cap drop
  • securityContext.seccompProfile → seccomp filter (RuntimeDefault / Localhost)
  • securityContext.appArmorProfile → AppArmor profile (per-container)
  • pod.spec.hostNetwork / hostPID → namespace isolation
  • resources.limits → cgroup
  • PodSecurityStandards (restricted / baseline / privileged) → preset combinations of the five layers

The mental model is already in place; K8s is just a declarative wrapper. Week 2 should go quickly.