Day 2: strace + Common Syscalls in Practice
A practical guide to strace: flag cheat sheet, how to read program startup as three distinct layers (ld.so → libc init → main), annotated syscalls, and the gotchas you actually hit (fd reuse races, EINTR, short writes, locale noise).
1. strace cheat sheet Link to heading
Basics Link to heading
strace ./prog # trace prog
strace -p PID # attach to an existing process
strace -f ./prog # follow forks/clones ★essential★
strace -ff -o trace ./prog # one file per PID: trace.<pid>
strace -o out.log ./prog # write to file (keeps stderr clean)
Filtering Link to heading
strace -e trace=openat,mmap ./prog # only these
strace -e trace=!futex,clock_gettime ./prog # exclude noise
strace -e trace=%process ./prog # process mgmt (fork/execve/wait/exit)
strace -e trace=%file ./prog # file ops
strace -e trace=%network ./prog # sockets
strace -e trace=%signal ./prog # signals
strace -e trace=%ipc ./prog # SysV IPC
strace -e trace=%desc ./prog # fd ops
strace -e trace=%memory ./prog # mmap/brk/mprotect
Debugging / performance Link to heading
strace -c ./prog # tally per-syscall count + time
strace -tt -T ./prog # wall time + duration per line
strace -s 200 ./prog # buffer print width 200 (default 32 too short)
strace -y ./prog # show path after fd, e.g. "3</tmp/foo>"
strace -yy ./prog # even more detail (socket shows ip:port)
strace -k ./prog # userland stack trace for each syscall
Common combos Link to heading
# debug where a program is stuck
strace -f -p PID
# see which files get touched at startup
strace -f -e trace=openat -o load.log ./prog
# investigate EPERM/EACCES
strace -f -e trace=openat ./prog 2>&1 | grep -E 'EACCES|EPERM'
# find hotspots
strace -c -f ./prog
2. The three layers of program load (important) Link to heading
Starting a program is roughly three layers. strace sees all of them — but don’t conflate them:
Layer 1: kernel + ld.so loading the ELF Link to heading
- The kernel’s
execveentry sets up a fresh address space and jumps told.so ld.sodoes (this is the very first chunk of any strace):brk(NULL)— probe heap topmmap— ld.so’s own scratch spaceaccess("/etc/ld.so.preload", ...)— check preloadopenat("/etc/ld.so.cache", ...)+ mmap — the.soindex- For each
.so(libc, libpthread, …): openat → read ELF header → fstat → multiple mmaps (textPROT_READ|PROT_EXEC, dataPROT_READ|PROT_WRITE, bss anonymous) → mprotect (RELRO) → close
- Jump to the program entry point (
_start)
Signature: all mapped paths are .so files under /lib/ or /usr/lib/.
Layer 2: libc runtime init (before main) Link to heading
arch_prctl(ARCH_SET_FS, ...)— set TLS base (errno,__thread, stack canary all rely on it)set_tid_address(...)— tell the kernel where the thread ID lives (used by futex / robust list)set_robust_list(...)— foundation for pthread lock recoveryrseq(...)— restartable sequences (modern glibc/kernel)brk(...)— grow the heap for mallocopenat("/usr/lib/locale/...", ...)+ mmap — locale data- May also read
/etc/nsswitch.conf,/etc/passwd,/etc/resolv.conf, etc. (program-dependent)
Signature: reading data files rather than .sos.
Layer 3: program main() — business logic Link to heading
ls: ioctl tty detection → openat directory → getdents64 → write → exitcat: openat file → read loop → write → close → exit
How to find the layer boundaries: everything after the first arch_prctl is Layer 2; everything after the first business-related syscall (e.g. ls’s ioctl/statx) is Layer 3.
3. Canonical startup sequence (worth memorizing) Link to heading
execve(...) # entry point (strace's first line)
# === ld.so loading phase ===
brk(NULL) # probe heap
mmap(NULL, 8192, RW, MAP_ANON, ...) # ld.so scratch
access("/etc/ld.so.preload", R_OK) # usually ENOENT
openat("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC)
fstat(3, ...)
mmap(NULL, ..., PROT_READ, MAP_PRIVATE, 3, 0)
close(3)
openat("/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC)
read(3, ELF header, 832) # read ELF header
fstat(3, ...)
mmap(..., PROT_READ, MAP_PRIVATE, 3, 0) # probe layout
mmap(..., PROT_READ|PROT_EXEC, ..., 3, off) # text
mmap(..., PROT_READ|PROT_WRITE, ..., 3, off)# data
mprotect(..., PROT_READ) # RELRO
close(3)
# ... repeat for each .so
# === libc init ===
arch_prctl(ARCH_SET_FS, 0x...) # TLS base
set_tid_address(...)
set_robust_list(...)
rseq(...)
mprotect(..., PROT_READ) # lock GOT
prlimit64(0, RLIMIT_STACK, ...)
brk(NULL); brk(0x...) # grow heap
openat("/usr/lib/locale/...", ...) # locale (lots of ENOENT+success pairs)
mmap(...)
# === program main ===
... actual program logic ...
# === exit ===
close(1); close(2)
exit_group(0)
4. Key syscalls annotated (ls /tmp edition) Link to heading
| syscall | meaning |
|---|---|
execve(prog, argv, envp) | replace the current process image |
brk(NULL) | query current heap top |
brk(0x...) | grow heap to this address |
arch_prctl(ARCH_SET_FS, addr) | set FS register → TLS base |
set_tid_address(p) | on thread exit, kernel clears *p and futex_wakes |
access(path, R_OK) | check readability (TOCTOU-prone; modern code avoids) |
openat(AT_FDCWD, path, flags) | open (returns lowest available fd) |
O_CLOEXEC | auto-close on exec (prevents fd leak to children) |
O_DIRECTORY | must be a directory, else ENOTDIR |
O_NONBLOCK | non-blocking (not just sockets — avoids blocking on named pipes at open time too) |
fstat(fd, ...) | fd metadata |
statx(...) | modern stat with more fields (btime, attrs) |
mmap(...) | see Day 1’s four-quadrant table |
mprotect(addr, len, prot) | change page permissions (RELRO locks GOT; JIT marks code PROT_EXEC after writing) |
ioctl(fd, TCGETS, ...) | = isatty(fd) check |
ioctl(fd, TIOCGWINSZ, ...) | get terminal rows × cols |
getdents64(fd, buf, count) | read directory entries (loop until it returns 0) |
write(fd, buf, n) | write — may be short; must loop |
close(fd) | close fd — check the return value (NFS / cache flush errors) |
exit_group(status) | exit all threads (= _exit for single-threaded) |
5. Notable observations (things I tripped over today) Link to heading
5.1 ld.so doesn’t read locale Link to heading
/usr/lib/locale/* is read by libc business code, not the loader. Rule of thumb:
ld.soonly mmaps ELF files- locale / nsswitch / passwd / resolv.conf are data files read by libc during init/runtime
5.2 UTF-8 / utf8 double lookup Link to heading
glibc’s setlocale normalizes internally: first try the user’s spelling (en_US.UTF-8), and if that fails, fall back to lowercase no-hyphen (en_US.utf8).
To suppress: LC_ALL=C ls /tmp — C is a built-in locale that reads no files. Cuts strace lines roughly in half.
5.3 ls multi-column vs single-column (the isatty pattern) Link to heading
if (isatty(STDOUT_FILENO)) {
// multi-column + color + query window size
} else {
// single column + no color
}
isatty() is implemented as ioctl(fd, TCGETS, ...) checking for ENOTTY.
Broadly applied: grep --color=auto, git, pip, curl progress bars — same pattern. Build it into your own CLIs.
5.4 fd reuse rules Link to heading
POSIX guarantees open returns the lowest available fd.
- Single-threaded
close(3)→open()→ still 3 - Multi-threaded trap: thread A
close(3)concurrent with thread Bopen()→ B grabs 3, then A’s subsequentread(3)reads from B’s file (fd reuse race) - fds 0/1/2 aren’t kernel-enforced — they’re a shell + libc convention. When daemonizing, explicitly close 0/1/2 and reopen them on
/dev/null.
5.5 mmap of tiny files seems wasteful but libc does it anyway Link to heading
Mapping a whole page for a 54-byte file looks wasteful, but libc pays that tax for code uniformity. LLMs reading strace often flag it as a bug — it’s normal noise.
6. fd limits Link to heading
Three tiers Link to heading
- soft limit (what the process actually gets):
ulimit -n/getrlimit(RLIMIT_NOFILE) - hard limit (ceiling the soft limit can be raised to): only root can raise
- system-wide:
/proc/sys/fs/nr_open— ceiling for a single process’s hard limit/proc/sys/fs/file-max— total fds across the whole system
Raising them Link to heading
- shell:
ulimit -n 65536 - persistent:
/etc/security/limits.confor systemdLimitNOFILE= - code:
setrlimit(RLIMIT_NOFILE, ...) - container: docker
--ulimit; k8s via init container / securityContext
Symptoms of hitting the limit Link to heading
accept()/open()returningEMFILE(too many open files)- Classic on long-connection servers in Java/Node/Nginx — the 1024 default isn’t enough
Inspecting a process’s current fds Link to heading
ls -la /proc/PID/fd
ls -la /proc/self/fd # yourself
7. How the shell launches external commands Link to heading
Default: fork + execve Link to heading
shell: clone(...) # fork child
parent: wait4(child, ...)
child: execve("/bin/ls", argv, envp) # replace image
... all of ls's syscalls ...
exit_group(0)
parent receives SIGCHLD, wait4 returns
Optimization: last-command exec Link to heading
bash -c "ls" — a single command with nothing after it — doesn’t fork; bash just execves itself into ls. That’s why strace ls /tmp (without -f) shows no fork — strace takes a similar path and execs directly.
bash -c "exec ls" triggers the same behavior explicitly.
How to actually see the fork Link to heading
strace -f bash -c "ls; echo done" — because there’s still an echo to run afterwards, bash has to keep itself around, so it must fork.
8. Modern tooling for process-related syscalls Link to heading
# system-wide new-process tracing (container debugging gold)
sudo execsnoop-bpfcc
# bpftrace on execve
sudo bpftrace -e 'tracepoint:syscalls:sys_enter_execve {
printf("%s\n", str(args->filename));
}'
# perf trace — lighter than strace
perf trace ./prog
perf trace -e 'syscalls:sys_enter_clone' -a # system-wide clone
# where is the process?
sudo cat /proc/PID/stack # current kernel stack
sudo cat /proc/PID/wchan # which kernel function it's blocked in
sudo cat /proc/PID/syscall # syscall currently executing + args
9. Easy traps Link to heading
short read / short write Link to heading
read/write may complete only partially — always loop. Networks and pipes mandate looping, regular files should loop too.
while (left > 0) {
n = write(fd, buf + done, left);
if (n < 0) {
if (errno == EINTR) continue;
return -1;
}
done += n; left -= n;
}
EINTR Link to heading
A slow syscall interrupted by a signal returns EINTR. SA_RESTART makes the kernel auto-retry, but not every syscall supports auto-restart. The safe move is to write your own retry loop.
ioctl is type-unsafe Link to heading
The third argument is void * — the compiler can’t help you. Wrong cmd or wrong struct size and the kernel reads out of bounds. Famously a CVE goldmine.
close() can fail too Link to heading
NFS / network filesystems / any fd with buffered writes can return errors at close time. Check it in critical paths.
O_CLOEXEC isn’t the default Link to heading
Classic open doesn’t set close-on-exec — fds leak to child processes after exec. Modern code always passes O_CLOEXEC. pipe2, socket, accept4, dup3 all have _CLOEXEC variants.
dense futex calls Link to heading
Usually pthread lock contention or wakeups. A wall of futex in strace points to lock contention — it’s either a bug or a hot path.
10. Common syscall templates Link to heading
Startup (fork + exec) Link to heading
clone(CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, ...) = child_pid
[child] execve(...)
[parent] wait4(child_pid, ...)
Read a whole file Link to heading
openat(AT_FDCWD, path, O_RDONLY|O_CLOEXEC) = 3
fstat(3, ...) # get size
read(3, buf, size) # may need to loop
close(3)
TCP server accept loop Link to heading
socket(AF_INET, SOCK_STREAM|SOCK_CLOEXEC, ...) = 3
bind(3, ...)
listen(3, backlog)
epoll_create1(EPOLL_CLOEXEC) = 4
epoll_ctl(4, EPOLL_CTL_ADD, 3, ...)
loop:
epoll_wait(4, ...)
accept4(3, ..., SOCK_CLOEXEC|SOCK_NONBLOCK) = 5
epoll_ctl(4, EPOLL_CTL_ADD, 5, ...)
... read/write on 5 ...
Log writer with sync Link to heading
openat(..., O_WRONLY|O_APPEND|O_CLOEXEC)
write(...)
fsync(fd) # flush data + metadata to disk
# or fdatasync — data only
close(fd)
11. Today’s takeaways Link to heading
- Three load layers: ld.so loading
.sos / libc init (TLS, locale, …) / program logic. Read strace in segments. - The isatty pattern is everywhere:
TCGETSmeans tty detection — CLIs use it to flip between interactive and piped modes. - fd is lowest-available: stable in single-threaded code, racy across threads.
- Shell exec-only optimization:
sh -c "single_cmd"doesn’t fork. - Always pass
-f, bump-s 200for buffers, use-e trace=%groupfor grouping. - ioctl is an escape hatch: type-unsafe + a sprawling interface, hard to filter at fine grain with seccomp.
LC_ALL=Ckills locale noise — common in debugging and minimal containers.
Day 3 preview Link to heading
Tomorrow: seccomp. We’ll bump into:
prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0)— the first line for unprivileged filter installationseccomp(SECCOMP_SET_MODE_STRICT, ...)/SECCOMP_SET_MODE_FILTER- BPF programs see
struct seccomp_data(containingnr,arch,args[6],instruction_pointer) - strace still works after a filter is set (PTRACE sits above seccomp), but filter-killed syscalls show up as SIGKILL
- ioctl is a sore point inside filters (discussed above)
Think ahead: if a seccomp filter denies mmap, what from Day 1 / Day 2 immediately breaks? (Hint: large mallocs, ld.so loading, locale.)