Day 2: strace + Common Syscalls in Practice

A practical guide to strace: flag cheat sheet, how to read program startup as three distinct layers (ld.so → libc init → main), annotated syscalls, and the gotchas you actually hit (fd reuse races, EINTR, short writes, locale noise).

1. strace cheat sheet Link to heading

Basics Link to heading

strace ./prog                       # trace prog
strace -p PID                       # attach to an existing process
strace -f ./prog                    # follow forks/clones ★essential★
strace -ff -o trace ./prog          # one file per PID: trace.<pid>
strace -o out.log ./prog            # write to file (keeps stderr clean)

Filtering Link to heading

strace -e trace=openat,mmap ./prog              # only these
strace -e trace=!futex,clock_gettime ./prog     # exclude noise
strace -e trace=%process ./prog                 # process mgmt (fork/execve/wait/exit)
strace -e trace=%file ./prog                    # file ops
strace -e trace=%network ./prog                 # sockets
strace -e trace=%signal ./prog                  # signals
strace -e trace=%ipc ./prog                     # SysV IPC
strace -e trace=%desc ./prog                    # fd ops
strace -e trace=%memory ./prog                  # mmap/brk/mprotect

Debugging / performance Link to heading

strace -c ./prog                    # tally per-syscall count + time
strace -tt -T ./prog                # wall time + duration per line
strace -s 200 ./prog                # buffer print width 200 (default 32 too short)
strace -y ./prog                    # show path after fd, e.g. "3</tmp/foo>"
strace -yy ./prog                   # even more detail (socket shows ip:port)
strace -k ./prog                    # userland stack trace for each syscall

Common combos Link to heading

# debug where a program is stuck
strace -f -p PID

# see which files get touched at startup
strace -f -e trace=openat -o load.log ./prog

# investigate EPERM/EACCES
strace -f -e trace=openat ./prog 2>&1 | grep -E 'EACCES|EPERM'

# find hotspots
strace -c -f ./prog

2. The three layers of program load (important) Link to heading

Starting a program is roughly three layers. strace sees all of them — but don’t conflate them:

Layer 1: kernel + ld.so loading the ELF Link to heading

The kernel’s execve entry sets up a fresh address space and jumps to ld.so
ld.so does (this is the very first chunk of any strace):
- brk(NULL) — probe heap top
- mmap — ld.so’s own scratch space
- access("/etc/ld.so.preload", ...) — check preload
- openat("/etc/ld.so.cache", ...) + mmap — the .so index
- For each .so (libc, libpthread, …): openat → read ELF header → fstat → multiple mmaps (text PROT_READ|PROT_EXEC, data PROT_READ|PROT_WRITE, bss anonymous) → mprotect (RELRO) → close
Jump to the program entry point (_start)

Signature: all mapped paths are .so files under /lib/ or /usr/lib/.

Layer 2: libc runtime init (before main) Link to heading

arch_prctl(ARCH_SET_FS, ...) — set TLS base (errno, __thread, stack canary all rely on it)
set_tid_address(...) — tell the kernel where the thread ID lives (used by futex / robust list)
set_robust_list(...) — foundation for pthread lock recovery
rseq(...) — restartable sequences (modern glibc/kernel)
brk(...) — grow the heap for malloc
openat("/usr/lib/locale/...", ...) + mmap — locale data
May also read /etc/nsswitch.conf, /etc/passwd, /etc/resolv.conf, etc. (program-dependent)

Signature: reading data files rather than .sos.

Layer 3: program main() — business logic Link to heading

ls: ioctl tty detection → openat directory → getdents64 → write → exit
cat: openat file → read loop → write → close → exit

How to find the layer boundaries: everything after the first arch_prctl is Layer 2; everything after the first business-related syscall (e.g. ls’s ioctl/statx) is Layer 3.

3. Canonical startup sequence (worth memorizing) Link to heading

execve(...)                                 # entry point (strace's first line)

# === ld.so loading phase ===
brk(NULL)                                   # probe heap
mmap(NULL, 8192, RW, MAP_ANON, ...)         # ld.so scratch
access("/etc/ld.so.preload", R_OK)          # usually ENOENT
openat("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC)
fstat(3, ...)
mmap(NULL, ..., PROT_READ, MAP_PRIVATE, 3, 0)
close(3)

openat("/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC)
read(3, ELF header, 832)                    # read ELF header
fstat(3, ...)
mmap(..., PROT_READ, MAP_PRIVATE, 3, 0)     # probe layout
mmap(..., PROT_READ|PROT_EXEC, ..., 3, off) # text
mmap(..., PROT_READ|PROT_WRITE, ..., 3, off)# data
mprotect(..., PROT_READ)                    # RELRO
close(3)
# ... repeat for each .so

# === libc init ===
arch_prctl(ARCH_SET_FS, 0x...)              # TLS base
set_tid_address(...)
set_robust_list(...)
rseq(...)
mprotect(..., PROT_READ)                    # lock GOT
prlimit64(0, RLIMIT_STACK, ...)
brk(NULL); brk(0x...)                       # grow heap
openat("/usr/lib/locale/...", ...)          # locale (lots of ENOENT+success pairs)
mmap(...)

# === program main ===
... actual program logic ...

# === exit ===
close(1); close(2)
exit_group(0)

4. Key syscalls annotated (ls /tmp edition) Link to heading

syscall	meaning
`execve(prog, argv, envp)`	replace the current process image
`brk(NULL)`	query current heap top
`brk(0x...)`	grow heap to this address
`arch_prctl(ARCH_SET_FS, addr)`	set FS register → TLS base
`set_tid_address(p)`	on thread exit, kernel clears `*p` and `futex_wake`s
`access(path, R_OK)`	check readability (TOCTOU-prone; modern code avoids)
`openat(AT_FDCWD, path, flags)`	open (returns lowest available fd)
`O_CLOEXEC`	auto-close on exec (prevents fd leak to children)
`O_DIRECTORY`	must be a directory, else ENOTDIR
`O_NONBLOCK`	non-blocking (not just sockets — avoids blocking on named pipes at open time too)
`fstat(fd, ...)`	fd metadata
`statx(...)`	modern stat with more fields (btime, attrs)
`mmap(...)`	see Day 1’s four-quadrant table
`mprotect(addr, len, prot)`	change page permissions (RELRO locks GOT; JIT marks code PROT_EXEC after writing)
`ioctl(fd, TCGETS, ...)`	= isatty(fd) check
`ioctl(fd, TIOCGWINSZ, ...)`	get terminal rows × cols
`getdents64(fd, buf, count)`	read directory entries (loop until it returns 0)
`write(fd, buf, n)`	write — may be short; must loop
`close(fd)`	close fd — check the return value (NFS / cache flush errors)
`exit_group(status)`	exit all threads (= `_exit` for single-threaded)

5. Notable observations (things I tripped over today) Link to heading

5.1 ld.so doesn’t read locale Link to heading

/usr/lib/locale/* is read by libc business code, not the loader. Rule of thumb:

ld.so only mmaps ELF files
locale / nsswitch / passwd / resolv.conf are data files read by libc during init/runtime

5.2 UTF-8 / utf8 double lookup Link to heading

glibc’s setlocale normalizes internally: first try the user’s spelling (en_US.UTF-8), and if that fails, fall back to lowercase no-hyphen (en_US.utf8).

To suppress: LC_ALL=C ls /tmp — C is a built-in locale that reads no files. Cuts strace lines roughly in half.

5.3 ls multi-column vs single-column (the isatty pattern) Link to heading

if (isatty(STDOUT_FILENO)) {
    // multi-column + color + query window size
} else {
    // single column + no color
}

isatty() is implemented as ioctl(fd, TCGETS, ...) checking for ENOTTY.

Broadly applied: grep --color=auto, git, pip, curl progress bars — same pattern. Build it into your own CLIs.

5.4 fd reuse rules Link to heading

POSIX guarantees open returns the lowest available fd.

Single-threaded close(3) → open() → still 3
Multi-threaded trap: thread A close(3) concurrent with thread B open() → B grabs 3, then A’s subsequent read(3) reads from B’s file (fd reuse race)
fds 0/1/2 aren’t kernel-enforced — they’re a shell + libc convention. When daemonizing, explicitly close 0/1/2 and reopen them on /dev/null.

5.5 mmap of tiny files seems wasteful but libc does it anyway Link to heading

Mapping a whole page for a 54-byte file looks wasteful, but libc pays that tax for code uniformity. LLMs reading strace often flag it as a bug — it’s normal noise.

6. fd limits Link to heading

Three tiers Link to heading

soft limit (what the process actually gets): ulimit -n / getrlimit(RLIMIT_NOFILE)
hard limit (ceiling the soft limit can be raised to): only root can raise
system-wide:
- /proc/sys/fs/nr_open — ceiling for a single process’s hard limit
- /proc/sys/fs/file-max — total fds across the whole system

Raising them Link to heading

shell: ulimit -n 65536
persistent: /etc/security/limits.conf or systemd LimitNOFILE=
code: setrlimit(RLIMIT_NOFILE, ...)
container: docker --ulimit; k8s via init container / securityContext

Symptoms of hitting the limit Link to heading

accept() / open() returning EMFILE (too many open files)
Classic on long-connection servers in Java/Node/Nginx — the 1024 default isn’t enough

Inspecting a process’s current fds Link to heading

ls -la /proc/PID/fd
ls -la /proc/self/fd     # yourself

7. How the shell launches external commands Link to heading

Default: fork + execve Link to heading

shell: clone(...)         # fork child
  parent: wait4(child, ...)
  child:  execve("/bin/ls", argv, envp)   # replace image
          ... all of ls's syscalls ...
          exit_group(0)
parent receives SIGCHLD, wait4 returns

Optimization: last-command exec Link to heading

bash -c "ls" — a single command with nothing after it — doesn’t fork; bash just execves itself into ls. That’s why strace ls /tmp (without -f) shows no fork — strace takes a similar path and execs directly.

bash -c "exec ls" triggers the same behavior explicitly.

How to actually see the fork Link to heading

strace -f bash -c "ls; echo done" — because there’s still an echo to run afterwards, bash has to keep itself around, so it must fork.

# system-wide new-process tracing (container debugging gold)
sudo execsnoop-bpfcc

# bpftrace on execve
sudo bpftrace -e 'tracepoint:syscalls:sys_enter_execve {
    printf("%s\n", str(args->filename));
}'

# perf trace — lighter than strace
perf trace ./prog
perf trace -e 'syscalls:sys_enter_clone' -a   # system-wide clone

# where is the process?
sudo cat /proc/PID/stack    # current kernel stack
sudo cat /proc/PID/wchan    # which kernel function it's blocked in
sudo cat /proc/PID/syscall  # syscall currently executing + args

9. Easy traps Link to heading

short read / short write Link to heading

read/write may complete only partially — always loop. Networks and pipes mandate looping, regular files should loop too.

while (left > 0) {
    n = write(fd, buf + done, left);
    if (n < 0) {
        if (errno == EINTR) continue;
        return -1;
    }
    done += n; left -= n;
}

EINTR Link to heading

A slow syscall interrupted by a signal returns EINTR. SA_RESTART makes the kernel auto-retry, but not every syscall supports auto-restart. The safe move is to write your own retry loop.

ioctl is type-unsafe Link to heading

The third argument is void * — the compiler can’t help you. Wrong cmd or wrong struct size and the kernel reads out of bounds. Famously a CVE goldmine.

close() can fail too Link to heading

NFS / network filesystems / any fd with buffered writes can return errors at close time. Check it in critical paths.

O_CLOEXEC isn’t the default Link to heading

Classic open doesn’t set close-on-exec — fds leak to child processes after exec. Modern code always passes O_CLOEXEC. pipe2, socket, accept4, dup3 all have _CLOEXEC variants.

dense futex calls Link to heading

Usually pthread lock contention or wakeups. A wall of futex in strace points to lock contention — it’s either a bug or a hot path.

10. Common syscall templates Link to heading

Startup (fork + exec) Link to heading

clone(CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, ...) = child_pid
[child] execve(...)
[parent] wait4(child_pid, ...)

Read a whole file Link to heading

openat(AT_FDCWD, path, O_RDONLY|O_CLOEXEC) = 3
fstat(3, ...)                               # get size
read(3, buf, size)                          # may need to loop
close(3)

TCP server accept loop Link to heading

socket(AF_INET, SOCK_STREAM|SOCK_CLOEXEC, ...) = 3
bind(3, ...)
listen(3, backlog)
epoll_create1(EPOLL_CLOEXEC) = 4
epoll_ctl(4, EPOLL_CTL_ADD, 3, ...)
loop:
  epoll_wait(4, ...)
  accept4(3, ..., SOCK_CLOEXEC|SOCK_NONBLOCK) = 5
  epoll_ctl(4, EPOLL_CTL_ADD, 5, ...)
  ... read/write on 5 ...

Log writer with sync Link to heading

openat(..., O_WRONLY|O_APPEND|O_CLOEXEC)
write(...)
fsync(fd)        # flush data + metadata to disk
# or fdatasync — data only
close(fd)

11. Today’s takeaways Link to heading

Three load layers: ld.so loading .sos / libc init (TLS, locale, …) / program logic. Read strace in segments.
The isatty pattern is everywhere: TCGETS means tty detection — CLIs use it to flip between interactive and piped modes.
fd is lowest-available: stable in single-threaded code, racy across threads.
Shell exec-only optimization: sh -c "single_cmd" doesn’t fork.
Always pass -f, bump -s 200 for buffers, use -e trace=%group for grouping.
ioctl is an escape hatch: type-unsafe + a sprawling interface, hard to filter at fine grain with seccomp.
LC_ALL=C kills locale noise — common in debugging and minimal containers.

Day 3 preview Link to heading

Tomorrow: seccomp. We’ll bump into:

prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0) — the first line for unprivileged filter installation
seccomp(SECCOMP_SET_MODE_STRICT, ...) / SECCOMP_SET_MODE_FILTER
BPF programs see struct seccomp_data (containing nr, arch, args[6], instruction_pointer)
strace still works after a filter is set (PTRACE sits above seccomp), but filter-killed syscalls show up as SIGKILL
ioctl is a sore point inside filters (discussed above)

Think ahead: if a seccomp filter denies mmap, what from Day 1 / Day 2 immediately breaks? (Hint: large mallocs, ld.so loading, locale.)