Day 1: Linux Syscall Internals

A deep dive into how Linux system calls actually work on x86_64: the ABI, the syscall instruction’s hardware side effects, the full user→kernel path, and a categorized cheat sheet of the syscalls worth knowing.

1. x86_64 syscall ABI Link to heading

Register convention Link to heading

PurposeRegisterNotes
syscall numberraxSet before entering kernel
Arg 1rdi
Arg 2rsi
Arg 3rdx
Arg 4r10Not rcx — the syscall instruction uses rcx for the return address
Arg 5r8
Arg 6r9
Return valuerax-4095..-1 is errno; the glibc wrapper translates it

Contrast with a normal function call (System V AMD64 ABI): argument order is rdi/rsi/rdx/rcx/r8/r9. Syscalls move arg 4 from rcx to r10 because the hardware needs rcx for something else.

Hardware side effects of syscall Link to heading

When the CPU executes syscall, it automatically:

  • rcx ← rip (return address)
  • r11 ← rflags
  • rip ← MSR_LSTAR (jump to kernel entry)
  • Loads kernel CS/SS from MSR_STAR, switches to ring 0
  • rflags &= ~MSR_SFMASK (clears IF/TF/DF, etc.)
  • Does not switch the stack — this is the key difference vs. int 0x80

So the syscall clobber list must include rcx and r11; after a syscall, user code can no longer rely on those two registers.


2. syscall vs int 0x80 vs sysenter Link to heading

MechanismVendorEra64-bit
int 0x80Generic interrupt gateClassic 32-bit LinuxWorks but goes through the compat path; args truncated to 32 bits
sysenter/sysexitIntel (P2)32-bitRetired in 64-bit long mode
syscall/sysretAMD (K6-2)Buggy in 32-bitThe 64-bit standard

Why syscall is fast: no IDT lookup, no interrupt stack frame, no TSS-based stack switch. It’s a CPU instruction built specifically for system calls — practically a direct jump.

Why using int 0x80 in 64-bit is a trap: it enters via the 32-bit ABI, so the syscall number is read as a 32-bit value and arguments are truncated → 64-bit pointers get sliced.


3. The full user → kernel path (using write(1, "hi\n", 3)) Link to heading

[user space]
glibc write() wrapper:
    mov $1, %eax          ; __NR_write
    mov $1, %edi          ; fd
    mov $msg, %rsi        ; buf
    mov $3, %edx          ; count
    syscall               ; --- hardware boundary ---
                          ; rcx ← rip, r11 ← rflags, ring 0, jump MSR_LSTAR
    cmp $-4095, %rax      ; check for error
    jae set_errno
    ret

[kernel space]
entry_SYSCALL_64 (arch/x86/entry/entry_64.S):
    swapgs                    # switch GS base to per-CPU
    use per-CPU data to find the kernel stack
    switch rsp to the kernel stack
    push all user registers → pt_regs

do_syscall_64 (arch/x86/entry/common.c):
    nr = rax & __SYSCALL_MASK
    ret = sys_call_table[nr](pt_regs)   # dispatch

__x64_sys_write → ksys_write → vfs_write
    file = fdget(fd)
    file->f_op->write(...)              # goes through driver / fs

Return:
    return value → pt_regs->rax
    restore registers from pt_regs
    swapgs back to user GS
    sysretq                             # rip ← rcx, rflags ← r11, ring 3

[user space]
    wrapper checks rax, sets errno or returns

What the glibc wrapper actually does (it’s very thin) Link to heading

  1. Stage registers
  2. syscall
  3. Check whether rax falls in the errno range; if so, set errno and return -1
  4. (pthread cancellation point check)

It does not do any validity checking (fd validity, buffer address validity, etc.) — those are all the kernel’s job.


4. Observing it yourself Link to heading

Confirm how thin the wrapper is Link to heading

gdb ./your_program
(gdb) disas write
# mov $1, %eax; syscall; ret path

Inspect syscall arguments Link to heading

strace -e trace=write ./your_program
strace -f -o out.log ls /tmp      # -f follows forks
strace -c ./your_program          # hotspot summary

Watch the kernel dispatch (ftrace) Link to heading

sudo -i
cd /sys/kernel/tracing
echo function_graph > current_tracer
echo __x64_sys_write > set_graph_function
echo 1 > tracing_on
# in another terminal, run the test program
echo 0 > tracing_on
cat trace

Live kernel-function inspection (bpftrace) Link to heading

sudo bpftrace -e 'kprobe:__x64_sys_write {
    printf("pid=%d fd=%d count=%d\n", pid, arg0, arg2);
}'

Source locations Link to heading

  • arch/x86/entry/entry_64.S — entry_SYSCALL_64 assembly
  • arch/x86/entry/common.c — do_syscall_64
  • arch/x86/entry/syscalls/syscall_64.tbl — nr → handler table
  • include/linux/syscalls.h — handler declarations

5. Quick reference: the syscalls that matter (by category) Link to heading

Process / program loading Link to heading

syscallPurposeKey points
forkDuplicate current processCOW; parent/child share pages until a write
cloneGeneric process/thread creationFlags control what’s shared (VM/files/FS/…). pthread_create is built on clone
execveReplace current programDoes not return on success. fds inherited by default; O_CLOEXEC ones are not
exit_groupExit all threads_exit only exits the current thread; normal exit goes through exit_group
wait4 / waitidWait for childSkipping this leaves a zombie
getpid / gettidpid / tidThese differ in multithreaded programs

File I/O Link to heading

syscallPurposeKey points
openatOpen a filedirfd + path. AT_FDCWD is equivalent to open. Prefer openat
openLegacyInternally glibc does openat(AT_FDCWD, ...)
read / writeRead/write fdShort reads/writes are normal — loop if needed
pread / pwriteRead/write at offsetDoesn’t change file offset; safe with shared fds across threads
readv / writevScatter/gatherMultiple buffers in one syscall
lseekChange file offsetPipes/sockets aren’t seekable
closeClose fdCheck the return value (NFS / cache flush errors)
dup / dup2 / dup3Duplicate fdShell’s 2>&1 uses dup2
fcntlMisc fd / file attributesF_GETFL/F_SETFL to change O_NONBLOCK, etc.; F_SETFD for FD_CLOEXEC
fstat / fstatat / statxFile metadatastatx is the modern one with extra fields (btime, attrs)

Directory operations (*at family — anchored on dirfd instead of cwd) Link to heading

openat, fstatat, unlinkat, renameat2, linkat, symlinkat, mkdirat, faccessat, readlinkat, fchmodat, fchownat, utimensat, mknodat

Why: avoids cwd races, defends against TOCTOU, and enables capability-style sandboxing (pass an O_PATH dirfd around as a capability).

Memory Link to heading

syscallPurposeKey points
mmapMap memoryFour quadrants: {private, shared} × {anonymous, file-backed}
munmapUnmap
mprotectChange page permissionsJIT pattern: write code, then mprotect to PROT_EXEC
mremapMove/grow a mappingUsed by realloc for large blocks
brk / sbrkMove heap breakmalloc uses this for allocations < 128KB
madviseHint to the kernelMADV_DONTNEED returns memory immediately; MADV_HUGEPAGE, etc.

The four mmap quadrants:

  • anon + private: large malloc, stack, bss. Swap is the fallback. Fork is COW.
  • anon + shared: parent/child IPC
  • file + private: .so text/rodata. Reads share the page cache; writes COW and don’t go back to the file
  • file + shared: mmap large files, shm. Writeback returns to the file. Page cache shared across processes

IPC / synchronization Link to heading

syscallPurposeKey points
pipe / pipe2Anonymous pipepipe2 takes flags (O_CLOEXEC/O_NONBLOCK)
socketpairFull-duplex unix socket pairStronger than pipe — can pass fds (SCM_RIGHTS)
futexUserspace fast-lock backingUnderpins pthread mutex/cond/sem. Seeing futex in strace usually means contention
eventfdCounter signal as an fdFeeds events into epoll
signalfdSignal as an fdDemuxes async signals into synchronous reads
kill / tgkillSend a signaltgkill targets a specific thread group + tid precisely

I/O multiplexing Link to heading

syscallPurposeKey points
select / pselectOld-school multiplexfd_set capped at 1024 fds
poll / ppollArray-basedNo cap, but O(n)
epoll_create1 / epoll_ctl / epoll_waitLinux high-performanceO(1) ready notification; mandatory for servers
io_uring_setup / io_uring_enterNext-gen async I/OSubmission/completion queues, fewer syscalls

Networking Link to heading

syscallPurposeKey points
socketCreate a socketfamily/type/protocol
bind / listen / accept4Server trioaccept4 takes flags
connectClient connectA non-blocking socket returns EINPROGRESS immediately
send / recv / sendto / recvfrom / sendmsg / recvmsgI/OThe msg variants are the most general — they carry ancillary data
setsockopt / getsockoptSocket optionsSO_REUSEADDR, TCP_NODELAY, etc.

Time Link to heading

syscallPurposeKey points
clock_gettimeRead timeMostly served from vDSO — never actually enters the kernel
nanosleep / clock_nanosleepSleep
timerfd_createTimer as fdepoll-friendly

Privilege / security (heavily used in week 2 / 3) Link to heading

syscallPurposeKey points
setuid / setgid / setresuidChange uid/gidPrivilege drop
capset / capgetPOSIX capabilitiesFiner-grained than root/non-root
prctlVarious process attributesPR_SET_NO_NEW_PRIVS, PR_SET_SECCOMP, PR_SET_DUMPABLE
seccompInstall a seccomp filterDay 3-4 focus
unshare / setns / clone(CLONE_NEW*)NamespacesContainer foundation: pid/net/mnt/uts/ipc/user/cgroup
pivot_root / chrootSwitch rootContainer rootfs
mount / umount2MountContainers mount procfs, do bind mounts
ptraceTrace another processWhat strace/gdb sit on top of

vDSO (pseudo-syscalls — never actually enter the kernel) Link to heading

clock_gettime, gettimeofday, time, getcpu usually go through the vDSO — the kernel maps a code + data page into every process, userspace reads it directly, no syscall. strace doesn’t see these.


6. Common pitfalls / easy mix-ups Link to heading

  • Short read/write: read can return fewer bytes than requested, and so can write. Loop.
  • EINTR: slow syscalls can be interrupted by a signal. Use SA_RESTART or retry yourself.
  • errno is thread-local: no cross-thread interference.
  • A wall of mmap in strace: usually the dynamic linker loading each PT_LOAD segment of every .so.
  • futex calls: typically pthread lock contention — either a bug or a hot path.
  • First syscall after execve: usually ld.so initialization, commonly brk(NULL) probing the heap + mmap loading .so files.

7. Command cheat sheet Link to heading

# syscall numbers
grep '^#define __NR_' /usr/include/asm/unistd_64.h | head

# program dependencies
ldd ./prog
readelf -d ./prog       # DT_NEEDED gives deps
readelf -l libc.so.6    # PT_LOAD segments

# trace
strace ./prog
strace -e trace=openat,mmap ./prog
strace -f -p PID
ltrace ./prog                       # library calls (usually libc)
perf trace ./prog                   # lighter weight than strace

# kernel side
sudo bpftrace -e 'tracepoint:syscalls:sys_enter_write { @[comm] = count(); }'
sudo perf top

Key takeaways Link to heading

  1. The syscall ABI is a register-based contract; the glibc wrapper is just a shim.
  2. The syscall instruction is special-purpose hardware with side effects (rcx/r11) and bypasses the interrupt flow.
  3. Kernel dispatch is just sys_call_table[nr] — extremely minimal.
  4. The *at family uses dirfd instead of cwd, making it both thread-safe and TOCTOU-resistant.
  5. The mmap four-quadrant model determines backing / fork / writeback behavior.
  6. vDSO keeps hot syscalls out of the kernel.