Day 1: Linux Syscall Internals
A deep dive into how Linux system calls actually work on x86_64: the ABI, the syscall instruction’s hardware side effects, the full user→kernel path, and a categorized cheat sheet of the syscalls worth knowing.
1. x86_64 syscall ABI Link to heading
Register convention Link to heading
| Purpose | Register | Notes |
|---|---|---|
| syscall number | rax | Set before entering kernel |
| Arg 1 | rdi | |
| Arg 2 | rsi | |
| Arg 3 | rdx | |
| Arg 4 | r10 | Not rcx — the syscall instruction uses rcx for the return address |
| Arg 5 | r8 | |
| Arg 6 | r9 | |
| Return value | rax | -4095..-1 is errno; the glibc wrapper translates it |
Contrast with a normal function call (System V AMD64 ABI): argument order is rdi/rsi/rdx/rcx/r8/r9. Syscalls move arg 4 from rcx to r10 because the hardware needs rcx for something else.
Hardware side effects of syscall
Link to heading
When the CPU executes syscall, it automatically:
rcx ← rip(return address)r11 ← rflagsrip ← MSR_LSTAR(jump to kernel entry)- Loads kernel CS/SS from
MSR_STAR, switches to ring 0 rflags &= ~MSR_SFMASK(clears IF/TF/DF, etc.)- Does not switch the stack — this is the key difference vs.
int 0x80
So the syscall clobber list must include rcx and r11; after a syscall, user code can no longer rely on those two registers.
2. syscall vs int 0x80 vs sysenter Link to heading
| Mechanism | Vendor | Era | 64-bit |
|---|---|---|---|
int 0x80 | Generic interrupt gate | Classic 32-bit Linux | Works but goes through the compat path; args truncated to 32 bits |
sysenter/sysexit | Intel (P2) | 32-bit | Retired in 64-bit long mode |
syscall/sysret | AMD (K6-2) | Buggy in 32-bit | The 64-bit standard |
Why syscall is fast: no IDT lookup, no interrupt stack frame, no TSS-based stack switch. It’s a CPU instruction built specifically for system calls — practically a direct jump.
Why using int 0x80 in 64-bit is a trap: it enters via the 32-bit ABI, so the syscall number is read as a 32-bit value and arguments are truncated → 64-bit pointers get sliced.
3. The full user → kernel path (using write(1, "hi\n", 3))
Link to heading
[user space]
glibc write() wrapper:
mov $1, %eax ; __NR_write
mov $1, %edi ; fd
mov $msg, %rsi ; buf
mov $3, %edx ; count
syscall ; --- hardware boundary ---
; rcx ← rip, r11 ← rflags, ring 0, jump MSR_LSTAR
cmp $-4095, %rax ; check for error
jae set_errno
ret
[kernel space]
entry_SYSCALL_64 (arch/x86/entry/entry_64.S):
swapgs # switch GS base to per-CPU
use per-CPU data to find the kernel stack
switch rsp to the kernel stack
push all user registers → pt_regs
do_syscall_64 (arch/x86/entry/common.c):
nr = rax & __SYSCALL_MASK
ret = sys_call_table[nr](pt_regs) # dispatch
__x64_sys_write → ksys_write → vfs_write
file = fdget(fd)
file->f_op->write(...) # goes through driver / fs
Return:
return value → pt_regs->rax
restore registers from pt_regs
swapgs back to user GS
sysretq # rip ← rcx, rflags ← r11, ring 3
[user space]
wrapper checks rax, sets errno or returns
What the glibc wrapper actually does (it’s very thin) Link to heading
- Stage registers
syscall- Check whether rax falls in the errno range; if so, set errno and return -1
- (pthread cancellation point check)
It does not do any validity checking (fd validity, buffer address validity, etc.) — those are all the kernel’s job.
4. Observing it yourself Link to heading
Confirm how thin the wrapper is Link to heading
gdb ./your_program
(gdb) disas write
# mov $1, %eax; syscall; ret path
Inspect syscall arguments Link to heading
strace -e trace=write ./your_program
strace -f -o out.log ls /tmp # -f follows forks
strace -c ./your_program # hotspot summary
Watch the kernel dispatch (ftrace) Link to heading
sudo -i
cd /sys/kernel/tracing
echo function_graph > current_tracer
echo __x64_sys_write > set_graph_function
echo 1 > tracing_on
# in another terminal, run the test program
echo 0 > tracing_on
cat trace
Live kernel-function inspection (bpftrace) Link to heading
sudo bpftrace -e 'kprobe:__x64_sys_write {
printf("pid=%d fd=%d count=%d\n", pid, arg0, arg2);
}'
Source locations Link to heading
arch/x86/entry/entry_64.S— entry_SYSCALL_64 assemblyarch/x86/entry/common.c— do_syscall_64arch/x86/entry/syscalls/syscall_64.tbl— nr → handler tableinclude/linux/syscalls.h— handler declarations
5. Quick reference: the syscalls that matter (by category) Link to heading
Process / program loading Link to heading
| syscall | Purpose | Key points |
|---|---|---|
fork | Duplicate current process | COW; parent/child share pages until a write |
clone | Generic process/thread creation | Flags control what’s shared (VM/files/FS/…). pthread_create is built on clone |
execve | Replace current program | Does not return on success. fds inherited by default; O_CLOEXEC ones are not |
exit_group | Exit all threads | _exit only exits the current thread; normal exit goes through exit_group |
wait4 / waitid | Wait for child | Skipping this leaves a zombie |
getpid / gettid | pid / tid | These differ in multithreaded programs |
File I/O Link to heading
| syscall | Purpose | Key points |
|---|---|---|
openat | Open a file | dirfd + path. AT_FDCWD is equivalent to open. Prefer openat |
open | Legacy | Internally glibc does openat(AT_FDCWD, ...) |
read / write | Read/write fd | Short reads/writes are normal — loop if needed |
pread / pwrite | Read/write at offset | Doesn’t change file offset; safe with shared fds across threads |
readv / writev | Scatter/gather | Multiple buffers in one syscall |
lseek | Change file offset | Pipes/sockets aren’t seekable |
close | Close fd | Check the return value (NFS / cache flush errors) |
dup / dup2 / dup3 | Duplicate fd | Shell’s 2>&1 uses dup2 |
fcntl | Misc fd / file attributes | F_GETFL/F_SETFL to change O_NONBLOCK, etc.; F_SETFD for FD_CLOEXEC |
fstat / fstatat / statx | File metadata | statx is the modern one with extra fields (btime, attrs) |
Directory operations (*at family — anchored on dirfd instead of cwd) Link to heading
openat, fstatat, unlinkat, renameat2, linkat, symlinkat, mkdirat, faccessat, readlinkat, fchmodat, fchownat, utimensat, mknodat…
Why: avoids cwd races, defends against TOCTOU, and enables capability-style sandboxing (pass an O_PATH dirfd around as a capability).
Memory Link to heading
| syscall | Purpose | Key points |
|---|---|---|
mmap | Map memory | Four quadrants: {private, shared} × {anonymous, file-backed} |
munmap | Unmap | |
mprotect | Change page permissions | JIT pattern: write code, then mprotect to PROT_EXEC |
mremap | Move/grow a mapping | Used by realloc for large blocks |
brk / sbrk | Move heap break | malloc uses this for allocations < 128KB |
madvise | Hint to the kernel | MADV_DONTNEED returns memory immediately; MADV_HUGEPAGE, etc. |
The four mmap quadrants:
- anon + private: large malloc, stack, bss. Swap is the fallback. Fork is COW.
- anon + shared: parent/child IPC
- file + private: .so text/rodata. Reads share the page cache; writes COW and don’t go back to the file
- file + shared: mmap large files, shm. Writeback returns to the file. Page cache shared across processes
IPC / synchronization Link to heading
| syscall | Purpose | Key points |
|---|---|---|
pipe / pipe2 | Anonymous pipe | pipe2 takes flags (O_CLOEXEC/O_NONBLOCK) |
socketpair | Full-duplex unix socket pair | Stronger than pipe — can pass fds (SCM_RIGHTS) |
futex | Userspace fast-lock backing | Underpins pthread mutex/cond/sem. Seeing futex in strace usually means contention |
eventfd | Counter signal as an fd | Feeds events into epoll |
signalfd | Signal as an fd | Demuxes async signals into synchronous reads |
kill / tgkill | Send a signal | tgkill targets a specific thread group + tid precisely |
I/O multiplexing Link to heading
| syscall | Purpose | Key points |
|---|---|---|
select / pselect | Old-school multiplex | fd_set capped at 1024 fds |
poll / ppoll | Array-based | No cap, but O(n) |
epoll_create1 / epoll_ctl / epoll_wait | Linux high-performance | O(1) ready notification; mandatory for servers |
io_uring_setup / io_uring_enter | Next-gen async I/O | Submission/completion queues, fewer syscalls |
Networking Link to heading
| syscall | Purpose | Key points |
|---|---|---|
socket | Create a socket | family/type/protocol |
bind / listen / accept4 | Server trio | accept4 takes flags |
connect | Client connect | A non-blocking socket returns EINPROGRESS immediately |
send / recv / sendto / recvfrom / sendmsg / recvmsg | I/O | The msg variants are the most general — they carry ancillary data |
setsockopt / getsockopt | Socket options | SO_REUSEADDR, TCP_NODELAY, etc. |
Time Link to heading
| syscall | Purpose | Key points |
|---|---|---|
clock_gettime | Read time | Mostly served from vDSO — never actually enters the kernel |
nanosleep / clock_nanosleep | Sleep | |
timerfd_create | Timer as fd | epoll-friendly |
Privilege / security (heavily used in week 2 / 3) Link to heading
| syscall | Purpose | Key points |
|---|---|---|
setuid / setgid / setresuid | Change uid/gid | Privilege drop |
capset / capget | POSIX capabilities | Finer-grained than root/non-root |
prctl | Various process attributes | PR_SET_NO_NEW_PRIVS, PR_SET_SECCOMP, PR_SET_DUMPABLE |
seccomp | Install a seccomp filter | Day 3-4 focus |
unshare / setns / clone(CLONE_NEW*) | Namespaces | Container foundation: pid/net/mnt/uts/ipc/user/cgroup |
pivot_root / chroot | Switch root | Container rootfs |
mount / umount2 | Mount | Containers mount procfs, do bind mounts |
ptrace | Trace another process | What strace/gdb sit on top of |
vDSO (pseudo-syscalls — never actually enter the kernel) Link to heading
clock_gettime, gettimeofday, time, getcpu usually go through the vDSO — the kernel maps a code + data page into every process, userspace reads it directly, no syscall. strace doesn’t see these.
6. Common pitfalls / easy mix-ups Link to heading
- Short read/write:
readcan return fewer bytes than requested, and so canwrite. Loop. - EINTR: slow syscalls can be interrupted by a signal. Use SA_RESTART or retry yourself.
- errno is thread-local: no cross-thread interference.
- A wall of mmap in strace: usually the dynamic linker loading each PT_LOAD segment of every .so.
- futex calls: typically pthread lock contention — either a bug or a hot path.
- First syscall after execve: usually ld.so initialization, commonly
brk(NULL)probing the heap +mmaploading .so files.
7. Command cheat sheet Link to heading
# syscall numbers
grep '^#define __NR_' /usr/include/asm/unistd_64.h | head
# program dependencies
ldd ./prog
readelf -d ./prog # DT_NEEDED gives deps
readelf -l libc.so.6 # PT_LOAD segments
# trace
strace ./prog
strace -e trace=openat,mmap ./prog
strace -f -p PID
ltrace ./prog # library calls (usually libc)
perf trace ./prog # lighter weight than strace
# kernel side
sudo bpftrace -e 'tracepoint:syscalls:sys_enter_write { @[comm] = count(); }'
sudo perf top
Key takeaways Link to heading
- The syscall ABI is a register-based contract; the glibc wrapper is just a shim.
- The
syscallinstruction is special-purpose hardware with side effects (rcx/r11) and bypasses the interrupt flow. - Kernel dispatch is just
sys_call_table[nr]— extremely minimal. - The *at family uses dirfd instead of cwd, making it both thread-safe and TOCTOU-resistant.
- The mmap four-quadrant model determines backing / fork / writeback behavior.
- vDSO keeps hot syscalls out of the kernel.