I/O Systems and Storage
I/O Systems and Storage
Connecting the CPU to the World
A CPU that could only compute but never receive input or produce output would be useless. I/O (Input/Output) systems connect the CPU to everything else: keyboards, displays, storage, networks. Understanding I/O reveals why database performance depends on storage layout, why network code looks the way it does, and why SSDs changed system design.
Real-world analogy: The CPU is the manager of a company. Registers and cache are the manager’s desk — tiny but instant access. RAM is the filing cabinet — bigger, a short walk away. Storage (disk/SSD) is the off-site archive — huge, but you need to send a request and wait. Network is the postal service — can reach anywhere, but latency is significant. A good manager (OS) orchestrates all these resources efficiently.
Part 1 — I/O System Overview
The Bus System
CPU ←──────────────── CPU Bus (fastest) ──────────────────→ Memory
←──────────────── PCIe Bus ─────────────────────────→ GPU, NVMe SSD
←──────────────── SATA/USB Bus ────────────────────→ HDD, USB devices
←──────────────── USB Bus ──────────────────────────→ Keyboard, Mouse
Modern system (simplified):
┌─────────┐ ┌───────────┐ ┌─────────────────────┐
│ CPU │←──│ Memory │←──│ DRAM │
│ │ │ Controller│ └─────────────────────┘
│ │ └───────────┘
│ │ ┌───────────┐ ┌─────────────────────┐
│ │←──│ PCIe │←──│ GPU / NVMe SSD │
│ │ │ Root Hub │ └─────────────────────┘
│ │ └───────────┘
│ │ ┌───────────┐ ┌─────────────────────┐
│ │←──│ PCH │←──│ SATA / USB / Audio │
│ │ │(Platform │ └─────────────────────┘
└─────────┘ │Controller │
│ Hub) │
└───────────┘
PCIe (Peripheral Component Interconnect Express)
The dominant high-speed I/O bus today.
PCIe x1: ~1 GB/s (one lane)
PCIe x4: ~4 GB/s (four lanes)
PCIe x16: ~16 GB/s (sixteen lanes — GPU slot)
PCIe 4.0 x4: ~8 GB/s (NVMe SSDs)
PCIe 5.0 x16: ~64 GB/s (next-gen GPUs)
Key: PCIe is a point-to-point, serial link (one bit at a time, very fast)
vs. old PCI: parallel bus, shared among all devices, much slower
Part 2 — I/O Methods
Programmed I/O (Polling)
CPU directly reads/writes device registers in a loop.
// Pseudocode: send byte to UART (serial port)
void uartSendByte(uint8_t byte) {
// Wait until transmit buffer empty
while (!(UART_STATUS & TX_EMPTY)) {
// busy-wait — CPU doing nothing useful
}
UART_DATA = byte; // write to device register
}
Pros: simple, low latency for fast devices
Cons: CPU wastes 100% of its time waiting (busy-wait)
Unacceptable for slow devices (keyboard, disk)
Analogy: You order food at a restaurant, then stand at the kitchen window staring at the chef until your food is ready. You can’t do anything else.
Interrupt-Driven I/O
Device notifies CPU when ready via interrupt. CPU can do other work meanwhile.
1. CPU issues I/O request to device (e.g., "read sector 42")
2. CPU continues executing other processes
3. Device finishes → raises interrupt signal on interrupt line
4. CPU finishes current instruction
5. CPU saves state (PC, registers) onto stack
6. CPU jumps to Interrupt Service Routine (ISR) / interrupt handler
7. ISR reads data from device, copies to kernel buffer
8. ISR acknowledges interrupt (clears it)
9. CPU restores saved state, continues previous process
Timeline:
CPU: [other work...][other work...][ISR: 2µs][other work...]
Disk: [seek + rotate + read: 5ms]────────►[interrupt]
↑
CPU notified here
CPU was productive the entire 5ms! (not wasted in busy-wait)
Interrupt Handling in Hardware:
Interrupt Controller (e.g., APIC):
- Receives interrupt signals from all devices
- Prioritizes them (higher priority preempts lower)
- Sends IRQ (interrupt request) to CPU
- CPU has interrupt enable/disable flag (IF flag in x86)
Interrupt vector table:
- Array of function pointers, indexed by interrupt number
- IRQ 0: Timer
- IRQ 1: Keyboard
- IRQ 14: IDE disk
- IRQ NN: user-defined
DMA (Direct Memory Access)
For bulk transfers, let a dedicated DMA controller move data directly to/from memory without CPU involvement.
Without DMA (CPU copies):
CPU: [read 1 byte from disk][write to RAM][read][write][read][write]...
× 512 times for one sector = CPU tied up for entire transfer
With DMA:
1. CPU programs DMA controller:
"Copy 512 bytes from disk controller to address 0x8000"
2. CPU continues other work
3. DMA controller executes the copy autonomously
- Reads from device, writes to RAM, uses memory bus
- CPU bus is free (DMA uses it during CPU idle cycles — "cycle stealing")
4. DMA complete → DMA controller interrupts CPU
5. CPU reads 512 bytes from 0x8000 (already in RAM)
Modern: PCIe devices do DMA directly (GPU, NVMe SSD)
GPU computes → writes result directly to CPU-accessible memory
NVMe SSD → DMAs data directly to kernel buffer, interrupts CPU
Comparison:
Method CPU Usage Latency Use Case
Polling 100% Lowest Fast devices, real-time
Interrupt ~0% Medium Keyboard, mouse, network
DMA ~0% Medium Disk, network bulk transfer
Part 3 — Storage Devices
HDD (Hard Disk Drive)
Physical structure:
┌────────────────────────────────────────┐
│ Spindle motor spins platters │
│ at 5400 or 7200 RPM │
│ │
│ ┌──────────────────────────────────┐ │
│ │ Platter (magnetic disk) │ │
│ │ ┌─────────────────────────────┐│ │
│ │ │ Tracks (concentric circles)││ │
│ │ │ ┌──────────────────────┐ ││ │
│ │ │ │ Sectors (512B-4KB) │ ││ │
│ │ │ └──────────────────────┘ ││ │
│ │ └─────────────────────────────┘│ │
│ └──────────────────────────────────┘ │
│ Read/Write heads on actuator arm │
└────────────────────────────────────────┘
Access time = Seek time + Rotational latency + Transfer time
Seek time: 1-10ms (arm moves to correct track)
Rotational latency: 0-8ms (wait for sector to rotate under head)
avg = 4ms at 7200 RPM (60/7200/2 = 4.17ms)
Transfer time: 0.1ms (read the sector)
Total average: ~10ms per random read
Sequential reads: much faster (no seeking between reads)
Throughput:
Sequential: 100-200 MB/s
Random 4KB reads: ~0.5-1 MB/s (100-200 IOPS)
Key insight: HDDs HATE random access. Sequential is 100-400× faster.
This is why databases use B-trees (sequential-friendly) not linked lists.
SSD (Solid State Drive)
Storage: NAND flash memory cells
SLC (Single Level Cell): 1 bit/cell — fast, durable, expensive
MLC (Multi Level Cell): 2 bits/cell — balanced
TLC (Triple Level Cell): 3 bits/cell — slow, less durable, cheap
QLC (Quad Level Cell): 4 bits/cell — very slow, cheap
No moving parts → no seek time, no rotational latency
Access:
Random read: 10-100 µs (100-10,000× faster than HDD)
Sequential: 500-7000 MB/s
IOPS (I/O Operations Per Second):
HDD: 100-200 IOPS
SATA SSD: 50,000-100,000 IOPS
NVMe SSD: 500,000-1,000,000+ IOPS
NVMe (Non-Volatile Memory Express):
Protocol designed specifically for SSDs (unlike SATA, designed for HDDs)
Connects via PCIe directly — no legacy overhead
Supports 64,000 command queues (vs SATA's 1!)
SSD Internals:
Pages: smallest unit of read/write (4KB-16KB)
Blocks: smallest unit of ERASE (256KB-1MB)
Critical constraint: must erase entire block before rewriting!
Write: can write to any empty page O(1)
Overwrite: must read entire block, erase block, write modified block
"Write amplification" — writing 4KB causes 256KB of flash I/O
FTL (Flash Translation Layer):
Maps logical block addresses (what OS sees) to physical flash locations
Implements wear leveling: spreads writes across all cells
(Flash cells wear out after 1,000-100,000 program/erase cycles)
Implements garbage collection: reclaims freed space
Hides all complexity from OS
NVM (Non-Volatile Memory) / Persistent Memory
Intel Optane (3D XPoint):
Byte-addressable (unlike flash which has page/block structure)
Speed: ~300ns latency (between DRAM and NAND)
Endurance: much better than NAND
Used as: fast storage tier, or as DRAM extension
Status: Intel discontinued Optane in 2022
Part 4 — I/O in the OS
Device Drivers
Software architecture:
┌────────────────────────────────┐
│ Application (printf, etc.) │
├────────────────────────────────┤
│ System Call Interface │ (read, write, ioctl)
├────────────────────────────────┤
│ Virtual File System (VFS) │ (unified interface)
├────────────────────────────────┤
│ File System (ext4, NTFS...) │ (directories, files)
├────────────────────────────────┤
│ Block I/O Layer │ (scheduling, caching)
├────────────────────────────────┤
│ Device Driver (kernel code) │ (hardware-specific)
├────────────────────────────────┤
│ Hardware (disk controller) │
└────────────────────────────────┘
I/O Scheduling
Multiple processes competing for the disk. In what order to serve them?
FIFO (First Come First Served):
Requests: cylinder 98, 183, 37, 122, 14, 124, 65, 67
Starting head position: 53
Order: 53→98→183→37→122→14→124→65→67
Movement: 45+85+146+85+108+110+59+2 = 640 cylinders
SCAN (Elevator Algorithm):
Head moves in one direction, serves all requests on the way,
then reverses.
53→65→67→98→122→124→183→37→14
Movement: much less total movement
Like an elevator: doesn't go back to floor 1 just because
someone pressed it — continues up first, then comes back down.
C-SCAN (Circular SCAN):
Like SCAN but on return sweep, jumps back to start without serving
Provides more uniform wait times
Used in many OS schedulers
Modern SSDs: scheduling less important
NVMe SSDs have near-uniform access time for all locations
No physical head to move
OS uses simpler scheduling (often none for NVMe)
Focus: merge small I/Os, maintain ordering where needed
Part 5 — The I/O Stack in Practice
Buffered vs Direct I/O
// Buffered I/O (default in OS):
FILE* f = fopen("data.bin", "rb");
fread(buffer, 4096, 1, f);
// OS caches the page in the page cache
// Second fread of same data → served from RAM (page cache hit)
// Direct I/O (O_DIRECT):
int fd = open("data.bin", O_RDONLY | O_DIRECT);
read(fd, buffer, 4096);
// Bypasses page cache, goes directly to hardware
// Used by: databases (manage their own buffer cache)
// video streaming (huge sequential data, caching useless)
Synchronous vs Asynchronous I/O
// Synchronous (blocking):
ssize_t bytes = read(fd, buffer, 4096);
// Thread blocks here until data is ready
// Simple but wastes CPU time
// Asynchronous (non-blocking):
struct aiocb cb = { .aio_fildes = fd,
.aio_buf = buffer,
.aio_nbytes = 4096 };
aio_read(&cb);
// Returns immediately!
// Do other work...
while (aio_error(&cb) == EINPROGRESS) { /* poll or wait */ }
ssize_t bytes = aio_return(&cb);
// Data is now in buffer
// io_uring (modern Linux, extremely fast):
// Zero-copy, no syscall per operation, batch submissions
// Used by: Nginx, databases for high-performance I/O
Memory-Mapped I/O (MMIO)
Device registers accessible via memory addresses.
// Device registers appear at specific physical addresses
// After mapping: read/write to pointer = read/write device register
// Example: control a hardware timer
volatile uint32_t* timer_control = (volatile uint32_t*)0xFE003000;
volatile uint32_t* timer_value = (volatile uint32_t*)0xFE003004;
*timer_control = 0x00000001; // start timer (write device register)
uint32_t elapsed = *timer_value; // read elapsed time
// volatile: tells compiler NOT to cache this in a register
// (hardware can change the value at any time)
Part 6 — RAID (Redundant Array of Independent Disks)
Combine multiple disks for performance, capacity, or reliability.
RAID 0 — Striping:
[Disk 1] [Disk 2] [Disk 3]
Block0 Block1 Block2
Block3 Block4 Block5
Performance: 3× read and write speed (parallel)
Capacity: 3× single disk
Fault tolerance: NONE — one disk fails → all data lost
Use case: video editing scratch disk, temp data
RAID 1 — Mirroring:
[Disk 1] [Disk 2] (identical copies)
Block0 Block0
Block1 Block1
Performance: 2× read speed (read from either disk)
same write speed (write both simultaneously)
Capacity: same as one disk (50% efficiency)
Fault tolerance: survives 1 disk failure
Use case: OS drives, critical data
RAID 5 — Striping with Parity:
[Disk 1] [Disk 2] [Disk 3] [Disk 4]
Block0 Block1 Block2 Parity012
Block4 Block5 Parity45 Block6
Block8 Parity89 Block9 Block10
Parity = XOR of all data blocks in stripe
Can reconstruct any missing block from parity
Performance: 3× read, ~1× write (parity computation)
Capacity: N-1 disks (3/4 efficiency for 4 disks)
Fault tolerance: survives 1 disk failure
Use case: most common for servers
RAID 6 — Double Parity:
Like RAID 5 but with 2 parity blocks
Survives 2 simultaneous disk failures
N-2 disks capacity
RAID 10 — Stripe of Mirrors:
Mirror pairs, then stripe across pairs
Survives 1 failure per mirror pair
50% capacity, fastest writes with redundancy
Practice Problems
A 7200 RPM HDD has average seek time of 8ms. What is the average rotational latency? What is total average random access time?
Calculate NVMe SSD vs HDD throughput for random 4KB reads:
- HDD: 150 IOPS
- NVMe SSD: 700,000 IOPS How many times faster is the SSD?
You have 4 x 4TB disks. Compare total usable storage for RAID 0, 1, 5, and 10.
Why does a database (like PostgreSQL) use O_DIRECT I/O instead of buffered I/O?
Answers
Problem 1:
7200 RPM = 7200 rotations per minute = 120 rotations per second
One full rotation = 1/120 seconds = 8.33 ms
Average rotational latency = half rotation = 4.17 ms
Total avg random access = seek + rotation + transfer
= 8ms + 4.17ms + 0.1ms ≈ 12.3ms
Problem 2:
HDD: 150 IOPS × 4KB = 600 KB/s
NVMe: 700,000 IOPS × 4KB = 2,800,000 KB/s = 2.8 GB/s
SSD is 700,000/150 ≈ 4,667× faster in IOPS
Problem 3:
4 × 4TB disks:
RAID 0: 4 × 4TB = 16TB (no fault tolerance)
RAID 1: 2TB (mirrors, only 2 disks worth used)
Wait — RAID 1 needs pairs: 2 mirror pairs = 2×4TB = 8TB
RAID 5: (4-1) × 4TB = 12TB (one disk for parity)
RAID 10: 2 mirror pairs, striped = 2 × 4TB = 8TB
Problem 4:
PostgreSQL manages its own buffer cache (shared_buffers).
If it used buffered I/O, data would be cached TWICE:
- Once in PostgreSQL's buffer pool
- Once in the OS page cache
This wastes RAM and adds overhead.
With O_DIRECT:
- PostgreSQL controls exactly what's cached and when
- No redundant copies
- Can implement database-specific eviction policies
- Avoids double-copy: disk → OS cache → userspace buffer
With O_DIRECT: disk → userspace buffer (one copy)
References
- Patterson & Hennessy — Computer Organization and Design — Chapter 5
- Bryant & O’Hallaron — CS:APP — Chapter 10 (System-Level I/O)
- Linux I/O — Brendan Gregg’s Linux Performance
- NVMe Spec — nvmexpress.org
- Storage Performance — USENIX FAST conference