Virtual Memory

Computer Architecture

Virtual Memory

The Illusion of Infinite, Private Memory

Every process on your computer believes it has the entire address space to itself. Your browser thinks it owns addresses 0 to 2^64. Your code editor thinks the same. They’re both wrong — but the illusion is perfect, and it’s one of the most elegant ideas in systems design.

Real-world analogy: Hotel room numbers. Your room is “204” — that’s your virtual address. But physically, room 204 might be anywhere in the building. The hotel (OS + MMU) maps your room number to a physical location. Two hotels can both have a “Room 204” — no conflict, because they’re in different physical locations. Processes work the same way.

Part 1 — Why Virtual Memory?

Problem 1 — Not Enough RAM

System has 8 GB RAM
Photoshop needs:     2 GB
Chrome needs:        3 GB
VS Code needs:       1 GB
Other processes:     1 GB
Total needed:        7 GB  (just fits)

But what about running more programs?
Virtual memory allows programs to use more memory than physically available
by storing some pages on disk (swap space).

Problem 2 — Memory Protection

Without virtual memory:
  Process A stores password at address 0x1000
  Process B reads address 0x1000 → steals password!

With virtual memory:
  Process A's virtual address 0x1000 → Physical frame 42
  Process B's virtual address 0x1000 → Physical frame 99 (different!)
  
  Each process has its own completely separate virtual address space.
  Process B literally cannot access Process A's memory.

Problem 3 — Fragmentation

Without virtual memory:
  RAM has 1GB free but in 1MB chunks scattered across memory
  A program needing 512MB contiguous memory can't run!

With virtual memory:
  Program sees 512MB of contiguous virtual addresses
  Physically, those pages are scattered throughout RAM
  The page table handles the non-contiguous mapping transparently

Part 2 — Address Translation

Pages and Frames

Memory is divided into fixed-size chunks:

Virtual memory → divided into pages (typically 4KB)
Physical memory → divided into frames (same size as pages)

Virtual Address Space (per process):     Physical Memory (shared):
┌─────────────┐                          ┌─────────────┐
│ Page 0      │ ────────────────────────►│ Frame 7     │
│ (4KB)       │                          │ (4KB)       │
├─────────────┤                          ├─────────────┤
│ Page 1      │ ──────────┐              │ Frame 0     │
│             │            │             ├─────────────┤
├─────────────┤            └────────────►│ Frame 12    │
│ Page 2      │  (on disk)               ├─────────────┤
│             │──  - - - - - - - - - -   │ Frame 3     │
├─────────────┤                          ├─────────────┤
│   ...       │                          │   ...       │

Why 4KB pages? - Large enough: amortizes translation overhead - Small enough: doesn’t waste memory for small allocations - Power of 2: efficient bit-manipulation for address splitting

Virtual Address Structure

64-bit virtual address (x86-64 uses 48 bits effectively):

┌─────────┬─────────┬─────────┬─────────┬──────────────┐
│ PML4    │  PDPT   │   PD    │   PT    │    Offset    │
│ [47:39] │ [38:30] │ [29:21] │ [20:12] │   [11:0]     │
│  9 bits │  9 bits │  9 bits │  9 bits │   12 bits    │
└─────────┴─────────┴─────────┴─────────┴──────────────┘
  512      512       512       512        4096 bytes
  entries  entries   entries   entries    per page

4-level page table hierarchy!
Each level has 512 entries (2^9)
Total: 512^4 = 2^36 pages × 4KB = 256 TB addressable

12-bit offset: covers 2^12 = 4096 bytes within a page ✓

Page Table Walk

Virtual address: 0x00007FF1A3B4C5D6

Break into parts:
  PML4 index:   bits [47:39] = 0x00F  (15)
  PDPT index:   bits [38:30] = 0x068  (104)
  PD index:     bits [29:21] = 0x11D  (285)
  PT index:     bits [20:12] = 0x1B4  (436)
  Page offset:  bits [11:0]  = 0x5D6  (1494)

Translation:
1. CR3 register → physical address of PML4 table
2. PML4[15]     → physical address of PDPT table
3. PDPT[104]    → physical address of PD table
4. PD[285]      → physical address of PT table
5. PT[436]      → physical frame number (PFN)
6. Physical address = PFN × 4096 + 0x5D6

Page Table Entry (PTE)

64-bit PTE structure:
┌─────────────────────────────┬─────────────────────────────┐
│      Physical Frame Number  │          Flags               │
│         [51:12]             │          [11:0]              │
└─────────────────────────────┴─────────────────────────────┘

Key flags:
  P (Present):    1 = page in RAM, 0 = on disk (page fault on access)
  R/W:            0 = read-only, 1 = read-write
  U/S:            0 = kernel only, 1 = user accessible
  A (Accessed):   set by hardware when page is read
  D (Dirty):      set by hardware when page is written
  NX:             1 = no-execute (prevents code injection)
  G (Global):     don't flush from TLB on context switch (kernel pages)

Part 3 — TLB (Translation Lookaside Buffer)

The Problem: Translation is Slow

Without a TLB, every memory access requires 4 additional memory accesses (the page table walk). A 4-cycle operation becomes 5 × 60 ns = 300 ns. Unusable.

Solution: Cache recent translations in a tiny, fast hardware structure — the TLB.

TLB Structure

TLB (fully associative, 64-1024 entries):
┌──────────┬─────────────────┬────────┐
│  VPN     │    PFN          │ Flags  │
│(virtual  │  (physical      │ PRWAD  │
│page num) │   frame num)    │       │
├──────────┼─────────────────┼────────┤
│ 0x7F1A3  │    0x0042F      │ 11110  │
│ 0x00001  │    0x00001      │ 11010  │
│ ...      │    ...          │ ...   │
└──────────┴─────────────────┴────────┘

Lookup: hash VPN, check for match (fully associative = check all entries)
Hit time: 1-2 cycles (same level as L1 cache)

TLB Hit vs Miss

TLB Hit (>99% of accesses):
  Virtual address → TLB lookup (1-2 cycles) → Physical address → memory
  Fast! Like cache hit.

TLB Miss (<1% of accesses):
  Virtual address → TLB lookup MISS → Page table walk (4 memory accesses)
  Hardware page table walker (x86) or software handler (RISC-V)
  → Load PTE into TLB → retry memory access
  Cost: 200+ cycles

TLB Reach = TLB entries × page size
  64 entries × 4KB = 256KB
  If working set > 256KB: frequent TLB misses

Huge Pages (2MB or 1GB pages):
  64 entries × 2MB = 128MB reach
  Huge pages dramatically reduce TLB misses for large datasets

Context Switch and TLB

When OS switches between processes:
  Old process: VPN 0x1000 → PFN 0x500
  New process: VPN 0x1000 → PFN 0x200  (completely different mapping!)

Options:
1. Flush entire TLB on context switch
   Cost: next process starts with 0 TLB entries → many misses
   
2. ASID (Address Space ID) tag each TLB entry
   TLB entry: [ASID | VPN | PFN | flags]
   Process A uses ASID=1, Process B uses ASID=2
   No flush needed — entries from different ASIDs coexist
   x86 calls this PCID (Process Context ID)

Part 4 — Page Faults

When a page is not in RAM (P bit = 0 in PTE), a page fault occurs.

Page Fault Handler

1. CPU detects P=0 in PTE → triggers page fault exception
2. Hardware saves CPU state (registers, PC, etc.)
3. OS page fault handler runs:
   a. Is this a valid virtual address? (In process's VMA)
      NO → Segmentation fault (SIGSEGV) → process killed
      YES → continue
   b. Find the page on disk (swap space or file)
   c. If RAM is full: evict a page (write dirty page to disk if needed)
   d. Load the needed page from disk into a free frame
   e. Update PTE: set P=1, set frame number
   f. Update TLB
   g. Resume faulting instruction (re-executes, now succeeds)

Total time: ~1-10 ms (disk access dominates)
vs. normal memory access: 60-100 ns
Page fault = 10,000-100,000× slower!

Types of Page Faults

Minor (soft) fault: page exists in physical memory, just not mapped
  - Stack growth: just allocate a new frame, no disk I/O
  - Copy-on-write: just copy the frame, no disk I/O
  - Cost: ~1-10 µs

Major (hard) fault: page must be loaded from disk
  - First access to mmap'd file
  - Swapped-out page being accessed again
  - Cost: 1-10 ms

Part 5 — Page Replacement Algorithms

When RAM is full and a page must be evicted, which page to choose?

OPT (Optimal) — Theoretical Best

Evict the page that will be used furthest in the future.

Pages: A B C D A B E A B C D E
Frames: 3

Load A: [A _ _]  miss
Load B: [A B _]  miss
Load C: [A B C]  miss
Load D: [D B C]  miss  evict A (used furthest: position 4)
Load A: [D A C]  miss  evict B (used furthest: position 5)
Load B: [D A B]  miss  evict C (never used again!)
Load E: [E A B]  miss  evict D (never used again!)
Load A: [E A B]  HIT
...

OPT is optimal but requires knowing the future → impossible in practice
Used as a benchmark to compare other algorithms

LRU (Least Recently Used)

Evict the page used longest ago. Approximates OPT well.

Pages: A B C D A B E A B C D E
Frames: 3

Load A: [A _ _]  miss
Load B: [A B _]  miss
Load C: [A B C]  miss
Load D: [D B C]  miss  evict A (LRU: longest ago)
Load A: [D A C]  miss  evict B
Load B: [D A B]  miss  evict C
Load E: [E A B]  miss  evict D (LRU)
Load A: [E A B]  HIT
Load B: [E A B]  HIT
Load C: [E C B]  miss  evict A
Load D: [E C D]  miss  evict B
Load E: [E C D]  HIT
Misses: 8

LRU is near-optimal but expensive to implement exactly in hardware.

Approximate LRU — Clock Algorithm:

Pages arranged in circle. "Clock hand" sweeps around.
Each page has a reference bit (R), set by hardware on access.

On page fault:
  While current page's R=1:
    Set R=0 (give second chance)
    Advance hand
  Evict current page (R=0, not used since last sweep)

This is the algorithm Linux actually uses!

FIFO (First In, First Out)

Evict the oldest loaded page. Simple but poor performance.

Belady's Anomaly: adding more frames can cause MORE misses with FIFO!
(This anomaly does NOT occur with LRU or OPT)

Comparison

Algorithm    Misses (example above)  Implementation
OPT          6                       Impossible (needs future)
LRU          8                       Expensive (stack or counter)
Clock        ~8-10                   O(1), used in practice
FIFO         12                      O(1) but poor performance

Part 6 — Memory-Mapped Files

Files can be mapped directly into virtual address space — no explicit read/write calls needed.

#include <sys/mman.h>

// Map a file into virtual memory
int fd = open("data.bin", O_RDONLY);
size_t fileSize = getFileSize(fd);

void* ptr = mmap(nullptr, fileSize, PROT_READ, MAP_PRIVATE, fd, 0);
// ptr now points to the file contents in virtual memory

// Access file like a regular array
int* data = (int*)ptr;
printf("%d\n", data[0]);    // reads first 4 bytes of file
printf("%d\n", data[1000]); // reads bytes 4000-4003

// OS uses demand paging: pages loaded only when accessed
// Perfect for large files — only accessed parts use RAM

// Unmap when done
munmap(ptr, fileSize);
close(fd);

Advantages:

1. Demand paging: only read parts actually accessed
2. OS caches pages — multiple processes can share the same physical pages
3. Simpler code: no read() loops, no buffer management
4. Can be faster than read() for random access patterns

Part 7 — The Memory of a Process

Virtual address space layout (Linux x86-64):

0xFFFFFFFFFFFFFFFF ┌─────────────────┐
                   │   Kernel Space  │  Not accessible to user processes
0xFFFF800000000000 ├─────────────────┤
                   │                 │
                   │   (unmapped)    │
                   │                 │
0x00007FFFFFFFFFFF ├─────────────────┤
                   │     Stack       │  Grows downward
                   │       ↓        │
                   ├─────────────────┤
                   │       ↑        │
                   │  Memory Maps   │  mmap(), shared libs
                   ├─────────────────┤
                   │       ↑        │
                   │      Heap      │  malloc() grows upward
                   ├─────────────────┤
                   │   BSS segment  │  Uninitialized global variables
                   ├─────────────────┤
                   │  Data segment  │  Initialized global/static variables
                   ├─────────────────┤
                   │  Text segment  │  Executable code (read-only)
0x0000000000400000 └─────────────────┘

Practice Problems

A system has 4KB pages and 32-bit virtual addresses. How many pages in the virtual address space?
A TLB has 64 entries with 4KB pages. What is the TLB reach? What percentage of a 1MB working set is covered?
Given reference string: 1 2 3 4 1 2 5 1 2 3 4 5 with 3 frames, calculate misses for FIFO and LRU.
Why does stack overflow cause a segfault (not just a page fault)?

Answers

Problem 1:

Virtual address space = 2^32 = 4 GB
Page size = 4 KB = 2^12
Number of pages = 2^32 / 2^12 = 2^20 = 1,048,576 pages (1M pages)

Problem 2:

TLB reach = 64 entries × 4KB = 256 KB
1MB working set coverage = 256KB / 1024KB = 25%
Only 25% covered → many TLB misses for a 1MB working set.
Solution: use huge pages (2MB) → 64 × 2MB = 128MB reach

Problem 3:

Reference: 1 2 3 4 1 2 5 1 2 3 4 5  (3 frames)

FIFO:
1:[1_ _] M  2:[12 _] M  3:[123] M  4:[423] M  1:[413] M  2:[412] M
5:[512] M  1:[512] H  2:[512] H  3:[312] M  4:[314] M  5:[514] M
FIFO misses: 9

LRU:
1:[1_ _] M  2:[12 _] M  3:[123] M  4:[124] M  1:[124] H  2:[124] H
5:[524] M  1:[512] H  2:[512] H  3:[512] M  → 3:[312] H wait...
Actually:
After 5→[524]: 1=[524]H, 2=[524]H, 3=[523]M(evict 4), 4=[423]M(evict 5), 5=[423]H
LRU misses: 8

Problem 4:

The stack has a maximum size (typically 8MB on Linux, ulimit -s).
Above the stack is unmapped virtual address space.
When the stack grows beyond its limit, it accesses an unmapped VMA.
The OS sees the fault is for an invalid VMA → sends SIGSEGV.
There's a small "guard page" at the stack boundary with no permissions
specifically to catch stack overflow and generate SIGSEGV immediately.

References

Patterson & Hennessy — Computer Organization and Design — Chapter 5.7
Bryant & O’Hallaron — Computer Systems: A Programmer’s Perspective — Chapter 9
Linux kernel — virtual memory documentation
MIT 6.004 — Virtual Memory lecture