File Systems

Operating Systems

File Systems

Persistent, Organized Storage

A file system gives you the illusion of organized, named, persistent storage sitting on top of what is physically just a bunch of numbered disk sectors. It’s one of the OS’s most elegant abstractions.

Real-world analogy: Think of a file system as a library. The library building is the hard disk. Each book is a file. The library catalog (card catalog or digital index) is the file system’s metadata structures. The Dewey Decimal System is the directory hierarchy. A librarian who knows where every book is located corresponds to the OS file system driver. Without the catalog, you’d have to search every shelf to find any book.

Part 1 — File System Concepts

What is a File?

A file is a named, persistent sequence of bytes managed by the OS.

From the user's perspective:
  - Has a name (path)
  - Has contents (bytes)
  - Has metadata (size, owner, permissions, timestamps)

From the OS perspective:
  - Stored as blocks on disk
  - Has an inode (metadata structure)
  - Accessed via file descriptor
  - Can be type: regular, directory, symlink, device, pipe, socket

File operations:

// Open returns a file descriptor (small integer)
int fd = open("file.txt", O_RDONLY);           // open existing
int fd = open("file.txt", O_WRONLY|O_CREAT, 0644); // create/open for write

// Read and write move the file offset
ssize_t n = read(fd, buffer, 1024);  // read up to 1024 bytes
ssize_t n = write(fd, buffer, 1024); // write 1024 bytes

// Seek changes the file offset
off_t pos = lseek(fd, 0, SEEK_SET);  // seek to beginning
off_t pos = lseek(fd, 0, SEEK_END);  // seek to end (get file size)
off_t pos = lseek(fd, -10, SEEK_CUR); // seek 10 bytes backward

// Metadata
struct stat st;
fstat(fd, &st);
printf("size=%ld, permissions=%o\n", st.st_size, st.st_mode);

// Close when done
close(fd);

Directories

Directories map names to inode numbers. A directory is a special file containing name→inode pairs.

Directory /home/user/:
┌─────────────────────────────────┐
│  Name         Inode Number      │
│  .            1234  (itself)    │
│  ..           1000  (parent)    │
│  documents    1235              │
│  photos       1236              │
│  code         1237              │
│  resume.pdf   1238              │
└─────────────────────────────────┘

Path resolution:
  /home/user/documents/notes.txt
  
  1. Start at root /  → inode 2 (root always inode 2)
  2. Look up "home"   → inode 1000
  3. Look up "user"   → inode 1234
  4. Look up "documents" → inode 1235
  5. Look up "notes.txt" → inode 5678
  6. Read inode 5678 to find disk blocks
  7. Read those blocks = file contents

Part 2 — Inode (Index Node)

The core metadata structure for every file. On disk, the inode stores everything about a file except its name.

inode contents:
┌────────────────────────────────────────────────────┐
│  File type:      regular, directory, symlink...    │
│  Permissions:    rwxrwxrwx (owner, group, other)   │
│  Owner UID:      1000 (user ID)                    │
│  Owner GID:      1000 (group ID)                   │
│  File size:      1048576 (bytes)                   │
│  Link count:     2 (number of directory entries)   │
│  atime:          last access timestamp             │
│  mtime:          last modification timestamp       │
│  ctime:          last status change timestamp      │
│  Block count:    2048 (512-byte blocks used)       │
│                                                    │
│  Block pointers:                                   │
│    Direct[0]:  block 1024   ─────────────────────► │
│    Direct[1]:  block 1025                          │
│    ...                                             │
│    Direct[11]: block 1035                          │
│    Indirect:   block 2000  → [ptr][ptr][ptr]...    │
│    Double ind: block 3000  → [...] → [ptr]...      │
│    Triple ind: block 4000  → [...] → [...] → ...   │
└────────────────────────────────────────────────────┘

Block Pointers

How inodes point to file data blocks (ext2/ext3 style):

Small files (≤ 12 blocks ≈ 48KB with 4KB blocks):
  12 direct pointers → 12 × 4KB = 48KB

Larger files use indirect blocks:
  1 indirect pointer → block of 1024 pointers → 1024 × 4KB = 4MB
  1 double indirect → 1024 × 1024 × 4KB = 4GB
  1 triple indirect → 1024³ × 4KB = 4TB

Total max file size (ext2):
  12 × 4KB + 4MB + 4GB + 4TB ≈ 4TB

Extent-based inodes (ext4, APFS, NTFS):

Instead of individual block pointers, store extents:
  extent = (start_block, length)
  
  A 10MB file stored contiguously:
    Old: 2560 block pointers
    New: 1 extent: (start=5000, length=2560)
  
  Advantages:
    - Fewer metadata reads for large sequential files
    - Less fragmentation (allocator tries to keep extents contiguous)
    - Faster large file operations

Part 3 — File System Layout on Disk

ext2/ext3 layout:

Disk divided into block groups:

Block Group 0:
┌──────────────┬──────────────┬──────────────┬──────────────┐
│  Superblock  │ Group Desc.  │  Inode Table │  Data Blocks │
│  (FS info)   │  (bg info)   │  (all inodes)│  (file data) │
└──────────────┴──────────────┴──────────────┴──────────────┘
Block Group 1:
┌──────────────┬──────────────┬──────────────┬──────────────┐
│  Superblock  │ Group Desc.  │  Inode Table │  Data Blocks │
│  (backup)    │  (backup)    │              │              │
└──────────────┴──────────────┴──────────────┴──────────────┘
...

Superblock contains:
  - Total inode count, free inode count
  - Total block count, free block count
  - Block size (1KB, 2KB, 4KB)
  - Magic number (0xEF53 for ext2)
  - Last mount time, mount count

Inode Table:
  Fixed-size array of inodes
  inode number → table index → inode on disk
  
Block Bitmap:
  One bit per data block: 0=free, 1=used
  Fast allocation: scan for first 0 bit

Inode Bitmap:
  One bit per inode: 0=free, 1=used

Part 4 — Journaling

The crash consistency problem:

Creating a file requires multiple disk writes:
  1. Update inode bitmap (mark inode used)
  2. Write inode (initialize metadata)
  3. Write directory entry (name → inode mapping)
  4. Write file data
  
If power fails after step 1 but before step 3:
  Inode marked used but no directory points to it → leaked inode!
  
If power fails after step 3 but before step 2:
  Directory points to uninitialized inode → garbage data!
  
Inconsistency!

Solution: Journaling (write-ahead logging)

Before writing to the actual filesystem, write the operation
to a journal (log) first.

Steps:
  1. Write "BEGIN" marker to journal
  2. Write all planned changes to journal
  3. Write "COMMIT" marker to journal
  4. Actually apply changes to filesystem
  5. Write "DONE" to journal (can now reuse journal space)

On crash recovery:
  Find incomplete operations in journal
  If COMMITTED but not DONE: redo the operation
  If not COMMITTED: ignore (incomplete transaction)
  
  File system is always consistent after recovery!

Journal modes (ext4):
  data=writeback: only metadata journaled (fastest, some data risk)
  data=ordered:   data written before metadata (default, safe)
  data=journal:   both data and metadata journaled (slowest, safest)

Journaling adds ~5-10% overhead but eliminates fsck on every crash.
Before journaling: after crash → must run fsck on entire disk (hours!)
With journaling: after crash → replay journal → ready in seconds

Part 5 — Hard Links and Symbolic Links

Hard link:

Multiple directory entries pointing to the SAME inode.
The inode's link count tracks how many entries point to it.

ln file.txt hardlink.txt
# Both entries point to same inode

file.txt ──────────────────► inode 1234
hardlink.txt ───────────────►    (link count = 2)

Deletion:
  rm file.txt → link count becomes 1, inode not freed
  rm hardlink.txt → link count becomes 0, inode freed, blocks freed

Limitations:
  - Can't hard link directories (would create cycles)
  - Can't cross filesystem boundaries (inode numbers are per-filesystem)
  - Both names are equal — no "original" vs "link"

Symbolic link (symlink):

A special file whose content is a path to another file.

ln -s /home/user/documents/notes.txt shortcut.txt
# shortcut.txt is a file containing the string "/home/user/documents/notes.txt"

shortcut.txt (inode 9999) contains: "/home/user/documents/notes.txt"
↓ follow symlink
notes.txt (inode 1234) = actual file

Differences from hard link:
  - Can point to directories
  - Can cross filesystem boundaries
  - The target may not exist (dangling symlink)
  - Symlink has its own inode, own link count
  - Following symlink involves extra path traversal
  - stat() follows symlink; lstat() doesn't

rm notes.txt → symlink still exists but dangling (broken)

Part 6 — File System Implementations

FAT (File Allocation Table)

Used in USB drives, SD cards, historical Windows.

Disk layout:
┌─────────┬─────────┬──────────────────────────────┐
│  Boot   │   FAT   │           Data Area           │
│  sector │  table  │  cluster 2, 3, 4, 5, ...     │
└─────────┴─────────┴──────────────────────────────┘

FAT table: one entry per cluster
  0x000:    free
  0x002-FFx: next cluster in file
  0xFFF:    end of file

File "hello.txt" stored in clusters 2, 5, 7:
  FAT[2] = 5     (next cluster after 2 is 5)
  FAT[5] = 7     (next cluster after 5 is 7)
  FAT[7] = 0xFFF (end of file)

Simple linked list on disk.
Pros: simple, universal, no fragmentation check needed
Cons: FAT must be read into RAM; slow for large disks;
      no permissions, no journaling, 4GB file size limit (FAT32)

ext4 (Linux Standard)

Modern features added to ext2/ext3 design:
  - Extent-based allocation (contiguous ranges vs individual blocks)
  - Delayed allocation (batch allocations for better placement)
  - Journaling (data=ordered by default)
  - Large file support (up to 16TB files, 1EB filesystem)
  - Extents tree in inode (replaces indirect blocks for large files)
  - Checksums on journal and metadata
  - Supports Unix permissions, ACLs
  
Performance features:
  - Multiblock allocator (allocates multiple contiguous blocks at once)
  - Persistent preallocation (guarantee space before writing)
  - Online defragmentation

NTFS (Windows)

Master File Table (MFT):
  Every file and directory = one MFT record (1KB each)
  First 16 records reserved for metadata
  MFT can grow as files added
  
MFT record structure:
┌────────────────────────────────────────────┐
│ Header (magic, flags, size)                │
│ $STANDARD_INFORMATION: times, flags        │
│ $FILE_NAME: name, parent directory         │
│ $DATA: file data (inline if < ~700 bytes!) │
│         or run list for larger files       │
└────────────────────────────────────────────┘

Run list (like extents):
  File stored as list of runs (contiguous blocks)
  (start_cluster, length) pairs

Features:
  - ACLs (Access Control Lists) for permissions
  - Encryption (EFS)
  - Compression
  - Journaling (change journal)
  - Hard links and junctions (directory hard links)
  - Max file size: 16TB, max volume: 256TB

Copy-on-Write File Systems (ZFS, BTRFS, APFS)

Traditional file system: update in place
  Old data: [A][B][C] → write B' → [A][B'][C]
  Problem: interrupted write = partial update = inconsistency

COW file system: never overwrite
  Old: [A][B][C]
  Write B': [A][B][C][B'] → update pointers → [A][B'][C]
  Old [B] still exists until pointer removed

Benefits:
  - No corruption: partially written state is never exposed
  - Snapshots are FREE: just preserve old root pointer
  - Data integrity: checksums on every block
  - No need for journal (COW IS the atomicity mechanism)

Snapshots example (ZFS):
  zfs snapshot pool/data@backup   # instant, zero cost!
  # If files change, only changed blocks are copied
  # Unmodified blocks shared between snapshot and current

ZFS features:
  - Built-in RAID (raidz = software RAID-5)
  - Checksums on all data and metadata (detects bit rot)
  - Deduplication (same data stored once)
  - Compression (transparent)
  - Unlimited snapshots and clones
  - Send/receive (incremental backup)

Part 7 — File System Caching

Disk is 1000× slower than RAM. Cache frequently used disk data in memory.

Linux Page Cache:
  File data cached in page-sized chunks (4KB)
  Reads: first time = disk, subsequent = cache (if not evicted)
  Writes: written to cache first, written to disk later (write-back)
  
  Cache coherence: mmap and read/write see same data
  
  free command shows:
    Mem:  8G total, 2G used, 1G free, 5G buff/cache
    The 5G "buff/cache" is pages used for disk cache
    This is NOT wasted — it will be freed if programs need RAM

Buffer Cache:
  Caches raw disk blocks (for file system metadata)
  Directory entries, inodes, superblock all cached
  
  Without buffer cache: every path traversal = multiple disk reads
  With buffer cache: frequently used directories served from RAM

Practice Problems

A file is stored at blocks 4, 9, 15, 22 on disk. Draw the FAT entries for this file.
A file is created, written, and deleted. The user’s data was written before the system crashed. Will the data be recovered? Why?
What is the difference between stat and lstat? When would you use each?
Why can’t you create a hard link to a directory? (Most systems prevent this)
An inode has 12 direct pointers, 1 indirect, and block size = 4KB with 1024 pointers per indirect block. What is the maximum single-file size?

Answers

Problem 1:

FAT entries for file at blocks 4, 9, 15, 22:

FAT index:  1    2    3    4    5    6    7    8    9   10   11...22   23...
FAT value: ...  ...  ...   9   free free free free  15  ...  ...   22  EOF

FAT[4] = 9    (next block after 4 is 9)
FAT[9] = 15   (next block after 9 is 15)
FAT[15] = 22  (next block after 15 is 22)
FAT[22] = EOF (end of file, 0xFFF in FAT16)

Problem 2:

The data IS on disk (we're told it was written).
Whether it's accessible depends on the crash timing:

Scenario 1 — journaling, data=ordered mode:
  Data written to disk BEFORE metadata
  Journal has COMMIT record
  After crash: journal replay updates metadata
  Result: file exists and contains correct data ✓

Scenario 2 — no journaling or crash before commit:
  Data blocks written but inode/directory not updated
  After crash: blocks may be "free" according to bitmap
  fsck might not find them as part of any file
  Data is on disk but inaccessible (no inode points to it)
  "Orphan blocks" might be recoverable with forensic tools

Scenario 3 — write-back caching, data not yet on disk:
  Data in page cache, power failure → data LOST
  Unless: sync(), fsync(), O_SYNC, or write-through mode used

Problem 3:

stat(path, &buf):
  Follows symbolic links
  Returns info about the target file
  
  symlink.txt → /etc/passwd
  stat("symlink.txt") → info about /etc/passwd (the target)
  
lstat(path, &buf):
  Does NOT follow symbolic links
  Returns info about the symbolic link itself
  
  lstat("symlink.txt") → info about symlink.txt itself
    (type=symlink, size=length of target path string, etc.)

Use stat when you want info about what the path refers to.
Use lstat when checking if something IS a symlink (st_mode check),
or when traversing a directory tree (avoid following links).

Problem 4:

If directory hard links were allowed:

mkdir /tmp/a
mkdir /tmp/a/b
ln -d /tmp/a /tmp/a/b/link_to_a  # if allowed!

File tree structure:
/tmp/a/b/link_to_a → points to /tmp/a → CYCLE!

Problems with directory cycles:
1. find, du, ls -R → infinite loop!
2. Path resolution: /tmp/a/b/link_to_a/b/link_to_a/b/... infinite!
3. File system consistency checks (fsck) break

Linux prevents it: open(2) with O_PATH + link syscalls reject directories.
Exception: . and .. are special hard links managed entirely by the kernel.

Symbolic links to directories ARE allowed but tools handle them:
  find handles -follow flag
  ls shows the symlink, not its contents by default

Problem 5:

12 direct pointers × 4KB = 48KB

1 indirect pointer:
  Points to one block of 1024 pointers
  1024 × 4KB = 4MB

Total = 48KB + 4MB = 4,194,304 + 49,152 ≈ 4.05MB

If we also had double and triple indirect:
  Double: 1024 × 1024 × 4KB = 4GB
  Triple: 1024³ × 4KB = 4TB

With only 12 direct + 1 indirect: max ≈ 4MB

References

Arpaci-Dusseau — OSTEP — Chapters 36-45 (Persistence)
McKusick et al. — The Design and Implementation of the FreeBSD Operating System
ext4 documentation — kernel.org ext4
ZFS on Linux — openzfs.org
Silberschatz — OS Concepts — Chapters 13-14