CPU Architecture & Instruction Set Architecture
CPU Architecture & Instruction Set Architecture
The Heart of the Computer
The CPU (Central Processing Unit) executes your program. Every second, a modern CPU executes billions of instructions — each one a tiny, precise operation like “add these two numbers” or “load this value from memory.”
Real-world analogy: A CPU is like a highly efficient factory assembly line. Instructions are the work orders. Registers are the workbenches — small, fast, right next to the workers. RAM is the warehouse — much larger but further away. The pipeline is the assembly line — multiple stages processing different work orders simultaneously.
Part 1 — Von Neumann Architecture
The architecture underlying virtually every computer ever built.
┌─────────────────────────────────────────────────────┐
│ │
│ ┌───────────────────┐ ┌────────────────────┐ │
│ │ CPU │ │ Memory │ │
│ │ ┌─────────────┐ │ │ │ │
│ │ │ Control │ │ │ Programs stored │ │
│ │ │ Unit │◄─┼──────┼─ as data (!) │ │
│ │ └─────────────┘ │ │ │ │
│ │ ┌─────────────┐ │ │ Variables stored │ │
│ │ │ Arithmetic │◄─┼──────┼─ here too │ │
│ │ │ Logic Unit │ │ │ │ │
│ │ └─────────────┘ │ └────────────────────┘ │
│ │ ┌─────────────┐ │ │
│ │ │ Registers │ │ ┌────────────────────┐ │
│ │ │ (fast!) │ │ │ I/O Devices │ │
│ │ └─────────────┘ │ │ Keyboard, Screen │ │
│ └───────────────────┘ └────────────────────┘ │
│ │
│ System Bus (data highway) │
└─────────────────────────────────────────────────────┘
Key insight — stored-program concept: Programs and data live in the same memory. This allows programs to be treated as data, modified, and loaded dynamically. This is why you can run different programs without hardware changes.
Von Neumann bottleneck: Only one bus between CPU and memory. CPU is much faster than memory. The CPU often sits idle waiting for data. This bottleneck is why cache memory exists.
Part 2 — CPU Components
Registers
The fastest storage in a computer. Typically 32 or 64 flip-flops. Access time: ~1 clock cycle.
x86-64 registers:
General Purpose (64-bit):
RAX RBX RCX RDX — accumulator, base, counter, data
RSI RDI — source index, destination index
RSP RBP — stack pointer, base pointer
R8 R9 R10 R11
R12 R13 R14 R15
Each 64-bit register has sub-registers:
RAX (64-bit)
EAX (32-bit, lower half)
AX (16-bit, lower quarter)
AH / AL (8-bit each)
Special Purpose:
RIP — instruction pointer (current instruction address)
RFLAGS — status flags (zero, carry, overflow, sign, ...)
Floating Point / SIMD:
XMM0-XMM15 (128-bit)
YMM0-YMM15 (256-bit, AVX)
ZMM0-ZMM31 (512-bit, AVX-512)
RISC-V registers (cleaner architecture for learning):
x0 (zero) — always 0, writes ignored
x1 (ra) — return address
x2 (sp) — stack pointer
x3 (gp) — global pointer
x4 (tp) — thread pointer
x5-x7 — temporaries
x8 (s0/fp) — saved register / frame pointer
x9 (s1) — saved register
x10-x17 — function arguments / return values (a0-a7)
x18-x27 — saved registers (s2-s11)
x28-x31 — temporaries (t3-t6)
ALU — Arithmetic Logic Unit
Performs all arithmetic and logic operations.
┌────────────┐
A ─────┤ ├───── Result
│ ALU │
B ─────┤ ├───── Flags
└──────┬─────┘
│
Operation
(ADD, SUB, AND, OR, ...)
Operations:
Arithmetic: ADD, SUB, MUL, DIV
Logical: AND, OR, XOR, NOT
Shift: SHL, SHR, SAR (shift left/right/arithmetic)
Compare: sets flags without storing result
Flags register:
Z (Zero flag): result was zero
N (Negative flag): result was negative
C (Carry flag): unsigned overflow
V (Overflow flag): signed overflow
Control Unit
Decodes instructions and coordinates all CPU components.
Fetch instruction from memory (using PC)
↓
Decode: what operation? which registers?
↓
Execute: tell ALU what to do
↓
Memory access: load/store if needed
↓
Write back: store result in register
↓
Update PC (next instruction)
↓
Repeat
Part 3 — Instruction Set Architecture (ISA)
The ISA is the contract between hardware and software — the set of instructions a CPU understands. It’s the lowest-level programming interface.
RISC vs CISC
CISC (Complex Instruction Set Computer):
Philosophy: complex instructions do more work per instruction
Example: x86
MOVS — copies a string (loop + load + store in ONE instruction)
FSIN — computes sine in hardware
Pros: fewer instructions per program, backward compatibility
Cons: complex decoder, variable instruction length, hard to pipeline
Examples: x86, x86-64 (Intel/AMD)
RISC (Reduced Instruction Set Computer):
Philosophy: simple instructions, do one thing each
Example: RISC-V, ARM
All instructions same size (32 bits)
Load/store only accesses memory
ALU operations only on registers
Pros: simple, fast, easy to pipeline, lower power
Cons: more instructions per program
Examples: RISC-V, ARM (used in phones, Apple Silicon), MIPS
Modern reality: x86 chips internally translate CISC instructions to RISC-like micro-operations. The ISA is CISC but the implementation is RISC.
Instruction Format (RISC-V)
32-bit instruction, several formats:
R-type (register operations):
┌─────────┬──────┬──────┬───┬──────┬───────┐
│ funct7 │ rs2 │ rs1 │fn3│ rd │ opcode│
│ [31:25]│[24:20]│[19:15]│[14:12]│[11:7]│[6:0]│
└─────────┴──────┴──────┴───┴──────┴───────┘
7 bits 5 5 3 5 7
Example: ADD x1, x2, x3 (x1 = x2 + x3)
funct7=0000000, rs2=x3, rs1=x2, funct3=000, rd=x1, opcode=0110011
I-type (immediate operations):
┌─────────────┬──────┬───┬──────┬───────┐
│ imm[11:0]│ rs1 │fn3│ rd │ opcode│
│ 12 bits │ 5 │ 3 │ 5 │ 7 │
└─────────────┴──────┴───┴──────┴───────┘
Example: ADDI x1, x2, 10 (x1 = x2 + 10)
Immediate can encode constants up to 12 bits (-2048 to 2047)
S-type (stores):
┌──────────┬──────┬──────┬───┬──────────┬───────┐
│ imm[11:5]│ rs2 │ rs1 │fn3│ imm[4:0] │ opcode│
└──────────┴──────┴──────┴───┴──────────┴───────┘
B-type (branches):
Similar to S-type but immediate encodes PC-relative offset
J-type (jumps):
20-bit immediate for long-range jumps
Types of Instructions
Arithmetic/Logic:
ADD x1, x2, x3 ; x1 = x2 + x3
SUB x1, x2, x3 ; x1 = x2 - x3
AND x1, x2, x3 ; x1 = x2 & x3
OR x1, x2, x3 ; x1 = x2 | x3
XOR x1, x2, x3 ; x1 = x2 ^ x3
SLL x1, x2, x3 ; x1 = x2 << x3 (shift left logical)
SRL x1, x2, x3 ; x1 = x2 >> x3 (shift right logical)
SRA x1, x2, x3 ; x1 = x2 >> x3 (shift right arithmetic, sign-extend)
; Immediate variants (I-type):
ADDI x1, x2, 10 ; x1 = x2 + 10
ANDI x1, x2, 0xFF ; x1 = x2 & 0xFF
Load/Store (memory access):
; Only way to access memory in RISC!
; All operations must be register-to-register
; Except load and store
LW x1, 8(x2) ; x1 = Memory[x2 + 8] (load word = 4 bytes)
LH x1, 4(x2) ; load halfword (2 bytes, sign-extended)
LB x1, 0(x2) ; load byte (sign-extended)
LWU x1, 8(x2) ; load word unsigned (zero-extended)
SW x1, 8(x2) ; Memory[x2 + 8] = x1 (store word)
SH x1, 4(x2) ; store halfword
SB x1, 0(x2) ; store byte
; Why offset addressing?
; Accessing a struct field:
; struct Point { int x; int y; } (x at offset 0, y at offset 4)
; LW x5, 0(x10) ; load point.x
; LW x6, 4(x10) ; load point.y
Control Flow:
; Conditional branches (PC-relative):
BEQ x1, x2, label ; if x1==x2, jump to label
BNE x1, x2, label ; if x1!=x2, jump
BLT x1, x2, label ; if x1 < x2 (signed)
BGE x1, x2, label ; if x1 >= x2 (signed)
BLTU x1, x2, label ; if x1 < x2 (unsigned)
BGEU x1, x2, label ; if x1 >= x2 (unsigned)
; Unconditional:
JAL x1, label ; x1 = PC+4, jump to label (call)
JALR x1, x2, 0 ; x1 = PC+4, jump to x2+0 (return)
; Example: if (a == b) { ... }
; a in x10, b in x11
BNE x10, x11, skip ; if not equal, skip body
; ... body ...
skip:
Part 4 — Assembly Language
Assembly is a human-readable form of machine code. Each assembly instruction corresponds to one machine instruction.
C to Assembly Translation
// C code
int sum(int a, int b) {
return a + b;
}
; RISC-V Assembly
; Arguments: a in a0, b in a1
; Return value: a0
sum:
add a0, a0, a1 ; a0 = a0 + a1
ret ; return (jalr x0, ra, 0)
Loop example:
// Sum array of n integers
int arraySum(int* arr, int n) {
int sum = 0;
for (int i = 0; i < n; i++) {
sum += arr[i];
}
return sum;
}
; arr in a0, n in a1
; sum in a2, i in a3
arraySum:
li a2, 0 ; sum = 0
li a3, 0 ; i = 0
loop:
bge a3, a1, done ; if i >= n, exit loop
slli t0, a3, 2 ; t0 = i * 4 (byte offset for int)
add t1, a0, t0 ; t1 = &arr[i]
lw t2, 0(t1) ; t2 = arr[i]
add a2, a2, t2 ; sum += arr[i]
addi a3, a3, 1 ; i++
j loop
done:
mv a0, a2 ; return sum
ret
If-else example:
int max(int a, int b) {
if (a > b)
return a;
else
return b;
}
; a in a0, b in a1
max:
bge a1, a0, return_b ; if b >= a, return b
ret ; return a (already in a0)
return_b:
mv a0, a1
ret
Addressing Modes
Different ways to specify where an operand is:
Immediate: ADDI x1, x2, 42 operand = 42 (in instruction)
Register: ADD x1, x2, x3 operand = value in x3
Base+Offset: LW x1, 8(x2) operand = Memory[x2 + 8]
PC-Relative: BEQ x1, x2, +100 target = PC + 100
x86 has more modes (complex addressing):
[base + index*scale + displacement]
e.g. [rbx + rcx*4 + 8] → arr[rbx/4 + rcx]
Part 5 — x86 Assembly (Brief)
Since most desktops run x86, useful to recognize:
; x86-64 (AT&T syntax used by GCC)
; Format: instruction src, dst (reversed from Intel syntax!)
mov $5, %rax ; rax = 5
add %rbx, %rax ; rax = rax + rbx
sub $3, %rax ; rax = rax - 3
imul %rcx, %rax ; rax = rax * rcx
push %rax ; push onto stack
pop %rbx ; pop from stack
call function ; call function
ret ; return from function
cmp %rbx, %rax ; set flags based on rax-rbx
je label ; jump if equal (ZF=1)
jg label ; jump if greater (signed)
; Intel syntax (used by MASM, NASM):
mov rax, 5 ; rax = 5 (destination first)
add rax, rbx ; rax = rax + rbx
Seeing assembly from C++:
# Compile with assembly output
g++ -O0 -S -masm=intel file.cpp -o file.asm
# With optimization
g++ -O2 -S -masm=intel file.cpp -o file_opt.asm
# Or use Compiler Explorer online:
# godbolt.org — type C++, see assembly live
Part 6 — Calling Conventions
How functions pass arguments and return values. This is the ABI (Application Binary Interface).
RISC-V Calling Convention
Argument registers: a0-a7 (x10-x17)
First 8 arguments passed in registers
Remaining: pushed on stack
Return value: a0, a1 (for large returns)
Caller-saved (caller must save if needed):
ra, t0-t6, a0-a7
Callee can freely modify these
Callee-saved (callee must restore):
s0-s11 (x8-x9, x18-x27)
If callee uses these, must save/restore
Stack frame:
┌─────────────────┐ ← old sp
│ saved ra │ (if function calls others)
│ saved s0 │ (if used)
│ saved s1 │ (if used)
│ local vars │
│ ... │
└─────────────────┘ ← sp (points here during function)
Function call example:
; int factorial(int n)
factorial:
addi sp, sp, -16 ; allocate stack frame
sw ra, 12(sp) ; save return address
sw s0, 8(sp) ; save s0
mv s0, a0 ; s0 = n
li t0, 1
ble a0, t0, base ; if n <= 1, base case
addi a0, a0, -1 ; a0 = n-1
call factorial ; recursive call
mul a0, s0, a0 ; a0 = n * factorial(n-1)
j done
base:
li a0, 1 ; return 1
done:
lw ra, 12(sp) ; restore return address
lw s0, 8(sp) ; restore s0
addi sp, sp, 16 ; deallocate frame
ret
Practice Problems
Translate to RISC-V assembly:
int absolute(int x) { if (x < 0) return -x; return x; }What does this assembly do?
li t0, 1 li t1, 0 loop: mul t1, t1, a0 ; what's wrong here? mul t0, t0, a0 addi t2, t2, -1 bnez t2, loopHow many bytes does
LW x5, 0(x6)load?What is the difference between
SRLandSRA?
Answers
Problem 1:
; x in a0, result in a0
absolute:
bge a0, x0, done ; if x >= 0, done
neg a0, a0 ; a0 = -a0 (pseudo: sub a0, x0, a0)
done:
ret
Problem 3: 4 bytes (LW = Load Word = 32 bits = 4 bytes)
Problem 4:
SRL (Shift Right Logical): fills with 0s from left
1000 1010 >> 2 = 0010 0010 (unsigned right shift)
SRA (Shift Right Arithmetic): fills with sign bit from left
1000 1010 >> 2 = 1110 0010 (signed right shift, preserves sign)
Use SRL for unsigned numbers
Use SRA for signed numbers (equivalent to division by 2^n)
References
- Patterson & Hennessy — Computer Organization and Design RISC-V ed. — Chapters 2-3
- Nisan & Schocken — The Elements of Computing Systems (Nand2Tetris)
- RISC-V Spec — riscv.org
- Godbolt Compiler Explorer — godbolt.org
- MIT 6.004 — Computation Structures