Back to Articles
25 min read

CPU Architecture Deep Dive: From Basics to Modern Multi-Core Design

A comprehensive technical walkthrough of processor fundamentals. We bridge the gap between abstract instruction sets and physical silicon, exploring how features like branch prediction, hyper-threading, and dynamic frequency scaling power modern computing.

CPU basics

The CPU (Central Processing Unit) is the "brain" of the computer, executing billions of instructions per second through a fetch-decode-execute cycle: fetching instructions from memory, decoding what operation to perform, executing using the ALU (Arithmetic Logic Unit), and storing results. Modern CPUs have multiple cores (parallel processors), cache hierarchy (L1/L2/L3 for fast data access), run at gigahertz speeds, and use sophisticated techniques like pipelining, branch prediction, and out-of-order execution to maximize performance.

┌────────────────────────────────────────────────────────────────────┐ │ CPU ARCHITECTURE │ ├────────────────────────────────────────────────────────────────────┤ │ │ │ ┌────────────────────────────────────────────────────────────┐ │ │ │ CPU PACKAGE │ │ │ │ ┌──────────────────────────────────────────────────────┐ │ │ │ │ │ L3 CACHE (Shared) │ │ │ │ │ │ 32MB │ │ │ │ │ └──────────────────────────────────────────────────────┘ │ │ │ │ │ │ │ │ ┌────────────┐ ┌────────────┐ ┌────────────┐ ┌────────────┐│ │ │ │ │ CORE 0 │ │ CORE 1 │ │ CORE 2 │ │ CORE 3 ││ │ │ │ │ ┌────────┐ │ │ ┌────────┐ │ │ ┌────────┐ │ │ ┌────────┐ ││ │ │ │ │ │L2 512KB│ │ │ │L2 512KB│ │ │ │L2 512KB│ │ │ │L2 512KB│ ││ │ │ │ │ └────────┘ │ │ └────────┘ │ │ └────────┘ │ │ └────────┘ ││ │ │ │ │ ┌───┐┌───┐ │ │ ┌───┐┌───┐ │ │ ┌───┐┌───┐ │ │ ┌───┐┌───┐ ││ │ │ │ │ │L1D││L1I│ │ │ │L1D││L1I│ │ │ │L1D││L1I│ │ │ │L1D││L1I│ ││ │ │ │ │ │32K││32K│ │ │ │32K││32K│ │ │ │32K││32K│ │ │ │32K││32K│ ││ │ │ │ │ └───┘└───┘ │ │ └───┘└───┘ │ │ └───┘└───┘ │ │ └───┘└───┘ ││ │ │ │ │ ┌────────┐ │ │ ┌────────┐ │ │ ┌────────┐ │ │ ┌────────┐ ││ │ │ │ │ │ ALU │ │ │ │ ALU │ │ │ │ ALU │ │ │ │ ALU │ ││ │ │ │ │ │ FPU │ │ │ │ FPU │ │ │ │ FPU │ │ │ │ FPU │ ││ │ │ │ │ └────────┘ │ │ └────────┘ │ │ └────────┘ │ │ └────────┘ ││ │ │ │ └────────────┘ └────────────┘ └────────────┘ └────────────┘│ │ │ │ │ │ │ │ ┌───────────────────────────────────────────────────────┐ │ │ │ │ │ Memory Controller │ PCIe Controller │ IMC │ │ │ │ │ └───────────────────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────┘ │ │ │ ├────────────────────────────────────────────────────────────────────┤ │ FETCH → DECODE → EXECUTE → WRITEBACK (Pipeline) │ │ │ │ Clock: 5.0 GHz = 5 billion cycles per second │ │ IPC: Instructions Per Cycle (efficiency metric) │ │ Performance ≈ Cores × Clock Speed × IPC │ └────────────────────────────────────────────────────────────────────┘
// CPU concepts demonstration const cpu = { model: 'AMD Ryzen 9 7950X', cores: 16, threads: 32, // SMT (Simultaneous Multi-Threading) baseClock: 4.5, // GHz boostClock: 5.7, // GHz cache: { L1: '1MB total', // Fastest, smallest L2: '16MB', // Fast L3: '64MB' // Slower, shared }, tdp: 170, // Watts (thermal design power) process: '5nm' // Manufacturing node }; // Cache latency comparison (approximate cycles) const cacheLatency = { L1: 4, // ~1 nanosecond L2: 12, // ~3 nanoseconds L3: 40, // ~10 nanoseconds RAM: 200 // ~50+ nanoseconds }; // Why cache matters: Example function calculateCacheImpact(dataSizeKB, cacheHitRate) { const L1AccessTime = 1; // ns const RAMAccessTime = 50; // ns const avgAccessTime = (cacheHitRate * L1AccessTime) + ((1 - cacheHitRate) * RAMAccessTime); console.log(`Cache hit rate: ${cacheHitRate * 100}%`); console.log(`Average access: ${avgAccessTime}ns`); // 90% hit rate: 5.9ns, 50% hit rate: 25.5ns (4x slower!) }

Arithmetic Logic Unit (ALU)

The ALU is the mathematical brain of the CPU, performing all arithmetic operations (add, subtract, multiply, divide) and logical operations (AND, OR, NOT, XOR) on binary data. Every calculation your computer makes ultimately passes through the ALU.

ALU Block Diagram Operand A Operand B │ │ ▼ ▼ ┌───────────────────────────┐ │ │ │ ┌───────────┐ │ │ │ ALU │ │ │ │ │ │ │ │ + - × ÷ │ │ │ │ AND OR │ │ │ │ NOT XOR │ │ │ └─────┬─────┘ │ │ │ │ └────────────┼──────────────┘ ┌──────┴──────┐ ▼ ▼ Result Flags (Zero, Carry, Overflow)
// Simple ALU simulation class ALU { static operations = { ADD: (a, b) => ({ result: a + b, zero: (a + b) === 0 }), SUB: (a, b) => ({ result: a - b, zero: (a - b) === 0 }), AND: (a, b) => ({ result: a & b, zero: (a & b) === 0 }), OR: (a, b) => ({ result: a | b, zero: (a | b) === 0 }), XOR: (a, b) => ({ result: a ^ b, zero: (a ^ b) === 0 }), NOT: (a) => ({ result: ~a, zero: (~a) === 0 }) }; execute(op, a, b = 0) { return ALU.operations[op](a, b); } } const alu = new ALU(); console.log(alu.execute('ADD', 5, 3)); // { result: 8, zero: false } console.log(alu.execute('AND', 0b1100, 0b1010)); // { result: 8 (0b1000), zero: false }

Control Unit

The Control Unit is the CPU's traffic controller, directing all operations by interpreting instructions and sending signals to coordinate the ALU, registers, and memory. It orchestrates the fetch-decode-execute cycle that makes program execution possible.

Control Unit Operation ┌─────────────────────────────────────────┐ │ CONTROL UNIT │ │ ┌─────────────────────────────────┐ │ │ │ Instruction Decoder │ │ │ └───────────────┬─────────────────┘ │ │ │ │ │ ┌────────┴────────┐ │ │ ▼ ▼ ▼ │ │ ┌────────┬────────┬────────┐ │ │ │Timing │Control │Sequenc-│ │ │ │Signals │Lines │ing │ │ │ └───┬────┴───┬────┴───┬────┘ │ └────────┼────────┼────────┼──────────────┘ │ │ │ ▼ ▼ ▼ Memory ALU Registers

Registers

Registers are the fastest storage locations in a computer, built directly into the CPU and accessed in a single clock cycle. Common registers include the Program Counter (PC), Accumulator, Stack Pointer, and general-purpose registers for temporary data storage.

// CPU Register simulation class CPURegisters { constructor() { // Common x86-like registers (simplified) this.registers = { // General purpose (32-bit represented as numbers) EAX: 0, // Accumulator EBX: 0, // Base ECX: 0, // Counter EDX: 0, // Data // Special purpose EIP: 0, // Instruction Pointer (Program Counter) ESP: 0, // Stack Pointer EBP: 0, // Base Pointer // Flags register FLAGS: { zero: false, carry: false, sign: false, overflow: false } }; } get(name) { return this.registers[name]; } set(name, value) { this.registers[name] = value; } incrementIP(bytes = 1) { this.registers.EIP += bytes; } }
x86 Register Layout 64-bit │ RAX │ ───────┼────────────┤ 32-bit │ │ EAX │ ───────┼────┼───────┤ 16-bit │ │ │AX │ ───────┼────┼───┼───┤ 8-bit │ │ │AH│AL│ └────┴───┴──┴──┘

Instruction Cycle (Fetch-Decode-Execute)

Every instruction executes through three fundamental phases: Fetch retrieves the instruction from memory, Decode interprets what operation to perform, and Execute carries out the operation. This cycle repeats billions of times per second in modern CPUs.

Instruction Cycle (Von Neumann) ┌──────────────────────────────────────┐ │ │ ▼ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │ FETCH │───▶│ DECODE │───▶│ EXECUTE │─┘ │ │ │ │ │ │ │ Get next │ │Interpret │ │ Perform │ │ instruct.│ │ opcode & │ │ the │ │ from RAM │ │ operands │ │operation │ └──────────┘ └──────────┘ └──────────┘ │ │ │ Memory Bus │ └───────────────────────────────┘
// Simplified instruction cycle simulation class SimpleCPU { constructor() { this.memory = new Uint8Array(256); this.PC = 0; // Program Counter this.ACC = 0; // Accumulator this.running = true; } // Opcodes: 0=HALT, 1=LOAD, 2=ADD, 3=STORE, 4=JUMP cycle() { // FETCH const instruction = this.memory[this.PC]; const operand = this.memory[this.PC + 1]; // DECODE & EXECUTE switch (instruction) { case 0: this.running = false; break; // HALT case 1: this.ACC = this.memory[operand]; break; // LOAD case 2: this.ACC += this.memory[operand]; break; // ADD case 3: this.memory[operand] = this.ACC; break; // STORE case 4: this.PC = operand - 2; break; // JUMP } this.PC += 2; // Move to next instruction return { instruction, operand, ACC: this.ACC }; } run() { while (this.running) console.log(this.cycle()); } }

Clock Speed and Cycles

The CPU clock is an electronic oscillator that generates timing pulses, measured in Hertz (cycles per second). Modern CPUs run at several GHz (billions of cycles per second), but different instructions may require different numbers of clock cycles to complete.

Clock Signal Voltage │ ┌───┐ ┌───┐ ┌───┐ ┌───┐ 1 │ │ │ │ │ │ │ │ │ │───┘ └───┘ └───┘ └───┘ └─── 0 │ └─────────────────────────────────── Time │◄──►│ 1 cycle At 4 GHz: 1 cycle = 0.25 nanoseconds (light travels ~7.5 cm)
// Clock speed and instruction timing const clockSpeedGHz = 4.0; const cyclesPerSecond = clockSpeedGHz * 1e9; const secondsPerCycle = 1 / cyclesPerSecond; const instructionCycles = { ADD_REG: 1, // Add registers: 1 cycle MUL_REG: 3, // Multiply: 3 cycles LOAD_MEM: 4, // Load from L1 cache: ~4 cycles LOAD_RAM: 100, // Load from RAM: ~100 cycles DIV: 20 // Division: ~20 cycles }; Object.entries(instructionCycles).forEach(([op, cycles]) => { const timeNs = (cycles * secondsPerCycle * 1e9).toFixed(3); console.log(`${op}: ${cycles} cycles = ${timeNs} ns`); });

Word Size

Word size refers to the number of bits a CPU processes in one operation, determining the size of registers, memory addressing, and data bus width. Common word sizes have evolved from 8-bit (early micros) to 16-bit, 32-bit, and today's 64-bit architectures.

Word Size Evolution 8-bit: │████████│ = 256 values (0-255) 16-bit: │████████████████│ = 65,536 values (0-65,535) 32-bit: │████████████████████████████████│ = 4.29 billion values (0-4,294,967,295) 4 GB addressable 64-bit: │████████████████████████████████│ = 18.4 quintillion │████████████████████████████████│ 16 EB addressable
// Word size implications const wordSizes = [8, 16, 32, 64]; wordSizes.forEach(bits => { const maxValue = BigInt(2) ** BigInt(bits) - BigInt(1); const addressableMemory = BigInt(2) ** BigInt(bits); console.log(`${bits}-bit:`); console.log(` Max unsigned value: ${maxValue.toLocaleString()}`); console.log(` Addressable memory: ${formatBytes(addressableMemory)}`); }); function formatBytes(bytes) { const units = ['B', 'KB', 'MB', 'GB', 'TB', 'PB', 'EB']; let i = 0; let size = Number(bytes); while (size >= 1024 && i < units.length - 1) { size /= 1024; i++; } return `${size.toFixed(2)} ${units[i]}`; }

Instruction Sets Overview

An Instruction Set Architecture (ISA) defines the machine language instructions a CPU understands. The two main philosophies are CISC (Complex Instruction Set Computing, like x86) with many specialized instructions, and RISC (Reduced Instruction Set Computing, like ARM) with simpler, uniform instructions.

CISC vs RISC Philosophy CISC (x86) RISC (ARM) ────────────────────────── ────────────────────────── Complex instructions Simple instructions Variable length (1-15 bytes) Fixed length (4 bytes) Memory-to-memory ops Load/Store architecture Fewer registers Many registers Hardware decoding Simpler decoding Example: Add memory to register CISC (1 instruction): RISC (3 instructions): ADD EAX, [address] LOAD R1, [address] ADD R0, R0, R1 (result in R0)
// Simple instruction set simulation const instructionSet = { // RISC-like instructions NOP: { opcode: 0x00, cycles: 1, desc: "No operation" }, LOAD: { opcode: 0x01, cycles: 2, desc: "Load register from memory" }, STORE: { opcode: 0x02, cycles: 2, desc: "Store register to memory" }, ADD: { opcode: 0x10, cycles: 1, desc: "Add two registers" }, SUB: { opcode: 0x11, cycles: 1, desc: "Subtract registers" }, MUL: { opcode: 0x12, cycles: 3, desc: "Multiply registers" }, AND: { opcode: 0x20, cycles: 1, desc: "Bitwise AND" }, OR: { opcode: 0x21, cycles: 1, desc: "Bitwise OR" }, JMP: { opcode: 0x30, cycles: 1, desc: "Unconditional jump" }, JZ: { opcode: 0x31, cycles: 1, desc: "Jump if zero" }, CALL: { opcode: 0x40, cycles: 2, desc: "Call subroutine" }, RET: { opcode: 0x41, cycles: 2, desc: "Return from subroutine" }, HALT: { opcode: 0xFF, cycles: 1, desc: "Stop execution" } }; // Print instruction set table console.log("Opcode │ Mnemonic │ Cycles │ Description"); console.log("───────┼──────────┼────────┼─────────────────────"); Object.entries(instructionSet).forEach(([name, info]) => { console.log(` 0x${info.opcode.toString(16).padStart(2,'0')}${name.padEnd(8)}${info.cycles}${info.desc}`); });

Pipelining

Pipelining divides instruction execution into stages (typically Fetch, Decode, Execute, Memory, Writeback), allowing multiple instructions to be in-flight simultaneously—like an assembly line. A 5-stage pipeline can theoretically achieve 5x throughput, but hazards (data dependencies, branches) can stall the pipeline and reduce efficiency.

Classic 5-Stage RISC Pipeline Time → 1 2 3 4 5 6 7 8 9 ┌────┬────┬────┬────┬────┐ Inst 1 │ IF │ ID │ EX │MEM │ WB │ └────┴────┴────┴────┴────┘ ┌────┬────┬────┬────┬────┐ Inst 2 │ IF │ ID │ EX │MEM │ WB │ └────┴────┴────┴────┴────┘ ┌────┬────┬────┬────┬────┐ Inst 3 │ IF │ ID │ EX │MEM │ WB │ └────┴────┴────┴────┴────┘ ┌────┬────┬────┬────┬────┐ Inst 4 │ IF │ ID │ EX │MEM │ WB │ └────┴────┴────┴────┴────┘ Without pipeline: 4 instructions × 5 cycles = 20 cycles With pipeline: 4 + 4 = 8 cycles (after fill) IF = Instruction Fetch MEM = Memory Access ID = Instruction Decode WB = Write Back EX = Execute
// Pipeline simulation class Pipeline { constructor(stages = ['IF', 'ID', 'EX', 'MEM', 'WB']) { this.stages = stages; this.pipe = new Array(stages.length).fill(null); this.cycle = 0; this.completed = []; } tick(newInstruction = null) { this.cycle++; // Instruction leaving pipeline if (this.pipe[this.pipe.length - 1]) { this.completed.push(this.pipe[this.pipe.length - 1]); } // Shift all instructions forward for (let i = this.pipe.length - 1; i > 0; i--) { this.pipe[i] = this.pipe[i - 1]; } this.pipe[0] = newInstruction; return this.getState(); } getState() { return this.stages.map((stage, i) => `${stage}: ${this.pipe[i] || '---'}` ).join(' | '); } } const pipe = new Pipeline(); ['ADD', 'SUB', 'MUL', 'LOAD', 'STORE', null, null, null, null].forEach((inst, i) => { console.log(`Cycle ${i + 1}: ${pipe.tick(inst)}`); });

Branch Prediction Basics

Branch prediction allows the CPU to speculatively fetch and execute instructions before knowing a branch's outcome, avoiding pipeline stalls. Modern predictors achieve 95%+ accuracy using techniques like branch history tables and pattern recognition—a misprediction costs 10-20 cycles to flush and restart the pipeline.

Branch Prediction Problem Without prediction: ┌────┬────┬────┐ │ IF │ ID │ EX │──► Branch result known └────┴────┴────┘ ┌────┬────┬────┐ │ IF │ ID │ EX │ 3 cycles wasted! └────┴────┴────┘ With prediction (guess correctly): ┌────┬────┬────┐ │ IF │ ID │ EX │ └────┴────┴────┘ ┌────┬────┬────┐ │ IF │ ID │ EX │ Speculative, confirmed! └────┴────┴────┘ ┌────┬────┬────┐ │ IF │ ID │ EX │ No stall! └────┴────┴────┘ 2-Bit Saturating Counter (common predictor): Predict Taken Predict Not Taken ┌──────────────────────────┐ ┌──────────────────────────┐ │ │ │ │ ▼ Taken │ │ Not Taken ▼ ┌────────┐ ───────► ┌────────┐ ◄─────── ┌────────┐ ───────► ┌────────┐ │ Strong │ │ Weak │ │ Weak │ │ Strong │ │ Taken │ ◄─────── │ Taken │ ───────► │Not Tkn │ ◄─────── │Not Tkn │ └────────┘ Not Taken└────────┘ Taken └────────┘ Not Taken└────────┘ 11 10 01 00
// 2-bit saturating counter branch predictor class BranchPredictor { constructor(tableSize = 256) { // 2-bit counters: 0,1 = not taken; 2,3 = taken this.table = new Uint8Array(tableSize).fill(2); // Start weakly taken this.stats = { correct: 0, incorrect: 0 }; } predict(address) { const index = address % this.table.length; return this.table[index] >= 2; // true = taken } update(address, actuallyTaken) { const index = address % this.table.length; const predicted = this.table[index] >= 2; // Update counter if (actuallyTaken && this.table[index] < 3) this.table[index]++; if (!actuallyTaken && this.table[index] > 0) this.table[index]--; // Track accuracy if (predicted === actuallyTaken) this.stats.correct++; else this.stats.incorrect++; } get accuracy() { const total = this.stats.correct + this.stats.incorrect; return total ? (this.stats.correct / total * 100).toFixed(1) : 0; } } // Simulate a loop (predictable pattern) const predictor = new BranchPredictor(); const loopAddr = 0x1000; for (let i = 0; i < 1000; i++) { for (let j = 0; j < 10; j++) { const taken = j < 9; // Loop 9 times, then exit predictor.update(loopAddr, taken); } } console.log(`Prediction accuracy: ${predictor.accuracy}%`); // ~90%

Cache Memory (L1, L2, L3)

CPU caches are small, fast SRAM memories that store frequently accessed data and instructions, dramatically reducing average memory access time. L1 is fastest but smallest (~32KB, ~4 cycles), L2 is larger and slower (~256KB-1MB, ~12 cycles), and L3 is shared across cores (~8-64MB, ~40 cycles)—compared to ~100+ cycles for main memory.

Cache Hierarchy in Modern CPU ┌─────────────────────────────────────────────┐ │ CPU Die │ │ ┌──────────┐ ┌──────────┐ │ │ │ Core 0 │ │ Core 1 │ │ │ │ ┌──────┐ │ │ ┌──────┐ │ │ │ │ │L1-I │ │ 32KB │ │L1-I │ │ │ │ │ │4 cyc │ │ per core │ │4 cyc │ │ │ │ │ ├──────┤ │ │ ├──────┤ │ │ │ │ │L1-D │ │ 32KB │ │L1-D │ │ │ │ │ │4 cyc │ │ per core │ │4 cyc │ │ │ │ │ └──┬───┘ │ │ └──┬───┘ │ │ │ │ ┌──┴───┐ │ │ ┌──┴───┐ │ │ │ │ │ L2 │ │ 256KB-1MB │ │ L2 │ │ │ │ │ │12 cyc│ │ per core │ │12 cyc│ │ │ │ │ └──┬───┘ │ │ └──┬───┘ │ │ │ └────┼─────┘ └────┼─────┘ │ │ │ │ │ │ ┌────┴───────────────────────┴────┐ │ │ │ L3 Cache │ │ │ │ 8-64MB, ~40 cycles │ │ │ │ (Shared by all cores) │ │ │ └─────────────────┬───────────────┘ │ └────────────────────┼────────────────────────┘ ┌───────┴───────┐ │ Main Memory │ │ DDR4/DDR5 │ │ ~100+ cycles │ └───────────────┘
// Cache hit/miss simulation class CacheSimulator { constructor(cacheLines = 64, lineSize = 64) { this.cacheLines = cacheLines; this.lineSize = lineSize; this.cache = new Map(); this.stats = { hits: 0, misses: 0 }; } access(address) { const lineAddr = Math.floor(address / this.lineSize) * this.lineSize; const index = (lineAddr / this.lineSize) % this.cacheLines; if (this.cache.get(index) === lineAddr) { this.stats.hits++; return { hit: true, latency: 4 }; } else { this.stats.misses++; this.cache.set(index, lineAddr); return { hit: false, latency: 100 }; } } get hitRate() { const total = this.stats.hits + this.stats.misses; return (this.stats.hits / total * 100).toFixed(1); } } // Sequential access (cache-friendly) const cache = new CacheSimulator(); for (let i = 0; i < 1000; i++) { cache.access(i * 4); // Sequential, reuses cache lines } console.log(`Sequential hit rate: ${cache.hitRate}%`); // Random access (cache-unfriendly) const cache2 = new CacheSimulator(); for (let i = 0; i < 1000; i++) { cache2.access(Math.floor(Math.random() * 100000)); } console.log(`Random hit rate: ${cache2.hitRate}%`);

Cache Coherence Introduction

In multi-core systems, each core has its own cache, creating the problem of keeping cached copies of the same memory location consistent. The MESI protocol (Modified, Exclusive, Shared, Invalid) is widely used, where cores snoop the bus to detect when they need to update or invalidate their cached copies.

MESI Protocol States ┌───────────────────────────────────────────────────────────────┐ │ State │ Valid? │ Dirty? │ Shared? │ Description │ ├───────────────────────────────────────────────────────────────┤ │ Modified │ Yes │ Yes │ No │ Only copy, changed │ │ Exclusive │ Yes │ No │ No │ Only copy, clean │ │ Shared │ Yes │ No │ Yes │ Multiple clean copy │ │ Invalid │ No │ - │ - │ Not in cache │ └───────────────────────────────────────────────────────────────┘ Example: Core 0 writes, Core 1 reads same address Initial: Core0: Invalid Core1: Invalid Step 1: Core0 reads address X Core0: Exclusive Core1: Invalid Step 2: Core0 writes to X Core0: Modified Core1: Invalid Step 3: Core1 wants to read X - Core0 snoops request, provides data - Core0: Shared Core1: Shared - Value written back to memory
// Simplified MESI state machine class CacheLine { constructor() { this.state = 'Invalid'; this.data = null; } } class MESISimulator { constructor() { this.caches = [new Map(), new Map()]; // 2 cores this.memory = new Map(); } read(core, address) { const otherCore = 1 - core; const line = this.caches[core].get(address) || new CacheLine(); const otherLine = this.caches[otherCore].get(address); if (line.state !== 'Invalid') { return { state: line.state, data: line.data, event: 'Hit' }; } // Cache miss - check other core if (otherLine?.state === 'Modified') { // Snoop: get data from other core this.memory.set(address, otherLine.data); otherLine.state = 'Shared'; line.state = 'Shared'; line.data = otherLine.data; } else if (otherLine?.state === 'Exclusive' || otherLine?.state === 'Shared') { otherLine.state = 'Shared'; line.state = 'Shared'; line.data = this.memory.get(address); } else { line.state = 'Exclusive'; line.data = this.memory.get(address); } this.caches[core].set(address, line); return { state: line.state, data: line.data, event: 'Miss' }; } write(core, address, value) { const otherCore = 1 - core; const line = this.caches[core].get(address) || new CacheLine(); const otherLine = this.caches[otherCore].get(address); // Invalidate other core's copy if (otherLine) otherLine.state = 'Invalid'; line.state = 'Modified'; line.data = value; this.caches[core].set(address, line); return { state: 'Modified', invalidated: !!otherLine }; } }

Superscalar Execution

Superscalar processors can execute multiple instructions per clock cycle by having multiple execution units (ALUs, FPUs, load/store units) operating in parallel. The CPU analyzes instruction dependencies and dispatches independent instructions simultaneously, achieving IPC (Instructions Per Cycle) greater than 1—modern CPUs can sustain 4-6 IPC.

Superscalar Execution (Multiple Issue) Scalar (1 instruction/cycle): Cycle: 1 2 3 4 5 6 ADD SUB MUL LOAD AND OR Superscalar (4-wide, can issue 4/cycle if independent): Cycle: 1 2 ┌─────────────┐ ┌─────────────┐ │ ADD (ALU 0) │ │ AND (ALU 0) │ │ SUB (ALU 1) │ │ OR (ALU 1) │ │ MUL (FPU) │ │ ... │ │ LOAD (LSU) │ │ ... │ └─────────────┘ └─────────────┘ Execution Units in Modern CPU: ┌───────────────────────────────────────────────────┐ │ Instruction Scheduler │ └──┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬────┘ │ │ │ │ │ │ │ │ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐ │ALU 0│ALU 1│ALU 2│ALU 3│FPU 0│FPU 1│Load │Store│ └─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘ Modern CPUs: 8-12 execution units

CISC vs RISC

CISC (Complex Instruction Set Computing) like x86 offers many specialized instructions of variable length, while RISC (Reduced Instruction Set Computing) like ARM uses simple, fixed-size instructions that execute in one cycle. Modern x86 CPUs internally convert CISC instructions to RISC-like micro-ops, blurring the distinction—the real difference now is legacy compatibility vs clean design.

CISC vs RISC Philosophy ┌─────────────────────┬───────────────────────────────────────┐ │ CISC (x86) │ RISC (ARM/MIPS) │ ├─────────────────────┼───────────────────────────────────────┤ │ Variable length │ Fixed length (32-bit) │ │ 1-15 bytes │ │ ├─────────────────────┼───────────────────────────────────────┤ │ Complex addressing │ Simple addressing │ │ Memory operands │ Load/Store only │ ├─────────────────────┼───────────────────────────────────────┤ │ Fewer registers │ Many registers (32+) │ │ (8-16 architectural)│ │ ├─────────────────────┼───────────────────────────────────────┤ │ Microcode decode │ Hardwired decode │ └─────────────────────┴───────────────────────────────────────┘ Example: A = B + C (from memory) CISC (1 instruction): RISC (4 instructions): ADD [A], [B], [C] LDR R1, [B] LDR R2, [C] ADD R3, R1, R2 STR R3, [A] Modern Reality: ┌────────────────────────────────────────────────┐ │ x86 Processor │ │ ┌─────────────────────────────────────────┐ │ │ │ CISC Instructions → Decoder │ │ │ └──────────────┬──────────────────────────┘ │ │ ▼ │ │ ┌─────────────────────────────────────────┐ │ │ │ RISC-like Micro-ops (μops) │ │ │ │ (Internally looks like RISC!) │ │ │ └─────────────────────────────────────────┘ │ └────────────────────────────────────────────────┘

Microcode

Microcode is a layer of low-level instructions stored in ROM that translates complex machine instructions into sequences of simpler operations the hardware can execute. It allows CPU designers to implement complex instructions in "software" and enables bug fixes via microcode updates—Intel regularly releases patches this way.

Microcode Architecture High-Level View: ┌────────────────────────────────────────────────────────┐ │ Machine Instruction │ │ DIV EAX, EBX │ └────────────────────────┬───────────────────────────────┘ ┌────────────────────────────────────────────────────────┐ │ Microcode Sequencer │ │ ┌─────────────────────────────────────────────────┐ │ │ │ Microcode ROM │ │ │ │ ┌────────────────────────────────────────┐ │ │ │ │ │ DIV entry point: μop1, μop2, ... μop40 │ │ │ │ │ └────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────┘ │ └────────────────────────┬───────────────────────────────┘ ┌────────────────────────────────────────────────────────┐ │ Micro-operations (μops) │ │ μop1: Load dividend low │ │ μop2: Load dividend high │ │ μop3: Load divisor │ │ ... │ │ μop40: Store quotient │ └────────────────────────────────────────────────────────┘ Microcode Update Process: BIOS/OS loads microcode patch at boot ┌─────────────────┐ │ Microcode RAM │ ← Updated microcode │ (patches ROM) │ └─────────────────┘
// Conceptual microcode representation const microcode = { // Complex CISC instruction broken into micro-ops 'DIV': [ { op: 'LOAD', dest: 'temp1', src: 'dividend_low' }, { op: 'LOAD', dest: 'temp2', src: 'dividend_high' }, { op: 'LOAD', dest: 'temp3', src: 'divisor' }, { op: 'TEST', src: 'temp3', action: 'check_zero' }, { op: 'BRANCH', cond: 'zero', target: 'div_by_zero_handler' }, // ... 30+ more micro-ops for actual division algorithm { op: 'STORE', src: 'quotient', dest: 'EAX' }, { op: 'STORE', src: 'remainder', dest: 'EDX' }, { op: 'END' } ], // Simple instruction = single micro-op 'ADD_REG': [ { op: 'ADD', dest: 'reg1', src1: 'reg1', src2: 'reg2' } ] }; console.log(`DIV requires ${microcode['DIV'].length} micro-ops`); console.log(`ADD requires ${microcode['ADD_REG'].length} micro-op`);

CPU Sockets and Packaging

The CPU socket is the interface between the processor and motherboard, with different sockets supporting specific CPU families (Intel LGA 1700, AMD AM5). Modern CPUs use either LGA (Land Grid Array) with pins on the motherboard, or PGA (Pin Grid Array) with pins on the CPU—the package also includes the heat spreader and substrate connecting the die to external pins.

CPU Package Anatomy Top View (IHS) Side View (Cross-section) ┌───────────────┐ │ ┌─────────┐ │ Heat Spreader (IHS) │ │ │ │ ═══════════════════════ │ │ HEAT │ │ │ │ │SPREADER │ │ Thermal Interface │ │ (IHS) │ │ ─────────┼───────── │ │ │ │ │ │ └─────────┘ │ ┌────────┴────────┐ │ │ │ CPU Die(s) │ └───────────────┘ └────────┬────────┘ ┌────────┴────────┐ │ Substrate │ │ (PCB routing) │ └───┬───┬───┬───┬─┘ │ │ │ │ Contact pads/pins LGA (Land Grid Array) PGA (Pin Grid Array) Intel, newer AMD Classic AMD ┌─────────────────┐ ┌─────────────────┐ │ CPU Package │ │ CPU Package │ │ ░░░░░░░░░░░ │◄─ Contacts │ ▼▼▼▼▼▼▼▼▼▼▼ │◄─ Pins on CPU └─────────────────┘ └─────────────────┘ ║ ║ ┌─────────────────┐ ┌─────────────────┐ │ ▲▲▲▲▲▲▲▲▲▲▲ │◄─ Pins │ ○○○○○○○○○○○ │◄─ Holes in │ Motherboard │ on board │ Motherboard │ socket └─────────────────┘ └─────────────────┘ Common Desktop Sockets (2024): ┌────────────────┬──────────┬────────────────────────┐ │ Socket │ Type │ CPUs │ ├────────────────┼──────────┼────────────────────────┤ │ Intel LGA 1700 │ LGA │ 12th-14th Gen Core │ │ AMD AM5 │ LGA │ Ryzen 7000+ │ │ AMD AM4 │ PGA │ Ryzen 1000-5000 │ └────────────────┴──────────┴────────────────────────┘