CPU Architecture Deep Dive: From Basics to Modern Multi-Core Design | Mehran Khanjan

CPU basics

The CPU (Central Processing Unit) is the "brain" of the computer, executing billions of instructions per second through a fetch-decode-execute cycle: fetching instructions from memory, decoding what operation to perform, executing using the ALU (Arithmetic Logic Unit), and storing results. Modern CPUs have multiple cores (parallel processors), cache hierarchy (L1/L2/L3 for fast data access), run at gigahertz speeds, and use sophisticated techniques like pipelining, branch prediction, and out-of-order execution to maximize performance.

┌────────────────────────────────────────────────────────────────────┐
│                      CPU ARCHITECTURE                               │
├────────────────────────────────────────────────────────────────────┤
│                                                                    │
│  ┌────────────────────────────────────────────────────────────┐   │
│  │                      CPU PACKAGE                            │   │
│  │  ┌──────────────────────────────────────────────────────┐  │   │
│  │  │                   L3 CACHE (Shared)                   │  │   │
│  │  │                      32MB                             │  │   │
│  │  └──────────────────────────────────────────────────────┘  │   │
│  │                                                             │   │
│  │  ┌────────────┐ ┌────────────┐ ┌────────────┐ ┌────────────┐│   │
│  │  │   CORE 0   │ │   CORE 1   │ │   CORE 2   │ │   CORE 3   ││   │
│  │  │ ┌────────┐ │ │ ┌────────┐ │ │ ┌────────┐ │ │ ┌────────┐ ││   │
│  │  │ │L2 512KB│ │ │ │L2 512KB│ │ │ │L2 512KB│ │ │ │L2 512KB│ ││   │
│  │  │ └────────┘ │ │ └────────┘ │ │ └────────┘ │ │ └────────┘ ││   │
│  │  │ ┌───┐┌───┐ │ │ ┌───┐┌───┐ │ │ ┌───┐┌───┐ │ │ ┌───┐┌───┐ ││   │
│  │  │ │L1D││L1I│ │ │ │L1D││L1I│ │ │ │L1D││L1I│ │ │ │L1D││L1I│ ││   │
│  │  │ │32K││32K│ │ │ │32K││32K│ │ │ │32K││32K│ │ │ │32K││32K│ ││   │
│  │  │ └───┘└───┘ │ │ └───┘└───┘ │ │ └───┘└───┘ │ │ └───┘└───┘ ││   │
│  │  │ ┌────────┐ │ │ ┌────────┐ │ │ ┌────────┐ │ │ ┌────────┐ ││   │
│  │  │ │  ALU   │ │ │ │  ALU   │ │ │ │  ALU   │ │ │ │  ALU   │ ││   │
│  │  │ │  FPU   │ │ │ │  FPU   │ │ │ │  FPU   │ │ │ │  FPU   │ ││   │
│  │  │ └────────┘ │ │ └────────┘ │ │ └────────┘ │ │ └────────┘ ││   │
│  │  └────────────┘ └────────────┘ └────────────┘ └────────────┘│   │
│  │                                                             │   │
│  │  ┌───────────────────────────────────────────────────────┐ │   │
│  │  │  Memory Controller  │  PCIe Controller  │  IMC        │ │   │
│  │  └───────────────────────────────────────────────────────┘ │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                                                                    │
├────────────────────────────────────────────────────────────────────┤
│  FETCH → DECODE → EXECUTE → WRITEBACK (Pipeline)                  │
│                                                                    │
│  Clock: 5.0 GHz = 5 billion cycles per second                     │
│  IPC: Instructions Per Cycle (efficiency metric)                  │
│  Performance ≈ Cores × Clock Speed × IPC                          │
└────────────────────────────────────────────────────────────────────┘

// CPU concepts demonstration
const cpu = {
    model: 'AMD Ryzen 9 7950X',
    cores: 16,
    threads: 32,            // SMT (Simultaneous Multi-Threading)
    baseClock: 4.5,         // GHz
    boostClock: 5.7,        // GHz
    cache: {
        L1: '1MB total',    // Fastest, smallest
        L2: '16MB',         // Fast
        L3: '64MB'          // Slower, shared
    },
    tdp: 170,               // Watts (thermal design power)
    process: '5nm'          // Manufacturing node
};

// Cache latency comparison (approximate cycles)
const cacheLatency = {
    L1: 4,      // ~1 nanosecond
    L2: 12,     // ~3 nanoseconds
    L3: 40,     // ~10 nanoseconds
    RAM: 200    // ~50+ nanoseconds
};

// Why cache matters: Example
function calculateCacheImpact(dataSizeKB, cacheHitRate) {
    const L1AccessTime = 1;    // ns
    const RAMAccessTime = 50;  // ns
    const avgAccessTime = (cacheHitRate * L1AccessTime) + 
                         ((1 - cacheHitRate) * RAMAccessTime);
    console.log(`Cache hit rate: ${cacheHitRate * 100}%`);
    console.log(`Average access: ${avgAccessTime}ns`);
    // 90% hit rate: 5.9ns, 50% hit rate: 25.5ns (4x slower!)
}

Arithmetic Logic Unit (ALU)

The ALU is the mathematical brain of the CPU, performing all arithmetic operations (add, subtract, multiply, divide) and logical operations (AND, OR, NOT, XOR) on binary data. Every calculation your computer makes ultimately passes through the ALU.

                    ALU Block Diagram
                    
        Operand A          Operand B
            │                  │
            ▼                  ▼
        ┌───────────────────────────┐
        │                           │
        │      ┌───────────┐        │
        │      │    ALU    │        │
        │      │           │        │
        │      │  + - × ÷  │        │
        │      │  AND OR   │        │
        │      │  NOT XOR  │        │
        │      └─────┬─────┘        │
        │            │              │
        └────────────┼──────────────┘
                     │
              ┌──────┴──────┐
              ▼             ▼
           Result        Flags
                      (Zero, Carry,
                       Overflow)

// Simple ALU simulation
class ALU {
  static operations = {
    ADD: (a, b) => ({ result: a + b, zero: (a + b) === 0 }),
    SUB: (a, b) => ({ result: a - b, zero: (a - b) === 0 }),
    AND: (a, b) => ({ result: a & b, zero: (a & b) === 0 }),
    OR:  (a, b) => ({ result: a | b, zero: (a | b) === 0 }),
    XOR: (a, b) => ({ result: a ^ b, zero: (a ^ b) === 0 }),
    NOT: (a)    => ({ result: ~a,    zero: (~a) === 0 })
  };
  
  execute(op, a, b = 0) {
    return ALU.operations[op](a, b);
  }
}

const alu = new ALU();
console.log(alu.execute('ADD', 5, 3));  // { result: 8, zero: false }
console.log(alu.execute('AND', 0b1100, 0b1010)); // { result: 8 (0b1000), zero: false }

Control Unit

The Control Unit is the CPU's traffic controller, directing all operations by interpreting instructions and sending signals to coordinate the ALU, registers, and memory. It orchestrates the fetch-decode-execute cycle that makes program execution possible.

                Control Unit Operation
                
    ┌─────────────────────────────────────────┐
    │              CONTROL UNIT               │
    │  ┌─────────────────────────────────┐    │
    │  │     Instruction Decoder         │    │
    │  └───────────────┬─────────────────┘    │
    │                  │                      │
    │         ┌────────┴────────┐             │
    │         ▼        ▼        ▼             │
    │    ┌────────┬────────┬────────┐         │
    │    │Timing  │Control │Sequenc-│         │
    │    │Signals │Lines   │ing     │         │
    │    └───┬────┴───┬────┴───┬────┘         │
    └────────┼────────┼────────┼──────────────┘
             │        │        │
             ▼        ▼        ▼
          Memory    ALU    Registers

Registers

Registers are the fastest storage locations in a computer, built directly into the CPU and accessed in a single clock cycle. Common registers include the Program Counter (PC), Accumulator, Stack Pointer, and general-purpose registers for temporary data storage.

// CPU Register simulation
class CPURegisters {
  constructor() {
    // Common x86-like registers (simplified)
    this.registers = {
      // General purpose (32-bit represented as numbers)
      EAX: 0,  // Accumulator
      EBX: 0,  // Base
      ECX: 0,  // Counter
      EDX: 0,  // Data
      
      // Special purpose
      EIP: 0,  // Instruction Pointer (Program Counter)
      ESP: 0,  // Stack Pointer
      EBP: 0,  // Base Pointer
      
      // Flags register
      FLAGS: { zero: false, carry: false, sign: false, overflow: false }
    };
  }
  
  get(name)        { return this.registers[name]; }
  set(name, value) { this.registers[name] = value; }
  
  incrementIP(bytes = 1) { this.registers.EIP += bytes; }
}

    x86 Register Layout
    
    64-bit │    RAX     │
    ───────┼────────────┤
    32-bit │    │  EAX  │
    ───────┼────┼───────┤
    16-bit │    │   │AX │
    ───────┼────┼───┼───┤
    8-bit  │    │   │AH│AL│
           └────┴───┴──┴──┘

Instruction Cycle (Fetch-Decode-Execute)

Every instruction executes through three fundamental phases: Fetch retrieves the instruction from memory, Decode interprets what operation to perform, and Execute carries out the operation. This cycle repeats billions of times per second in modern CPUs.

    Instruction Cycle (Von Neumann)
    
          ┌──────────────────────────────────────┐
          │                                      │
          ▼                                      │
    ┌──────────┐    ┌──────────┐    ┌──────────┐ │
    │  FETCH   │───▶│  DECODE  │───▶│ EXECUTE  │─┘
    │          │    │          │    │          │
    │ Get next │    │Interpret │    │ Perform  │
    │ instruct.│    │ opcode & │    │   the    │
    │ from RAM │    │ operands │    │operation │
    └──────────┘    └──────────┘    └──────────┘
         │                               │
         │         Memory Bus            │
         └───────────────────────────────┘

// Simplified instruction cycle simulation
class SimpleCPU {
  constructor() {
    this.memory = new Uint8Array(256);
    this.PC = 0;           // Program Counter
    this.ACC = 0;          // Accumulator
    this.running = true;
  }

  // Opcodes: 0=HALT, 1=LOAD, 2=ADD, 3=STORE, 4=JUMP
  cycle() {
    // FETCH
    const instruction = this.memory[this.PC];
    const operand = this.memory[this.PC + 1];
    
    // DECODE & EXECUTE
    switch (instruction) {
      case 0: this.running = false; break;                    // HALT
      case 1: this.ACC = this.memory[operand]; break;         // LOAD
      case 2: this.ACC += this.memory[operand]; break;        // ADD
      case 3: this.memory[operand] = this.ACC; break;         // STORE
      case 4: this.PC = operand - 2; break;                   // JUMP
    }
    
    this.PC += 2;  // Move to next instruction
    return { instruction, operand, ACC: this.ACC };
  }
  
  run() {
    while (this.running) console.log(this.cycle());
  }
}

Clock Speed and Cycles

The CPU clock is an electronic oscillator that generates timing pulses, measured in Hertz (cycles per second). Modern CPUs run at several GHz (billions of cycles per second), but different instructions may require different numbers of clock cycles to complete.

    Clock Signal
    
    Voltage
       │   ┌───┐   ┌───┐   ┌───┐   ┌───┐
     1 │   │   │   │   │   │   │   │   │
       │───┘   └───┘   └───┘   └───┘   └───
     0 │
       └─────────────────────────────────── Time
           │◄──►│
           1 cycle
           
    At 4 GHz: 1 cycle = 0.25 nanoseconds
              (light travels ~7.5 cm)

// Clock speed and instruction timing
const clockSpeedGHz = 4.0;
const cyclesPerSecond = clockSpeedGHz * 1e9;
const secondsPerCycle = 1 / cyclesPerSecond;

const instructionCycles = {
  ADD_REG:  1,    // Add registers: 1 cycle
  MUL_REG:  3,    // Multiply: 3 cycles  
  LOAD_MEM: 4,    // Load from L1 cache: ~4 cycles
  LOAD_RAM: 100,  // Load from RAM: ~100 cycles
  DIV:      20    // Division: ~20 cycles
};

Object.entries(instructionCycles).forEach(([op, cycles]) => {
  const timeNs = (cycles * secondsPerCycle * 1e9).toFixed(3);
  console.log(`${op}: ${cycles} cycles = ${timeNs} ns`);
});

Word Size

Word size refers to the number of bits a CPU processes in one operation, determining the size of registers, memory addressing, and data bus width. Common word sizes have evolved from 8-bit (early micros) to 16-bit, 32-bit, and today's 64-bit architectures.

    Word Size Evolution
    
    8-bit:  │████████│                           = 256 values
            (0-255)
    
    16-bit: │████████████████│                   = 65,536 values
            (0-65,535)
    
    32-bit: │████████████████████████████████│   = 4.29 billion values
            (0-4,294,967,295)                      4 GB addressable
    
    64-bit: │████████████████████████████████│   = 18.4 quintillion
            │████████████████████████████████│     16 EB addressable

// Word size implications
const wordSizes = [8, 16, 32, 64];

wordSizes.forEach(bits => {
  const maxValue = BigInt(2) ** BigInt(bits) - BigInt(1);
  const addressableMemory = BigInt(2) ** BigInt(bits);
  
  console.log(`${bits}-bit:`);
  console.log(`  Max unsigned value: ${maxValue.toLocaleString()}`);
  console.log(`  Addressable memory: ${formatBytes(addressableMemory)}`);
});

function formatBytes(bytes) {
  const units = ['B', 'KB', 'MB', 'GB', 'TB', 'PB', 'EB'];
  let i = 0;
  let size = Number(bytes);
  while (size >= 1024 && i < units.length - 1) { size /= 1024; i++; }
  return `${size.toFixed(2)} ${units[i]}`;
}

Instruction Sets Overview

An Instruction Set Architecture (ISA) defines the machine language instructions a CPU understands. The two main philosophies are CISC (Complex Instruction Set Computing, like x86) with many specialized instructions, and RISC (Reduced Instruction Set Computing, like ARM) with simpler, uniform instructions.

    CISC vs RISC Philosophy
    
    CISC (x86)                    RISC (ARM)
    ──────────────────────────    ──────────────────────────
    Complex instructions          Simple instructions
    Variable length (1-15 bytes)  Fixed length (4 bytes)
    Memory-to-memory ops          Load/Store architecture
    Fewer registers               Many registers
    Hardware decoding             Simpler decoding
    
    Example: Add memory to register
    
    CISC (1 instruction):         RISC (3 instructions):
    ADD EAX, [address]            LOAD  R1, [address]
                                  ADD   R0, R0, R1
                                  (result in R0)

// Simple instruction set simulation
const instructionSet = {
  // RISC-like instructions
  NOP:   { opcode: 0x00, cycles: 1, desc: "No operation" },
  LOAD:  { opcode: 0x01, cycles: 2, desc: "Load register from memory" },
  STORE: { opcode: 0x02, cycles: 2, desc: "Store register to memory" },
  ADD:   { opcode: 0x10, cycles: 1, desc: "Add two registers" },
  SUB:   { opcode: 0x11, cycles: 1, desc: "Subtract registers" },
  MUL:   { opcode: 0x12, cycles: 3, desc: "Multiply registers" },
  AND:   { opcode: 0x20, cycles: 1, desc: "Bitwise AND" },
  OR:    { opcode: 0x21, cycles: 1, desc: "Bitwise OR" },
  JMP:   { opcode: 0x30, cycles: 1, desc: "Unconditional jump" },
  JZ:    { opcode: 0x31, cycles: 1, desc: "Jump if zero" },
  CALL:  { opcode: 0x40, cycles: 2, desc: "Call subroutine" },
  RET:   { opcode: 0x41, cycles: 2, desc: "Return from subroutine" },
  HALT:  { opcode: 0xFF, cycles: 1, desc: "Stop execution" }
};

// Print instruction set table
console.log("Opcode │ Mnemonic │ Cycles │ Description");
console.log("───────┼──────────┼────────┼─────────────────────");
Object.entries(instructionSet).forEach(([name, info]) => {
  console.log(`  0x${info.opcode.toString(16).padStart(2,'0')}  │ ${name.padEnd(8)} │   ${info.cycles}    │ ${info.desc}`);
});

Pipelining

Pipelining divides instruction execution into stages (typically Fetch, Decode, Execute, Memory, Writeback), allowing multiple instructions to be in-flight simultaneously—like an assembly line. A 5-stage pipeline can theoretically achieve 5x throughput, but hazards (data dependencies, branches) can stall the pipeline and reduce efficiency.

    Classic 5-Stage RISC Pipeline
    
    Time →    1    2    3    4    5    6    7    8    9
            ┌────┬────┬────┬────┬────┐
    Inst 1  │ IF │ ID │ EX │MEM │ WB │
            └────┴────┴────┴────┴────┘
                 ┌────┬────┬────┬────┬────┐
    Inst 2       │ IF │ ID │ EX │MEM │ WB │
                 └────┴────┴────┴────┴────┘
                      ┌────┬────┬────┬────┬────┐
    Inst 3            │ IF │ ID │ EX │MEM │ WB │
                      └────┴────┴────┴────┴────┘
                           ┌────┬────┬────┬────┬────┐
    Inst 4                 │ IF │ ID │ EX │MEM │ WB │
                           └────┴────┴────┴────┴────┘
    
    Without pipeline: 4 instructions × 5 cycles = 20 cycles
    With pipeline:    4 + 4 = 8 cycles (after fill)
    
    IF = Instruction Fetch    MEM = Memory Access
    ID = Instruction Decode   WB  = Write Back
    EX = Execute

// Pipeline simulation
class Pipeline {
  constructor(stages = ['IF', 'ID', 'EX', 'MEM', 'WB']) {
    this.stages = stages;
    this.pipe = new Array(stages.length).fill(null);
    this.cycle = 0;
    this.completed = [];
  }
  
  tick(newInstruction = null) {
    this.cycle++;
    
    // Instruction leaving pipeline
    if (this.pipe[this.pipe.length - 1]) {
      this.completed.push(this.pipe[this.pipe.length - 1]);
    }
    
    // Shift all instructions forward
    for (let i = this.pipe.length - 1; i > 0; i--) {
      this.pipe[i] = this.pipe[i - 1];
    }
    this.pipe[0] = newInstruction;
    
    return this.getState();
  }
  
  getState() {
    return this.stages.map((stage, i) => 
      `${stage}: ${this.pipe[i] || '---'}`
    ).join(' | ');
  }
}

const pipe = new Pipeline();
['ADD', 'SUB', 'MUL', 'LOAD', 'STORE', null, null, null, null].forEach((inst, i) => {
  console.log(`Cycle ${i + 1}: ${pipe.tick(inst)}`);
});

Branch Prediction Basics

Branch prediction allows the CPU to speculatively fetch and execute instructions before knowing a branch's outcome, avoiding pipeline stalls. Modern predictors achieve 95%+ accuracy using techniques like branch history tables and pattern recognition—a misprediction costs 10-20 cycles to flush and restart the pipeline.

    Branch Prediction Problem
    
    Without prediction:
    ┌────┬────┬────┐
    │ IF │ ID │ EX │──► Branch result known
    └────┴────┴────┘
                     ┌────┬────┬────┐
                     │ IF │ ID │ EX │  3 cycles wasted!
                     └────┴────┴────┘
    
    With prediction (guess correctly):
    ┌────┬────┬────┐
    │ IF │ ID │ EX │
    └────┴────┴────┘
         ┌────┬────┬────┐
         │ IF │ ID │ EX │  Speculative, confirmed!
         └────┴────┴────┘
              ┌────┬────┬────┐
              │ IF │ ID │ EX │  No stall!
              └────┴────┴────┘
    
    2-Bit Saturating Counter (common predictor):
    
           Predict Taken                 Predict Not Taken
    ┌──────────────────────────┐   ┌──────────────────────────┐
    │                          │   │                          │
    ▼          Taken           │   │        Not Taken         ▼
    ┌────────┐ ───────► ┌────────┐ ◄─────── ┌────────┐ ───────► ┌────────┐
    │ Strong │          │  Weak  │          │  Weak  │          │ Strong │
    │ Taken  │ ◄─────── │ Taken  │ ───────► │Not Tkn │ ◄─────── │Not Tkn │
    └────────┘ Not Taken└────────┘  Taken   └────────┘ Not Taken└────────┘
       11                  10                  01                  00

// 2-bit saturating counter branch predictor
class BranchPredictor {
  constructor(tableSize = 256) {
    // 2-bit counters: 0,1 = not taken; 2,3 = taken
    this.table = new Uint8Array(tableSize).fill(2); // Start weakly taken
    this.stats = { correct: 0, incorrect: 0 };
  }
  
  predict(address) {
    const index = address % this.table.length;
    return this.table[index] >= 2; // true = taken
  }
  
  update(address, actuallyTaken) {
    const index = address % this.table.length;
    const predicted = this.table[index] >= 2;
    
    // Update counter
    if (actuallyTaken && this.table[index] < 3) this.table[index]++;
    if (!actuallyTaken && this.table[index] > 0) this.table[index]--;
    
    // Track accuracy
    if (predicted === actuallyTaken) this.stats.correct++;
    else this.stats.incorrect++;
  }
  
  get accuracy() {
    const total = this.stats.correct + this.stats.incorrect;
    return total ? (this.stats.correct / total * 100).toFixed(1) : 0;
  }
}

// Simulate a loop (predictable pattern)
const predictor = new BranchPredictor();
const loopAddr = 0x1000;

for (let i = 0; i < 1000; i++) {
  for (let j = 0; j < 10; j++) {
    const taken = j < 9; // Loop 9 times, then exit
    predictor.update(loopAddr, taken);
  }
}
console.log(`Prediction accuracy: ${predictor.accuracy}%`); // ~90%

Cache Memory (L1, L2, L3)

CPU caches are small, fast SRAM memories that store frequently accessed data and instructions, dramatically reducing average memory access time. L1 is fastest but smallest (~32KB, ~4 cycles), L2 is larger and slower (~256KB-1MB, ~12 cycles), and L3 is shared across cores (~8-64MB, ~40 cycles)—compared to ~100+ cycles for main memory.

    Cache Hierarchy in Modern CPU
    
              ┌─────────────────────────────────────────────┐
              │                  CPU Die                    │
              │  ┌──────────┐            ┌──────────┐       │
              │  │  Core 0  │            │  Core 1  │       │
              │  │ ┌──────┐ │            │ ┌──────┐ │       │
              │  │ │L1-I  │ │ 32KB       │ │L1-I  │ │       │
              │  │ │4 cyc │ │ per core   │ │4 cyc │ │       │
              │  │ ├──────┤ │            │ ├──────┤ │       │
              │  │ │L1-D  │ │ 32KB       │ │L1-D  │ │       │
              │  │ │4 cyc │ │ per core   │ │4 cyc │ │       │
              │  │ └──┬───┘ │            │ └──┬───┘ │       │
              │  │ ┌──┴───┐ │            │ ┌──┴───┐ │       │
              │  │ │  L2  │ │ 256KB-1MB  │ │  L2  │ │       │
              │  │ │12 cyc│ │ per core   │ │12 cyc│ │       │
              │  │ └──┬───┘ │            │ └──┬───┘ │       │
              │  └────┼─────┘            └────┼─────┘       │
              │       │                       │             │
              │  ┌────┴───────────────────────┴────┐        │
              │  │            L3 Cache             │        │
              │  │      8-64MB, ~40 cycles         │        │
              │  │         (Shared by all cores)   │        │
              │  └─────────────────┬───────────────┘        │
              └────────────────────┼────────────────────────┘
                                   │
                           ┌───────┴───────┐
                           │   Main Memory │
                           │   DDR4/DDR5   │
                           │  ~100+ cycles │
                           └───────────────┘

// Cache hit/miss simulation
class CacheSimulator {
  constructor(cacheLines = 64, lineSize = 64) {
    this.cacheLines = cacheLines;
    this.lineSize = lineSize;
    this.cache = new Map();
    this.stats = { hits: 0, misses: 0 };
  }
  
  access(address) {
    const lineAddr = Math.floor(address / this.lineSize) * this.lineSize;
    const index = (lineAddr / this.lineSize) % this.cacheLines;
    
    if (this.cache.get(index) === lineAddr) {
      this.stats.hits++;
      return { hit: true, latency: 4 };
    } else {
      this.stats.misses++;
      this.cache.set(index, lineAddr);
      return { hit: false, latency: 100 };
    }
  }
  
  get hitRate() {
    const total = this.stats.hits + this.stats.misses;
    return (this.stats.hits / total * 100).toFixed(1);
  }
}

// Sequential access (cache-friendly)
const cache = new CacheSimulator();
for (let i = 0; i < 1000; i++) {
  cache.access(i * 4);  // Sequential, reuses cache lines
}
console.log(`Sequential hit rate: ${cache.hitRate}%`);

// Random access (cache-unfriendly)
const cache2 = new CacheSimulator();
for (let i = 0; i < 1000; i++) {
  cache2.access(Math.floor(Math.random() * 100000));
}
console.log(`Random hit rate: ${cache2.hitRate}%`);

Cache Coherence Introduction

In multi-core systems, each core has its own cache, creating the problem of keeping cached copies of the same memory location consistent. The MESI protocol (Modified, Exclusive, Shared, Invalid) is widely used, where cores snoop the bus to detect when they need to update or invalidate their cached copies.

    MESI Protocol States
    
    ┌───────────────────────────────────────────────────────────────┐
    │  State     │ Valid? │ Dirty? │ Shared? │ Description         │
    ├───────────────────────────────────────────────────────────────┤
    │ Modified   │  Yes   │  Yes   │   No    │ Only copy, changed  │
    │ Exclusive  │  Yes   │   No   │   No    │ Only copy, clean    │
    │ Shared     │  Yes   │   No   │  Yes    │ Multiple clean copy │
    │ Invalid    │   No   │   -    │   -     │ Not in cache        │
    └───────────────────────────────────────────────────────────────┘
    
    Example: Core 0 writes, Core 1 reads same address
    
    Initial:  Core0: Invalid    Core1: Invalid
    
    Step 1: Core0 reads address X
              Core0: Exclusive   Core1: Invalid
              
    Step 2: Core0 writes to X
              Core0: Modified    Core1: Invalid
              
    Step 3: Core1 wants to read X
              - Core0 snoops request, provides data
              - Core0: Shared     Core1: Shared
              - Value written back to memory

// Simplified MESI state machine
class CacheLine {
  constructor() {
    this.state = 'Invalid';
    this.data = null;
  }
}

class MESISimulator {
  constructor() {
    this.caches = [new Map(), new Map()];  // 2 cores
    this.memory = new Map();
  }
  
  read(core, address) {
    const otherCore = 1 - core;
    const line = this.caches[core].get(address) || new CacheLine();
    const otherLine = this.caches[otherCore].get(address);
    
    if (line.state !== 'Invalid') {
      return { state: line.state, data: line.data, event: 'Hit' };
    }
    
    // Cache miss - check other core
    if (otherLine?.state === 'Modified') {
      // Snoop: get data from other core
      this.memory.set(address, otherLine.data);
      otherLine.state = 'Shared';
      line.state = 'Shared';
      line.data = otherLine.data;
    } else if (otherLine?.state === 'Exclusive' || otherLine?.state === 'Shared') {
      otherLine.state = 'Shared';
      line.state = 'Shared';
      line.data = this.memory.get(address);
    } else {
      line.state = 'Exclusive';
      line.data = this.memory.get(address);
    }
    
    this.caches[core].set(address, line);
    return { state: line.state, data: line.data, event: 'Miss' };
  }
  
  write(core, address, value) {
    const otherCore = 1 - core;
    const line = this.caches[core].get(address) || new CacheLine();
    const otherLine = this.caches[otherCore].get(address);
    
    // Invalidate other core's copy
    if (otherLine) otherLine.state = 'Invalid';
    
    line.state = 'Modified';
    line.data = value;
    this.caches[core].set(address, line);
    
    return { state: 'Modified', invalidated: !!otherLine };
  }
}

Superscalar Execution

Superscalar processors can execute multiple instructions per clock cycle by having multiple execution units (ALUs, FPUs, load/store units) operating in parallel. The CPU analyzes instruction dependencies and dispatches independent instructions simultaneously, achieving IPC (Instructions Per Cycle) greater than 1—modern CPUs can sustain 4-6 IPC.

    Superscalar Execution (Multiple Issue)
    
    Scalar (1 instruction/cycle):
    
    Cycle:  1     2     3     4     5     6
           ADD   SUB   MUL   LOAD  AND   OR
    
    Superscalar (4-wide, can issue 4/cycle if independent):
    
    Cycle:  1              2
           ┌─────────────┐ ┌─────────────┐
           │ ADD (ALU 0) │ │ AND (ALU 0) │
           │ SUB (ALU 1) │ │ OR  (ALU 1) │
           │ MUL (FPU)   │ │ ... │
           │ LOAD (LSU)  │ │ ... │
           └─────────────┘ └─────────────┘
    
    Execution Units in Modern CPU:
    ┌───────────────────────────────────────────────────┐
    │              Instruction Scheduler                │
    └──┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬────┘
       │     │     │     │     │     │     │     │
       ▼     ▼     ▼     ▼     ▼     ▼     ▼     ▼
    ┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐
    │ALU 0│ALU 1│ALU 2│ALU 3│FPU 0│FPU 1│Load │Store│
    └─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘
    
    Modern CPUs: 8-12 execution units

CISC vs RISC

CISC (Complex Instruction Set Computing) like x86 offers many specialized instructions of variable length, while RISC (Reduced Instruction Set Computing) like ARM uses simple, fixed-size instructions that execute in one cycle. Modern x86 CPUs internally convert CISC instructions to RISC-like micro-ops, blurring the distinction—the real difference now is legacy compatibility vs clean design.

    CISC vs RISC Philosophy
    
    ┌─────────────────────┬───────────────────────────────────────┐
    │     CISC (x86)      │           RISC (ARM/MIPS)             │
    ├─────────────────────┼───────────────────────────────────────┤
    │ Variable length     │ Fixed length (32-bit)                 │
    │ 1-15 bytes          │                                       │
    ├─────────────────────┼───────────────────────────────────────┤
    │ Complex addressing  │ Simple addressing                     │
    │ Memory operands     │ Load/Store only                       │
    ├─────────────────────┼───────────────────────────────────────┤
    │ Fewer registers     │ Many registers (32+)                  │
    │ (8-16 architectural)│                                       │
    ├─────────────────────┼───────────────────────────────────────┤
    │ Microcode decode    │ Hardwired decode                      │
    └─────────────────────┴───────────────────────────────────────┘
    
    Example: A = B + C (from memory)
    
    CISC (1 instruction):          RISC (4 instructions):
    ADD [A], [B], [C]              LDR  R1, [B]
                                   LDR  R2, [C]
                                   ADD  R3, R1, R2
                                   STR  R3, [A]
    
    Modern Reality:
    ┌────────────────────────────────────────────────┐
    │              x86 Processor                     │
    │  ┌─────────────────────────────────────────┐  │
    │  │   CISC Instructions → Decoder           │  │
    │  └──────────────┬──────────────────────────┘  │
    │                 ▼                              │
    │  ┌─────────────────────────────────────────┐  │
    │  │   RISC-like Micro-ops (μops)            │  │
    │  │   (Internally looks like RISC!)         │  │
    │  └─────────────────────────────────────────┘  │
    └────────────────────────────────────────────────┘

Microcode

Microcode is a layer of low-level instructions stored in ROM that translates complex machine instructions into sequences of simpler operations the hardware can execute. It allows CPU designers to implement complex instructions in "software" and enables bug fixes via microcode updates—Intel regularly releases patches this way.

    Microcode Architecture
    
    High-Level View:
    ┌────────────────────────────────────────────────────────┐
    │                 Machine Instruction                    │
    │                    DIV EAX, EBX                        │
    └────────────────────────┬───────────────────────────────┘
                             │
                             ▼
    ┌────────────────────────────────────────────────────────┐
    │              Microcode Sequencer                       │
    │  ┌─────────────────────────────────────────────────┐  │
    │  │  Microcode ROM                                  │  │
    │  │  ┌────────────────────────────────────────┐     │  │
    │  │  │ DIV entry point: μop1, μop2, ... μop40 │     │  │
    │  │  └────────────────────────────────────────┘     │  │
    │  └─────────────────────────────────────────────────┘  │
    └────────────────────────┬───────────────────────────────┘
                             │
                             ▼
    ┌────────────────────────────────────────────────────────┐
    │              Micro-operations (μops)                   │
    │  μop1: Load dividend low                               │
    │  μop2: Load dividend high                              │
    │  μop3: Load divisor                                    │
    │  ...                                                   │
    │  μop40: Store quotient                                 │
    └────────────────────────────────────────────────────────┘
    
    Microcode Update Process:
    BIOS/OS loads microcode patch at boot
           │
           ▼
    ┌─────────────────┐
    │ Microcode RAM   │ ← Updated microcode
    │ (patches ROM)   │
    └─────────────────┘

// Conceptual microcode representation
const microcode = {
  // Complex CISC instruction broken into micro-ops
  'DIV': [
    { op: 'LOAD', dest: 'temp1', src: 'dividend_low' },
    { op: 'LOAD', dest: 'temp2', src: 'dividend_high' },
    { op: 'LOAD', dest: 'temp3', src: 'divisor' },
    { op: 'TEST', src: 'temp3', action: 'check_zero' },
    { op: 'BRANCH', cond: 'zero', target: 'div_by_zero_handler' },
    // ... 30+ more micro-ops for actual division algorithm
    { op: 'STORE', src: 'quotient', dest: 'EAX' },
    { op: 'STORE', src: 'remainder', dest: 'EDX' },
    { op: 'END' }
  ],
  
  // Simple instruction = single micro-op
  'ADD_REG': [
    { op: 'ADD', dest: 'reg1', src1: 'reg1', src2: 'reg2' }
  ]
};

console.log(`DIV requires ${microcode['DIV'].length} micro-ops`);
console.log(`ADD requires ${microcode['ADD_REG'].length} micro-op`);

CPU Sockets and Packaging

The CPU socket is the interface between the processor and motherboard, with different sockets supporting specific CPU families (Intel LGA 1700, AMD AM5). Modern CPUs use either LGA (Land Grid Array) with pins on the motherboard, or PGA (Pin Grid Array) with pins on the CPU—the package also includes the heat spreader and substrate connecting the die to external pins.

    CPU Package Anatomy
    
           Top View (IHS)              Side View (Cross-section)
         ┌───────────────┐
         │  ┌─────────┐  │            Heat Spreader (IHS)
         │  │         │  │           ═══════════════════════
         │  │  HEAT   │  │                    │
         │  │SPREADER │  │           Thermal Interface
         │  │  (IHS)  │  │           ─────────┼─────────
         │  │         │  │                    │
         │  └─────────┘  │           ┌────────┴────────┐
         │               │           │    CPU Die(s)   │
         └───────────────┘           └────────┬────────┘
                                              │
                                     ┌────────┴────────┐
                                     │   Substrate     │
                                     │  (PCB routing)  │
                                     └───┬───┬───┬───┬─┘
                                         │   │   │   │
                                     Contact pads/pins
    
    
    LGA (Land Grid Array)           PGA (Pin Grid Array)
    Intel, newer AMD                Classic AMD
    
    ┌─────────────────┐             ┌─────────────────┐
    │   CPU Package   │             │   CPU Package   │
    │   ░░░░░░░░░░░   │◄─ Contacts  │   ▼▼▼▼▼▼▼▼▼▼▼   │◄─ Pins on CPU
    └─────────────────┘             └─────────────────┘
           ║                               ║
    ┌─────────────────┐             ┌─────────────────┐
    │   ▲▲▲▲▲▲▲▲▲▲▲   │◄─ Pins     │   ○○○○○○○○○○○   │◄─ Holes in
    │   Motherboard   │   on board  │   Motherboard   │   socket
    └─────────────────┘             └─────────────────┘
    
    Common Desktop Sockets (2024):
    ┌────────────────┬──────────┬────────────────────────┐
    │ Socket         │ Type     │ CPUs                   │
    ├────────────────┼──────────┼────────────────────────┤
    │ Intel LGA 1700 │ LGA      │ 12th-14th Gen Core     │
    │ AMD AM5        │ LGA      │ Ryzen 7000+            │
    │ AMD AM4        │ PGA      │ Ryzen 1000-5000        │
    └────────────────┴──────────┴────────────────────────┘