Memory Systems Architecture: From DDR5 to Cache Coherence
Latency is the bottleneck of modern computing. This article deconstructs the memory subsystem, tracing the data path from physical DRAM modules through the complexities of L1-L3 caching, coherence protocols, and advanced memory controllers.
RAM fundamentals
RAM (Random Access Memory) is volatile high-speed memory that stores currently running programs and data, losing all contents when power is removed. It's called "random access" because any memory location can be accessed in constant time, unlike sequential storage. Modern DDR5 (Double Data Rate 5) operates at 4800-8000+ MT/s (megatransfers per second), with typical desktop systems having 16-64GB, structured as DIMMs (Dual Inline Memory Modules) that connect to the CPU's memory controller.
┌────────────────────────────────────────────────────────────────────┐ │ RAM (DIMM Module) │ ├────────────────────────────────────────────────────────────────────┤ │ │ │ ┌─────────────────────────────────────────────────────────────┐ │ │ │ ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ │ │ │ │ ╔═══╗ ╔═══╗ ╔═══╗ ╔═══╗ ╔═══╗ ╔═══╗ ╔═══╗ ╔═══╗ │ │ │ │ ║ C ║ ║ C ║ ║ C ║ ║ C ║ ║ C ║ ║ C ║ ║ C ║ ║ C ║ ← Chips │ │ │ │ ║ H ║ ║ H ║ ║ H ║ ║ H ║ ║ H ║ ║ H ║ ║ H ║ ║ H ║ │ │ │ │ ║ I ║ ║ I ║ ║ I ║ ║ I ║ ║ I ║ ║ I ║ ║ I ║ ║ I ║ │ │ │ │ ║ P ║ ║ P ║ ║ P ║ ║ P ║ ║ P ║ ║ P ║ ║ P ║ ║ P ║ │ │ │ │ ╚═══╝ ╚═══╝ ╚═══╝ ╚═══╝ ╚═══╝ ╚═══╝ ╚═══╝ ╚═══╝ │ │ │ │ ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓ │ │ │ │ │││││││││││││││││││││││││││││││││││││││││││││││││││││││ │ │ │ └──┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴──┘ │ │ ↑ 288 pins (DDR5) │ │ │ ├────────────────────────────────────────────────────────────────────┤ │ MEMORY HIERARCHY (Speed vs Capacity tradeoff): │ │ │ │ Speed │ │ ▲ │ │ │ ┌─────┐ │ │ Fastest │ │ L1 │ ← 64KB, ~1ns │ │ │ ├─────┤ │ │ │ │ L2 │ ← 512KB, ~3ns │ │ │ ├─────┤ │ │ │ │ L3 │ ← 32MB, ~10ns │ │ │ ├─────┤ │ │ │ │ RAM │ ← 32GB, ~50ns │ │ │ ├─────┤ │ │ Slowest │ │ SSD │ ← 1TB, ~100μs │ │ │ ├─────┤ │ │ │ │ HDD │ ← 4TB, ~10ms │ │ ▼ └─────┘ │ │ │ │ ◄────────────────────────────────► │ │ Small Large │ │ Capacity │ │ │ └────────────────────────────────────────────────────────────────────┘
// RAM concepts and calculations const ramModule = { type: 'DDR5', capacity: 32, // GB per module speed: 6000, // MT/s (MegaTransfers per second) voltage: 1.1, // Volts (DDR5 is more power efficient) cas_latency: 36, // CL36 (Column Access Strobe) formFactor: 'DIMM' }; // Calculate theoretical bandwidth function calculateBandwidth(speedMTs, busWidth = 64) { // DDR5 has two 32-bit channels per module const bytesPerTransfer = busWidth / 8; // 8 bytes const bandwidthGBs = (speedMTs * 1e6 * bytesPerTransfer) / 1e9; return bandwidthGBs; } console.log(`Bandwidth: ${calculateBandwidth(6000)}GB/s`); // ~48GB/s per module // Memory latency calculation function calculateLatencyNs(speedMTs, casLatency) { const clockSpeed = speedMTs / 2; // DDR = double data rate const nsPerCycle = 1e9 / (clockSpeed * 1e6); return casLatency * nsPerCycle; } console.log(`True latency: ${calculateLatencyNs(6000, 36).toFixed(1)}ns`); // ~12ns // Why dual-channel matters const singleChannel = calculateBandwidth(6000); const dualChannel = calculateBandwidth(6000) * 2; console.log(`Single: ${singleChannel}GB/s, Dual: ${dualChannel}GB/s`);
Memory Hierarchy
Computer memory is organized in a hierarchy trading off speed, size, and cost—registers are fastest but smallest, followed by cache, RAM, and storage. The CPU checks each level sequentially, with most accesses hitting faster, smaller levels due to locality of reference.
Memory Hierarchy Pyramid ▲ Faster │ Smaller │ More Expensive (per byte) │ ╱╲ ╱ ╲ ╱ R ╲ Registers: ~1 cycle ╱ E G ╲ ~KB ╱───────╲ ╱ L1 $ ╲ L1 Cache: ~4 cycles ╱ 32-64KB ╲ ~64KB ╱─────────────╲ ╱ L2 Cache ╲ L2 Cache: ~10 cycles ╱ 256KB-1MB ╲ ~256KB-1MB ╱───────────────────╲ ╱ L3 Cache ╲ L3 Cache: ~40 cycles ╱ 8-64MB ╲ ~8-64MB ╱─────────────────────────╲ ╱ Main Memory ╲ RAM: ~100 cycles ╱ 8GB-128GB ╲ ~16-128GB ╱───────────────────────────────╲ ╱ SSD / Hard Drive Storage ╲ SSD: ~10,000 cycles ╱ 256GB - 10TB+ ╲ HDD: ~10,000,000 cycles ╱─────────────────────────────────────╲ │ ▼ Slower, Larger, Cheaper
// Memory access time simulation const memoryHierarchy = [ { level: 'Register', accessNs: 0.25, sizeKB: 0.001, hitRate: 0.90 }, { level: 'L1 Cache', accessNs: 1, sizeKB: 64, hitRate: 0.95 }, { level: 'L2 Cache', accessNs: 4, sizeKB: 512, hitRate: 0.97 }, { level: 'L3 Cache', accessNs: 10, sizeKB: 8192, hitRate: 0.99 }, { level: 'RAM', accessNs: 100, sizeKB: 16777216, hitRate: 0.999 }, { level: 'SSD', accessNs: 100000, sizeKB: 1e9, hitRate: 1.0 } ]; // Calculate effective access time using hit rates function effectiveAccessTime(hierarchy) { let missRate = 1; let totalTime = 0; hierarchy.forEach(level => { const hitAtThisLevel = missRate * level.hitRate; totalTime += hitAtThisLevel * level.accessNs; missRate *= (1 - level.hitRate); }); return totalTime; } console.log(`Effective access time: ${effectiveAccessTime(memoryHierarchy).toFixed(2)} ns`);
SRAM vs DRAM
SRAM (Static RAM) uses flip-flops to store each bit, making it fast but expensive and power-hungry—used for CPU caches. DRAM (Dynamic RAM) stores bits as charges in capacitors, offering higher density and lower cost but requiring constant refresh cycles—used for main memory.
SRAM Cell (6 Transistors) DRAM Cell (1T-1C) VDD Bit Line │ │ ┌───┴───┐ │ │ │ ┌──┴──┐ ──┤ FLIP ├── ────┤ T │ │ FLOP │ └──┬──┘ │ │ │ └───┬───┘ ┌──┴──┐ │ │ C │ ← Capacitor GND │ │ (stores bit) └──┬──┘ - Fast (~1ns) │ - No refresh GND - 6 transistors/bit - Used in caches - Slower (~50ns) - Needs refresh - 1 transistor/bit - Used in main RAM
// SRAM vs DRAM characteristics comparison const memoryTypes = { SRAM: { transistorsPerBit: 6, accessTimeNs: 1, needsRefresh: false, relativeCoast: 20, typicalUse: 'CPU Cache' }, DRAM: { transistorsPerBit: 1, accessTimeNs: 50, needsRefresh: true, refreshIntervalMs: 64, relativeCoast: 1, typicalUse: 'Main Memory' } }; // DRAM refresh calculation const dramRefresh = { rowCount: 8192, refreshIntervalMs: 64, get refreshesPerSecond() { return 1000 / this.refreshIntervalMs * this.rowCount; } }; console.log(`DRAM performs ${dramRefresh.refreshesPerSecond.toLocaleString()} refreshes/second`);
ROM, PROM, EPROM, EEPROM
These are non-volatile memory types that retain data without power: ROM is factory-programmed, PROM is one-time programmable by users, EPROM can be erased with UV light and reprogrammed, and EEPROM/Flash can be electrically erased and rewritten—Flash memory is what's in your SSD and USB drives.
Non-Volatile Memory Evolution ┌─────────────────────────────────────────────────────────────┐ │ Type │ Writable │ Erasable │ Method │ Use Case │ ├─────────┼──────────┼───────────┼───────────────┼───────────┤ │ ROM │ Factory │ Never │ Mask at fab │ Firmware │ │ PROM │ Once │ Never │ Burn fuses │ Prototype │ │ EPROM │ Multiple │ Entire │ UV light │ Dev/Test │ │ EEPROM │ Multiple │ Byte-wise │ Electrical │ Settings │ │ Flash │ Multiple │ Block │ Electrical │ SSD, USB │ └─────────────────────────────────────────────────────────────┘ EPROM with UV Window: ┌─────────────┐ │ ┌─────┐ │ │ │ UV │ │ ← Quartz window │ │ WIN │ │ for erasure │ └─────┘ │ ──┤ ├── ──┤ EPROM ├── ──┤ ├── └─────────────┘
Memory Addressing
Each byte in memory has a unique address, allowing the CPU to read or write specific locations. Addressing can be direct (fixed address), indirect (address in register), indexed (base + offset), or various other modes that give programmers flexibility in accessing data.
// Memory addressing modes demonstration class MemoryAddressing { constructor() { this.memory = new Uint8Array(256); this.registers = { A: 0, B: 0, X: 0 }; // X is index register // Initialize some memory this.memory[0x10] = 42; this.memory[0x20] = 100; this.memory[0x30] = 0x20; // Contains address 0x20 } // Different addressing modes immediate(value) { return value; } // Value itself direct(address) { return this.memory[address]; } // Direct address indirect(address) { return this.memory[this.memory[address]]; } // Address at address indexed(base, index) { return this.memory[base + index]; } // Base + Index register(reg) { return this.registers[reg]; } // Register value demo() { this.registers.X = 5; console.log('Immediate #42: ', this.immediate(42)); // 42 console.log('Direct $10: ', this.direct(0x10)); // 42 console.log('Indirect ($30): ', this.indirect(0x30)); // 100 console.log('Indexed $10,X: ', this.indexed(0x10, this.registers.X)); // mem[0x15] } } new MemoryAddressing().demo();
Memory Addressing Modes IMMEDIATE: MOV A, #42 ; A = 42 (value in instruction) DIRECT: MOV A, $1000 ; A = memory[0x1000] │ ▼ ┌───────┐ $1000: │ 42 │ └───────┘ INDIRECT: MOV A, ($1000) ; A = memory[memory[0x1000]] │ ▼ ┌───────┐ ┌───────┐ $1000: │ $2000 │────▶│ 42 │ └───────┘ └───────┘ $2000 INDEXED: MOV A, $1000,X ; A = memory[0x1000 + X] (where X = 5) │ ▼ ┌───────┐ $1005: │ 42 │ └───────┘
Endianness
Endianness determines how multi-byte values are stored in memory: Big-Endian stores the most significant byte first (at lowest address), while Little-Endian stores the least significant byte first. x86 uses Little-Endian; network protocols typically use Big-Endian.
Storing 0x12345678 in Memory BIG-ENDIAN (Network byte order, PowerPC, SPARC) "Big end first" Address: $00 $01 $02 $03 ┌─────┬─────┬─────┬─────┐ │ 0x12│ 0x34│ 0x56│ 0x78│ └─────┴─────┴─────┴─────┘ MSB LSB LITTLE-ENDIAN (x86, ARM default) "Little end first" Address: $00 $01 $02 $03 ┌─────┬─────┬─────┬─────┐ │ 0x78│ 0x56│ 0x34│ 0x12│ └─────┴─────┴─────┴─────┘ LSB MSB
// Endianness detection and conversion function detectEndianness() { const buffer = new ArrayBuffer(4); const int32 = new Uint32Array(buffer); const int8 = new Uint8Array(buffer); int32[0] = 0x12345678; if (int8[0] === 0x78) return 'Little-Endian'; if (int8[0] === 0x12) return 'Big-Endian'; return 'Unknown'; } // Convert between endianness function swapEndian32(value) { return ((value & 0xFF) << 24) | ((value & 0xFF00) << 8) | ((value >> 8) & 0xFF00) | ((value >> 24) & 0xFF); } console.log(`This system is: ${detectEndianness()}`); console.log(`0x12345678 swapped: 0x${swapEndian32(0x12345678).toString(16)}`); // DataView for explicit endianness control const buffer = new ArrayBuffer(4); const view = new DataView(buffer); view.setUint32(0, 0x12345678, true); // true = little-endian console.log('LE bytes:', new Uint8Array(buffer)); // [0x78, 0x56, 0x34, 0x12] view.setUint32(0, 0x12345678, false); // false = big-endian console.log('BE bytes:', new Uint8Array(buffer)); // [0x12, 0x34, 0x56, 0x78]
Memory Modules (SIMM, DIMM, SO-DIMM)
Memory modules are the physical packages containing RAM chips that plug into motherboards. SIMMs (30/72-pin, obsolete) had a 32-bit path, DIMMs (desktop, 64-bit path) are standard today, and SO-DIMMs are the smaller form factor used in laptops.
Memory Module Evolution SIMM (30-pin) - 1980s SIMM (72-pin) - Early 90s ┌──────────────────┐ ┌────────────────────────────┐ │▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓│ │▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓│ │ 8-bit path │ │ 32-bit path │ └┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┘ └┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┘ 30 pins 72 pins DIMM (168/240/288-pin) - Desktop ┌──────────────────────────────────────────────────────┐ │ ┌──┐ ┌──┐ ┌──┐ ┌──┐ ┌──┐ ┌──┐ ┌──┐ ┌──┐ ┌──┐ ┌──┐ │ │ │▓▓│ │▓▓│ │▓▓│ │▓▓│ │▓▓│ │▓▓│ │▓▓│ │▓▓│ │▓▓│ │▓▓│ │ ← RAM chips │ └──┘ └──┘ └──┘ └──┘ └──┘ └──┘ └──┘ └──┘ └──┘ └──┘ │ │ 64-bit path (DDR4/DDR5) │ └┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┘ ╚═╦═╝ Notch (keying) ╚═╦═╝ 288 pins (DDR4) SO-DIMM (Laptop) - Smaller form factor ┌────────────────────────────────┐ │┌─┐┌─┐┌─┐┌─┐┌─┐┌─┐┌─┐┌─┐┌─┐┌─┐ │ ││▓││▓││▓││▓││▓││▓││▓││▓││▓││▓│ │ │└─┘└─┘└─┘└─┘└─┘└─┘└─┘└─┘└─┘└─┘ │ └┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┘ ~67% length of full DIMM
// DDR Memory specifications comparison const ddrGenerations = [ { gen: 'DDR1', pins: 184, voltage: 2.5, speeds: '200-400 MT/s', year: 2000 }, { gen: 'DDR2', pins: 240, voltage: 1.8, speeds: '400-1066 MT/s', year: 2003 }, { gen: 'DDR3', pins: 240, voltage: 1.5, speeds: '800-2133 MT/s', year: 2007 }, { gen: 'DDR4', pins: 288, voltage: 1.2, speeds: '1600-3200 MT/s', year: 2014 }, { gen: 'DDR5', pins: 288, voltage: 1.1, speeds: '4800-8400 MT/s', year: 2020 } ]; console.table(ddrGenerations); // Calculate theoretical bandwidth function calculateBandwidth(transferRateMT, busWidthBits = 64) { // Bandwidth = Transfer Rate × Bus Width / 8 (bits to bytes) return (transferRateMT * 1e6 * busWidthBits / 8) / 1e9; // GB/s } console.log(`DDR4-3200 bandwidth: ${calculateBandwidth(3200).toFixed(1)} GB/s`); console.log(`DDR5-6400 bandwidth: ${calculateBandwidth(6400).toFixed(1)} GB/s`);
DDR Evolution (DDR1-DDR5)
DDR (Double Data Rate) SDRAM transfers data on both rising and falling clock edges, doubling effective bandwidth. Each generation has doubled data rates (DDR4: 3200 MT/s, DDR5: 6400+ MT/s) while reducing voltage for efficiency—DDR5 also adds on-die ECC and dual 32-bit channels per module for improved reliability and bandwidth.
DDR Evolution Timeline Generation Year Voltage Speed (MT/s) Bandwidth/module ────────────────────────────────────────────────────────────── DDR1 2000 2.5V 200-400 3.2 GB/s DDR2 2003 1.8V 400-1066 8.5 GB/s DDR3 2007 1.5V 800-2133 17.0 GB/s DDR4 2014 1.2V 1600-3200 25.6 GB/s DDR5 2020 1.1V 4800-8400+ 67.2 GB/s DDR Transfer Timing: Clock ──┐ ┌──┐ ┌──┐ ┌──┐ ┌──┐ ┌── └──┘ └──┘ └──┘ └──┘ └──┘ SDR ─X────X────X────X────X────X──── └─ Transfer on rising edge only DDR ─X──X──X──X──X──X──X──X──X──X── └─ Transfer on BOTH edges = 2x DDR5 Dual-Channel Per DIMM: ┌────────────────────────────────────────────┐ │ DDR5 DIMM │ │ ┌──────────────────┬──────────────────┐ │ │ │ Channel A │ Channel B │ │ │ │ 32-bit │ 32-bit │ │ │ │ (+ 8 ECC) │ (+ 8 ECC) │ │ │ └──────────────────┴──────────────────┘ │ └────────────────────────────────────────────┘ vs DDR4: Single 64-bit channel per DIMM
// DDR bandwidth calculation function calculateDDRBandwidth(transferRate, busWidth = 64, channels = 1) { // Bandwidth = Transfer Rate × Bus Width (bytes) × Channels return (transferRate * 1e6) * (busWidth / 8) * channels / 1e9; } const configs = [ { name: 'DDR4-2400 Single', rate: 2400, channels: 1 }, { name: 'DDR4-3200 Dual', rate: 3200, channels: 2 }, { name: 'DDR5-6400 Dual', rate: 6400, channels: 2 }, { name: 'DDR5-8000 Quad', rate: 8000, channels: 4 } ]; configs.forEach(c => { const bw = calculateDDRBandwidth(c.rate, 64, c.channels); console.log(`${c.name}: ${bw.toFixed(1)} GB/s`); });
Memory Timings and Latency
Memory timings (CAS Latency, tRCD, tRP, tRAS) measure the delays in clock cycles for various memory operations—lower is faster. CAS Latency (CL) is most critical, representing cycles between column address and data availability; DDR5-6400 CL40 has similar absolute latency to DDR4-3200 CL16 because DDR5's faster clock compensates for higher cycle counts.
Memory Timing Parameters ┌──────────────────────────────────────────────────────────────┐ │ DDR4-3200 CL16-18-18-36 (typical gaming RAM) │ │ │ │ CL (CAS Latency): 16 cycles - Column address to data │ │ tRCD: 18 cycles - Row to Column delay │ │ tRP: 18 cycles - Row Precharge time │ │ tRAS: 36 cycles - Row Active time │ └──────────────────────────────────────────────────────────────┘ Memory Access Sequence: Time ─────────────────────────────────────────────────────────► │◄── tRCD ──►│◄───── CL ─────►│ │ │ │ ▼ ▼ ▼ ┌────────────┬────────────────┬────────────┐ │ Row Addr │ Column Addr │ DATA │ │ (Activate) │ (Read) │ │ └────────────┴────────────────┴────────────┘ │◄─────────────── tRAS ─────────────────────►│ │◄─── tRP ────►│ │ (Precharge) │ Absolute Latency Calculation: ┌────────────────────────────────────────────┐ │ Latency (ns) = CL / (Transfer Rate / 2000) │ └────────────────────────────────────────────┘ DDR4-3200 CL16: 16 / (3200/2000) = 10.0 ns DDR5-6400 CL40: 40 / (6400/2000) = 12.5 ns
// Calculate actual memory latency function calculateLatency(transferRate, casLatency) { // Transfer rate is in MT/s (megatransfers) // Actual clock = transfer rate / 2 (DDR = double data rate) const actualClock = transferRate / 2; // MHz const clockPeriod = 1000 / actualClock; // ns per cycle return casLatency * clockPeriod; } const modules = [ { name: 'DDR4-2400 CL14', rate: 2400, cl: 14 }, { name: 'DDR4-3200 CL16', rate: 3200, cl: 16 }, { name: 'DDR4-3600 CL18', rate: 3600, cl: 18 }, { name: 'DDR5-6000 CL36', rate: 6000, cl: 36 }, { name: 'DDR5-6400 CL40', rate: 6400, cl: 40 } ]; console.log("Module Latency (ns)"); console.log("─────────────────────────────────"); modules.forEach(m => { console.log(`${m.name.padEnd(20)} ${calculateLatency(m.rate, m.cl).toFixed(2)} ns`); });
Dual/Quad Channel Memory
Multi-channel memory configurations increase bandwidth by allowing simultaneous access to multiple memory modules through independent channels. Dual-channel doubles theoretical bandwidth by interleaving data across two 64-bit channels (128-bit total), while quad-channel (found in HEDT/server platforms) provides 256-bit access—modules must be matched and installed in correct slots.
Memory Channel Configurations Single Channel: ┌─────────────────────────────────────────┐ │ Memory Controller │ │ │ │ │ [64-bit bus] │ │ │ │ │ ┌─────┴─────┐ │ │ │ DIMM │ │ │ └───────────┘ │ └─────────────────────────────────────────┘ Bandwidth: 1x (e.g., 25.6 GB/s for DDR4-3200) Dual Channel: ┌─────────────────────────────────────────┐ │ Memory Controller │ │ ┌─────┴─────┐ │ │ [64-bit] [64-bit] │ │ │ │ │ │ ┌────┴────┐ ┌────┴────┐ │ │ │ DIMM A1 │ │ DIMM B1 │ │ │ └─────────┘ └─────────┘ │ │ Ch A Ch B │ └─────────────────────────────────────────┘ Bandwidth: 2x (e.g., 51.2 GB/s for DDR4-3200) Motherboard Slots (Color-coded): Slot: A1 A2 B1 B2 ┌───┐ ┌───┐ ┌───┐ ┌───┐ │███│ │░░░│ │███│ │░░░│ │███│ │░░░│ │███│ │░░░│ └───┘ └───┘ └───┘ └───┘ Chan A Chan A Chan B Chan B For dual-channel: Install in A1 + B1 (same color)
ECC Memory
ECC (Error-Correcting Code) memory adds extra bits to detect and correct single-bit errors, critical for servers and workstations where data integrity is paramount. Using a Hamming code variant, 8 ECC bits per 64 data bits can correct any single-bit error and detect two-bit errors—the ~3% performance overhead and higher cost make it uncommon in consumer systems.
ECC Memory Operation Standard Memory: 64 data bits ┌────────────────────────────────────────────────────────────┐ │ D63 D62 D61 ... D2 D1 D0 │ └────────────────────────────────────────────────────────────┘ ECC Memory: 64 data bits + 8 ECC bits = 72 bits ┌────────────────────────────────────────────────────────────┬────────┐ │ D63 D62 D61 ... D2 D1 D0 │ECC bits│ └────────────────────────────────────────────────────────────┴────────┘ Error Detection & Correction: Write: Data ──────┬──────► Memory │ ┌─────▼─────┐ │ Calculate │ │ ECC │ └─────┬─────┘ │ ECC bits ──────► Memory Read: Memory ───┬───► Data (possibly corrupted) │ Memory ───┼───► Stored ECC │ ┌─────▼─────┐ │ Compare & │ │ Correct │──► Corrected Data └───────────┘ │ ┌───────┴───────┐ │ │ No Error Single-bit Error Multi-bit Error (corrected) (detected only)
// Simplified ECC demonstration (Hamming-like) function calculateParity(data, positions) { return positions.reduce((parity, pos) => parity ^ ((data >> pos) & 1), 0); } // 8-bit data example (real ECC uses 64 bits) function generateECC(data) { // Parity bit positions check different bit combinations const p1 = calculateParity(data, [0, 1, 3, 4, 6]); // positions 1,2,4,5,7 const p2 = calculateParity(data, [0, 2, 3, 5, 6]); // positions 1,3,4,6,7 const p4 = calculateParity(data, [1, 2, 3, 7]); // positions 2,3,4,8 const p8 = calculateParity(data, [4, 5, 6, 7]); // positions 5,6,7,8 return (p8 << 3) | (p4 << 2) | (p2 << 1) | p1; } function checkECC(data, ecc) { const calculated = generateECC(data); const syndrome = calculated ^ ecc; if (syndrome === 0) return { status: 'OK', data }; // Syndrome indicates error position return { status: 'ERROR', position: syndrome, data }; } const testData = 0b10110010; const ecc = generateECC(testData); console.log(`Data: ${testData.toString(2).padStart(8,'0')}, ECC: ${ecc.toString(2).padStart(4,'0')}`); console.log(checkECC(testData, ecc)); // OK console.log(checkECC(testData ^ 0x04, ecc)); // Error detected
Memory Interleaving
Memory interleaving distributes consecutive memory addresses across multiple memory banks or channels, allowing parallel access and reducing wait times when accessing sequential data. Bank interleaving hides row activation latency, while channel interleaving maximizes bandwidth—modern memory controllers automatically manage this for optimal performance.
Memory Interleaving Without Interleaving (all sequential in one bank): Address: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Bank 0: [0][1][2][3][4][5][6][7][8][9][A][B][C][D][E][F] Bank 1: [ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ] Access pattern: Must wait for each access in same bank ───►───►───►───►───► (sequential, slow) With 2-Way Interleaving: Address: 0 2 4 6 8 10 12 14 Bank 0: [0][2][4][6][8][A][C][E] ← Even addresses Address: 1 3 5 7 9 11 13 15 Bank 1: [1][3][5][7][9][B][D][F] ← Odd addresses Access pattern: Banks accessed in parallel! Bank 0: ───►───►───►───► Bank 1: ───►───►───►───► (overlapped, fast) Channel Interleaving (64-byte cache line example): Cache Line 0 (Addr 0-63): Channel A Cache Line 1 (Addr 64-127): Channel B Cache Line 2 (Addr 128-191): Channel A Cache Line 3 (Addr 192-255): Channel B Sequential read fetches from alternating channels = 2x effective bandwidth for streaming
Virtual Memory Basics
Virtual memory provides each process with its own isolated address space, using the CPU's Memory Management Unit (MMU) to translate virtual addresses to physical addresses via page tables. This enables running programs larger than physical RAM (paging to disk), memory protection between processes, and simplified memory allocation for programmers.
Virtual Memory Address Translation Process A sees: Process B sees: ┌──────────────┐ ┌──────────────┐ │ 0xFFFFFFFF │ │ 0xFFFFFFFF │ │ Stack │ │ Stack │ │ ↓ │ │ ↓ │ │ │ │ │ │ ↑ │ │ ↑ │ │ Heap │ │ Heap │ │ Code │ │ Code │ │ 0x00000000 │ │ 0x00000000 │ └──────┬───────┘ └──────┬───────┘ │ │ │ ┌─────────────────┐ │ └───►│ Page Tables │◄───┘ │ MMU │ └────────┬────────┘ │ ▼ Physical RAM: ┌────┬────┬────┬────┬────┬────┬────┬────┐ │ A │ B │ A │Free│ B │ A │ B │ A │ │Code│Heap│Heap│ │Stk │Stk │Code│Data│ └────┴────┴────┴────┴────┴────┴────┴────┘ Page Table Entry: ┌────────────────────┬───────────────────┐ │ Physical Frame # │ P│R│W│U│D│A│... │ └────────────────────┴───────────────────┘ P=Present, R=Read, W=Write, U=User, D=Dirty, A=Accessed
// Virtual memory simulation class VirtualMemory { constructor(physicalPages = 16, virtualPages = 64) { this.pageSize = 4096; // 4KB pages this.physicalPages = physicalPages; this.virtualPages = virtualPages; // Page table: virtual page -> physical frame (or null if not present) this.pageTable = new Map(); this.freeFrames = [...Array(physicalPages).keys()]; this.disk = new Map(); // Swapped pages } translateAddress(virtualAddr) { const virtualPage = Math.floor(virtualAddr / this.pageSize); const offset = virtualAddr % this.pageSize; let frame = this.pageTable.get(virtualPage); if (frame === undefined) { // Page fault! frame = this.handlePageFault(virtualPage); } const physicalAddr = frame * this.pageSize + offset; return { virtualAddr, physicalAddr, page: virtualPage, frame, offset }; } handlePageFault(virtualPage) { console.log(`Page fault on virtual page ${virtualPage}`); if (this.freeFrames.length === 0) { // Need to swap out a page const [victimPage] = this.pageTable.entries().next().value; const victimFrame = this.pageTable.get(victimPage); this.disk.set(victimPage, victimFrame); this.pageTable.delete(victimPage); this.freeFrames.push(victimFrame); console.log(` Swapped out page ${victimPage}`); } const frame = this.freeFrames.pop(); this.pageTable.set(virtualPage, frame); return frame; } } const vm = new VirtualMemory(4, 16); // 4 physical pages, 16 virtual console.log(vm.translateAddress(0x1234)); // Page 1 console.log(vm.translateAddress(0x5678)); // Page 5 console.log(vm.translateAddress(0x9ABC)); // Page 9 console.log(vm.translateAddress(0xDEF0)); // Page 13 - will cause swap console.log(vm.translateAddress(0xFFFF)); // Page 15 - will cause swap
Memory Mapping
Memory mapping creates a direct correspondence between file contents or device registers and process memory addresses, allowing programs to access files as if they were arrays in memory. This technique (mmap on Unix) is used for loading executables, shared libraries, inter-process communication, and efficient file I/O without explicit read/write system calls.
Memory-Mapped File Traditional File I/O: ┌─────────┐ read() ┌─────────┐ copy ┌─────────┐ │ File │───────────►│ Kernel │─────────►│ User │ │ on Disk │ │ Buffer │ │ Buffer │ └─────────┘ └─────────┘ └─────────┘ = 2 copies, system call overhead Memory-Mapped I/O: ┌─────────┐ ┌───────────────────────────────┐ │ File │◄──────────►│ Process Address Space │ │ on Disk │ MMU │ ┌─────────────────────────┐ │ └─────────┘ maps │ │ mmap'd region │ │ ▲ pages │ │ (file appears as │ │ │ │ │ memory array) │ │ ┌────┴────┐ │ └─────────────────────────┘ │ │ Page │ └───────────────────────────────┘ │ Cache │ └─────────┘ = Zero-copy access, demand paging Memory-Mapped Devices (MMIO): Physical Address Space: ┌────────────────────────────────────────────┐ │ 0x00000000 - 0x3FFFFFFF: RAM │ ├────────────────────────────────────────────┤ │ 0x40000000 - 0x4000FFFF: GPU Registers │ ← Write here ├────────────────────────────────────────────┤ = GPU command │ 0x40010000 - 0x400100FF: Network Card │ ├────────────────────────────────────────────┤ │ 0xFFFFF000 - 0xFFFFFFFF: BIOS ROM │ └────────────────────────────────────────────┘
// Node.js memory-mapped file example (conceptual) // Real implementation would use 'mmap-io' package /* const mmap = require('mmap-io'); const fs = require('fs'); const fd = fs.openSync('data.bin', 'r+'); const size = fs.fstatSync(fd).size; // Map file into memory const buffer = mmap.map(size, mmap.PROT_READ | mmap.PROT_WRITE, mmap.MAP_SHARED, fd, 0); // Now access file as if it were an array! buffer[0] = 0x42; // Writes directly to file const byte = buffer[1000]; // Reads from file without syscall mmap.sync(buffer); // Ensure changes are written */ // Simulation of memory-mapped concept class MemoryMappedFile { constructor(size) { this.data = new Uint8Array(size); this.dirtyPages = new Set(); this.pageSize = 4096; } read(offset) { // In real mmap, this might trigger a page fault // to load data from disk on first access return this.data[offset]; } write(offset, value) { this.data[offset] = value; this.dirtyPages.add(Math.floor(offset / this.pageSize)); } sync() { // Write dirty pages back to "disk" console.log(`Syncing ${this.dirtyPages.size} dirty pages`); this.dirtyPages.clear(); } }
Cache Memory (L1, L2, L3)
Cache memory is a small, extremely fast SRAM hierarchy positioned between the CPU and main RAM to reduce memory access latency. L1 cache (32-64KB) is split into instruction and data caches, runs at CPU speed, and has ~4 cycle latency; L2 cache (256KB-1MB) is per-core with ~12 cycle latency; L3 cache (8-64MB) is shared across all cores with ~40 cycle latency, acting as a victim cache for evicted L2 lines.
┌─────────────────────────────────────────────────────────────┐ │ CPU │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │ │ Core 0 │ │ Core 1 │ │ Core 2 │ │ Core 3 │ │ │ │┌──┐ ┌──┐│ │┌──┐ ┌──┐│ │┌──┐ ┌──┐│ │┌──┐ ┌──┐│ │ │ ││L1│ │L1││ ││L1│ │L1││ ││L1│ │L1││ ││L1│ │L1││ │ │ ││I │ │D ││ ││I │ │D ││ ││I │ │D ││ ││I │ │D ││ │ │ │└──┘ └──┘│ │└──┘ └──┘│ │└──┘ └──┘│ │└──┘ └──┘│ │ │ │ ┌──────┐│ │ ┌──────┐│ │ ┌──────┐│ │ ┌──────┐│ │ │ │ │ L2 ││ │ │ L2 ││ │ │ L2 ││ │ │ L2 ││ │ │ │ └──────┘│ │ └──────┘│ │ └──────┘│ │ └──────┘│ │ │ └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘ │ │ └─────────────┴──────┬──────┴─────────────┘ │ │ ┌──────┴──────┐ │ │ │ L3 Cache │ (Shared) │ │ │ 8-64 MB │ │ │ └──────┬──────┘ │ └────────────────────────────┼────────────────────────────────┘ │ ┌────────┴────────┐ │ Main Memory │ │ (DDR4/DDR5) │ └─────────────────┘
Cache Coherence Introduction
Cache coherence ensures that multiple CPU cores see a consistent view of memory when they cache the same memory locations, preventing scenarios where one core reads stale data after another core has modified it. Without coherence protocols, a write by Core 0 to address X would be invisible to Core 1 which has its own cached copy, leading to data corruption in multithreaded programs.
Problem without coherence: Time T0: Memory[X] = 100 ┌─────────┐ ┌─────────┐ ┌─────────┐ │ Core 0 │ │ Core 1 │ │ Memory │ │ X = 100 │ │ X = 100 │ │ X = 100 │ └─────────┘ └─────────┘ └─────────┘ Time T1: Core 0 writes X = 200 ┌─────────┐ ┌─────────┐ ┌─────────┐ │ Core 0 │ │ Core 1 │ │ Memory │ │ X = 200 │ │ X = 100 │◄─── │ X = ???│ ← INCOHERENT! └─────────┘ └─────────┘ └─────────┘ (stale data)
Cache Replacement Policies
Cache replacement policies determine which cache line to evict when the cache is full and a new line must be loaded; common policies include LRU (Least Recently Used) which evicts the oldest accessed line, FIFO (First-In-First-Out) which evicts in arrival order, Random which selects arbitrarily, and pseudo-LRU which approximates LRU with less hardware overhead. Modern CPUs typically use adaptive policies that combine multiple strategies.
# LRU Cache Implementation from collections import OrderedDict class LRUCache: def __init__(self, capacity: int): self.cache = OrderedDict() self.capacity = capacity def get(self, key: int) -> int: if key not in self.cache: return -1 # Cache miss self.cache.move_to_end(key) # Mark as recently used return self.cache[key] def put(self, key: int, value: int): if key in self.cache: self.cache.move_to_end(key) self.cache[key] = value if len(self.cache) > self.capacity: self.cache.popitem(last=False) # Evict LRU item # Example: 4-way cache with LRU # Access sequence: A, B, C, D, E (capacity=4) # After A: [A] # After B: [A, B] # After C: [A, B, C] # After D: [A, B, C, D] ← Full # After E: [B, C, D, E] ← A evicted (LRU)
Cache Associativity
Cache associativity defines how many possible cache locations a memory address can map to: direct-mapped (1-way) means each address maps to exactly one cache line (fast but conflict-prone), fully associative allows any address in any line (flexible but expensive), and N-way set-associative divides cache into sets where each address maps to one set but can occupy any of N ways (practical compromise used in modern CPUs, typically 8-16 way for L1/L2).
Memory Address: 0x12345678 Cache: 64 sets, 8-way associative, 64-byte lines Address breakdown (64B line, 64 sets): ┌────────────────────┬──────────┬────────┐ │ Tag │ Index │ Offset │ │ (20 bits) │ (6 bits) │(6 bits)│ └────────────────────┴──────────┴────────┘ │ │ │ │ │ └─► Byte within 64B line (0-63) │ └─► Which set (0-63) └─► Identifies unique memory block Direct-Mapped (1-way): 8-Way Set Associative: ┌─────────────────┐ ┌─────────────────────────────────┐ │ Set 0: [Line] │ │ Set 0: [W0][W1][W2][W3]...[W7] │ │ Set 1: [Line] │ │ Set 1: [W0][W1][W2][W3]...[W7] │ │ Set 2: [Line] │ │ Set 2: [W0][W1][W2][W3]...[W7] │ │ ... │ │ ... │ └─────────────────┘ └─────────────────────────────────┘ ↑ Conflicts! ↑ 8 choices per set
Write-Through vs Write-Back
Write-through immediately propagates every write to both cache and main memory, ensuring consistency but consuming memory bandwidth; write-back (write-behind) only writes to cache and marks the line "dirty," deferring memory updates until eviction, which is more efficient for write-intensive workloads but requires careful handling during eviction and coherence operations. Modern CPUs predominantly use write-back for L1/L2 caches due to its bandwidth efficiency.
WRITE-THROUGH: WRITE-BACK: CPU writes X=5 CPU writes X=5 │ │ ▼ ▼ ┌───────┐ ┌───────┐ │ Cache │──── Write X=5 ────► │ Cache │ (dirty bit = 1) │ X = 5 │ │ │ X = 5 │ └───────┘ │ └───────┘ ▼ │ ┌────────┐ │ (only on eviction) │ Memory │ ▼ │ X = 5 │ ┌────────┐ └────────┘ │ Memory │ │ X = 5 │ Pros: Simple, consistent └────────┘ Cons: High bandwidth Pros: Low bandwidth Cons: Complex eviction
Inclusive vs Exclusive Caches
Inclusive caches guarantee that all data in L1 is also present in L2/L3 (simplifies coherence snooping but wastes capacity), while exclusive caches ensure each line exists in only one cache level (maximizes effective capacity but complicates coherence). AMD uses exclusive L3 caches for maximum capacity utilization, while Intel traditionally used inclusive L3 but switched to non-inclusive (NINE) designs in recent architectures to balance both concerns.
INCLUSIVE (Intel traditional): EXCLUSIVE (AMD): L1: [A][B][C][D] L1: [A][B][C][D] ↓ subset ↓ disjoint L2: [A][B][C][D][E][F][G][H] L2: [E][F][G][H][I][J][K][L] ↓ subset ↓ disjoint L3: [A][B][C][D][E][F]...[P] L3: [M][N][O][P][Q][R][S][T] Effective capacity: Effective capacity: L3 size only L1 + L2 + L3 (all unique!) On L1 eviction: On L1 eviction: Drop from L1 (copy in L2) Move to L2 (swap) Coherence snoop: Coherence snoop: Check L3 only ✓ (simpler) Must check all levels ✗
MESI/MOESI Protocols
MESI is a cache coherence protocol where each cache line has four states: Modified (dirty, exclusive), Exclusive (clean, exclusive), Shared (clean, multiple copies), Invalid (unusable); MOESI adds an Owned state where one cache holds the authoritative dirty copy while others have shared copies, avoiding write-back to memory on sharing. These protocols use bus snooping or directory-based tracking to maintain coherence across cores.
MESI State Transitions: ┌──────────────────────────────────────────┐ │ │ ▼ │ ┌─────────┐ Read hit ┌─────────┐ │ │ Invalid │──────────────►│Exclusive│ │ │ (I) │ (no sharers) │ (E) │ │ └────┬────┘ └────┬────┘ │ │ │ │ │ Read hit │ Write │ │ (sharers exist) ▼ │ │ ┌─────────┐ │ │ │Modified │──────────┘ │ │ (M) │ Evict/Snoop │ └────┬────┘ ▼ │ ┌─────────┐ Snoop read │ │ Shared │◄───────────────────┘ │ (S) │ └─────────┘ MOESI adds: ┌─────────┐ │ Owned │ ← Has dirty copy, but shares with others │ (O) │ (avoids immediate write-back to memory) └─────────┘
Scratchpad Memory
Scratchpad memory (SPM) is software-managed on-chip SRAM that, unlike cache, has no automatic caching logic or tag comparison hardware—the programmer explicitly controls what data resides there via DMA or direct addressing. Used extensively in embedded systems (ARM Cortex-M), DSPs, and GPUs (shared memory in CUDA), it offers deterministic access latency, lower power consumption, and guaranteed data placement at the cost of increased programming complexity.
Traditional Cache: Scratchpad Memory: ┌──────────────────┐ ┌──────────────────┐ │ Tag │ Data │ │ Data │ │ Array │ Array │ │ (Direct Addr) │ ├─────────┼────────┤ ├──────────────────┤ │ HW-managed │ │ SW-managed │ │ Automatic lookup │ │ Explicit DMA │ │ Unpredictable │ │ Deterministic │ └──────────────────┘ └──────────────────┘ // CUDA Shared Memory (Scratchpad) Example __global__ void matmul(float* A, float* B, float* C) { __shared__ float tile_A[16][16]; // Scratchpad allocation __shared__ float tile_B[16][16]; // 48KB per SM // Explicit load from global to scratchpad tile_A[ty][tx] = A[row * N + tx]; tile_B[ty][tx] = B[ty * N + col]; __syncthreads(); // Explicit synchronization // Fast access from scratchpad (~5 cycles vs ~400 for global) for (int k = 0; k < 16; k++) sum += tile_A[ty][k] * tile_B[k][tx]; }
Memory Controller Design
The memory controller is the bridge between CPU and DRAM, translating read/write requests into precise electrical signals with correct timing (tCAS, tRAS, tRP). It handles command queuing, address mapping, and ensures timing constraints are met.
┌─────────────┐ ┌──────────────────┐ ┌─────────────┐ │ CPU │────▶│ Memory Controller│────▶│ DRAM │ │ (Requests) │ │ - Command Queue │ │ (Storage) │ └─────────────┘ │ - Timing Engine │ └─────────────┘ │ - Address Mapper│ └──────────────────┘
Rank and Bank Organization
DRAM is hierarchically organized: a channel contains ranks (groups of chips acting together for bus width), each rank contains banks (independent memory arrays), enabling parallel access across banks while only one rank drives the bus at a time.
Channel
├── Rank 0 (64-bit wide)
│ ├── Bank 0 ├── Bank 1 ├── Bank 2 ... ├── Bank 15
│ └── Each bank: Rows × Columns of cells
├── Rank 1
│ ├── Bank 0 ├── Bank 1 ├── Bank 2 ... ├── Bank 15
└── ...
// Address mapping example
const physicalAddress = 0x1A2B3C4D;
const channel = (physicalAddress >> 6) & 0x1; // 1 bit
const rank = (physicalAddress >> 7) & 0x1; // 1 bit
const bank = (physicalAddress >> 8) & 0xF; // 4 bits
const row = (physicalAddress >> 12) & 0xFFFF; // 16 bits
const column = physicalAddress & 0x3F; // 6 bits
Row Buffer Management
Each bank has a row buffer (~8KB) acting as a cache; accessing an already-open row (row hit ~15ns) is much faster than opening a new row (row miss ~50ns), making row buffer policies (open-page vs closed-page) critical for performance.
Row Buffer Policies: ┌─────────────────────────────────────────────────────────┐ │ OPEN-PAGE: Keep row open, bet on locality │ │ Hit: ████░░░░░░ 15ns (fast) │ │ Miss: ████████████████████ 50ns (precharge+activate) │ │ │ │ CLOSED-PAGE: Close row immediately after access │ │ Always: ██████████████ 35ns (predictable) │ └─────────────────────────────────────────────────────────┘
class RowBufferManager { constructor(policy = 'open-page') { this.openRow = new Map(); // bank -> currently open row this.policy = policy; } access(bank, row) { if (this.openRow.get(bank) === row) { return { type: 'HIT', latency: 15 }; // CAS only } // Row miss: precharge + activate + CAS this.openRow.set(bank, this.policy === 'open-page' ? row : null); return { type: 'MISS', latency: 50 }; } }
Refresh Mechanisms
DRAM cells leak charge and must be refreshed every 64ms (32ms at high temps); the controller periodically issues REF commands, blocking access briefly—distributed refresh spreads this overhead, while modern DDR5 uses per-bank refresh to reduce blocking.
Traditional Refresh (All-Bank): Time ──────────────────────────────────────▶ │▓▓▓│ │▓▓▓│ (All banks blocked) Per-Bank Refresh (DDR5): Bank0 │▓│ │▓│ Bank1 │▓│ │▓│ (Only 1 bank blocked) Bank2 │▓│ │▓│ Bank3 │▓│ │▓│ // Refresh timing const REFRESH_INTERVAL_MS = 64; const ROWS_PER_BANK = 65536; const tREFI = (REFRESH_INTERVAL_MS * 1e6) / (ROWS_PER_BANK / 8); // ~7.8μs
Memory Scheduling Algorithms
The scheduler reorders queued memory requests to maximize throughput; FR-FCFS (First-Ready First-Come-First-Served) prioritizes row hits, while modern schedulers also consider fairness, QoS, and inter-thread interference.
class FRFCFSScheduler { constructor() { this.queue = []; this.openRows = new Map(); } schedule() { // Priority 1: Row hits (first-ready) const rowHit = this.queue.find(req => this.openRows.get(req.bank) === req.row ); if (rowHit) return this.dequeue(rowHit); // Priority 2: Oldest request (FCFS) return this.queue.shift(); } // Advanced: Add fairness with per-thread quotas scheduleWithFairness(threadQuotas) { const eligible = this.queue.filter(r => threadQuotas[r.thread] > 0); // Apply FR-FCFS within eligible requests } }
Scheduling Decision Tree: ┌─────────────┐ │ New Request │ └──────┬──────┘ ▼ ┌─────────────┐ Yes ◀─┤ Row Hit? ├─▶ No └──────┬──────┘ │ ▼ ▼ ┌──────────┐ ┌──────────┐ │ Schedule │ │ Check │ │ Immediate│ │ FCFS Age │ └──────────┘ └──────────┘
Near-Memory Processing
NMP places compute logic close to memory (in the buffer chip or memory controller) to reduce data movement energy and bandwidth bottlenecks—ideal for bandwidth-bound workloads like graph analytics, where data moves more than it computes.
Traditional: Near-Memory Processing: ┌──────┐ ◀──────▶ ┌──────┐ ┌──────────────────┐ │ CPU │ (Data moves) │ CPU │ │ Memory + Logic │ └──────┘ └──────┘ │ ┌────┐ ┌─────┐ │ ▲ │ │ │ALU │ │DRAM │ │ │ 100+ GB/s ▼ │ └────┘ └─────┘ │ ┌──────┐ (Commands │ (TB/s local) │ │ DRAM │ only) └──────────────────┘ └──────┘ // Workload suitability const nmProcessingBenefit = (computeIntensity, dataSize) => { // Low compute/byte ratio = good NMP candidate const bytesPerOp = dataSize / computeIntensity; return bytesPerOp > 100 ? 'EXCELLENT' : bytesPerOp > 10 ? 'GOOD' : 'POOR'; };
Processing-in-Memory (PIM)
PIM embeds compute directly in memory arrays (e.g., Samsung HBM-PIM, UPMEM), enabling massively parallel operations within memory banks—particularly effective for bulk operations, neural network inference, and DNA sequence matching.
┌────────────────────────────────────────────┐ │ HBM-PIM Stack │ ├────────────────────────────────────────────┤ │ DRAM Die 3 │ Bank │ Bank │ + PIM Units │ │ DRAM Die 2 │ Bank │ Bank │ + PIM Units │ │ DRAM Die 1 │ Bank │ Bank │ + PIM Units │ │ DRAM Die 0 │ Bank │ Bank │ + PIM Units │ ├────────────────────────────────────────────┤ │ Base Logic Die (TSV) │ └────────────────────────────────────────────┘ // PIM operation example (conceptual) class PIMController { executeInMemory(operation, bankMask) { // Single command triggers parallel execution across banks const ops = { 'MAC': 'Multiply-Accumulate for ML', // 1 TFLOPS in-memory 'COPY': 'Bank-to-bank transfer', // Avoids bus 'SEARCH': 'Content-addressable lookup' // Parallel compare }; return this.dispatchToBanks(operation, bankMask); } }
CXL Protocol and Memory Pooling
CXL (Compute Express Link) extends PCIe with cache-coherent memory semantics, enabling disaggregated memory pools that multiple hosts can share—revolutionizing data center efficiency through dynamic memory allocation and tiered memory architectures.
┌─────────────────────────────────────────────────────────────┐ │ CXL Memory Pooling │ │ ┌───────┐ ┌───────┐ ┌───────┐ │ │ │Host 1 │ │Host 2 │ │Host 3 │ │ │ └───┬───┘ └───┬───┘ └───┬───┘ │ │ │ │ │ CXL 2.0 Switch │ │ ════╪═════════╪═════════╪══════════════════ │ │ │ │ │ │ │ ┌───┴───┬─────┴───┬─────┴───┐ │ │ │Pool 1 │ Pool 2 │ Pool 3 │ Shared Memory Pools │ │ │ DDR5 │ DDR5 │ PMem │ │ │ └───────┴─────────┴─────────┘ │ └─────────────────────────────────────────────────────────────┘ CXL Protocol Types: ┌──────────┬─────────────────────────────────────────────────┐ │ CXL.io │ PCIe-equivalent I/O (discovery, config) │ │ CXL.cache│ Device caches host memory (accelerators) │ │ CXL.mem │ Host accesses device memory (expansion) │ └──────────┴─────────────────────────────────────────────────┘
// Conceptual CXL memory pool manager class CXLMemoryPool { constructor() { this.pools = new Map(); // poolId -> { capacity, allocated, type } this.hostAllocations = new Map(); } allocate(hostId, size, type = 'DDR5') { // Dynamic allocation from shared pool const pool = this.findPool(size, type); const region = { baseAddr: pool.allocate(size), size, coherencyDomain: this.setupCoherency(hostId) }; this.hostAllocations.set(hostId, region); return region; // Host sees this as normal memory via CXL.mem } migrate(hostId, fromPool, toPool) { // Live migration between memory tiers (e.g., DDR5 → PMem) return this.dmaTransfer(fromPool, toPool, this.hostAllocations.get(hostId)); } }