Memory Systems Architecture: From DDR5 to Cache Coherence | Mehran Khanjan

RAM fundamentals

RAM (Random Access Memory) is volatile high-speed memory that stores currently running programs and data, losing all contents when power is removed. It's called "random access" because any memory location can be accessed in constant time, unlike sequential storage. Modern DDR5 (Double Data Rate 5) operates at 4800-8000+ MT/s (megatransfers per second), with typical desktop systems having 16-64GB, structured as DIMMs (Dual Inline Memory Modules) that connect to the CPU's memory controller.

┌────────────────────────────────────────────────────────────────────┐
│                        RAM (DIMM Module)                            │
├────────────────────────────────────────────────────────────────────┤
│                                                                    │
│  ┌─────────────────────────────────────────────────────────────┐  │
│  │  ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░  │  │
│  │  ╔═══╗ ╔═══╗ ╔═══╗ ╔═══╗ ╔═══╗ ╔═══╗ ╔═══╗ ╔═══╗         │  │
│  │  ║ C ║ ║ C ║ ║ C ║ ║ C ║ ║ C ║ ║ C ║ ║ C ║ ║ C ║ ← Chips │  │
│  │  ║ H ║ ║ H ║ ║ H ║ ║ H ║ ║ H ║ ║ H ║ ║ H ║ ║ H ║         │  │
│  │  ║ I ║ ║ I ║ ║ I ║ ║ I ║ ║ I ║ ║ I ║ ║ I ║ ║ I ║         │  │
│  │  ║ P ║ ║ P ║ ║ P ║ ║ P ║ ║ P ║ ║ P ║ ║ P ║ ║ P ║         │  │
│  │  ╚═══╝ ╚═══╝ ╚═══╝ ╚═══╝ ╚═══╝ ╚═══╝ ╚═══╝ ╚═══╝         │  │
│  │  ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓  │  │
│  │  │││││││││││││││││││││││││││││││││││││││││││││││││││││││  │  │
│  └──┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴──┘  │
│                        ↑ 288 pins (DDR5)                          │
│                                                                    │
├────────────────────────────────────────────────────────────────────┤
│  MEMORY HIERARCHY (Speed vs Capacity tradeoff):                    │
│                                                                    │
│           Speed                                                    │
│             ▲                                                      │
│             │    ┌─────┐                                           │
│     Fastest │    │ L1  │ ← 64KB, ~1ns                             │
│             │    ├─────┤                                           │
│             │    │ L2  │ ← 512KB, ~3ns                            │
│             │    ├─────┤                                           │
│             │    │ L3  │ ← 32MB, ~10ns                            │
│             │    ├─────┤                                           │
│             │    │ RAM │ ← 32GB, ~50ns                            │
│             │    ├─────┤                                           │
│     Slowest │    │ SSD │ ← 1TB, ~100μs                            │
│             │    ├─────┤                                           │
│             │    │ HDD │ ← 4TB, ~10ms                             │
│             ▼    └─────┘                                           │
│                                                                    │
│           ◄────────────────────────────────►                       │
│           Small                        Large                       │
│                        Capacity                                    │
│                                                                    │
└────────────────────────────────────────────────────────────────────┘

// RAM concepts and calculations
const ramModule = {
    type: 'DDR5',
    capacity: 32,           // GB per module
    speed: 6000,            // MT/s (MegaTransfers per second)
    voltage: 1.1,           // Volts (DDR5 is more power efficient)
    cas_latency: 36,        // CL36 (Column Access Strobe)
    formFactor: 'DIMM'
};

// Calculate theoretical bandwidth
function calculateBandwidth(speedMTs, busWidth = 64) {
    // DDR5 has two 32-bit channels per module
    const bytesPerTransfer = busWidth / 8;  // 8 bytes
    const bandwidthGBs = (speedMTs * 1e6 * bytesPerTransfer) / 1e9;
    return bandwidthGBs;
}

console.log(`Bandwidth: ${calculateBandwidth(6000)}GB/s`); // ~48GB/s per module

// Memory latency calculation
function calculateLatencyNs(speedMTs, casLatency) {
    const clockSpeed = speedMTs / 2;  // DDR = double data rate
    const nsPerCycle = 1e9 / (clockSpeed * 1e6);
    return casLatency * nsPerCycle;
}

console.log(`True latency: ${calculateLatencyNs(6000, 36).toFixed(1)}ns`); // ~12ns

// Why dual-channel matters
const singleChannel = calculateBandwidth(6000);
const dualChannel = calculateBandwidth(6000) * 2;
console.log(`Single: ${singleChannel}GB/s, Dual: ${dualChannel}GB/s`);

Memory Hierarchy

Computer memory is organized in a hierarchy trading off speed, size, and cost—registers are fastest but smallest, followed by cache, RAM, and storage. The CPU checks each level sequentially, with most accesses hitting faster, smaller levels due to locality of reference.

    Memory Hierarchy Pyramid
    
                    ▲ Faster
                    │ Smaller
                    │ More Expensive (per byte)
                    │
                   ╱╲
                  ╱  ╲
                 ╱ R  ╲         Registers: ~1 cycle
                ╱ E G  ╲        ~KB
               ╱───────╲
              ╱  L1 $   ╲       L1 Cache: ~4 cycles
             ╱  32-64KB  ╲      ~64KB
            ╱─────────────╲
           ╱   L2 Cache    ╲    L2 Cache: ~10 cycles
          ╱   256KB-1MB     ╲   ~256KB-1MB
         ╱───────────────────╲
        ╱      L3 Cache       ╲  L3 Cache: ~40 cycles
       ╱       8-64MB          ╲ ~8-64MB
      ╱─────────────────────────╲
     ╱       Main Memory         ╲   RAM: ~100 cycles
    ╱        8GB-128GB            ╲  ~16-128GB
   ╱───────────────────────────────╲
  ╱     SSD / Hard Drive Storage    ╲  SSD: ~10,000 cycles
 ╱          256GB - 10TB+            ╲ HDD: ~10,000,000 cycles
╱─────────────────────────────────────╲
                    │
                    ▼ Slower, Larger, Cheaper

// Memory access time simulation
const memoryHierarchy = [
  { level: 'Register',  accessNs: 0.25,       sizeKB: 0.001,    hitRate: 0.90 },
  { level: 'L1 Cache',  accessNs: 1,          sizeKB: 64,       hitRate: 0.95 },
  { level: 'L2 Cache',  accessNs: 4,          sizeKB: 512,      hitRate: 0.97 },
  { level: 'L3 Cache',  accessNs: 10,         sizeKB: 8192,     hitRate: 0.99 },
  { level: 'RAM',       accessNs: 100,        sizeKB: 16777216, hitRate: 0.999 },
  { level: 'SSD',       accessNs: 100000,     sizeKB: 1e9,      hitRate: 1.0 }
];

// Calculate effective access time using hit rates
function effectiveAccessTime(hierarchy) {
  let missRate = 1;
  let totalTime = 0;
  
  hierarchy.forEach(level => {
    const hitAtThisLevel = missRate * level.hitRate;
    totalTime += hitAtThisLevel * level.accessNs;
    missRate *= (1 - level.hitRate);
  });
  
  return totalTime;
}

console.log(`Effective access time: ${effectiveAccessTime(memoryHierarchy).toFixed(2)} ns`);

SRAM vs DRAM

SRAM (Static RAM) uses flip-flops to store each bit, making it fast but expensive and power-hungry—used for CPU caches. DRAM (Dynamic RAM) stores bits as charges in capacitors, offering higher density and lower cost but requiring constant refresh cycles—used for main memory.

    SRAM Cell (6 Transistors)          DRAM Cell (1T-1C)
    
         VDD                                Bit Line
          │                                    │
      ┌───┴───┐                               │
      │       │                            ┌──┴──┐
    ──┤ FLIP  ├──                     ────┤  T  │
      │ FLOP  │                            └──┬──┘
      │       │                               │
      └───┬───┘                            ┌──┴──┐
          │                                │  C  │ ← Capacitor
         GND                               │     │   (stores bit)
                                           └──┬──┘
    - Fast (~1ns)                             │
    - No refresh                             GND
    - 6 transistors/bit
    - Used in caches                   - Slower (~50ns)
                                       - Needs refresh
                                       - 1 transistor/bit
                                       - Used in main RAM

// SRAM vs DRAM characteristics comparison
const memoryTypes = {
  SRAM: {
    transistorsPerBit: 6,
    accessTimeNs: 1,
    needsRefresh: false,
    relativeCoast: 20,
    typicalUse: 'CPU Cache'
  },
  DRAM: {
    transistorsPerBit: 1,
    accessTimeNs: 50,
    needsRefresh: true,
    refreshIntervalMs: 64,
    relativeCoast: 1,
    typicalUse: 'Main Memory'
  }
};

// DRAM refresh calculation
const dramRefresh = {
  rowCount: 8192,
  refreshIntervalMs: 64,
  get refreshesPerSecond() { return 1000 / this.refreshIntervalMs * this.rowCount; }
};

console.log(`DRAM performs ${dramRefresh.refreshesPerSecond.toLocaleString()} refreshes/second`);

ROM, PROM, EPROM, EEPROM

These are non-volatile memory types that retain data without power: ROM is factory-programmed, PROM is one-time programmable by users, EPROM can be erased with UV light and reprogrammed, and EEPROM/Flash can be electrically erased and rewritten—Flash memory is what's in your SSD and USB drives.

    Non-Volatile Memory Evolution
    
    ┌─────────────────────────────────────────────────────────────┐
    │ Type    │ Writable │ Erasable  │ Method        │ Use Case  │
    ├─────────┼──────────┼───────────┼───────────────┼───────────┤
    │ ROM     │ Factory  │ Never     │ Mask at fab   │ Firmware  │
    │ PROM    │ Once     │ Never     │ Burn fuses    │ Prototype │
    │ EPROM   │ Multiple │ Entire    │ UV light      │ Dev/Test  │
    │ EEPROM  │ Multiple │ Byte-wise │ Electrical    │ Settings  │
    │ Flash   │ Multiple │ Block     │ Electrical    │ SSD, USB  │
    └─────────────────────────────────────────────────────────────┘
    
    EPROM with UV Window:
         ┌─────────────┐
         │   ┌─────┐   │
         │   │ UV  │   │  ← Quartz window
         │   │ WIN │   │    for erasure
         │   └─────┘   │
       ──┤             ├──
       ──┤   EPROM     ├──
       ──┤             ├──
         └─────────────┘

Memory Addressing

Each byte in memory has a unique address, allowing the CPU to read or write specific locations. Addressing can be direct (fixed address), indirect (address in register), indexed (base + offset), or various other modes that give programmers flexibility in accessing data.

// Memory addressing modes demonstration
class MemoryAddressing {
  constructor() {
    this.memory = new Uint8Array(256);
    this.registers = { A: 0, B: 0, X: 0 }; // X is index register
    
    // Initialize some memory
    this.memory[0x10] = 42;
    this.memory[0x20] = 100;
    this.memory[0x30] = 0x20; // Contains address 0x20
  }

  // Different addressing modes
  immediate(value)     { return value; }                           // Value itself
  direct(address)      { return this.memory[address]; }            // Direct address
  indirect(address)    { return this.memory[this.memory[address]]; } // Address at address
  indexed(base, index) { return this.memory[base + index]; }       // Base + Index
  register(reg)        { return this.registers[reg]; }             // Register value
  
  demo() {
    this.registers.X = 5;
    console.log('Immediate #42:     ', this.immediate(42));        // 42
    console.log('Direct $10:        ', this.direct(0x10));         // 42
    console.log('Indirect ($30):    ', this.indirect(0x30));       // 100
    console.log('Indexed $10,X:     ', this.indexed(0x10, this.registers.X)); // mem[0x15]
  }
}

new MemoryAddressing().demo();

    Memory Addressing Modes
    
    IMMEDIATE:  MOV A, #42        ; A = 42 (value in instruction)
    
    DIRECT:     MOV A, $1000      ; A = memory[0x1000]
                    │
                    ▼
                ┌───────┐
    $1000:      │  42   │
                └───────┘
    
    INDIRECT:   MOV A, ($1000)    ; A = memory[memory[0x1000]]
                    │
                    ▼
                ┌───────┐     ┌───────┐
    $1000:      │ $2000 │────▶│  42   │
                └───────┘     └───────┘
                              $2000
    
    INDEXED:    MOV A, $1000,X    ; A = memory[0x1000 + X]
                (where X = 5)
                    │
                    ▼
                ┌───────┐
    $1005:      │  42   │
                └───────┘

Endianness

Endianness determines how multi-byte values are stored in memory: Big-Endian stores the most significant byte first (at lowest address), while Little-Endian stores the least significant byte first. x86 uses Little-Endian; network protocols typically use Big-Endian.

    Storing 0x12345678 in Memory
    
    BIG-ENDIAN (Network byte order, PowerPC, SPARC)
    "Big end first"
    
    Address:  $00   $01   $02   $03
             ┌─────┬─────┬─────┬─────┐
             │ 0x12│ 0x34│ 0x56│ 0x78│
             └─────┴─────┴─────┴─────┘
               MSB                LSB
               
    LITTLE-ENDIAN (x86, ARM default)
    "Little end first"
    
    Address:  $00   $01   $02   $03
             ┌─────┬─────┬─────┬─────┐
             │ 0x78│ 0x56│ 0x34│ 0x12│
             └─────┴─────┴─────┴─────┘
               LSB                MSB

// Endianness detection and conversion
function detectEndianness() {
  const buffer = new ArrayBuffer(4);
  const int32 = new Uint32Array(buffer);
  const int8 = new Uint8Array(buffer);
  
  int32[0] = 0x12345678;
  
  if (int8[0] === 0x78) return 'Little-Endian';
  if (int8[0] === 0x12) return 'Big-Endian';
  return 'Unknown';
}

// Convert between endianness
function swapEndian32(value) {
  return ((value & 0xFF) << 24) |
         ((value & 0xFF00) << 8) |
         ((value >> 8) & 0xFF00) |
         ((value >> 24) & 0xFF);
}

console.log(`This system is: ${detectEndianness()}`);
console.log(`0x12345678 swapped: 0x${swapEndian32(0x12345678).toString(16)}`);

// DataView for explicit endianness control
const buffer = new ArrayBuffer(4);
const view = new DataView(buffer);
view.setUint32(0, 0x12345678, true);  // true = little-endian
console.log('LE bytes:', new Uint8Array(buffer)); // [0x78, 0x56, 0x34, 0x12]
view.setUint32(0, 0x12345678, false); // false = big-endian
console.log('BE bytes:', new Uint8Array(buffer)); // [0x12, 0x34, 0x56, 0x78]

Memory Modules (SIMM, DIMM, SO-DIMM)

Memory modules are the physical packages containing RAM chips that plug into motherboards. SIMMs (30/72-pin, obsolete) had a 32-bit path, DIMMs (desktop, 64-bit path) are standard today, and SO-DIMMs are the smaller form factor used in laptops.

    Memory Module Evolution
    
    SIMM (30-pin) - 1980s          SIMM (72-pin) - Early 90s
    ┌──────────────────┐           ┌────────────────────────────┐
    │▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓│           │▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓│
    │   8-bit path     │           │     32-bit path            │
    └┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┘           └┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┘
     30 pins                        72 pins
    
    DIMM (168/240/288-pin) - Desktop
    ┌──────────────────────────────────────────────────────┐
    │ ┌──┐ ┌──┐ ┌──┐ ┌──┐ ┌──┐ ┌──┐ ┌──┐ ┌──┐ ┌──┐ ┌──┐  │
    │ │▓▓│ │▓▓│ │▓▓│ │▓▓│ │▓▓│ │▓▓│ │▓▓│ │▓▓│ │▓▓│ │▓▓│  │  ← RAM chips
    │ └──┘ └──┘ └──┘ └──┘ └──┘ └──┘ └──┘ └──┘ └──┘ └──┘  │
    │               64-bit path (DDR4/DDR5)               │
    └┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┘
      ╚═╦═╝          Notch (keying)          ╚═╦═╝
        288 pins (DDR4)
    
    SO-DIMM (Laptop) - Smaller form factor
    ┌────────────────────────────────┐
    │┌─┐┌─┐┌─┐┌─┐┌─┐┌─┐┌─┐┌─┐┌─┐┌─┐ │
    ││▓││▓││▓││▓││▓││▓││▓││▓││▓││▓│ │
    │└─┘└─┘└─┘└─┘└─┘└─┘└─┘└─┘└─┘└─┘ │
    └┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┘
      ~67% length of full DIMM

// DDR Memory specifications comparison
const ddrGenerations = [
  { gen: 'DDR1', pins: 184, voltage: 2.5, speeds: '200-400 MT/s',  year: 2000 },
  { gen: 'DDR2', pins: 240, voltage: 1.8, speeds: '400-1066 MT/s', year: 2003 },
  { gen: 'DDR3', pins: 240, voltage: 1.5, speeds: '800-2133 MT/s', year: 2007 },
  { gen: 'DDR4', pins: 288, voltage: 1.2, speeds: '1600-3200 MT/s', year: 2014 },
  { gen: 'DDR5', pins: 288, voltage: 1.1, speeds: '4800-8400 MT/s', year: 2020 }
];

console.table(ddrGenerations);

// Calculate theoretical bandwidth
function calculateBandwidth(transferRateMT, busWidthBits = 64) {
  // Bandwidth = Transfer Rate × Bus Width / 8 (bits to bytes)
  return (transferRateMT * 1e6 * busWidthBits / 8) / 1e9; // GB/s
}

console.log(`DDR4-3200 bandwidth: ${calculateBandwidth(3200).toFixed(1)} GB/s`);
console.log(`DDR5-6400 bandwidth: ${calculateBandwidth(6400).toFixed(1)} GB/s`);

DDR Evolution (DDR1-DDR5)

DDR (Double Data Rate) SDRAM transfers data on both rising and falling clock edges, doubling effective bandwidth. Each generation has doubled data rates (DDR4: 3200 MT/s, DDR5: 6400+ MT/s) while reducing voltage for efficiency—DDR5 also adds on-die ECC and dual 32-bit channels per module for improved reliability and bandwidth.

    DDR Evolution Timeline
    
    Generation   Year   Voltage   Speed (MT/s)   Bandwidth/module
    ──────────────────────────────────────────────────────────────
    DDR1         2000   2.5V      200-400        3.2 GB/s
    DDR2         2003   1.8V      400-1066       8.5 GB/s  
    DDR3         2007   1.5V      800-2133       17.0 GB/s
    DDR4         2014   1.2V      1600-3200      25.6 GB/s
    DDR5         2020   1.1V      4800-8400+     67.2 GB/s
    
    DDR Transfer Timing:
    
    Clock    ──┐  ┌──┐  ┌──┐  ┌──┐  ┌──┐  ┌──
              └──┘  └──┘  └──┘  └──┘  └──┘
    
    SDR      ─X────X────X────X────X────X────
              └─ Transfer on rising edge only
    
    DDR      ─X──X──X──X──X──X──X──X──X──X──
              └─ Transfer on BOTH edges = 2x
    
    DDR5 Dual-Channel Per DIMM:
    ┌────────────────────────────────────────────┐
    │               DDR5 DIMM                    │
    │  ┌──────────────────┬──────────────────┐  │
    │  │   Channel A      │    Channel B     │  │
    │  │    32-bit        │     32-bit       │  │
    │  │   (+ 8 ECC)      │    (+ 8 ECC)     │  │
    │  └──────────────────┴──────────────────┘  │
    └────────────────────────────────────────────┘
    vs DDR4: Single 64-bit channel per DIMM

// DDR bandwidth calculation
function calculateDDRBandwidth(transferRate, busWidth = 64, channels = 1) {
  // Bandwidth = Transfer Rate × Bus Width (bytes) × Channels
  return (transferRate * 1e6) * (busWidth / 8) * channels / 1e9;
}

const configs = [
  { name: 'DDR4-2400 Single', rate: 2400, channels: 1 },
  { name: 'DDR4-3200 Dual',   rate: 3200, channels: 2 },
  { name: 'DDR5-6400 Dual',   rate: 6400, channels: 2 },
  { name: 'DDR5-8000 Quad',   rate: 8000, channels: 4 }
];

configs.forEach(c => {
  const bw = calculateDDRBandwidth(c.rate, 64, c.channels);
  console.log(`${c.name}: ${bw.toFixed(1)} GB/s`);
});

Memory Timings and Latency

Memory timings (CAS Latency, tRCD, tRP, tRAS) measure the delays in clock cycles for various memory operations—lower is faster. CAS Latency (CL) is most critical, representing cycles between column address and data availability; DDR5-6400 CL40 has similar absolute latency to DDR4-3200 CL16 because DDR5's faster clock compensates for higher cycle counts.

    Memory Timing Parameters
    
    ┌──────────────────────────────────────────────────────────────┐
    │ DDR4-3200 CL16-18-18-36  (typical gaming RAM)               │
    │                                                              │
    │  CL  (CAS Latency):     16 cycles - Column address to data  │
    │  tRCD:                  18 cycles - Row to Column delay     │
    │  tRP:                   18 cycles - Row Precharge time      │
    │  tRAS:                  36 cycles - Row Active time         │
    └──────────────────────────────────────────────────────────────┘
    
    Memory Access Sequence:
    
    Time ─────────────────────────────────────────────────────────►
    
    │◄── tRCD ──►│◄───── CL ─────►│
    │            │                │
    ▼            ▼                ▼
    ┌────────────┬────────────────┬────────────┐
    │ Row Addr   │   Column Addr  │   DATA     │
    │ (Activate) │    (Read)      │            │
    └────────────┴────────────────┴────────────┘
    
    │◄─────────────── tRAS ─────────────────────►│
                                  │◄─── tRP ────►│
                                  │  (Precharge) │
    
    Absolute Latency Calculation:
    ┌────────────────────────────────────────────┐
    │ Latency (ns) = CL / (Transfer Rate / 2000) │
    └────────────────────────────────────────────┘
    
    DDR4-3200 CL16: 16 / (3200/2000) = 10.0 ns
    DDR5-6400 CL40: 40 / (6400/2000) = 12.5 ns

// Calculate actual memory latency
function calculateLatency(transferRate, casLatency) {
  // Transfer rate is in MT/s (megatransfers)
  // Actual clock = transfer rate / 2 (DDR = double data rate)
  const actualClock = transferRate / 2; // MHz
  const clockPeriod = 1000 / actualClock; // ns per cycle
  return casLatency * clockPeriod;
}

const modules = [
  { name: 'DDR4-2400 CL14', rate: 2400, cl: 14 },
  { name: 'DDR4-3200 CL16', rate: 3200, cl: 16 },
  { name: 'DDR4-3600 CL18', rate: 3600, cl: 18 },
  { name: 'DDR5-6000 CL36', rate: 6000, cl: 36 },
  { name: 'DDR5-6400 CL40', rate: 6400, cl: 40 }
];

console.log("Module              Latency (ns)");
console.log("─────────────────────────────────");
modules.forEach(m => {
  console.log(`${m.name.padEnd(20)} ${calculateLatency(m.rate, m.cl).toFixed(2)} ns`);
});

Dual/Quad Channel Memory

Multi-channel memory configurations increase bandwidth by allowing simultaneous access to multiple memory modules through independent channels. Dual-channel doubles theoretical bandwidth by interleaving data across two 64-bit channels (128-bit total), while quad-channel (found in HEDT/server platforms) provides 256-bit access—modules must be matched and installed in correct slots.

    Memory Channel Configurations
    
    Single Channel:
    ┌─────────────────────────────────────────┐
    │              Memory Controller          │
    │                    │                    │
    │              [64-bit bus]               │
    │                    │                    │
    │              ┌─────┴─────┐              │
    │              │   DIMM    │              │
    │              └───────────┘              │
    └─────────────────────────────────────────┘
    Bandwidth: 1x (e.g., 25.6 GB/s for DDR4-3200)
    
    Dual Channel:
    ┌─────────────────────────────────────────┐
    │              Memory Controller          │
    │              ┌─────┴─────┐              │
    │        [64-bit]    [64-bit]             │
    │           │           │                 │
    │      ┌────┴────┐ ┌────┴────┐           │
    │      │ DIMM A1 │ │ DIMM B1 │           │
    │      └─────────┘ └─────────┘           │
    │         Ch A        Ch B               │
    └─────────────────────────────────────────┘
    Bandwidth: 2x (e.g., 51.2 GB/s for DDR4-3200)
    
    Motherboard Slots (Color-coded):
    
    Slot:    A1      A2      B1      B2
           ┌───┐   ┌───┐   ┌───┐   ┌───┐
           │███│   │░░░│   │███│   │░░░│
           │███│   │░░░│   │███│   │░░░│
           └───┘   └───┘   └───┘   └───┘
           Chan A  Chan A  Chan B  Chan B
           
    For dual-channel: Install in A1 + B1 (same color)

ECC Memory

ECC (Error-Correcting Code) memory adds extra bits to detect and correct single-bit errors, critical for servers and workstations where data integrity is paramount. Using a Hamming code variant, 8 ECC bits per 64 data bits can correct any single-bit error and detect two-bit errors—the ~3% performance overhead and higher cost make it uncommon in consumer systems.

    ECC Memory Operation
    
    Standard Memory:  64 data bits
    ┌────────────────────────────────────────────────────────────┐
    │ D63 D62 D61 ... D2 D1 D0                                   │
    └────────────────────────────────────────────────────────────┘
    
    ECC Memory:  64 data bits + 8 ECC bits = 72 bits
    ┌────────────────────────────────────────────────────────────┬────────┐
    │ D63 D62 D61 ... D2 D1 D0                                   │ECC bits│
    └────────────────────────────────────────────────────────────┴────────┘
    
    Error Detection & Correction:
    
    Write:  Data ──────┬──────► Memory
                       │
                 ┌─────▼─────┐
                 │ Calculate │
                 │    ECC    │
                 └─────┬─────┘
                       │
                   ECC bits ──────► Memory
    
    Read:   Memory ───┬───► Data (possibly corrupted)
                      │
            Memory ───┼───► Stored ECC
                      │
                ┌─────▼─────┐
                │ Compare & │
                │  Correct  │──► Corrected Data
                └───────────┘
                      │
              ┌───────┴───────┐
              │               │
         No Error     Single-bit Error    Multi-bit Error
                      (corrected)          (detected only)

// Simplified ECC demonstration (Hamming-like)
function calculateParity(data, positions) {
  return positions.reduce((parity, pos) => 
    parity ^ ((data >> pos) & 1), 0);
}

// 8-bit data example (real ECC uses 64 bits)
function generateECC(data) {
  // Parity bit positions check different bit combinations
  const p1 = calculateParity(data, [0, 1, 3, 4, 6]); // positions 1,2,4,5,7
  const p2 = calculateParity(data, [0, 2, 3, 5, 6]); // positions 1,3,4,6,7
  const p4 = calculateParity(data, [1, 2, 3, 7]);    // positions 2,3,4,8
  const p8 = calculateParity(data, [4, 5, 6, 7]);    // positions 5,6,7,8
  
  return (p8 << 3) | (p4 << 2) | (p2 << 1) | p1;
}

function checkECC(data, ecc) {
  const calculated = generateECC(data);
  const syndrome = calculated ^ ecc;
  
  if (syndrome === 0) return { status: 'OK', data };
  // Syndrome indicates error position
  return { status: 'ERROR', position: syndrome, data };
}

const testData = 0b10110010;
const ecc = generateECC(testData);
console.log(`Data: ${testData.toString(2).padStart(8,'0')}, ECC: ${ecc.toString(2).padStart(4,'0')}`);
console.log(checkECC(testData, ecc));       // OK
console.log(checkECC(testData ^ 0x04, ecc)); // Error detected

Memory Interleaving

Memory interleaving distributes consecutive memory addresses across multiple memory banks or channels, allowing parallel access and reducing wait times when accessing sequential data. Bank interleaving hides row activation latency, while channel interleaving maximizes bandwidth—modern memory controllers automatically manage this for optimal performance.

    Memory Interleaving
    
    Without Interleaving (all sequential in one bank):
    
    Address:  0  1  2  3  4  5  6  7  8  9  10 11 12 13 14 15
    Bank 0:  [0][1][2][3][4][5][6][7][8][9][A][B][C][D][E][F]
    Bank 1:  [ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ]
    
    Access pattern: Must wait for each access in same bank
    ───►───►───►───►───►  (sequential, slow)
    
    With 2-Way Interleaving:
    
    Address:  0  2  4  6  8  10 12 14
    Bank 0:  [0][2][4][6][8][A][C][E]  ← Even addresses
    
    Address:  1  3  5  7  9  11 13 15
    Bank 1:  [1][3][5][7][9][B][D][F]  ← Odd addresses
    
    Access pattern: Banks accessed in parallel!
    Bank 0: ───►───►───►───►
    Bank 1:  ───►───►───►───►  (overlapped, fast)
    
    
    Channel Interleaving (64-byte cache line example):
    
    Cache Line 0 (Addr 0-63):    Channel A
    Cache Line 1 (Addr 64-127):  Channel B
    Cache Line 2 (Addr 128-191): Channel A
    Cache Line 3 (Addr 192-255): Channel B
    
    Sequential read fetches from alternating channels
    = 2x effective bandwidth for streaming

Virtual Memory Basics

Virtual memory provides each process with its own isolated address space, using the CPU's Memory Management Unit (MMU) to translate virtual addresses to physical addresses via page tables. This enables running programs larger than physical RAM (paging to disk), memory protection between processes, and simplified memory allocation for programmers.

    Virtual Memory Address Translation
    
    Process A sees:              Process B sees:
    ┌──────────────┐            ┌──────────────┐
    │ 0xFFFFFFFF   │            │ 0xFFFFFFFF   │
    │    Stack     │            │    Stack     │
    │      ↓       │            │      ↓       │
    │              │            │              │
    │      ↑       │            │      ↑       │
    │    Heap      │            │    Heap      │
    │    Code      │            │    Code      │
    │ 0x00000000   │            │ 0x00000000   │
    └──────┬───────┘            └──────┬───────┘
           │                           │
           │    ┌─────────────────┐    │
           └───►│  Page Tables    │◄───┘
                │      MMU        │
                └────────┬────────┘
                         │
                         ▼
    Physical RAM:
    ┌────┬────┬────┬────┬────┬────┬────┬────┐
    │ A  │ B  │ A  │Free│ B  │ A  │ B  │ A  │
    │Code│Heap│Heap│    │Stk │Stk │Code│Data│
    └────┴────┴────┴────┴────┴────┴────┴────┘
    
    Page Table Entry:
    ┌────────────────────┬───────────────────┐
    │   Physical Frame # │ P│R│W│U│D│A│...  │
    └────────────────────┴───────────────────┘
    P=Present, R=Read, W=Write, U=User, D=Dirty, A=Accessed

// Virtual memory simulation
class VirtualMemory {
  constructor(physicalPages = 16, virtualPages = 64) {
    this.pageSize = 4096;  // 4KB pages
    this.physicalPages = physicalPages;
    this.virtualPages = virtualPages;
    
    // Page table: virtual page -> physical frame (or null if not present)
    this.pageTable = new Map();
    this.freeFrames = [...Array(physicalPages).keys()];
    this.disk = new Map();  // Swapped pages
  }
  
  translateAddress(virtualAddr) {
    const virtualPage = Math.floor(virtualAddr / this.pageSize);
    const offset = virtualAddr % this.pageSize;
    
    let frame = this.pageTable.get(virtualPage);
    
    if (frame === undefined) {
      // Page fault!
      frame = this.handlePageFault(virtualPage);
    }
    
    const physicalAddr = frame * this.pageSize + offset;
    return { virtualAddr, physicalAddr, page: virtualPage, frame, offset };
  }
  
  handlePageFault(virtualPage) {
    console.log(`Page fault on virtual page ${virtualPage}`);
    
    if (this.freeFrames.length === 0) {
      // Need to swap out a page
      const [victimPage] = this.pageTable.entries().next().value;
      const victimFrame = this.pageTable.get(victimPage);
      this.disk.set(victimPage, victimFrame);
      this.pageTable.delete(victimPage);
      this.freeFrames.push(victimFrame);
      console.log(`  Swapped out page ${victimPage}`);
    }
    
    const frame = this.freeFrames.pop();
    this.pageTable.set(virtualPage, frame);
    return frame;
  }
}

const vm = new VirtualMemory(4, 16);  // 4 physical pages, 16 virtual
console.log(vm.translateAddress(0x1234));   // Page 1
console.log(vm.translateAddress(0x5678));   // Page 5
console.log(vm.translateAddress(0x9ABC));   // Page 9
console.log(vm.translateAddress(0xDEF0));   // Page 13 - will cause swap
console.log(vm.translateAddress(0xFFFF));   // Page 15 - will cause swap

Memory Mapping

Memory mapping creates a direct correspondence between file contents or device registers and process memory addresses, allowing programs to access files as if they were arrays in memory. This technique (mmap on Unix) is used for loading executables, shared libraries, inter-process communication, and efficient file I/O without explicit read/write system calls.

    Memory-Mapped File
    
    Traditional File I/O:
    ┌─────────┐   read()    ┌─────────┐   copy    ┌─────────┐
    │  File   │───────────►│  Kernel  │─────────►│  User   │
    │ on Disk │            │  Buffer  │          │ Buffer  │
    └─────────┘            └─────────┘          └─────────┘
    = 2 copies, system call overhead
    
    Memory-Mapped I/O:
    ┌─────────┐            ┌───────────────────────────────┐
    │  File   │◄──────────►│      Process Address Space    │
    │ on Disk │    MMU     │   ┌─────────────────────────┐ │
    └─────────┘    maps    │   │   mmap'd region         │ │
         ▲         pages   │   │   (file appears as      │ │
         │                 │   │    memory array)        │ │
    ┌────┴────┐            │   └─────────────────────────┘ │
    │ Page    │            └───────────────────────────────┘
    │ Cache   │
    └─────────┘
    = Zero-copy access, demand paging
    
    Memory-Mapped Devices (MMIO):
    
    Physical Address Space:
    ┌────────────────────────────────────────────┐
    │   0x00000000 - 0x3FFFFFFF: RAM             │
    ├────────────────────────────────────────────┤
    │   0x40000000 - 0x4000FFFF: GPU Registers   │ ← Write here
    ├────────────────────────────────────────────┤    = GPU command
    │   0x40010000 - 0x400100FF: Network Card    │
    ├────────────────────────────────────────────┤
    │   0xFFFFF000 - 0xFFFFFFFF: BIOS ROM        │
    └────────────────────────────────────────────┘

// Node.js memory-mapped file example (conceptual)
// Real implementation would use 'mmap-io' package

/*
const mmap = require('mmap-io');
const fs = require('fs');

const fd = fs.openSync('data.bin', 'r+');
const size = fs.fstatSync(fd).size;

// Map file into memory
const buffer = mmap.map(size, mmap.PROT_READ | mmap.PROT_WRITE, 
                         mmap.MAP_SHARED, fd, 0);

// Now access file as if it were an array!
buffer[0] = 0x42;           // Writes directly to file
const byte = buffer[1000];  // Reads from file without syscall

mmap.sync(buffer);  // Ensure changes are written
*/

// Simulation of memory-mapped concept
class MemoryMappedFile {
  constructor(size) {
    this.data = new Uint8Array(size);
    this.dirtyPages = new Set();
    this.pageSize = 4096;
  }
  
  read(offset) {
    // In real mmap, this might trigger a page fault
    // to load data from disk on first access
    return this.data[offset];
  }
  
  write(offset, value) {
    this.data[offset] = value;
    this.dirtyPages.add(Math.floor(offset / this.pageSize));
  }
  
  sync() {
    // Write dirty pages back to "disk"
    console.log(`Syncing ${this.dirtyPages.size} dirty pages`);
    this.dirtyPages.clear();
  }
}

Cache Memory (L1, L2, L3)

Cache memory is a small, extremely fast SRAM hierarchy positioned between the CPU and main RAM to reduce memory access latency. L1 cache (32-64KB) is split into instruction and data caches, runs at CPU speed, and has ~4 cycle latency; L2 cache (256KB-1MB) is per-core with ~12 cycle latency; L3 cache (8-64MB) is shared across all cores with ~40 cycle latency, acting as a victim cache for evicted L2 lines.

┌─────────────────────────────────────────────────────────────┐
│                          CPU                                │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐    │
│  │  Core 0  │  │  Core 1  │  │  Core 2  │  │  Core 3  │    │
│  │┌──┐ ┌──┐│  │┌──┐ ┌──┐│  │┌──┐ ┌──┐│  │┌──┐ ┌──┐│    │
│  ││L1│ │L1││  ││L1│ │L1││  ││L1│ │L1││  ││L1│ │L1││    │
│  ││I │ │D ││  ││I │ │D ││  ││I │ │D ││  ││I │ │D ││    │
│  │└──┘ └──┘│  │└──┘ └──┘│  │└──┘ └──┘│  │└──┘ └──┘│    │
│  │ ┌──────┐│  │ ┌──────┐│  │ ┌──────┐│  │ ┌──────┐│    │
│  │ │  L2  ││  │ │  L2  ││  │ │  L2  ││  │ │  L2  ││    │
│  │ └──────┘│  │ └──────┘│  │ └──────┘│  │ └──────┘│    │
│  └────┬─────┘  └────┬─────┘  └────┬─────┘  └────┬─────┘    │
│       └─────────────┴──────┬──────┴─────────────┘          │
│                     ┌──────┴──────┐                        │
│                     │   L3 Cache  │  (Shared)              │
│                     │   8-64 MB   │                        │
│                     └──────┬──────┘                        │
└────────────────────────────┼────────────────────────────────┘
                             │
                    ┌────────┴────────┐
                    │   Main Memory   │
                    │   (DDR4/DDR5)   │
                    └─────────────────┘

Cache Coherence Introduction

Cache coherence ensures that multiple CPU cores see a consistent view of memory when they cache the same memory locations, preventing scenarios where one core reads stale data after another core has modified it. Without coherence protocols, a write by Core 0 to address X would be invisible to Core 1 which has its own cached copy, leading to data corruption in multithreaded programs.

Problem without coherence:

Time T0:  Memory[X] = 100
          ┌─────────┐     ┌─────────┐     ┌─────────┐
          │ Core 0  │     │ Core 1  │     │ Memory  │
          │ X = 100 │     │ X = 100 │     │ X = 100 │
          └─────────┘     └─────────┘     └─────────┘

Time T1:  Core 0 writes X = 200
          ┌─────────┐     ┌─────────┐     ┌─────────┐
          │ Core 0  │     │ Core 1  │     │ Memory  │
          │ X = 200 │     │ X = 100 │◄─── │ X = ???│ ← INCOHERENT!
          └─────────┘     └─────────┘     └─────────┘
                          (stale data)

Cache Replacement Policies

Cache replacement policies determine which cache line to evict when the cache is full and a new line must be loaded; common policies include LRU (Least Recently Used) which evicts the oldest accessed line, FIFO (First-In-First-Out) which evicts in arrival order, Random which selects arbitrarily, and pseudo-LRU which approximates LRU with less hardware overhead. Modern CPUs typically use adaptive policies that combine multiple strategies.

# LRU Cache Implementation
from collections import OrderedDict

class LRUCache:
    def __init__(self, capacity: int):
        self.cache = OrderedDict()
        self.capacity = capacity
    
    def get(self, key: int) -> int:
        if key not in self.cache:
            return -1  # Cache miss
        self.cache.move_to_end(key)  # Mark as recently used
        return self.cache[key]
    
    def put(self, key: int, value: int):
        if key in self.cache:
            self.cache.move_to_end(key)
        self.cache[key] = value
        if len(self.cache) > self.capacity:
            self.cache.popitem(last=False)  # Evict LRU item

# Example: 4-way cache with LRU
# Access sequence: A, B, C, D, E (capacity=4)
# After A: [A]
# After B: [A, B]
# After C: [A, B, C]
# After D: [A, B, C, D]  ← Full
# After E: [B, C, D, E]  ← A evicted (LRU)

Cache Associativity

Cache associativity defines how many possible cache locations a memory address can map to: direct-mapped (1-way) means each address maps to exactly one cache line (fast but conflict-prone), fully associative allows any address in any line (flexible but expensive), and N-way set-associative divides cache into sets where each address maps to one set but can occupy any of N ways (practical compromise used in modern CPUs, typically 8-16 way for L1/L2).

Memory Address: 0x12345678
Cache: 64 sets, 8-way associative, 64-byte lines

Address breakdown (64B line, 64 sets):
┌────────────────────┬──────────┬────────┐
│       Tag          │  Index   │ Offset │
│    (20 bits)       │ (6 bits) │(6 bits)│
└────────────────────┴──────────┴────────┘
        │                 │          │
        │                 │          └─► Byte within 64B line (0-63)
        │                 └─► Which set (0-63)
        └─► Identifies unique memory block

Direct-Mapped (1-way):        8-Way Set Associative:
┌─────────────────┐          ┌─────────────────────────────────┐
│ Set 0: [Line]   │          │ Set 0: [W0][W1][W2][W3]...[W7] │
│ Set 1: [Line]   │          │ Set 1: [W0][W1][W2][W3]...[W7] │
│ Set 2: [Line]   │          │ Set 2: [W0][W1][W2][W3]...[W7] │
│ ...             │          │ ...                            │
└─────────────────┘          └─────────────────────────────────┘
 ↑ Conflicts!                  ↑ 8 choices per set

Write-Through vs Write-Back

Write-through immediately propagates every write to both cache and main memory, ensuring consistency but consuming memory bandwidth; write-back (write-behind) only writes to cache and marks the line "dirty," deferring memory updates until eviction, which is more efficient for write-intensive workloads but requires careful handling during eviction and coherence operations. Modern CPUs predominantly use write-back for L1/L2 caches due to its bandwidth efficiency.

WRITE-THROUGH:                      WRITE-BACK:
                                    
CPU writes X=5                      CPU writes X=5
    │                                   │
    ▼                                   ▼
┌───────┐                          ┌───────┐
│ Cache │──── Write X=5 ────►      │ Cache │ (dirty bit = 1)
│ X = 5 │         │                │ X = 5 │
└───────┘         │                └───────┘
                  ▼                     │
             ┌────────┐                 │ (only on eviction)
             │ Memory │                 ▼
             │ X = 5  │            ┌────────┐
             └────────┘            │ Memory │
                                   │ X = 5  │
Pros: Simple, consistent          └────────┘
Cons: High bandwidth              
                                   Pros: Low bandwidth
                                   Cons: Complex eviction

Inclusive vs Exclusive Caches

Inclusive caches guarantee that all data in L1 is also present in L2/L3 (simplifies coherence snooping but wastes capacity), while exclusive caches ensure each line exists in only one cache level (maximizes effective capacity but complicates coherence). AMD uses exclusive L3 caches for maximum capacity utilization, while Intel traditionally used inclusive L3 but switched to non-inclusive (NINE) designs in recent architectures to balance both concerns.

INCLUSIVE (Intel traditional):    EXCLUSIVE (AMD):

L1: [A][B][C][D]                 L1: [A][B][C][D]
         ↓ subset                      ↓ disjoint
L2: [A][B][C][D][E][F][G][H]     L2: [E][F][G][H][I][J][K][L]
         ↓ subset                      ↓ disjoint  
L3: [A][B][C][D][E][F]...[P]     L3: [M][N][O][P][Q][R][S][T]

Effective capacity:               Effective capacity:
  L3 size only                      L1 + L2 + L3 (all unique!)

On L1 eviction:                   On L1 eviction:
  Drop from L1 (copy in L2)         Move to L2 (swap)

Coherence snoop:                  Coherence snoop:
  Check L3 only ✓ (simpler)         Must check all levels ✗

MESI/MOESI Protocols

MESI is a cache coherence protocol where each cache line has four states: Modified (dirty, exclusive), Exclusive (clean, exclusive), Shared (clean, multiple copies), Invalid (unusable); MOESI adds an Owned state where one cache holds the authoritative dirty copy while others have shared copies, avoiding write-back to memory on sharing. These protocols use bus snooping or directory-based tracking to maintain coherence across cores.

MESI State Transitions:

         ┌──────────────────────────────────────────┐
         │                                          │
         ▼                                          │
    ┌─────────┐   Read hit    ┌─────────┐          │
    │ Invalid │──────────────►│Exclusive│          │
    │   (I)   │  (no sharers) │   (E)   │          │
    └────┬────┘               └────┬────┘          │
         │                         │               │
         │ Read hit                │ Write         │
         │ (sharers exist)         ▼               │
         │                    ┌─────────┐          │
         │                    │Modified │──────────┘
         │                    │   (M)   │  Evict/Snoop
         │                    └────┬────┘
         ▼                         │
    ┌─────────┐    Snoop read      │
    │ Shared  │◄───────────────────┘
    │   (S)   │
    └─────────┘

MOESI adds:
    ┌─────────┐
    │  Owned  │  ← Has dirty copy, but shares with others
    │   (O)   │    (avoids immediate write-back to memory)
    └─────────┘

Scratchpad Memory

Scratchpad memory (SPM) is software-managed on-chip SRAM that, unlike cache, has no automatic caching logic or tag comparison hardware—the programmer explicitly controls what data resides there via DMA or direct addressing. Used extensively in embedded systems (ARM Cortex-M), DSPs, and GPUs (shared memory in CUDA), it offers deterministic access latency, lower power consumption, and guaranteed data placement at the cost of increased programming complexity.

Traditional Cache:              Scratchpad Memory:
┌──────────────────┐           ┌──────────────────┐
│   Tag   │  Data  │           │      Data        │
│  Array  │ Array  │           │  (Direct Addr)   │
├─────────┼────────┤           ├──────────────────┤
│ HW-managed       │           │ SW-managed       │
│ Automatic lookup │           │ Explicit DMA     │
│ Unpredictable    │           │ Deterministic    │
└──────────────────┘           └──────────────────┘

// CUDA Shared Memory (Scratchpad) Example
__global__ void matmul(float* A, float* B, float* C) {
    __shared__ float tile_A[16][16];  // Scratchpad allocation
    __shared__ float tile_B[16][16];  // 48KB per SM
    
    // Explicit load from global to scratchpad
    tile_A[ty][tx] = A[row * N + tx];
    tile_B[ty][tx] = B[ty * N + col];
    __syncthreads();  // Explicit synchronization
    
    // Fast access from scratchpad (~5 cycles vs ~400 for global)
    for (int k = 0; k < 16; k++)
        sum += tile_A[ty][k] * tile_B[k][tx];
}

Memory Controller Design

The memory controller is the bridge between CPU and DRAM, translating read/write requests into precise electrical signals with correct timing (tCAS, tRAS, tRP). It handles command queuing, address mapping, and ensures timing constraints are met.

┌─────────────┐     ┌──────────────────┐     ┌─────────────┐
│     CPU     │────▶│ Memory Controller│────▶│    DRAM     │
│  (Requests) │     │  - Command Queue │     │  (Storage)  │
└─────────────┘     │  - Timing Engine │     └─────────────┘
                    │  - Address Mapper│
                    └──────────────────┘

Rank and Bank Organization

DRAM is hierarchically organized: a channel contains ranks (groups of chips acting together for bus width), each rank contains banks (independent memory arrays), enabling parallel access across banks while only one rank drives the bus at a time.

Channel
├── Rank 0 (64-bit wide)
│   ├── Bank 0  ├── Bank 1  ├── Bank 2  ... ├── Bank 15
│   └── Each bank: Rows × Columns of cells
├── Rank 1
│   ├── Bank 0  ├── Bank 1  ├── Bank 2  ... ├── Bank 15
└── ...

// Address mapping example
const physicalAddress = 0x1A2B3C4D;
const channel = (physicalAddress >> 6) & 0x1;   // 1 bit
const rank    = (physicalAddress >> 7) & 0x1;   // 1 bit  
const bank    = (physicalAddress >> 8) & 0xF;   // 4 bits
const row     = (physicalAddress >> 12) & 0xFFFF; // 16 bits
const column  = physicalAddress & 0x3F;          // 6 bits

Row Buffer Management

Each bank has a row buffer (~8KB) acting as a cache; accessing an already-open row (row hit ~15ns) is much faster than opening a new row (row miss ~50ns), making row buffer policies (open-page vs closed-page) critical for performance.

Row Buffer Policies:
┌─────────────────────────────────────────────────────────┐
│ OPEN-PAGE: Keep row open, bet on locality              │
│   Hit:  ████░░░░░░ 15ns (fast)                         │
│   Miss: ████████████████████ 50ns (precharge+activate) │
│                                                         │
│ CLOSED-PAGE: Close row immediately after access        │
│   Always: ██████████████ 35ns (predictable)            │
└─────────────────────────────────────────────────────────┘

class RowBufferManager {
  constructor(policy = 'open-page') {
    this.openRow = new Map(); // bank -> currently open row
    this.policy = policy;
  }
  
  access(bank, row) {
    if (this.openRow.get(bank) === row) {
      return { type: 'HIT', latency: 15 };  // CAS only
    }
    // Row miss: precharge + activate + CAS
    this.openRow.set(bank, this.policy === 'open-page' ? row : null);
    return { type: 'MISS', latency: 50 };
  }
}

Refresh Mechanisms

DRAM cells leak charge and must be refreshed every 64ms (32ms at high temps); the controller periodically issues REF commands, blocking access briefly—distributed refresh spreads this overhead, while modern DDR5 uses per-bank refresh to reduce blocking.

Traditional Refresh (All-Bank):
Time ──────────────────────────────────────▶
     │▓▓▓│                    │▓▓▓│         (All banks blocked)
     
Per-Bank Refresh (DDR5):
Bank0 │▓│                     │▓│
Bank1    │▓│                     │▓│        (Only 1 bank blocked)
Bank2       │▓│                     │▓│
Bank3          │▓│                     │▓│

// Refresh timing
const REFRESH_INTERVAL_MS = 64;
const ROWS_PER_BANK = 65536;
const tREFI = (REFRESH_INTERVAL_MS * 1e6) / (ROWS_PER_BANK / 8); // ~7.8μs

Memory Scheduling Algorithms

The scheduler reorders queued memory requests to maximize throughput; FR-FCFS (First-Ready First-Come-First-Served) prioritizes row hits, while modern schedulers also consider fairness, QoS, and inter-thread interference.

class FRFCFSScheduler {
  constructor() {
    this.queue = [];
    this.openRows = new Map();
  }

  schedule() {
    // Priority 1: Row hits (first-ready)
    const rowHit = this.queue.find(req => 
      this.openRows.get(req.bank) === req.row
    );
    if (rowHit) return this.dequeue(rowHit);
    
    // Priority 2: Oldest request (FCFS)
    return this.queue.shift();
  }
  
  // Advanced: Add fairness with per-thread quotas
  scheduleWithFairness(threadQuotas) {
    const eligible = this.queue.filter(r => threadQuotas[r.thread] > 0);
    // Apply FR-FCFS within eligible requests
  }
}

Scheduling Decision Tree:
                    ┌─────────────┐
                    │ New Request │
                    └──────┬──────┘
                           ▼
                    ┌─────────────┐
              Yes ◀─┤  Row Hit?   ├─▶ No
                    └──────┬──────┘    │
                           ▼           ▼
                    ┌──────────┐  ┌──────────┐
                    │ Schedule │  │ Check    │
                    │ Immediate│  │ FCFS Age │
                    └──────────┘  └──────────┘

Near-Memory Processing

NMP places compute logic close to memory (in the buffer chip or memory controller) to reduce data movement energy and bandwidth bottlenecks—ideal for bandwidth-bound workloads like graph analytics, where data moves more than it computes.

Traditional:               Near-Memory Processing:
┌──────┐    ◀──────▶    ┌──────┐    ┌──────────────────┐
│ CPU  │  (Data moves)  │ CPU  │    │  Memory + Logic  │
└──────┘                └──────┘    │  ┌────┐ ┌─────┐  │
   ▲                       │        │  │ALU │ │DRAM │  │
   │ 100+ GB/s            ▼        │  └────┘ └─────┘  │
┌──────┐               (Commands   │    (TB/s local)  │
│ DRAM │                only)      └──────────────────┘
└──────┘

// Workload suitability
const nmProcessingBenefit = (computeIntensity, dataSize) => {
  // Low compute/byte ratio = good NMP candidate
  const bytesPerOp = dataSize / computeIntensity;
  return bytesPerOp > 100 ? 'EXCELLENT' : bytesPerOp > 10 ? 'GOOD' : 'POOR';
};

Processing-in-Memory (PIM)

PIM embeds compute directly in memory arrays (e.g., Samsung HBM-PIM, UPMEM), enabling massively parallel operations within memory banks—particularly effective for bulk operations, neural network inference, and DNA sequence matching.

┌────────────────────────────────────────────┐
│              HBM-PIM Stack                 │
├────────────────────────────────────────────┤
│  DRAM Die 3  │ Bank │ Bank │ + PIM Units  │
│  DRAM Die 2  │ Bank │ Bank │ + PIM Units  │
│  DRAM Die 1  │ Bank │ Bank │ + PIM Units  │
│  DRAM Die 0  │ Bank │ Bank │ + PIM Units  │
├────────────────────────────────────────────┤
│           Base Logic Die (TSV)             │
└────────────────────────────────────────────┘

// PIM operation example (conceptual)
class PIMController {
  executeInMemory(operation, bankMask) {
    // Single command triggers parallel execution across banks
    const ops = {
      'MAC': 'Multiply-Accumulate for ML',    // 1 TFLOPS in-memory
      'COPY': 'Bank-to-bank transfer',        // Avoids bus
      'SEARCH': 'Content-addressable lookup'  // Parallel compare
    };
    return this.dispatchToBanks(operation, bankMask);
  }
}

CXL Protocol and Memory Pooling

CXL (Compute Express Link) extends PCIe with cache-coherent memory semantics, enabling disaggregated memory pools that multiple hosts can share—revolutionizing data center efficiency through dynamic memory allocation and tiered memory architectures.

┌─────────────────────────────────────────────────────────────┐
│                    CXL Memory Pooling                       │
│  ┌───────┐ ┌───────┐ ┌───────┐                             │
│  │Host 1 │ │Host 2 │ │Host 3 │                             │
│  └───┬───┘ └───┬───┘ └───┬───┘                             │
│      │         │         │      CXL 2.0 Switch              │
│  ════╪═════════╪═════════╪══════════════════                │
│      │         │         │                                  │
│  ┌───┴───┬─────┴───┬─────┴───┐                             │
│  │Pool 1 │ Pool 2  │  Pool 3 │  Shared Memory Pools        │
│  │ DDR5  │  DDR5   │  PMem   │                             │
│  └───────┴─────────┴─────────┘                             │
└─────────────────────────────────────────────────────────────┘

CXL Protocol Types:
┌──────────┬─────────────────────────────────────────────────┐
│ CXL.io   │ PCIe-equivalent I/O (discovery, config)        │
│ CXL.cache│ Device caches host memory (accelerators)       │
│ CXL.mem  │ Host accesses device memory (expansion)        │
└──────────┴─────────────────────────────────────────────────┘

// Conceptual CXL memory pool manager
class CXLMemoryPool {
  constructor() {
    this.pools = new Map();  // poolId -> { capacity, allocated, type }
    this.hostAllocations = new Map();
  }
  
  allocate(hostId, size, type = 'DDR5') {
    // Dynamic allocation from shared pool
    const pool = this.findPool(size, type);
    const region = {
      baseAddr: pool.allocate(size),
      size,
      coherencyDomain: this.setupCoherency(hostId)
    };
    this.hostAllocations.set(hostId, region);
    return region;  // Host sees this as normal memory via CXL.mem
  }
  
  migrate(hostId, fromPool, toPool) {
    // Live migration between memory tiers (e.g., DDR5 → PMem)
    return this.dmaTransfer(fromPool, toPool, this.hostAllocations.get(hostId));
  }
}