# **N3XT 3D MOSAIC:**

# **3D Thermal Scaffolding, Multi-Chip Illusion**

### Subhasish Mitra



Department of EE and Department of CS

Stanford University

### Abundant-Data Computing: e.g., AI/ML

### **Deadly combination!**



### Computing Today



# N3XT 3D: Computation immersed in Memory

#### **Nano-Engineered Computing Systems Technology**



### N3XT 3D MOSAIC

### Monolithic / Stacked / Assembled IC



#### Inter-chip integration *continuum*

Collaborators: Prof. H.-S.P. Wong (Stanford) + others

### N3XT 3D MOSAIC

# Monolithic / Stacked / Assembled IC



#### Inter-chip integration *continuum*

### Dense 3D Connections: Large Benefits



#### Monolithic 3D: only way today

Apps: Crypto, Graph, Genomics, Sparse matrix, Neural nets.

# N3XT 3D: Many Technologies

#### **BEOL-compatible:** ≤ 400°C fabrication

8



# Many Activities On-going

#### Lab-to-fab

#### **EMD** collaboration

#### **Industry fabs: many firsts**





- **1.** Carbon nanotube FETs (CNFETs)
- **2.** U.S. foundry Resistive RAM (RRAM)
- **3**. Ultra-dense monolithic 3D: CNFET+ RRAM + silicon CMOS

4. Product development: e.g., ■ ANALOG DEVICES



**Inference-RAM:** 3D, dense, quick & low-energy read, write ability sufficient for KV cache

Courtesy: S. Mukund (EMD Electronics)

# N3XT 3D Thermal



[Rich DAC 23] Copper boiling heatsink =  $10^6$  W/m<sup>2</sup>/K [C. Zhang Adv. Funct. Mater. 18]

# **3D** Thermal Scaffolding



[Rich DAC 23]

# Polycrystalline Diamond Thermal Dielectric



[Lyu IEDM 2024] Prof. S. Chowdhury (Stanford) BEOL = Back-end-of-line

# **Co-Placement Matters:** 3D Thermal Scaffolding Results



# only 5.5% extra footprint area

[Rich ICCAD 24]

### N3XT 3D MOSAIC

### Monolithic / Stacked / Assembled IC



#### Inter-chip integration *continuum*

### "Dream" Chip: All Memory + Compute On-Chip

### Infeasible, Moving Target



# **Illusion** Multi-Chip System

Target: within 10% end-to-end EDP of "Dream" chip



[Radway Nature Electronics 21] Example network: 2D mesh shown for simplicity

# 1. "Enough" Memory per Chip, Message Costs

### Must achieve target Message Costs



(Left) ResNet-18 1.1.Conv2. Parallelism [Shao Micro 19], no sync cost. (Right) 16-chip 187 MByte ResNet

# 2. Spatiotemporal Fine-grained Power-Gating

# Idle energy overheads must be ~0

(validated on our MINOTAUR multi-chip system hardware)



# **Illusion Mapping**



# Illusion MIQP vs. Existing Approaches

|                             | Shao<br>MICRO `19                        | Narayanan<br>SOSP `19          | Unger<br>OSDI `22 | Tarnawski<br>NeurIPS `20                         | Wang<br>ICML '24 | Our                                                            |  |  |
|-----------------------------|------------------------------------------|--------------------------------|-------------------|--------------------------------------------------|------------------|----------------------------------------------------------------|--|--|
| Method                      | Pre-defined<br>Communication<br>Patterns | Dynamic<br>Programming<br>(DP) | Graph<br>Search   | Mixed Integer<br>Linear<br>Programming<br>(MILP) | MILP + DP        | MIQP                                                           |  |  |
| True Minimum                | N/A                                      | No                             |                   | Yes                                              | No               | Yes                                                            |  |  |
| Computational<br>Cost Model | Measured                                 | Performance Profiler           |                   | Architectural Model                              |                  | Cycle-Accurate<br>Simulation +<br>Emulation + HW<br>Validation |  |  |
| Target                      | Latency /<br>Throughput                  | Training Spo<br>/ Through      | •                 | Latency / Throughput                             |                  | Energy-Delay Product<br>or Energy or Latency<br>or Throughput  |  |  |
| Interconnect<br>Topology    | Fixed                                    | Fixed                          | Variable          | Fixed                                            | Fixed            | Arbitrary                                                      |  |  |
| Runtime                     | N/A                                      | Profiling Time                 | Minutes           | Seconds                                          | Hours            | Minutes for<br>64× larger models                               |  |  |

### MINOTAUR: Transformer NVM Edge AI Inference & Training



### Illusion *In Hardware*

#### **MINOTAUR Illusion:** 96-MByte Transformers



| Dollar State |                  |
|--------------|------------------|
|              | an second starts |
|              |                  |
| TORIGADA'Y   |                  |

ATT ATTACKS ATTAC

| Chips      | 8 MINOTAUR            |  |  |  |
|------------|-----------------------|--|--|--|
| Total RRAM | Up to 96 MB           |  |  |  |
| Total SRAM | Up to 16 MB           |  |  |  |
| C2C Links  | 4 TX/RX per chip      |  |  |  |
| Networks   | CNNs,<br>Transformers |  |  |  |

# Illusion In Hardware

#### **MINOTAUR Illusion:** 96-MByte Transformers



<10% of Dream energy and execution time Demonstrated on MINOTAUR with BERT scaled from 1-8 Chips

# Traditional Parallel vs. Illusion System



# Illusion via Emulation (Beyond Hardware Demos)

#### Many thanks to Cadence!



#### Cycle-accurate results in *minutes not days*

# Emulation Challenges for Illusion: Electrical Aspects



#### **Clock Tree**

#### Nonzero off power



# 0.56 mW of always on power

Meticulous hardware calibration essential

(multi-chip MINOTAUR system in our case)

# Illusion Demonstrated In Hardware and Emulation

#### Illusion demonstrations agree within 5%

Dream Hardware Emulation (Uncalibrated) Emulation



### Heuristics Essential to MIQP Scalability

# **128X Larger ResNet Still Tractable With MIQP**

| Model<br>Size | Nodes | Edges | Chips | Variables | Constraints | Illusion EDP<br>Overhead<br>vs. Dream | Solve Time |
|---------------|-------|-------|-------|-----------|-------------|---------------------------------------|------------|
|               |       |       |       |           |             |                                       |            |

#### **Also explored:**

Highly parallel models (64 branches) Fine-grained Transformer parallelism Different chip and network configurations

. . .

### Illusion Scaleup: On-going



#### **Multiplicative effect**

Maintain Illusion despite exponential workload growth over fixed time

[Radway IEDM 21] Linear improvements in electrical links through tight integration [das Sharma, Nature Electronics 24] 29

# Conclusion

#### □ N3XT 3D MOSAIC

- Overcome memory wall & miniaturization wall, successful lab-to-fab
- Large system-level Energy Delay Product benefits for AI/ML

#### **3D** Thermal Scaffolding: high-power compute in 3D

Co-design: thermal dielectric + 3D architecture + 3D physical design

#### Multi-chip Illusion: large AI/ML workloads

Hardware results demo effectiveness, superior vs. traditional parallel