Groq AI Architecture Deep Dive: LPU Design Explained
While most AI hardware focuses on raw FLOPS, Groq's Language Processing Unit (LPU) prioritizes deterministic execution and memory efficiency. This technical deep dive explores the architectural innovations that enable Groq to deliver 750+ tokens/second with sub-100ms latency — fundamentally rethinking how transformer models execute on silicon.
🧠 Core Philosophy: "Compile once, execute perfectly." Groq shifts complexity from runtime to compile-time, eliminating the unpredictability that plagues GPU-based inference. [[11]]
Architecture Overview: The Big Picture
Groq's LPU is a spatial architecture designed specifically for transformer inference. Unlike GPUs that use time-multiplexed compute units, the LPU dedicates hardware resources to specific operations with fixed data paths.
Host Interface
PCIe Gen4 x16 for prompt/response I/O
Compiler
Static scheduling, memory planning, kernel fusion
SRAM Array
230 MB on-chip, 128 banks, 15 TB/s bandwidth
Compute Fabric
1,000+ MAC units with dedicated data paths
Key Differentiator: No runtime scheduling, no cache coherency protocols, no dynamic memory allocation. Every operation executes at a pre-determined clock cycle. [[14]]
Memory Hierarchy: SRAM-First Design
Traditional GPUs rely on multi-level cache hierarchies with unpredictable hit rates. Groq's LPU uses a flat, compiler-managed SRAM architecture that eliminates memory bottlenecks.
| Component | Groq LPU | NVIDIA A100 GPU | Advantage |
|---|---|---|---|
| On-Chip Memory | 230 MB SRAM | 40 MB L2 Cache | 5.75× larger |
| Memory Bandwidth | 15 TB/s | 2 TB/s (HBM2e) | 7.5× higher |
| Access Latency | 1 cycle | 400+ cycles | 400× lower |
| Memory Management | Compiler-static | Runtime dynamic | Zero overhead |
| KV Cache Storage | Dedicated SRAM banks | HBM with eviction | No misses |
Why SRAM Matters for Transformers Critical Insight
Transformer inference is memory-bound, not compute-bound. 60-80% of GPU inference time is spent waiting for weights and KV cache from HBM. By keeping everything in SRAM, Groq keeps compute units fed 100% of the time — achieving 10-18× higher effective throughput. [[17]]
Compute Fabric: Spatial Dataflow Architecture
Instead of shared streaming multiprocessors (SMs), Groq uses a spatial architecture where data flows through dedicated functional units like an assembly line:
# Conceptual dataflow for a transformer layer
Input Tokens
↓
[Embedding Unit] → Fixed latency: 2 cycles
↓
[QKV Projection Array] → 128 parallel MAC units
↓
[Attention Compute Grid] → Dedicated softmax hardware
↓
[MLP Array] → Fused GeLU + projection
↓
[Layer Norm Unit] → Single-cycle normalization
↓
Output + KV Cache Update
# Each bracket = dedicated hardware with fixed latency
# No routing decisions, no arbitration, no stallsKey Innovation: The compiler determines the exact cycle when each operation executes. At runtime, the chip simply follows the schedule — no dynamic dispatch overhead. [[12]]
Compiler Stack: The Secret Sauce
Groq's compiler is arguably more important than the hardware itself. It performs several critical optimizations:
- Graph Lowering: Converts PyTorch/TensorFlow ops into Groq's intermediate representation (IR)
- Memory Planning: Allocates every tensor to a specific SRAM bank with zero runtime address calculation
- Instruction Scheduling: Generates a cycle-accurate timeline where each operation has a fixed slot
- Kernel Fusion: Combines multiple operations (e.g., QKV + softmax + output projection) into single instructions
- Precision Optimization: Automatically selects INT8/FP16 where possible without quality loss
💡 Pro Tip: Compilation takes 5-15 minutes initially, but the resulting binary can be reused indefinitely. For rapid iteration, use Groq's "fast compile" mode that skips some optimizations for quicker testing. [[1]]
Execution Model: Deterministic Pipeline
Once compiled, inference execution is completely deterministic:
| Phase | Duration | Description |
|---|---|---|
| Prompt Loading | ~10 ms | Input tokens copied to input SRAM buffer |
| Embedding + Positional | ~8 cycles | Token embeddings + rotary position encoding |
| Layer Execution | ~32 cycles/layer | QKV → Attention → MLP → LayerNorm (fused) |
| LM Head + Sampling | ~15 cycles | Final projection + softmax + token selection |
| Output Streaming | ~2 cycles/token | Generated token sent to host via PCIe |
Result: For Llama 3.1 8B (32 layers), first token arrives in ~90ms, then 750+ tokens/second thereafter. [[18]]
Scalability: Multi-LPU Configurations
For models larger than 80 MB (weights + KV cache), Groq supports model parallelism across multiple LPUs:
- Tensor Parallelism: Split large matrix multiplications across LPUs with synchronized execution
- Pipeline Parallelism: Assign different transformer layers to different LPUs
- Zero Runtime Overhead: Inter-LPU communication is also compiler-scheduled, avoiding runtime synchronization costs
Multi-LPU Performance Enterprise
2x LPU: ~1,400 tokens/sec for Mixtral 8x7B (quantized)
4x LPU: ~2,500 tokens/sec for Llama 3.1 70B (4-bit)
Latency: Still sub-150ms TTFT due to parallel compilation
Frequently Asked Questions
Yes — the LPU accelerates any workload expressible as matrix multiplications and element-wise functions. This includes vision transformers (ViT), speech models (Whisper), and scientific computing with transformer components. Pure CNNs or RNNs may see less benefit. [[4]]
Transformer inference has minimal dynamic control flow (mostly token sampling). For workloads requiring complex branching, Groq falls back to host CPU for those operations. Most LLM inference fits Groq's static execution model perfectly. [[11]]
Groq has published high-level architecture details and benchmarks, but low-level microarchitecture (exact MAC count, SRAM bank organization) remains proprietary. The compiler IR and API are well-documented for developers. [[25]]
Related Groq Deep Dives
Explore our complete Groq series for hardware walkthroughs, benchmarks, and real-world applications.
Read: How Groq Chip Works Step by Step →