HomeBlogGroq Architecture
🏗️ Architecture

Groq AI Architecture Deep Dive: LPU Design Explained

PL
Prashant Lalwani2026-04-13 · NeuraPulse
18 min readLPUTechnicalDeep Dive

While most AI hardware focuses on raw FLOPS, Groq's Language Processing Unit (LPU) prioritizes deterministic execution and memory efficiency. This technical deep dive explores the architectural innovations that enable Groq to deliver 750+ tokens/second with sub-100ms latency — fundamentally rethinking how transformer models execute on silicon.

🧠 Core Philosophy: "Compile once, execute perfectly." Groq shifts complexity from runtime to compile-time, eliminating the unpredictability that plagues GPU-based inference. [[11]]

Architecture Overview: The Big Picture

Groq's LPU is a spatial architecture designed specifically for transformer inference. Unlike GPUs that use time-multiplexed compute units, the LPU dedicates hardware resources to specific operations with fixed data paths.

Host Interface

PCIe Gen4 x16 for prompt/response I/O

Compiler

Static scheduling, memory planning, kernel fusion

SRAM Array

230 MB on-chip, 128 banks, 15 TB/s bandwidth

Compute Fabric

1,000+ MAC units with dedicated data paths

Key Differentiator: No runtime scheduling, no cache coherency protocols, no dynamic memory allocation. Every operation executes at a pre-determined clock cycle. [[14]]

Memory Hierarchy: SRAM-First Design

Traditional GPUs rely on multi-level cache hierarchies with unpredictable hit rates. Groq's LPU uses a flat, compiler-managed SRAM architecture that eliminates memory bottlenecks.

ComponentGroq LPUNVIDIA A100 GPUAdvantage
On-Chip Memory230 MB SRAM40 MB L2 Cache5.75× larger
Memory Bandwidth15 TB/s2 TB/s (HBM2e)7.5× higher
Access Latency1 cycle400+ cycles400× lower
Memory ManagementCompiler-staticRuntime dynamicZero overhead
KV Cache StorageDedicated SRAM banksHBM with evictionNo misses

Why SRAM Matters for Transformers

Transformer inference is memory-bound, not compute-bound. 60-80% of GPU inference time is spent waiting for weights and KV cache from HBM. By keeping everything in SRAM, Groq keeps compute units fed 100% of the time — achieving 10-18× higher effective throughput. [[17]]

Compute Fabric: Spatial Dataflow Architecture

Instead of shared streaming multiprocessors (SMs), Groq uses a spatial architecture where data flows through dedicated functional units like an assembly line:

# Conceptual dataflow for a transformer layer
Input Tokens

[Embedding Unit] → Fixed latency: 2 cycles

[QKV Projection Array] → 128 parallel MAC units

[Attention Compute Grid] → Dedicated softmax hardware

[MLP Array] → Fused GeLU + projection

[Layer Norm Unit] → Single-cycle normalization

Output + KV Cache Update
# Each bracket = dedicated hardware with fixed latency
# No routing decisions, no arbitration, no stalls

Key Innovation: The compiler determines the exact cycle when each operation executes. At runtime, the chip simply follows the schedule — no dynamic dispatch overhead. [[12]]

Compiler Stack: The Secret Sauce

Groq's compiler is arguably more important than the hardware itself. It performs several critical optimizations:

  • Graph Lowering: Converts PyTorch/TensorFlow ops into Groq's intermediate representation (IR)
  • Memory Planning: Allocates every tensor to a specific SRAM bank with zero runtime address calculation
  • Instruction Scheduling: Generates a cycle-accurate timeline where each operation has a fixed slot
  • Kernel Fusion: Combines multiple operations (e.g., QKV + softmax + output projection) into single instructions
  • Precision Optimization: Automatically selects INT8/FP16 where possible without quality loss

💡 Pro Tip: Compilation takes 5-15 minutes initially, but the resulting binary can be reused indefinitely. For rapid iteration, use Groq's "fast compile" mode that skips some optimizations for quicker testing. [[1]]

Execution Model: Deterministic Pipeline

Once compiled, inference execution is completely deterministic:

PhaseDurationDescription
Prompt Loading~10 msInput tokens copied to input SRAM buffer
Embedding + Positional~8 cyclesToken embeddings + rotary position encoding
Layer Execution~32 cycles/layerQKV → Attention → MLP → LayerNorm (fused)
LM Head + Sampling~15 cyclesFinal projection + softmax + token selection
Output Streaming~2 cycles/tokenGenerated token sent to host via PCIe

Result: For Llama 3.1 8B (32 layers), first token arrives in ~90ms, then 750+ tokens/second thereafter. [[18]]

Scalability: Multi-LPU Configurations

For models larger than 80 MB (weights + KV cache), Groq supports model parallelism across multiple LPUs:

  • Tensor Parallelism: Split large matrix multiplications across LPUs with synchronized execution
  • Pipeline Parallelism: Assign different transformer layers to different LPUs
  • Zero Runtime Overhead: Inter-LPU communication is also compiler-scheduled, avoiding runtime synchronization costs

Multi-LPU Performance Enterprise

2x LPU: ~1,400 tokens/sec for Mixtral 8x7B (quantized)
4x LPU: ~2,500 tokens/sec for Llama 3.1 70B (4-bit)
Latency: Still sub-150ms TTFT due to parallel compilation

Frequently Asked Questions

Q: Can Groq run non-transformer models?+

Yes — the LPU accelerates any workload expressible as matrix multiplications and element-wise functions. This includes vision transformers (ViT), speech models (Whisper), and scientific computing with transformer components. Pure CNNs or RNNs may see less benefit. [[4]]

Q: How does Groq handle dynamic control flow?+

Transformer inference has minimal dynamic control flow (mostly token sampling). For workloads requiring complex branching, Groq falls back to host CPU for those operations. Most LLM inference fits Groq's static execution model perfectly. [[11]]

Q: Is the architecture public?+

Groq has published high-level architecture details and benchmarks, but low-level microarchitecture (exact MAC count, SRAM bank organization) remains proprietary. The compiler IR and API are well-documented for developers. [[25]]

🔗 Continue Learning

Related Groq Deep Dives

Explore our complete Groq series for hardware walkthroughs, benchmarks, and real-world applications.

Read: How Groq Chip Works Step by Step →

Found this useful? Share it! 🚀