Home › Blog › Groq Architecture

🏗️ Architecture

Groq AI Architecture Deep Dive: LPU Design Explained

Prashant Lalwani2026-04-13 · NeuraPulse

18 min readLPUTechnicalDeep Dive

While most AI hardware focuses on raw FLOPS, Groq's Language Processing Unit (LPU) prioritizes deterministic execution and memory efficiency. This technical deep dive explores the architectural innovations that enable Groq to deliver 750+ tokens/second with sub-100ms latency — fundamentally rethinking how transformer models execute on silicon.

🧠 Core Philosophy: "Compile once, execute perfectly." Groq shifts complexity from runtime to compile-time, eliminating the unpredictability that plagues GPU-based inference. [[11]]

Architecture Overview: The Big Picture

Groq's LPU is a spatial architecture designed specifically for transformer inference. Unlike GPUs that use time-multiplexed compute units, the LPU dedicates hardware resources to specific operations with fixed data paths.

Host Interface

PCIe Gen4 x16 for prompt/response I/O

Compiler

Static scheduling, memory planning, kernel fusion

SRAM Array

230 MB on-chip, 128 banks, 15 TB/s bandwidth

Compute Fabric

1,000+ MAC units with dedicated data paths

Key Differentiator: No runtime scheduling, no cache coherency protocols, no dynamic memory allocation. Every operation executes at a pre-determined clock cycle. [[14]]

Memory Hierarchy: SRAM-First Design

Traditional GPUs rely on multi-level cache hierarchies with unpredictable hit rates. Groq's LPU uses a flat, compiler-managed SRAM architecture that eliminates memory bottlenecks.

Component	Groq LPU	NVIDIA A100 GPU	Advantage
On-Chip Memory	230 MB SRAM	40 MB L2 Cache	5.75× larger
Memory Bandwidth	15 TB/s	2 TB/s (HBM2e)	7.5× higher
Access Latency	1 cycle	400+ cycles	400× lower
Memory Management	Compiler-static	Runtime dynamic	Zero overhead
KV Cache Storage	Dedicated SRAM banks	HBM with eviction	No misses

Why SRAM Matters for Transformers Critical Insight

Transformer inference is memory-bound, not compute-bound. 60-80% of GPU inference time is spent waiting for weights and KV cache from HBM. By keeping everything in SRAM, Groq keeps compute units fed 100% of the time — achieving 10-18× higher effective throughput. [[17]]

Compute Fabric: Spatial Dataflow Architecture

Instead of shared streaming multiprocessors (SMs), Groq uses a spatial architecture where data flows through dedicated functional units like an assembly line:

# Conceptual dataflow for a transformer layer
 Input Tokens
 ↓
 [Embedding Unit] → Fixed latency: 2 cycles
 ↓
 [QKV Projection Array] → 128 parallel MAC units
 ↓
 [Attention Compute Grid] → Dedicated softmax hardware
 ↓
 [MLP Array] → Fused GeLU + projection
 ↓
 [Layer Norm Unit] → Single-cycle normalization
 ↓
 Output + KV Cache Update
# Each bracket = dedicated hardware with fixed latency
# No routing decisions, no arbitration, no stalls

Key Innovation: The compiler determines the exact cycle when each operation executes. At runtime, the chip simply follows the schedule — no dynamic dispatch overhead. [[12]]

Compiler Stack: The Secret Sauce

Groq's compiler is arguably more important than the hardware itself. It performs several critical optimizations:

Graph Lowering: Converts PyTorch/TensorFlow ops into Groq's intermediate representation (IR)
Memory Planning: Allocates every tensor to a specific SRAM bank with zero runtime address calculation
Instruction Scheduling: Generates a cycle-accurate timeline where each operation has a fixed slot
Kernel Fusion: Combines multiple operations (e.g., QKV + softmax + output projection) into single instructions
Precision Optimization: Automatically selects INT8/FP16 where possible without quality loss

💡 Pro Tip: Compilation takes 5-15 minutes initially, but the resulting binary can be reused indefinitely. For rapid iteration, use Groq's "fast compile" mode that skips some optimizations for quicker testing. [[1]]

Execution Model: Deterministic Pipeline

Once compiled, inference execution is completely deterministic:

Phase	Duration	Description
Prompt Loading	~10 ms	Input tokens copied to input SRAM buffer
Embedding + Positional	~8 cycles	Token embeddings + rotary position encoding
Layer Execution	~32 cycles/layer	QKV → Attention → MLP → LayerNorm (fused)
LM Head + Sampling	~15 cycles	Final projection + softmax + token selection
Output Streaming	~2 cycles/token	Generated token sent to host via PCIe

Result: For Llama 3.1 8B (32 layers), first token arrives in ~90ms, then 750+ tokens/second thereafter. [[18]]

Scalability: Multi-LPU Configurations

For models larger than 80 MB (weights + KV cache), Groq supports model parallelism across multiple LPUs:

Tensor Parallelism: Split large matrix multiplications across LPUs with synchronized execution
Pipeline Parallelism: Assign different transformer layers to different LPUs
Zero Runtime Overhead: Inter-LPU communication is also compiler-scheduled, avoiding runtime synchronization costs

Multi-LPU Performance Enterprise

2x LPU: ~1,400 tokens/sec for Mixtral 8x7B (quantized)
4x LPU: ~2,500 tokens/sec for Llama 3.1 70B (4-bit)
Latency: Still sub-150ms TTFT due to parallel compilation

Frequently Asked Questions

Q: Can Groq run non-transformer models?+

Yes — the LPU accelerates any workload expressible as matrix multiplications and element-wise functions. This includes vision transformers (ViT), speech models (Whisper), and scientific computing with transformer components. Pure CNNs or RNNs may see less benefit. [[4]]

Q: How does Groq handle dynamic control flow?+

Transformer inference has minimal dynamic control flow (mostly token sampling). For workloads requiring complex branching, Groq falls back to host CPU for those operations. Most LLM inference fits Groq's static execution model perfectly. [[11]]

Q: Is the architecture public?+

Groq has published high-level architecture details and benchmarks, but low-level microarchitecture (exact MAC count, SRAM bank organization) remains proprietary. The compiler IR and API are well-documented for developers. [[25]]

🔗 Continue Learning

Related Groq Deep Dives

Explore our complete Groq series for hardware walkthroughs, benchmarks, and real-world applications.

Read: How Groq Chip Works Step by Step →

Found this useful? Share it! 🚀

Twitter/X LinkedIn WhatsApp

Groq AI Architecture Deep Dive: LPU Design Explained

Architecture Overview: The Big Picture

Host Interface

Compiler

SRAM Array

Compute Fabric

Memory Hierarchy: SRAM-First Design

Why SRAM Matters for Transformers Critical Insight

Compute Fabric: Spatial Dataflow Architecture

Compiler Stack: The Secret Sauce

Execution Model: Deterministic Pipeline

Scalability: Multi-LPU Configurations

Multi-LPU Performance Enterprise

Frequently Asked Questions

Related Groq Deep Dives

Found this useful? Share it! 🚀

More Groq Articles

How Groq Chip Works Step by Step

Groq Inference Engine Explained

Groq AI Benchmarks for LLM