HomeBlogGroq AI
Groq AI

What Is the Groq Chip and How It Works: LPU Architecture Explained

PL
Prashant Lalwani 2026-04-09 · NeuraPulse · neuraplus-ai.github.io
14 min read Groq Chip LPU Architecture
What Is a Groq LPU Chip? LANGUAGE PROCESSING UNIT — ARCHITECTURE OVERVIEW TENSOR STREAMING PROCESSORS TSP TSP TSP TSP TSP TSP TSP TSP TSP TSP ON-CHIP SRAM — 230MB TOTAL SRAM MEMORY FABRIC DETERMINISTIC DATAFLOW ENGINE COMPILER SCHEDULED EXECUTION PCIe 5.0 I/O CHIP-CHIP LINK GROQ GroqChip™ (LPU-1) LPU KEY SPECS Architecture TSP Process TSMC 14nm On-chip SRAM 230 MB Bandwidth 80 TB/s LLaMA-3 70B 800 tok/s Latency Deterministic Chips/system up to 576 DRAM None (SRAM only) G GROQ LPU TECHNOLOGY NO DRAM BOTTLENECK SRAM = Zero wait COMPILER-FIRST Deterministic flow SINGLE FUNCTION Inference only = fast

In the race to accelerate artificial intelligence, most companies are competing on the same track — building bigger, faster GPUs. Groq took a different road entirely. The Groq chip is not a GPU — it is an entirely new class of processor called a Language Processing Unit (LPU), designed from first principles for one specific task: running large language models as fast as physically possible.

🔬 The Core Insight: GPUs are general-purpose parallel processors repurposed for AI. Groq's LPU is a single-purpose inference accelerator with a radically different memory architecture, execution model, and design philosophy — optimized exclusively to eliminate the bottlenecks that slow GPU-based AI inference.

230MBOn-chip SRAM per chip
80TB/sMemory bandwidth
576Max chips per GroqRack

What Is Groq?

Groq is an AI chip company founded in 2016 by Jonathan Ross — the engineer who created Google's first TPU (Tensor Processing Unit). The company's flagship product is the GroqChip, powered by the LPU (Language Processing Unit) architecture. Groq makes this hardware available through a cloud API — meaning anyone can access its extraordinary speed without purchasing hardware. At its fastest, Groq processes Llama-3 70B at 800 tokens/second — roughly 14x faster than an A100 GPU. To understand why these speeds matter in practice, see our Groq inference speed vs GPU comparison.

What Is an LPU (Language Processing Unit)?

An LPU is a processor architecture specifically designed to run autoregressive language model inference — the sequential token-by-token generation process that underlies GPT, Llama, Mistral, and all modern LLMs. Unlike a GPU that handles many different computational tasks (graphics, training, inference, scientific computing), an LPU does one thing: generate text as fast as possible.

The key hardware insight: autoregressive LLM inference is memory-bandwidth bound, not compute-bound. The bottleneck is not how fast you can do matrix multiplications — it is how fast you can move model weights between memory and compute. The LPU eliminates this bottleneck by putting all memory on-chip.

LPU Architecture Deep Dive

Tensor Streaming Processors (TSPs)

The LPU contains an array of Tensor Streaming Processors — specialized compute units optimized for the matrix multiply operations that dominate LLM inference. Unlike GPU's CUDA cores, TSPs have a fixed, deterministic data flow path — each TSP knows exactly what data it will receive and when, with no runtime scheduling overhead.

The SRAM Memory Fabric

Each Groq chip contains 230MB of on-chip SRAM — organized as a distributed memory fabric directly connected to the TSP array. This is the critical differentiator. Instead of external DRAM (like GPU's HBM3 or GDDR6), the LPU stores model weights and activations in SRAM right next to the compute units. No memory bus. No DRAM latency. No bandwidth bottleneck. For small to medium models, the entire model fits in SRAM simultaneously.

GPU vs LPU MEMORY ARCHITECTURE GPU Architecture CUDA Cores HBM3 DRAM 80GB off-chip 3TB/s ⚠ Memory bottleneck Wait for weights from DRAM → 55-90 tok/s LPU Architecture TSP Cores SRAM On-Chip 230MB same die 80TB/s ✓ No bottleneck Weights always on-chip → 800 tok/s
GPU DRAM bottleneck vs Groq LPU on-chip SRAM advantage — the core architectural difference

The SRAM Advantage — Why It Matters

SRAM (Static Random Access Memory) is the same type of memory used in CPU caches — fast, low-latency, but expensive and power-hungry per bit compared to DRAM. GPUs use DRAM because it's cheap and dense — you can fit 80GB on an H100 for reasonable cost. The trade-off: DRAM is 24x slower bandwidth than what Groq achieves with SRAM.

For LLM inference, this trade-off is catastrophic. Every forward pass requires loading all the model's weights from DRAM into compute units. At 55 billion parameters (FP16), Llama-3 70B = ~140GB of weights. On an A100 with 2TB/s bandwidth, loading these weights takes ~70ms per forward pass — directly limiting tokens per second. On Groq's SRAM fabric at 80TB/s, the same operation takes microseconds.

Compiler-Scheduled Execution — The Other Half

The SRAM architecture explains much of Groq's speed, but the compiler-scheduled execution model explains its consistency. GPUs use runtime dynamic scheduling — every clock cycle, the hardware decides which compute unit gets which data. This flexibility is powerful but expensive in cycles and energy.

Groq's compiler resolves all scheduling decisions at compile time. When you deploy a model on Groq, the Groq compiler (MLIR-based GroqFlow) analyzes the entire computational graph and creates a deterministic schedule — every TSP knows exactly what operation it will perform at every clock cycle, in what order, for every token generation step. Zero runtime decision overhead. Deterministic, predictable, consistent latency every time.

LPU vs GPU: Design Philosophy

Design AspectGPU (H100)Groq LPU
Primary purposeGeneral-purpose parallel computeLLM inference only
Memory type80GB HBM3 DRAM (off-chip)230MB SRAM (on-chip)
Memory bandwidth3.35 TB/s~80 TB/s
SchedulingRuntime dynamicCompiler static
Model trainingFull supportNot supported
Model varietyAny PyTorch modelGroq-compiled only
Inference speed55–90 tok/s (70B)800 tok/s (70B)
Latency consistencyVariableDeterministic
📖 Related Reading

Groq vs Nvidia for AI Inference 2026

Now that you understand the architecture, see the complete competitive comparison — use cases, costs, and which platform wins for different workloads.

Read Full Comparison →

Frequently Asked Questions

Q: What does LPU stand for?+

LPU stands for Language Processing Unit — a term coined by Groq to describe their purpose-built AI inference processor. Unlike a CPU (Central Processing Unit) or GPU (Graphics Processing Unit), an LPU is designed exclusively to run large language model inference at maximum speed. The name reflects the processor's singular focus on language model workloads.

Q: How many parameters can Groq's LPU handle?+

A single Groq chip has 230MB SRAM — too small for most production LLMs alone. Groq systems use multiple chips connected by high-bandwidth chip-to-chip links. The GroqRack supports up to 576 chips, providing enough combined SRAM capacity for very large models. Public Groq API currently supports models up to 70B parameters (Llama-3 70B, Mixtral 8x7B).

Q: Can Groq train AI models?+

No — Groq's LPU is an inference-only processor. Model training requires a fundamentally different computational pattern (backpropagation, gradient accumulation, optimizer states) that the LPU's deterministic compiler-scheduled architecture is not designed for. Training still requires GPUs (Nvidia A100/H100) or specialized training TPUs. Groq's value is in deploying trained models at maximum inference speed.

Q: Who founded Groq?+

Groq was founded in 2016 by Jonathan Ross, who was previously at Google where he led the team that created the first Google TPU (Tensor Processing Unit). Ross left Google to build a chip architecture specifically for LLM inference, predicting that the memory bandwidth problem would become the dominant bottleneck for AI as models scaled. The company is headquartered in Mountain View, California.

Found this useful? Share it! 🚀