How Groq Chip Works Step by Step: LPU Architecture Explained
Unlike traditional GPUs that rely on cache hierarchies and complex scheduling, Groq's Language Processing Unit (LPU) uses a deterministic, compiler-driven architecture to achieve sub-100ms time-to-first-token. This step-by-step guide breaks down exactly how the Groq chip executes transformer models — from compilation to token generation.
🎯 Key Insight: Groq eliminates runtime scheduling overhead by compiling the entire model into a fixed execution schedule. Every operation has a pre-determined time slot — no dynamic dispatch, no cache misses, no stalls. [[11]]
Step 1: Model Compilation (Offline)
Before any inference occurs, the Groq compiler performs a complete static analysis of your model:
- Graph Lowering: Converts PyTorch/TensorFlow operations into Groq's intermediate representation (IR)
- Memory Planning: Allocates every tensor to a specific SRAM bank with zero runtime address calculation
- Instruction Scheduling: Generates a fixed timeline where each operation executes at a precise clock cycle
- Kernel Fusion: Combines multiple operations (e.g., QKV projection + softmax) into single instructions
Compilation Output Deterministic Binary
The result is a single binary file containing: (1) weight data pre-loaded into SRAM layout, (2) instruction stream with cycle-accurate timing, (3) I/O mapping for host communication. No runtime decisions required.
Step 2: Weight Loading (Initialization)
At inference startup, weights are transferred from host DRAM to the LPU's on-chip SRAM:
Critical Advantage: Once loaded, weights never leave SRAM during inference. This eliminates the #1 bottleneck in GPU inference: memory bandwidth contention. [[12]]
Step 3: Token Processing Pipeline
When a prompt arrives, execution follows the pre-compiled schedule with zero runtime overhead:
# Simplified execution timeline (conceptual)
Cycle 0: Load prompt tokens → Input SRAM
Cycle 1-12: Embedding lookup + positional encoding
Cycle 13-45: Layer 1: QKV projection → Attention → MLP
Cycle 46-78: Layer 2: QKV projection → Attention → MLP
...
Cycle N-10: Final layer norm + LM head
Cycle N-5: Softmax + token sampling
Cycle N: Output token → Host
# Next token begins at Cycle N+1 with KV cache reuseKV Cache Optimization: Attention keys/values are stored in dedicated SRAM banks with direct address mapping — no hash lookups, no eviction policies. [[14]]
Step 4: Streaming Output
As soon as the first token is computed (~90ms after prompt receipt), it's streamed to the host via PCIe while subsequent tokens continue processing:
- Time-To-First-Token (TTFT): ~90ms for Llama 3.1 8B
- Token Generation Rate: 750+ tokens/second
- End-to-End Latency: Prompt length ÷ 750 + 90ms
💡 Why This Matters: Traditional GPUs spend 60-80% of inference time waiting for memory. Groq's SRAM-only design keeps compute units fed 100% of the time — achieving 10-18× higher throughput. [[17]]
Key Architectural Innovations
Groq LPU vs. Traditional GPU
Memory: 230 MB on-chip SRAM (vs. 80 GB HBM with high latency)
Scheduling: Compiler-determined static schedule (vs. runtime dynamic scheduling)
Compute: 1,000+ MAC units with dedicated data paths (vs. shared SM resources)
Precision: INT8/FP16 optimized for inference (vs. FP32 training focus)
Frequently Asked Questions
Groq supports models that fit within its SRAM capacity (~80 MB for weights). This includes Llama 3.1 8B, Mixtral 8x7B (quantized), and most fine-tuned variants. Larger models require model parallelism across multiple LPUs. [[25]]
Initial compilation takes 5-15 minutes depending on model size, but the resulting binary can be reused indefinitely. For rapid iteration, Groq provides a "fast compile" mode that skips some optimizations for quicker testing. [[1]]
No — the LPU accelerates any operation expressible as matrix multiplications and element-wise functions. This includes vision transformers, speech models, and scientific computing workloads with transformer components. [[4]]
Related Groq Deep Dives
Explore our complete Groq series for architecture details, benchmarks, and real-world applications.
Read: Groq AI Architecture Deep Dive →