Running Llama 3 locally with Ollama is fast on the right hardware. This guide benchmarks real token-per-second performance across GPU and CPU setups, explains how quantization affects quality and speed, and gives you the exact Ollama configuration flags to maximize throughput.
120
T/s on RTX 4090
8B
Best model for most users
Q4_K_M
Recommended quantization
Step 1 — Install and Pull
1
Install Ollama
One command on macOS/Linux. Windows executable available at ollama.com.
2
Pull Llama 3.1 8B
ollama pull llama3.1 — 4.7GB download, Q4_K_M quantized, best quality/size ratio.
3
Quick Speed Test
ollama run llama3.1 "Write 200 words about AI" — watch tokens stream and note speed.
Shell — Install, Pull, Run
curl -fsSL https://ollama.com/install.sh | sh ollama pull llama3.1 # 8B — 4.7GB, best for most ollama pull llama3.1:70b # 70B — needs 40GB+ VRAM/RAM ollama list # see all local models
Hardware Benchmark Results
| Hardware | VRAM / RAM | Model | T/s (Q4_K_M) | First Token |
|---|---|---|---|---|
| RTX 4090 | 24GB VRAM | Llama 3.1 8B | 118–122 | 0.6s |
| RTX 4080 | 16GB VRAM | Llama 3.1 8B | 88–94 | 0.8s |
| Apple M3 Max | 128GB UMem | Llama 3.1 8B | 62–68 | 1.1s |
| RTX 3080 | 10GB VRAM | Llama 3.1 8B | 44–52 | 1.4s |
| Apple M2 Pro | 32GB UMem | Llama 3.1 8B | 27–33 | 2.1s |
| CPU only | 32GB RAM | Llama 3.1 8B | 6–10 | 8.0s |
Quantization Guide
- Q4_K_M — 4.7GB, best balance of quality and speed. Use this for most applications.
- Q5_K_M — 5.7GB, slightly better quality. Choose if you have extra VRAM headroom.
- Q8_0 — 8.5GB, near full precision. Best quality, needs 12GB+ VRAM.
- Q2_K — 2.7GB, noticeable quality loss. Only for very constrained hardware.
Performance Tuning
Shell — Key Environment Variables
# Offload all layers to GPU (default tries automatically) export OLLAMA_NUM_GPU=999 # Allow 2 models loaded simultaneously export OLLAMA_MAX_LOADED_MODELS=2 # Set context + options per request curl http://localhost:11434/api/generate -d '{ "model": "llama3.1", "prompt": "Hello", "options": {"num_ctx": 4096, "temperature": 0.7} }'
Apple Silicon Tip
On Apple Silicon, Ollama uses unified memory shared by CPU and GPU. Close other apps to free memory. With 16GB: Llama 3.1 8B runs well. With 32GB+: 13B–30B models are comfortable.