Run Llama 3 with Ollama Locally: Performance Guide 2026

Running Llama 3 locally with Ollama is fast on the right hardware. This guide benchmarks real token-per-second performance across GPU and CPU setups, explains how quantization affects quality and speed, and gives you the exact Ollama configuration flags to maximize throughput.

120

T/s on RTX 4090

Best model for most users

Q4_K_M

Recommended quantization

Step 1 — Install and Pull

Install Ollama

One command on macOS/Linux. Windows executable available at ollama.com.

Pull Llama 3.1 8B

ollama pull llama3.1 — 4.7GB download, Q4_K_M quantized, best quality/size ratio.

Quick Speed Test

ollama run llama3.1 "Write 200 words about AI" — watch tokens stream and note speed.

Shell — Install, Pull, Run

curl -fsSL https://ollama.com/install.sh | sh
ollama pull llama3.1          # 8B — 4.7GB, best for most
ollama pull llama3.1:70b      # 70B — needs 40GB+ VRAM/RAM
ollama list                   # see all local models

Hardware Benchmark Results

Hardware	VRAM / RAM	Model	T/s (Q4_K_M)	First Token
RTX 4090	24GB VRAM	Llama 3.1 8B	118–122	0.6s
RTX 4080	16GB VRAM	Llama 3.1 8B	88–94	0.8s
Apple M3 Max	128GB UMem	Llama 3.1 8B	62–68	1.1s
RTX 3080	10GB VRAM	Llama 3.1 8B	44–52	1.4s
Apple M2 Pro	32GB UMem	Llama 3.1 8B	27–33	2.1s
CPU only	32GB RAM	Llama 3.1 8B	6–10	8.0s

Quantization Guide

Q4_K_M — 4.7GB, best balance of quality and speed. Use this for most applications.
Q5_K_M — 5.7GB, slightly better quality. Choose if you have extra VRAM headroom.
Q8_0 — 8.5GB, near full precision. Best quality, needs 12GB+ VRAM.
Q2_K — 2.7GB, noticeable quality loss. Only for very constrained hardware.

Performance Tuning

Shell — Key Environment Variables

# Offload all layers to GPU (default tries automatically)
export OLLAMA_NUM_GPU=999

# Allow 2 models loaded simultaneously
export OLLAMA_MAX_LOADED_MODELS=2

# Set context + options per request
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.1",
  "prompt": "Hello",
  "options": {"num_ctx": 4096, "temperature": 0.7}
}'

Apple Silicon Tip

On Apple Silicon, Ollama uses unified memory shared by CPU and GPU. Close other apps to free memory. With 16GB: Llama 3.1 8B runs well. With 32GB+: 13B–30B models are comfortable.

→ More Ollama Articles

→ Ollama API Usage Examples for Developers → Ollama Docker Setup for Local LLM Deployment → Ollama Use Cases for Business Automation & AI Agents