Blog/Ollama
Ollama · Llama 3 · Performance

Run Llama 3 with Ollama Locally: Performance Guide 2026

PL
Prashant Lalwani
April 16, 2026 · 11 min read
Local LLM · Benchmarks · Hardware
Llama 3 Local Performance — Tokens/sec by Hardware (Q4_K_M)120906030120RTX 409090RTX 408065M3 Max48RTX 308030M2 Pro8CPU onlytokens / secLlama 3.1 8B · 4-bit quantized (Q4_K_M) · 512-token output · local inference

Running Llama 3 locally with Ollama is fast on the right hardware. This guide benchmarks real token-per-second performance across GPU and CPU setups, explains how quantization affects quality and speed, and gives you the exact Ollama configuration flags to maximize throughput.

120
T/s on RTX 4090
8B
Best model for most users
Q4_K_M
Recommended quantization

Step 1 — Install and Pull

1

Install Ollama

One command on macOS/Linux. Windows executable available at ollama.com.

2

Pull Llama 3.1 8B

ollama pull llama3.1 — 4.7GB download, Q4_K_M quantized, best quality/size ratio.

3

Quick Speed Test

ollama run llama3.1 "Write 200 words about AI" — watch tokens stream and note speed.

Shell — Install, Pull, Run
curl -fsSL https://ollama.com/install.sh | sh
ollama pull llama3.1          # 8B — 4.7GB, best for most
ollama pull llama3.1:70b      # 70B — needs 40GB+ VRAM/RAM
ollama list                   # see all local models

Hardware Benchmark Results

HardwareVRAM / RAMModelT/s (Q4_K_M)First Token
RTX 409024GB VRAMLlama 3.1 8B118–1220.6s
RTX 408016GB VRAMLlama 3.1 8B88–940.8s
Apple M3 Max128GB UMemLlama 3.1 8B62–681.1s
RTX 308010GB VRAMLlama 3.1 8B44–521.4s
Apple M2 Pro32GB UMemLlama 3.1 8B27–332.1s
CPU only32GB RAMLlama 3.1 8B6–108.0s

Quantization Guide

  • Q4_K_M — 4.7GB, best balance of quality and speed. Use this for most applications.
  • Q5_K_M — 5.7GB, slightly better quality. Choose if you have extra VRAM headroom.
  • Q8_0 — 8.5GB, near full precision. Best quality, needs 12GB+ VRAM.
  • Q2_K — 2.7GB, noticeable quality loss. Only for very constrained hardware.

Performance Tuning

Shell — Key Environment Variables
# Offload all layers to GPU (default tries automatically)
export OLLAMA_NUM_GPU=999

# Allow 2 models loaded simultaneously
export OLLAMA_MAX_LOADED_MODELS=2

# Set context + options per request
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.1",
  "prompt": "Hello",
  "options": {"num_ctx": 4096, "temperature": 0.7}
}'
Apple Silicon Tip

On Apple Silicon, Ollama uses unified memory shared by CPU and GPU. Close other apps to free memory. With 16GB: Llama 3.1 8B runs well. With 32GB+: 13B–30B models are comfortable.