CoreWeave · GPU Cloud
How to Deploy AI Models on CoreWeave: Step-by-Step 2026 Guide
CoreWeave runs on Kubernetes, which makes AI model deployment structured and reproducible. This step-by-step guide takes you from zero to a running, auto-scaling AI model endpoint on CoreWeave infrastructure.
About CoreWeave: CoreWeave is a specialised GPU cloud provider and NVIDIA strategic partner, offering H100, A100, and L40S GPU infrastructure purpose-built for AI workloads. Apply for access at coreweave.com.
Prerequisites
- CoreWeave account (apply at coreweave.com — approval typically takes 1-2 business days)
- kubectl CLI installed locally
- Docker installed for containerising your model
- Basic Kubernetes knowledge (pods, deployments, services)
Step 1: Set Up kubectl Access
# Download your kubeconfig from the CoreWeave Cloud UI # Settings → API Access → Download kubeconfig # Set kubeconfig export KUBECONFIG=~/coreweave-kubeconfig.yaml # Verify connection kubectl get nodes # Should show your CoreWeave cluster nodes
Step 2: Containerise Your Model
# Example Dockerfile for a FastAPI model server FROM nvidia/cuda:12.2.0-runtime-ubuntu22.04 RUN apt-get update && apt-get install -y python3-pip RUN pip install fastapi uvicorn transformers torch accelerate COPY model_server.py /app/model_server.py WORKDIR /app CMD ["uvicorn", "model_server:app", "--host", "0.0.0.0", "--port", "8000"]
# model_server.py from fastapi import FastAPI from transformers import AutoModelForCausalLM, AutoTokenizer import torch app = FastAPI() model_name = "meta-llama/Llama-3.1-8B-Instruct" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype=torch.float16, device_map="auto" ) @app.post("/generate") async def generate(prompt: str, max_tokens: int = 512): inputs = tokenizer(prompt, return_tensors="pt").to("cuda") outputs = model.generate(**inputs, max_new_tokens=max_tokens) return {"text": tokenizer.decode(outputs[0], skip_special_tokens=True)}
Step 3: Push to CoreWeave Container Registry
# Build and push your container
docker build -t registry.coreweave.com/your-namespace/llama-server:v1 .
docker push registry.coreweave.com/your-namespace/llama-server:v1Step 4: Create Kubernetes Deployment
# deployment.yaml apiVersion: apps/v1 kind: Deployment metadata: name: llama-inference spec: replicas: 1 selector: matchLabels: app: llama-inference template: metadata: labels: app: llama-inference spec: containers: - name: llama-server image: registry.coreweave.com/your-ns/llama-server:v1 resources: requests: nvidia.com/gpu: "1" # Request 1 GPU limits: nvidia.com/gpu: "1" ports: - containerPort: 8000 affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: gpu.nvidia.com/class operator: In values: ["H100_80GB_SXM"] # Specify GPU type
Step 5: Expose with a Service
kubectl apply -f deployment.yaml # Create a LoadBalancer service for external access kubectl expose deployment llama-inference --type=LoadBalancer --port=80 --target-port=8000 # Get your external IP kubectl get service llama-inference # EXTERNAL-IP: 192.168.x.x — use this to call your model
Step 6: Add Autoscaling
# Scale based on GPU utilisation kubectl autoscale deployment llama-inference --min=1 --max=10 --cpu-percent=80 # Or use KEDA for GPU-metric-based autoscaling # (more precise for AI workloads)
Cost tip: CoreWeave charges by the second. Use kubectl scale deployment llama-inference --replicas=0 to shut down when not in use. Combine with KEDA autoscaling to scale to zero automatically after idle periods.