CoreWeave · GPU Cloud

How to Deploy AI Models on CoreWeave: Step-by-Step 2026 Guide

PL
Prashant Lalwani 2026-04-24 · 15 min read
CoreWeaveGPU Cloud
HOW TO DEPLOY AI MODELS ON COREWEAVE Step 1: Setup kubectlDownload kubeconfig from CoreWeave Cloud UI 1 Step 2: Containerise ModelDockerfile with CUDA base + FastAPI server 2 Step 3: Push to Registrydocker push registry.coreweave.com/... 3 Step 4: Create DeploymentK8s deployment.yaml with GPU requests 4 Step 5: Expose ServiceLoadBalancer → external IP endpoint 5 Step 6: AutoscalingKEDA / HPA scales GPU pods on demand 6 kubectl apply -f deployment.yaml deployment.apps/llama-inference created kubectl get pods llama-inference-xxx 1/1 Running 0 2m # GPU provisioned in under 10 minutes ✓ COREWEAVE DEPLOYMENT GUIDE 2026

CoreWeave runs on Kubernetes, which makes AI model deployment structured and reproducible. This step-by-step guide takes you from zero to a running, auto-scaling AI model endpoint on CoreWeave infrastructure.

About CoreWeave: CoreWeave is a specialised GPU cloud provider and NVIDIA strategic partner, offering H100, A100, and L40S GPU infrastructure purpose-built for AI workloads. Apply for access at coreweave.com.

Prerequisites

Step 1: Set Up kubectl Access

# Download your kubeconfig from the CoreWeave Cloud UI
# Settings → API Access → Download kubeconfig

# Set kubeconfig
export KUBECONFIG=~/coreweave-kubeconfig.yaml

# Verify connection
kubectl get nodes
# Should show your CoreWeave cluster nodes

Step 2: Containerise Your Model

# Example Dockerfile for a FastAPI model server
FROM nvidia/cuda:12.2.0-runtime-ubuntu22.04

RUN apt-get update && apt-get install -y python3-pip
RUN pip install fastapi uvicorn transformers torch accelerate

COPY model_server.py /app/model_server.py
WORKDIR /app

CMD ["uvicorn", "model_server:app", "--host", "0.0.0.0", "--port", "8000"]
# model_server.py
from fastapi import FastAPI
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

app = FastAPI()
model_name = "meta-llama/Llama-3.1-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name, torch_dtype=torch.float16, device_map="auto"
)

@app.post("/generate")
async def generate(prompt: str, max_tokens: int = 512):
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
    outputs = model.generate(**inputs, max_new_tokens=max_tokens)
    return {"text": tokenizer.decode(outputs[0], skip_special_tokens=True)}

Step 3: Push to CoreWeave Container Registry

# Build and push your container
docker build -t registry.coreweave.com/your-namespace/llama-server:v1 .
docker push registry.coreweave.com/your-namespace/llama-server:v1

Step 4: Create Kubernetes Deployment

# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: llama-inference
spec:
  replicas: 1
  selector:
    matchLabels:
      app: llama-inference
  template:
    metadata:
      labels:
        app: llama-inference
    spec:
      containers:
      - name: llama-server
        image: registry.coreweave.com/your-ns/llama-server:v1
        resources:
          requests:
            nvidia.com/gpu: "1"    # Request 1 GPU
          limits:
            nvidia.com/gpu: "1"
        ports:
        - containerPort: 8000
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: gpu.nvidia.com/class
                operator: In
                values: ["H100_80GB_SXM"]  # Specify GPU type

Step 5: Expose with a Service

kubectl apply -f deployment.yaml

# Create a LoadBalancer service for external access
kubectl expose deployment llama-inference   --type=LoadBalancer   --port=80   --target-port=8000

# Get your external IP
kubectl get service llama-inference
# EXTERNAL-IP: 192.168.x.x — use this to call your model

Step 6: Add Autoscaling

# Scale based on GPU utilisation
kubectl autoscale deployment llama-inference   --min=1 --max=10   --cpu-percent=80

# Or use KEDA for GPU-metric-based autoscaling
# (more precise for AI workloads)

Cost tip: CoreWeave charges by the second. Use kubectl scale deployment llama-inference --replicas=0 to shut down when not in use. Combine with KEDA autoscaling to scale to zero automatically after idle periods.