Skip to main content

Ollama — Run LLMs Locally on Your Cluster

Ollama lets you run large language models (LLaMA 3, Mistral, Phi-3, Gemma, etc.) as an API endpoint directly on your cluster. No OpenAI account, no API costs, no data leaving your network.


Why Run LLMs on Bare-Metal​

✔ Zero API cost (models run locally)
✔ Data never leaves your infrastructure
✔ No rate limits
✔ Use in CI pipelines, internal tools, SaaS apps
✔ 24 cores + 48 GiB RAM is enough for 7B–13B parameter models

What You Can Run​

ModelParametersRAM RequiredQuality
Phi-3 Mini3.8B~4 GiBFast, good for coding
Mistral 7B7B~6 GiBExcellent general purpose
LLaMA 3 8B8B~7 GiBBest open-source general
LLaMA 3 70B70B~48 GiBGPT-4 level (needs all 3 nodes)
Gemma 2 9B9B~8 GiBStrong reasoning
CodeLlama 13B13B~10 GiBCode generation

Your cluster (48 GiB total) can run multiple 7B–13B models simultaneously across nodes.


Deploy Ollama in k3s​

apiVersion: apps/v1
kind: Deployment
metadata:
name: ollama
namespace: ai
spec:
replicas: 1
selector:
matchLabels:
app: ollama
template:
metadata:
labels:
app: ollama
spec:
containers:
- name: ollama
image: ollama/ollama:latest
ports:
- containerPort: 11434
resources:
requests:
memory: "8Gi"
cpu: "4"
limits:
memory: "16Gi"
cpu: "8"
volumeMounts:
- name: ollama-data
mountPath: /root/.ollama
volumes:
- name: ollama-data
persistentVolumeClaim:
claimName: ollama-pvc
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: ollama-pvc
namespace: ai
spec:
accessModes: [ReadWriteOnce]
storageClassName: longhorn
resources:
requests:
storage: 50Gi
---
apiVersion: v1
kind: Service
metadata:
name: ollama
namespace: ai
spec:
type: LoadBalancer
selector:
app: ollama
ports:
- port: 11434
targetPort: 11434
kubectl create namespace ai
kubectl apply -f ollama.yaml

Pull and Run a Model​

# Get a shell in the pod
kubectl exec -n ai -it deploy/ollama -- /bin/bash

# Pull a model
ollama pull mistral
ollama pull llama3

# List models
ollama list

# Chat (interactive)
ollama run mistral

Use the REST API​

Ollama exposes an OpenAI-compatible API:

# From inside the cluster or via Tailscale/MetalLB
curl http://10.0.0.200:11434/api/generate \
-d '{
"model": "mistral",
"prompt": "Explain Kubernetes in 3 sentences",
"stream": false
}'

OpenAI-Compatible Endpoint​

Use Ollama as a drop-in replacement for OpenAI SDK:

from openai import OpenAI

client = OpenAI(
base_url="http://10.0.0.200:11434/v1",
api_key="ollama" # any string works
)

response = client.chat.completions.create(
model="mistral",
messages=[{"role": "user", "content": "Hello!"}]
)

print(response.choices[0].message.content)

Deploy Open-WebUI (Chat Interface)​

A full ChatGPT-like browser UI connected to your Ollama instance:

kubectl apply -n ai -f - <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
name: open-webui
namespace: ai
spec:
replicas: 1
selector:
matchLabels:
app: open-webui
template:
metadata:
labels:
app: open-webui
spec:
containers:
- name: open-webui
image: ghcr.io/open-webui/open-webui:main
ports:
- containerPort: 8080
env:
- name: OLLAMA_BASE_URL
value: http://ollama:11434
---
apiVersion: v1
kind: Service
metadata:
name: open-webui
namespace: ai
spec:
type: LoadBalancer
selector:
app: open-webui
ports:
- port: 80
targetPort: 8080
EOF

Access at http://10.0.0.201 — your private ChatGPT.


Done When​

✔ Ollama pod Running with persistent storage
✔ At least one model pulled (mistral or llama3)
✔ REST API responding on MetalLB IP
✔ Open-WebUI accessible in browser
✔ Python client connecting successfully