Skip to main content

Kubeflow — Full ML Platform on Kubernetes

Kubeflow is the Kubernetes-native ML platform. It orchestrates end-to-end ML pipelines — data prep, training, evaluation, serving — as reproducible, versioned DAGs running directly on your cluster.


What Kubeflow Provides

ComponentPurpose
PipelinesDAG-based ML workflow orchestration
NotebooksJupyterHub — collaborative notebooks in the browser
Training OperatorDistributed training (PyTorch, TensorFlow, JAX)
KServeModel serving with autoscaling
KatibHyperparameter tuning (AutoML)
TensorboardTraining visualization

Architecture

Data Scientist (browser)


Kubeflow Central Dashboard (k3s)
├── Notebooks (JupyterHub)
├── Pipelines UI (DAG editor)
└── Models (KServe endpoints)


Pipeline Run (k8s pods)
├── Step 1: Data ingestion (pod)
├── Step 2: Feature engineering (pod)
├── Step 3: Training (pod — uses all available CPUs)
├── Step 4: Evaluation (pod)
└── Step 5: Model registration → MLflow

Install Kubeflow

Using kustomize (official method)

# Install kustomize
curl -s "https://raw.githubusercontent.com/kubernetes-sigs/kustomize/master/hack/install_kustomize.sh" | bash
sudo mv kustomize /usr/local/bin/

# Clone the manifests
git clone https://github.com/kubeflow/manifests.git
cd manifests

# Install (takes 5–10 minutes)
while ! kustomize build example | kubectl apply -f -; do
echo "Retrying..."; sleep 10
done

Verify all components

kubectl get pods -n kubeflow --watch

Wait for all pods to be Running (may take 10+ minutes on first install).


Access the Dashboard

kubectl port-forward svc/istio-ingressgateway -n istio-system 8080:80

Open: http://localhost:8080

Default credentials: user@example.com / 12341234


Create a Pipeline (Python SDK)

import kfp
from kfp import dsl

@dsl.component(base_image="python:3.11")
def load_data(output_path: str) -> str:
import json
data = {"X": [[1,2],[3,4]], "y": [0,1]}
with open(output_path, "w") as f:
json.dump(data, f)
return output_path

@dsl.component(base_image="python:3.11",
packages_to_install=["scikit-learn"])
def train_model(data_path: str) -> float:
import json
from sklearn.linear_model import LogisticRegression
with open(data_path) as f:
data = json.load(f)
model = LogisticRegression()
model.fit(data["X"], data["y"])
return 0.95 # accuracy

@dsl.pipeline(name="simple-ml-pipeline")
def ml_pipeline():
data_task = load_data(output_path="/tmp/data.json")
train_task = train_model(data_path=data_task.output)

# Compile and submit
client = kfp.Client(host="http://localhost:8080/pipeline")
client.create_run_from_pipeline_func(ml_pipeline, arguments={})

The pipeline creates Kubernetes pods for each step — fully reproducible and auditable.


Distributed Training

Run PyTorch training across all 3 nodes simultaneously:

apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
name: distributed-training
namespace: kubeflow
spec:
pytorchReplicaSpecs:
Master:
replicas: 1
template:
spec:
containers:
- name: pytorch
image: pytorch/pytorch:2.0.0-cuda11.7-cudnn8-runtime
command: ["python", "train.py", "--distributed"]
resources:
requests:
cpu: "6"
memory: "12Gi"
Worker:
replicas: 2 # fast-skunk + fast-heron
template:
spec:
containers:
- name: pytorch
image: pytorch/pytorch:2.0.0-cuda11.7-cudnn8-runtime
command: ["python", "train.py", "--distributed"]
resources:
requests:
cpu: "6"
memory: "12Gi"

This uses 18 cores and 36 GiB RAM across all 3 nodes in parallel.


Katib — Hyperparameter Tuning

Automatically search for the best hyperparameters:

apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
name: hp-tuning
namespace: kubeflow
spec:
objective:
type: maximize
goal: 0.99
objectiveMetricName: accuracy
algorithm:
algorithmName: bayesianoptimization
parallelTrialCount: 3
maxTrialCount: 12
parameters:
- name: learning_rate
parameterType: double
feasibleSpace:
min: "0.001"
max: "0.1"
- name: batch_size
parameterType: int
feasibleSpace:
min: "16"
max: "128"
trialTemplate:
primaryContainerName: training
trialParameters:
- name: learningRate
reference: learning_rate
- name: batchSize
reference: batch_size
trialSpec:
apiVersion: batch/v1
kind: Job
spec:
template:
spec:
containers:
- name: training
image: my-training-image:latest

Kubeflow runs 3 trials in parallel, picks the best, and converges automatically.


Resource Requirements

ComponentCPURAM
Kubeflow core4 cores8 GiB
Istio service mesh2 cores4 GiB
JupyterHub notebooks1–4 per user2–8 GiB per user
Training jobsUp to cluster capacityUp to cluster capacity

Kubeflow is the heaviest component in this stack — ensure Longhorn storage is set up first.


Done When

✔ All Kubeflow pods Running
✔ Dashboard accessible
✔ First pipeline submitted and completed
✔ JupyterHub notebook spawning successfully