Kubeflow — Full ML Platform on Kubernetes

Kubeflow is the Kubernetes-native ML platform. It orchestrates end-to-end ML pipelines — data prep, training, evaluation, serving — as reproducible, versioned DAGs running directly on your cluster.

What Kubeflow Provides

Component	Purpose
Pipelines	DAG-based ML workflow orchestration
Notebooks	JupyterHub — collaborative notebooks in the browser
Training Operator	Distributed training (PyTorch, TensorFlow, JAX)
KServe	Model serving with autoscaling
Katib	Hyperparameter tuning (AutoML)
Tensorboard	Training visualization

Architecture

Data Scientist (browser)
        │
        ▼
Kubeflow Central Dashboard (k3s)
  ├── Notebooks (JupyterHub)
  ├── Pipelines UI (DAG editor)
  └── Models (KServe endpoints)
        │
        ▼
  Pipeline Run (k8s pods)
  ├── Step 1: Data ingestion (pod)
  ├── Step 2: Feature engineering (pod)
  ├── Step 3: Training (pod — uses all available CPUs)
  ├── Step 4: Evaluation (pod)
  └── Step 5: Model registration → MLflow

Install Kubeflow

Using kustomize (official method)

# Install kustomize
curl -s "https://raw.githubusercontent.com/kubernetes-sigs/kustomize/master/hack/install_kustomize.sh" | bash
sudo mv kustomize /usr/local/bin/

# Clone the manifests
git clone https://github.com/kubeflow/manifests.git
cd manifests

# Install (takes 5–10 minutes)
while ! kustomize build example | kubectl apply -f -; do
  echo "Retrying..."; sleep 10
done

Verify all components

kubectl get pods -n kubeflow --watch

Wait for all pods to be Running (may take 10+ minutes on first install).

Access the Dashboard

kubectl port-forward svc/istio-ingressgateway -n istio-system 8080:80

Open: http://localhost:8080

Default credentials: user@example.com / 12341234

Create a Pipeline (Python SDK)

import kfp
from kfp import dsl

@dsl.component(base_image="python:3.11")
def load_data(output_path: str) -> str:
    import json
    data = {"X": [[1,2],[3,4]], "y": [0,1]}
    with open(output_path, "w") as f:
        json.dump(data, f)
    return output_path

@dsl.component(base_image="python:3.11",
               packages_to_install=["scikit-learn"])
def train_model(data_path: str) -> float:
    import json
    from sklearn.linear_model import LogisticRegression
    with open(data_path) as f:
        data = json.load(f)
    model = LogisticRegression()
    model.fit(data["X"], data["y"])
    return 0.95  # accuracy

@dsl.pipeline(name="simple-ml-pipeline")
def ml_pipeline():
    data_task = load_data(output_path="/tmp/data.json")
    train_task = train_model(data_path=data_task.output)

# Compile and submit
client = kfp.Client(host="http://localhost:8080/pipeline")
client.create_run_from_pipeline_func(ml_pipeline, arguments={})

The pipeline creates Kubernetes pods for each step — fully reproducible and auditable.

Distributed Training

Run PyTorch training across all 3 nodes simultaneously:

apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
  name: distributed-training
  namespace: kubeflow
spec:
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      template:
        spec:
          containers:
            - name: pytorch
              image: pytorch/pytorch:2.0.0-cuda11.7-cudnn8-runtime
              command: ["python", "train.py", "--distributed"]
              resources:
                requests:
                  cpu: "6"
                  memory: "12Gi"
    Worker:
      replicas: 2   # fast-skunk + fast-heron
      template:
        spec:
          containers:
            - name: pytorch
              image: pytorch/pytorch:2.0.0-cuda11.7-cudnn8-runtime
              command: ["python", "train.py", "--distributed"]
              resources:
                requests:
                  cpu: "6"
                  memory: "12Gi"

This uses 18 cores and 36 GiB RAM across all 3 nodes in parallel.

Katib — Hyperparameter Tuning

Automatically search for the best hyperparameters:

apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
  name: hp-tuning
  namespace: kubeflow
spec:
  objective:
    type: maximize
    goal: 0.99
    objectiveMetricName: accuracy
  algorithm:
    algorithmName: bayesianoptimization
  parallelTrialCount: 3
  maxTrialCount: 12
  parameters:
    - name: learning_rate
      parameterType: double
      feasibleSpace:
        min: "0.001"
        max: "0.1"
    - name: batch_size
      parameterType: int
      feasibleSpace:
        min: "16"
        max: "128"
  trialTemplate:
    primaryContainerName: training
    trialParameters:
      - name: learningRate
        reference: learning_rate
      - name: batchSize
        reference: batch_size
    trialSpec:
      apiVersion: batch/v1
      kind: Job
      spec:
        template:
          spec:
            containers:
              - name: training
                image: my-training-image:latest

Kubeflow runs 3 trials in parallel, picks the best, and converges automatically.

Resource Requirements

Component	CPU	RAM
Kubeflow core	4 cores	8 GiB
Istio service mesh	2 cores	4 GiB
JupyterHub notebooks	1–4 per user	2–8 GiB per user
Training jobs	Up to cluster capacity	Up to cluster capacity

Kubeflow is the heaviest component in this stack — ensure Longhorn storage is set up first.

Done When

✔ All Kubeflow pods Running
✔ Dashboard accessible
✔ First pipeline submitted and completed
✔ JupyterHub notebook spawning successfully

What Kubeflow Provides​

Architecture​

Install Kubeflow​

Using kustomize (official method)​

Verify all components​

Access the Dashboard​

Create a Pipeline (Python SDK)​

Distributed Training​

Katib — Hyperparameter Tuning​

Resource Requirements​

Done When​