ClickHouse — Columnar Analytics Warehouse

ClickHouse is a column-oriented database built for real-time analytics. It can ingest millions of rows per second from Kafka, execute complex aggregations in milliseconds, and serves as the single source of truth for all analytical workloads.

Why ClickHouse

Capability	PostgreSQL	ClickHouse
1B row aggregation	~60s	~0.5s
Kafka native ingestion	Via Debezium	Built-in Kafka table engine
Columnar storage	❌	✅ (10-50x better compression)
Real-time + batch	Batch only	Both
SQL dialect	Standard	Extended (array functions, time-series)
k8s support	Zalando/CNPG operator	Altinity Operator

Install Altinity ClickHouse Operator

# Install the operator
kubectl apply -f https://raw.githubusercontent.com/Altinity/clickhouse-operator/master/deploy/operator/clickhouse-operator-install-bundle.yaml

# Verify
kubectl get pods -n kube-system | grep clickhouse

Deploy ClickHouse Cluster

# clickhouse-cluster.yaml
apiVersion: clickhouse.altinity.com/v1
kind: ClickHouseInstallation
metadata:
  name: clickhouse
  namespace: data-warehouse
spec:
  configuration:
    clusters:
      - name: data
        layout:
          shardsCount: 1
          replicasCount: 2          # 2 replicas for HA
    users:
      admin/password: ""            # set via secret
      admin/networks/ip: "::/0"
    settings:
      max_memory_usage: 10000000000 # 10 GB per query
      max_concurrent_queries: 100
      background_pool_size: 8
  templates:
    podTemplates:
      - name: clickhouse-pod
        spec:
          containers:
            - name: clickhouse
              image: clickhouse/clickhouse-server:24.1
              resources:
                requests:
                  cpu: "2"
                  memory: "4Gi"
                limits:
                  cpu: "4"
                  memory: "8Gi"
    volumeClaimTemplates:
      - name: data-volume
        spec:
          accessModes:
            - ReadWriteOnce
          storageClassName: longhorn
          resources:
            requests:
              storage: 200Gi

kubectl apply -f clickhouse-cluster.yaml

# Wait for pods
kubectl get pods -n data-warehouse -w

# Connect to ClickHouse
kubectl exec -it -n data-warehouse chi-clickhouse-data-0-0-0 -- \
  clickhouse-client --user admin --password $ADMIN_PASSWORD

Create Databases and Tables

-- Create analytics database
CREATE DATABASE IF NOT EXISTS analytics;
CREATE DATABASE IF NOT EXISTS raw;

-- Raw events table (ingested from Kafka)
CREATE TABLE raw.kafka_orders
(
    order_id    String,
    user_id     String,
    amount      Float64,
    status      String,
    created_at  DateTime,
    _kafka_ts   DateTime DEFAULT now()
)
ENGINE = MergeTree()
ORDER BY (created_at, order_id)
PARTITION BY toYYYYMM(created_at)
TTL created_at + INTERVAL 90 DAY;   -- auto-expire after 90 days

-- Aggregated orders mart (populated by dbt)
CREATE TABLE analytics.mart_orders
(
    date            Date,
    hour            UInt8,
    total_orders    UInt64,
    total_revenue   Float64,
    avg_order_value Float64,
    unique_users    UInt64
)
ENGINE = ReplacingMergeTree(date)
ORDER BY (date, hour);

Kafka Table Engine — Real-Time Ingestion

ClickHouse can read directly from Kafka topics without an ETL job:

-- Step 1: Kafka source table (reads from topic)
CREATE TABLE raw.kafka_orders_queue
(
    order_id   String,
    user_id    String,
    amount     Float64,
    status     String,
    created_at String
)
ENGINE = Kafka
SETTINGS
    kafka_broker_list = 'redpanda-0.redpanda.data-platform.svc:9093',
    kafka_topic_list  = 'orders.order.created',
    kafka_group_name  = 'clickhouse-consumer',
    kafka_format      = 'JSONEachRow',
    kafka_num_consumers = 4;

-- Step 2: Materialized view to move rows into the real table
CREATE MATERIALIZED VIEW raw.kafka_orders_mv TO raw.kafka_orders AS
SELECT
    order_id,
    user_id,
    amount,
    status,
    parseDateTimeBestEffort(created_at) AS created_at
FROM raw.kafka_orders_queue;

Once created, ClickHouse continuously consumes from Kafka — no Airflow job needed for real-time data.

Useful ClickHouse Queries

-- Orders per hour (last 24h)
SELECT
    toStartOfHour(created_at) AS hour,
    count()                   AS orders,
    sum(amount)               AS revenue
FROM raw.kafka_orders
WHERE created_at >= now() - INTERVAL 24 HOUR
GROUP BY hour
ORDER BY hour;

-- Top users by spend
SELECT
    user_id,
    count()     AS order_count,
    sum(amount) AS total_spent
FROM raw.kafka_orders
WHERE created_at >= today() - 30
GROUP BY user_id
ORDER BY total_spent DESC
LIMIT 20;

-- Check Kafka consumer lag
SELECT * FROM system.kafka_consumers;

ClickHouse HTTP Interface (for Superset)

# Test HTTP interface
curl -u admin:$ADMIN_PASSWORD \
  "http://clickhouse-clickhouse.data-warehouse.svc:8123/?query=SELECT+count()+FROM+raw.kafka_orders"

# Create Superset user with read-only access
kubectl exec -it -n data-warehouse chi-clickhouse-data-0-0-0 -- clickhouse-client << 'EOF'
CREATE USER superset
  IDENTIFIED WITH sha256_password BY 'SupersetPass123!'
  HOST ANY;

GRANT SELECT ON analytics.* TO superset;
GRANT SELECT ON raw.* TO superset;
EOF

Backup with Clickhouse Backup

# Install clickhouse-backup sidecar
# Backs up to MinIO (already running for Velero)

# Manual backup
kubectl exec -n data-warehouse chi-clickhouse-data-0-0-0 -- \
  clickhouse-backup create --tables "analytics.*"

# List backups
kubectl exec -n data-warehouse chi-clickhouse-data-0-0-0 -- \
  clickhouse-backup list

Ingress / UI Access

ClickHouse ships with a built-in Play UI at port 8123:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: clickhouse-ui
  namespace: data-warehouse
spec:
  ingressClassName: nginx
  rules:
    - host: clickhouse.yourdomain.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: clickhouse-clickhouse
                port:
                  number: 8123

Done When

✔ 2-replica ClickHouse cluster running with Longhorn storage
✔ Kafka Materialized View consuming orders.order.created in real time
✔ raw.kafka_orders table growing with events
✔ analytics.mart_orders table exists for dbt output
✔ Superset user created with SELECT grants
✔ HTTP interface accessible at clickhouse.yourdomain.com

Why ClickHouse​

Install Altinity ClickHouse Operator​

Deploy ClickHouse Cluster​

Create Databases and Tables​

Kafka Table Engine — Real-Time Ingestion​

Useful ClickHouse Queries​

ClickHouse HTTP Interface (for Superset)​

Backup with Clickhouse Backup​

Ingress / UI Access​

Done When​