Skip to main content

Data Layer — Complete Enterprise Data Platform

The data layer transforms raw events from your platform into actionable business intelligence. It covers the full chain: event ingestion → storage → transformation → orchestration → visualization → governance.


Complete Data Chain

┌─────────────────────────────────────────────────────────────────────┐
│ DATA SOURCES │
│ Microservices │ Databases (CDC) │ Logs │ k8s Metrics │
└────────┬────────┴────────┬──────────┴───┬────┴────────┬─────────────┘
│ │ │ │
▼ ▼ ▼ ▼
┌─────────────────────────────────────────────────────────────────────┐
│ INGESTION (Kafka / Redpanda) │
│ Topics: events.orders events.users db.changes platform.logs │
└──────────────────────────────┬──────────────────────────────────────┘

┌────────────────┴────────────────┐
▼ ▼
┌─────────────────────────┐ ┌──────────────────────────┐
│ Stream Processing │ │ Batch Loading │
│ (Kafka Streams / KSQL) │ │ (Airflow → ClickHouse) │
└────────────┬────────────┘ └─────────────┬────────────┘
│ │
└──────────────┬──────────────────┘

┌─────────────────────────────────────────────────────────────────────┐
│ STORAGE (ClickHouse) │
│ raw_events │ orders │ users │ metrics │ audit_logs │
└──────────────────────────┬──────────────────────────────────────────┘


┌─────────────────────────────────────────────────────────────────────┐
│ TRANSFORMATION (dbt) │
│ staging → intermediate → marts (finance, product, ops) │
└──────────────────────────┬──────────────────────────────────────────┘

┌────────────┴────────────┐
▼ ▼
┌─────────────────────────┐ ┌────────────────────────────┐
│ VISUALIZATION │ │ GOVERNANCE │
│ Apache Superset │ │ OpenMetadata │
│ Dashboards / Alerts │ │ Catalog / Lineage │
└─────────────────────────┘ └────────────────────────────┘

Stack Components

ComponentRoleWhy This One
Kafka / RedpandaEvent streaming backbone + CDCIndustry standard; Redpanda is Kafka-compatible without JVM
ClickHouseColumnar analytics warehouse100-1000x faster than Postgres for analytics; native Kubernetes support
dbtSQL transformation layerVersion-controlled, tested SQL; model dependency DAGs
Apache AirflowPipeline orchestrationAlready in Phase 16; reused for data pipeline scheduling
Apache SupersetSelf-hosted BI & dashboards40+ chart types; OIDC login via Keycloak
OpenMetadataData catalog, lineage, qualityUnified governance; auto-discovers ClickHouse, dbt, Airflow

Data Namespace Layout

kubectl get namespaces | grep data
# data-platform kafka, redpanda, schema-registry
# data-warehouse clickhouse
# data-transform dbt (k8s Job / Airflow tasks)
# data-viz superset
# data-catalog openmetadata

Kafka Topic Naming Convention

<domain>.<entity>.<event-type>

Examples:
orders.order.created
orders.order.fulfilled
payments.payment.processed
users.user.registered
platform.k8s.pod-started
db.postgres.changes ← CDC via Debezium

Data Flow: Order Processing Example

1. Order service publishes to Kafka topic: orders.order.created
2. ClickHouse Kafka engine ingests rows in real time
3. Airflow triggers dbt daily run at 06:00 UTC
4. dbt builds mart_orders (aggregated, enriched)
5. Superset dashboard "Orders KPIs" auto-refreshes
6. OpenMetadata shows lineage: kafka → clickhouse → dbt → superset

Infrastructure Requirements

ComponentCPUMemoryStorageNotes
Redpanda (3 brokers)2 CPU each4 Gi each50 Gi SSDUse Longhorn storage class
ClickHouse4 CPU8 Gi200 GiReplicatedMergeTree for HA
dbt0.5 CPU512 MiRuns as k8s Job / Airflow task
Superset1 CPU2 Gi10 GiPostgres for metadata
OpenMetadata2 CPU4 Gi20 GiElasticsearch + MySQL backend

Done When

✔ Kafka topics receive events from at least one microservice
✔ ClickHouse ingesting from Kafka in real time
✔ dbt models transform raw → mart tables on schedule
✔ Superset dashboard shows live order/user metrics
✔ OpenMetadata catalogs all datasets with lineage