Skip to main content

OpenMetadata — Data Catalog, Lineage & Governance

OpenMetadata is the unified data catalog that auto-discovers all data assets (ClickHouse tables, dbt models, Airflow pipelines, Kafka topics), builds an automatic lineage graph, tracks data quality, and provides a searchable catalog for the entire data team.


What OpenMetadata Provides​

FeatureDescription
Asset DiscoveryAuto-crawls ClickHouse, Kafka, dbt, Airflow, Superset
Data LineageVisual graph: Kafka → ClickHouse → dbt → Superset
Data QualityDefine and run tests on table columns
GlossaryBusiness terms linked to physical columns
OwnershipEvery table has an owner team and contact
PII ClassificationAuto-tags columns containing PII data
Data ContractsDefine expectations for producer/consumer agreements

Install OpenMetadata​

# Add Helm repo
helm repo add open-metadata https://helm.open-metadata.org
helm repo update

values-openmetadata.yaml​

# values-openmetadata.yaml
openmetadata:
config:
authentication:
provider: "custom-oidc"
publicKeyUrls:
- "https://keycloak.yourdomain.com/realms/platform/protocol/openid-connect/certs"
authority: "https://keycloak.yourdomain.com/realms/platform"
clientId: "openmetadata"
callbackUrl: "https://catalog.yourdomain.com/callback"

authorizer:
className: "org.openmetadata.service.security.DefaultAuthorizer"
containerRequestFilter: "org.openmetadata.service.security.JwtFilter"
initialAdmins:
- "platform-admin@yourdomain.com"

ingress:
enabled: true
className: nginx
hosts:
- host: catalog.yourdomain.com
paths:
- path: /
pathType: Prefix

# OpenMetadata requires MySQL + Elasticsearch
mysql:
enabled: true
auth:
rootPassword: "OpenMetaRoot123!"
database: openmetadata_db
username: openmetadata
password: "OpenMetaPass123!"

elasticsearch:
enabled: true
replicas: 1
resources:
requests:
memory: "2Gi"
limits:
memory: "3Gi"
kubectl create namespace data-catalog

helm upgrade --install openmetadata open-metadata/openmetadata \
--namespace data-catalog \
--values values-openmetadata.yaml \
--wait --timeout 15m

Connect Data Sources (Ingestion Connectors)​

ClickHouse Connector​

# Via OpenMetadata UI: Settings → Services → Databases → Add New Service
# Or via API:

source:
type: clickhouse
serviceName: clickhouse-warehouse
serviceConnection:
config:
type: Clickhouse
hostPort: "clickhouse-clickhouse.data-warehouse.svc:8123"
username: superset
password: "{{ env('CLICKHOUSE_PASSWORD') }}"
database: analytics

sourceConfig:
config:
type: DatabaseMetadata
markDeletedTables: true
includeTables: true
includeViews: true
schemaFilterPattern:
includes:
- analytics
- raw

sink:
type: metadata-rest
config: {}

workflowConfig:
openMetadataServerConfig:
hostPort: "http://openmetadata.data-catalog.svc:8585/api"
authProvider: openmetadata
securityConfig:
jwtToken: "{{ env('OM_JWT_TOKEN') }}"

Kafka/Redpanda Connector​

source:
type: redpanda
serviceName: redpanda-platform
serviceConnection:
config:
type: Redpanda
bootstrapServers: "redpanda-0.redpanda.data-platform.svc:9093"
schemaRegistryURL: "http://redpanda-schema-registry.data-platform.svc:8081"

sourceConfig:
config:
type: MessagingMetadata
topicFilterPattern:
includes:
- "orders.*"
- "payments.*"
- "users.*"

dbt Connector​

source:
type: dbt
serviceName: dbt-clickhouse
serviceConnection:
config:
type: dbt
dbtConfigSource:
dbtConfigType: local
dbtCatalogFilePath: /dbt/target/catalog.json
dbtManifestFilePath: /dbt/target/manifest.json
dbtRunResultsFilePath: /dbt/target/run_results.json

Airflow Connector​

source:
type: airflow
serviceName: airflow-platform
serviceConnection:
config:
type: Airflow
hostPort: "http://airflow-webserver.automation.svc:8080"
numberOfStatus: 10
connection:
type: Backend

Run Ingestion Pipelines​

# Run all connectors via OpenMetadata ingestion framework
kubectl run om-ingestion \
--image=openmetadata/ingestion:1.3.0 \
--namespace=data-catalog \
--restart=Never \
--rm -it \
-- python main.py --config /configs/clickhouse-ingestion.yaml

Or schedule via the OpenMetadata UI:

Settings → Services → clickhouse-warehouse
→ Ingestion → Add Ingestion
→ Type: Metadata Ingestion
→ Schedule: Every 6 hours
→ Deploy

Automatic Data Lineage​

After ingesting ClickHouse + dbt + Superset, OpenMetadata auto-builds lineage:

[Kafka Topic] orders.order.created
│
â–¼ (Kafka Engine)
[ClickHouse Table] raw.kafka_orders
│
â–¼ (dbt stg_orders)
[dbt Model] analytics.stg_orders
│
â–¼ (dbt int_order_enriched)
[dbt Model] analytics.int_order_enriched
│
â–¼ (dbt mart_orders)
[dbt Model] analytics.mart_orders
│
â–¼ (Superset Dataset)
[Superset Dashboard] Orders KPIs

Visible at: catalog.yourdomain.com → Explore → mart_orders → Lineage


Data Quality Tests​

# Via UI: Table → Profiler → Add Test

testSuite:
name: "orders-quality-suite"
executableEntityReference: "analytics.mart_orders"

testCases:
- name: "total_revenue_non_negative"
testDefinitionName: columnValuesToBeBetween
columnName: total_revenue
parameterValues:
- name: minValue
value: "0"

- name: "orders_freshness"
testDefinitionName: tableRowCountToBeBetween
parameterValues:
- name: minValue
value: "1"
description: "At least 1 row ingested today"

Business Glossary​

Link business terms to physical columns:

Glossary → Platform Glossary → + Term
→ Name: "Revenue"
→ Description: "Sum of fulfilled order amounts in USD, excluding refunds"
→ Tag: Finance
→ Related Terms: GMV, Net Revenue

Then: analytics.mart_orders → total_revenue column → Add Glossary Term: "Revenue"

PII Auto-Classification​

OpenMetadata auto-tags columns matching PII patterns:

user_id → PII.NonSensitive (identifier)
email → PII.Sensitive
ip_address → PII.Sensitive
amount → Financial

These tags feed into OPA policies that restrict direct SELECT access to PII columns.


Done When​

✔ OpenMetadata running at catalog.yourdomain.com
✔ Keycloak SSO login working
✔ ClickHouse, Kafka, dbt, Airflow all ingested
✔ Full lineage visible: Kafka → ClickHouse → dbt → Superset
✔ Data quality tests passing for mart_orders
✔ Business glossary terms linked to mart columns
✔ PII columns auto-tagged across all datasets