ve, offering a GraphQL API and CloudEvents support.
Implementation: Use a TypeScript-based ingestion orchestrator to manage metadata flows. This avoids brittle shell scripts and provides type safety.
// orchestrator.ts
import { MetadataClient } from '@openmetadata/client';
import { SnowflakeConnector } from './connectors/snowflake';
import { DbtConnector } from './connectors/dbt';
interface IngestionConfig {
catalogUrl: string;
authSecret: string;
sources: string[];
}
export class MetadataOrchestrator {
private client: MetadataClient;
constructor(config: IngestionConfig) {
this.client = new MetadataClient({
baseUrl: config.catalogUrl,
auth: { token: config.authSecret }
});
}
async executeIngestion(sourceName: string): Promise<void> {
const connector = this.resolveConnector(sourceName);
const metadata = await connector.extract();
// Batch upsert to reduce API load
await this.client.bulkUpsertEntities(metadata);
console.log(`Ingested ${metadata.length} entities from ${sourceName}`);
}
private resolveConnector(source: string) {
if (source.includes('snowflake')) return new SnowflakeConnector();
if (source.includes('dbt')) return new DbtConnector();
throw new Error(`Unsupported source: ${source}`);
}
}
Rationale: A typed orchestrator allows you to swap connectors and handle errors gracefully. It centralizes authentication and retry logic, which is critical when ingesting from dozens of sources.
2. Integrate the Lineage Primitive
Lineage should not be coupled to the catalog's ingestion schedule. Use Marquez as the reference implementation of the OpenLineage spec. This allows any toolâAirflow, dbt, Spark, Flinkâto emit lineage events directly to a central store.
Implementation: Emit lineage events from your orchestration layer using the OpenLineage TypeScript SDK.
// lineage-emitter.ts
import { OpenLineage, RunEvent, Dataset } from 'openlineage-typescript';
export class LineageEmitter {
private transport: OpenLineage;
constructor(endpoint: string) {
this.transport = new OpenLineage({ endpoint });
}
emitTableLineage(
runId: string,
upstreamTable: string,
downstreamTable: string,
jobName: string
): void {
const event: RunEvent = {
eventType: 'COMPLETE',
eventTime: new Date().toISOString(),
run: { runId },
job: { namespace: 'data-pipeline', name: jobName },
inputs: [{ namespace: 'snowflake', name: upstreamTable }],
outputs: [{ namespace: 'snowflake', name: downstreamTable }]
};
this.transport.emit(event);
}
}
Rationale: OpenLineage is the emerging standard. By emitting events directly from your pipelines, you ensure lineage is captured in real-time and survives tool migrations. Marquez stores this graph efficiently, and catalogs like OpenMetadata or DataHub can consume these events to populate their lineage views.
3. Enforce Storage Governance with Unity Catalog
If your architecture relies on Apache Iceberg or spans multiple clouds, Unity Catalog provides a unified governance plane. Open-sourced by Databricks in June 2024, it offers a portable REST API and native Iceberg support.
Implementation: Manage grants programmatically to ensure access policies are version-controlled.
// governance-manager.ts
import { UnityCatalogClient } from '@unity-catalog/client';
export class GovernanceManager {
private ucClient: UnityCatalogClient;
constructor(host: string, token: string) {
this.ucClient = new UnityCatalogClient({ host, token });
}
async applyTableAccess(
catalog: string,
schema: string,
table: string,
principal: string,
privilege: 'SELECT' | 'MODIFY'
): Promise<void> {
const sql = `GRANT ${privilege} ON TABLE ${catalog}.${schema}.${table} TO ${principal}`;
await this.ucClient.executeSql(sql);
// Sync to catalog for visibility
await this.syncAccessMetadata(catalog, schema, table, principal, privilege);
}
private async syncAccessMetadata(...) {
// Push access metadata to OpenMetadata/DataHub
}
}
Rationale: Unity Catalog decouples governance from the compute engine. You can govern tables across Snowflake, Databricks, and BigQuery using a single API. This is essential for multi-cloud strategies and ensures that access controls are consistent regardless of where data is processed.
4. Enable Agent-Native Access via MCP Federation
Traditional catalogs assume a human user navigating a UI. AI agents require programmatic access to metadata. Implement a federation layer using the Model Context Protocol (MCP) to expose metadata tools to agents.
Implementation: Define an MCP tool that resolves entity queries across multiple catalogs.
// mcp-tools.ts
import { McpServer } from '@modelcontextprotocol/sdk';
export class CatalogFederationTool {
private server: McpServer;
constructor() {
this.server = new McpServer({ name: 'metadata-federation', version: '1.0.0' });
this.registerTools();
}
private registerTools(): void {
this.server.tool(
'resolve_data_entity',
'Finds data entities across catalogs based on semantic queries.',
{ query: 'string', catalog_filter: 'string' },
async (args) => {
const results = await this.federatedSearch(args.query, args.catalog_filter);
return {
content: [{ type: 'text', text: JSON.stringify(results) }]
};
}
);
}
private async federatedSearch(query: string, filter?: string) {
// Parallel search across OpenMetadata, DataHub, Unity Catalog
// Apply Reciprocal Rank Fusion (RRF) to rank results
// Return unified entity schema
}
}
Rationale: An MCP tool allows agents in environments like Claude Code, Cursor, or custom workflows to query metadata without parsing HTML or using brittle APIs. Federation ensures the agent can search across all catalogs, providing a unified view of the data estate.
Pitfall Guide
1. The "Big Bang" Metadata Import
Explanation: Attempting to ingest all metadata from all sources in a single run overwhelms the catalog's search index and database, leading to timeouts and inconsistent states.
Fix: Implement phased ingestion. Start with critical assets (e.g., gold tables, key dashboards). Use incremental syncs and batch upserts to manage load.
2. Lineage Graph Explosion
Explanation: Storing every column-level lineage event without aggregation creates a graph too large to query efficiently. This degrades performance in both Marquez and the catalog.
Fix: Apply lineage pruning strategies. Aggregate lineage at the table level for high-volume pipelines and retain column-level detail only for curated datasets. Use OpenLineage's columnLineage feature judiciously.
3. RBAC Desynchronization
Explanation: Permissions in the catalog drift from the actual warehouse permissions. Users see access in the catalog that doesn't exist in Snowflake, or vice versa.
Fix: Automate RBAC sync. Use Unity Catalog or custom jobs to periodically reconcile catalog ownership and access metadata with the source of truth in the data warehouse.
4. Ignoring Search Index Tuning
Explanation: Open-source catalogs rely on Elasticsearch or OpenSearch. Default configurations often result in slow search queries or poor ranking, frustrating users.
Fix: Tune the search index. Configure analyzers for your domain-specific terminology. Implement custom ranking algorithms that boost frequently accessed or certified assets.
5. Agent Hallucination on Stale Metadata
Explanation: AI agents consuming metadata may act on outdated information if the catalog is not refreshed frequently, leading to incorrect queries or access attempts.
Fix: Implement freshness checks. Expose metadata freshness timestamps via the MCP tool. Agents should validate freshness before acting and trigger re-ingestion if data is stale.
6. Treating Marquez as a Full Catalog
Explanation: Marquez is a lineage store, not a discovery tool. It lacks glossaries, business metadata, and user-facing search.
Fix: Always pair Marquez with a governance catalog. Use Marquez for lineage storage and OpenMetadata/DataHub for discovery and governance.
7. Underestimating Operational Complexity
Explanation: While open-source tools reduce license costs, they increase operational overhead. Managing Postgres, Elasticsearch, Kafka, and multiple services requires DevOps expertise.
Fix: Use managed infrastructure where possible. Leverage Helm charts and Kubernetes operators for deployment. Invest in monitoring and alerting for the metadata stack.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Iceberg/Multi-Cloud Governance | Unity Catalog | Native Iceberg support; unified grants across clouds. | Low infra cost; high governance value. |
| Engineering-Led / Streaming Data | DataHub | Strong GraphQL API; Kafka integration for streaming lineage. | Medium infra cost; high dev flexibility. |
| Analyst Discovery / 90+ Connectors | OpenMetadata | Mature ecosystem; broad connector support; glossary features. | Medium infra cost; high user adoption. |
| AI Agent Integration | MCP Federation | Cross-catalog search; agent-native tools; RRF ranking. | Low infra cost; high AI readiness. |
| Lineage-Critical Pipelines | Marquez + OpenLineage | Standardized lineage; real-time events; tool-agnostic. | Low infra cost; high lineage accuracy. |
Configuration Template
Use this template to define a federated metadata stack configuration. This structure supports multi-catalog ingestion and agent access.
# metadata-stack-config.yaml
stack:
governance:
type: openmetadata
version: "2.0"
endpoints:
api: "https://om.internal/api/v1"
search: "https://es.internal"
auth:
type: jwt
secret_ref: "vault://metadata/om-token"
lineage:
type: marquez
version: "0.30"
endpoints:
api: "https://marquez.internal/api/v1"
openlineage: "https://marquez.internal/api/v1/namespaces"
transport:
type: kafka
brokers: ["kafka-1:9092", "kafka-2:9092"]
storage_governance:
type: unity_catalog
version: "0.2"
endpoints:
api: "https://uc.internal/api/2.1"
auth:
type: bearer
secret_ref: "vault://metadata/uc-token"
agent_access:
type: mcp_federation
version: "1.0"
tools:
- name: resolve_entity
description: "Search across catalogs"
ranking: rrf
eval_suite: "golden_queries_v2"
endpoints:
stdio: true
http: "https://mcp.internal/tools"
ingestion:
schedule: "0 */6 * * *"
batch_size: 500
retry_policy:
max_attempts: 3
backoff: exponential
Quick Start Guide
- Deploy Core Services: Use Helm to deploy OpenMetadata and Marquez to your Kubernetes cluster. Apply the configuration template.
- Install Lineage Agents: Add the OpenLineage agent to your Airflow or dbt environment. Point the agent to the Marquez endpoint.
- Ingest Initial Metadata: Run the ingestion orchestrator for your primary data warehouse. Verify entities appear in the catalog.
- Validate Lineage: Trigger a pipeline and confirm lineage events appear in Marquez and propagate to the catalog.
- Test Agent Access: Connect an MCP client to the federation layer. Query for a known entity and verify the response includes metadata from all configured catalogs.