Self-Service BI 3 min read

Data Catalog

Last updated: 2026-04-15

A data catalog is a searchable inventory of an organization's data assets – datasets, tables, columns, metrics, dashboards, pipelines, and the relationships between them. It answers the question every analyst asks before starting any work: "Does this data exist, where is it, and can I trust it?"

Without a catalog, data discovery depends on tribal knowledge. Someone knows which table holds customer records. Someone else knows that the "revenue" column in the finance schema is more reliable than the one in the sales schema. A new hire spends their first month asking around. This pattern breaks at scale. When an organization has hundreds of tables across multiple warehouses, with dozens of dashboards built on top, no single person holds the full map.

What a data catalog indexes

A catalog's value depends on the breadth of what it tracks. Useful catalogs go beyond tables:

  • Datasets and tables – the raw building blocks. Schema, column types, row counts, sample values.
  • Metrics and KPIs – governed business definitions, ideally pulled from the semantic layer.
  • Dashboards and reports – which visualizations exist, who built them, and what data they consume.
  • Pipelines and transformations – dbt models, ETL jobs, and scheduled queries that produce the data.
  • APIs and external sources – upstream systems that feed data into the warehouse.

Three types of metadata

The metadata a catalog stores falls into three categories, each serving a different audience.

Technical metadata. Schema definitions, column data types, partitioning strategies, storage format, database engine. This serves data engineers who need to understand the physical structure.

Operational metadata. Freshness timestamps, query frequency, pipeline run status, row count trends, error rates. This serves anyone who needs to know whether the data is current and reliable. A table that hasn't been updated in 14 days is technically valid but operationally stale.

Business metadata. Human-written descriptions, ownership assignments, domain tags, sensitivity classifications, and certification status. This serves business users who need to understand what the data means rather than how it's stored. A column named mrr_adj_net is meaningless without a description explaining it's "Monthly recurring revenue, adjusted for discounts and refunds, net of churn."

Most catalogs start with technical metadata because it's easy to extract automatically. Business metadata requires human effort – someone has to write the descriptions and assign the owners. The catalogs that actually get used are the ones where that effort has been invested.

Why discovery matters for self-service

Self-service analytics fails without discovery. An organization can invest heavily in BI tools, semantic layers, and training programs, but if users can't find the right dataset for their question, they either ask the data team (defeating self-service) or use the wrong dataset (producing wrong answers).

A catalog closes that gap. A product manager searching for "customer retention" finds the governed retention metric, the table it's built on, the dashboard that displays it, and the pipeline that refreshes it daily. They can assess freshness, check ownership, and start their analysis with confidence that they're working from the right source.

Key capabilities

Search. Full-text search across table names, column names, descriptions, and tags. The baseline feature – without it, the catalog is just a schema browser.

Lineage. Visual mapping of how data flows from source to warehouse to transformation to dashboard. Lineage answers "where did this number come from?" and "what breaks if I change this table?"

Usage tracking. Which datasets get queried most, by whom, and how recently. Usage signals help teams prioritize documentation and deprecate unused assets.

Tagging and classification. Domain labels, sensitivity tags (PII, financial, public), and certification badges that indicate trust level.

A data catalog works best when connected to the business glossary – the glossary defines what terms mean, and the catalog shows where those terms map to physical data assets. Together they bridge the gap between business language and warehouse schema, which is exactly where data wrangling effort gets wasted when users can't find clean, well-documented sources.

The Holistics Perspective

Holistics embeds catalog-like functionality into its semantic layer. Metric descriptions, dimension documentation, and dataset metadata are defined alongside the analytical models in AML. Users discover data through the exploration interface rather than a separate catalog tool.

See how Holistics approaches this →