What Is Agentic Query Language? | The Next Paradigm of AI Analytics

AI analytics needs a purpose-built output language. We compare SQL, metric DSLs, and semantic SQL, then explain why we designed AQL for composable, governed, verifiable analytical reasoning.

June 26, 2026 · 14 min read · Huy Nguyen

The missing design question in AI analytics

By now, the analytics industry has largely converged on one principle: AI should be grounded in a governed semantic layer, not raw database schemas. This ensures analytics answers are consistent with business definitions and governance policies.

But a semantic layer only answers one question:

What can the AI talk about?

It does not answer a second, equally important question:

What language should the AI use to express its analytical reasoning?

That distinction matters.

LLMs are probabilistic systems. Even when they are grounded in the right business context, they still need to generate some artifact that represents the analysis: a SQL query, a structured metric request, a semantic SQL statement, or something else.

That artifact is not just an implementation detail. It determines whether the result is governed, whether a human can inspect it, and whether the system can express complex analytical reasoning without falling back to fragile generated SQL.

This is the missing design question in AI analytics. Once the model understands the business, what language should it use to reason about data?

This question led us to design Agentic Query Language (AQL), Holistics' purpose-built query language for agentic analytics. This article explains the design tradeoffs behind that decision.

What makes a good AI query language

To be practical for AI analytics, an output language needs to satisfy three requirements.

First, it should be governed. The AI should work with business concepts and definitions that already exist in the semantic layer, rather than recreating logic from raw tables and columns every time.

Second, it should be verifiable for business intent. This is a specific claim worth unpacking. "Verifiable" does not mean "a human can read the syntax." SQL is perfectly readable to an analyst. But reading 30 lines of SQL to determine whether the revenue calculation matches finance's definition, whether the time window is correct, and whether the join logic is sound is a fundamentally different task from reading five lines that say revenue | where(status == "refunded") | compare(previous_quarter). Verifiability is about the distance between the generated artifact and the business question that produced it. The shorter that distance, the faster a human can confirm whether the AI understood the question correctly.

Third, it should be expressive. The language should support a broad range of analytical questions, including new combinations of metrics, filters, dimensions, transformations, and intermediate calculations – without falling back to SQL for anything beyond basic lookups.

Together, these properties make AI-generated analytics more reliable. They make queries easier for models to generate correctly, easier for humans to audit, and easier for systems to execute consistently.

Language vs. schema: the core distinction

Before comparing specific approaches, it helps to understand a distinction that shapes everything else.

When an analytical system defines a metric, it can represent that metric in two fundamentally different ways.

The first is metric-as-string. The metric has a name and a SQL expression attached to it. When the metric is used in a query, the system substitutes the SQL string into the appropriate place. When one metric references another, the system expands the inner SQL into the outer SQL. Composition is string expansion.

The second is metric-as-language-construct. The metric is a first-class object in a language that understands what it represents. When one metric builds on another, the language composes them at the semantic level – not by pasting SQL text, but by constructing a larger expression tree. Composition is structural.

This may sound like an implementation detail. It is not.

When composition is string expansion, complexity compounds. Each layer of composition embeds more SQL text inside more SQL text. The generated artifact grows in ways that are hard for humans to inspect and hard for LLMs to generate correctly.

When composition is structural, complexity stays manageable. Each layer adds meaning, not syntax. The generated artifact remains close to business intent regardless of how many concepts it combines.

This distinction – language vs. schema – is the lens through which the rest of this article should be read.

How AI generates analytical queries today

Today, AI analytics systems typically generate one of three outputs: raw SQL, SQL with semantic references, or a metric query DSL.

Each represents a different design tradeoff.

Output language	Governed	Verifiable	Expressive
SQL	No	No	Yes
SQL + semantic references	Yes	Partially	Yes
Metric query DSL	Yes	Yes	Limited

To show where these tradeoffs matter, we will use two examples. The first is simple:

"What's our refund rate over the past few months?"

The second is the kind of question that real business users ask regularly:

"Show me each product category's quarter-over-quarter revenue growth, but only for categories where the refund rate exceeds 5%."

The simple question works fine in every approach. The second question is where designs diverge.

SQL: the AI writes everything

For the simple question, the LLM generates:

SELECT
  DATE_TRUNC('month', o.created_at) AS month,
  SUM(CASE WHEN o.status = 'refunded'
      THEN oi.quantity * p.price ELSE 0 END)
    / NULLIF(SUM(oi.quantity * p.price), 0) AS refund_rate
FROM orders o
JOIN order_items oi ON o.id = oi.order_id
JOIN products p ON oi.product_id = p.id
GROUP BY 1
ORDER BY 1;

This gives the AI maximum freedom. SQL can express almost any analytical question.

But the AI is responsible for every implementation detail: joins, aggregation logic, date truncation, CASE expressions, divide-by-zero guards. Even for a simple question, the generated SQL exposes implementation instead of business intent. Every additional line is another opportunity for the LLM to introduce a subtle mistake.

Now consider the harder question: "Show me each product category's quarter-over-quarter revenue growth, but only for categories where the refund rate exceeds 5%." The LLM must generate:

WITH current_q AS (
  SELECT
    p.category,
    SUM(oi.quantity * p.price) AS revenue,
    SUM(CASE WHEN o.status = 'refunded'
        THEN oi.quantity * p.price ELSE 0 END)
      / NULLIF(SUM(oi.quantity * p.price), 0)
        AS refund_rate
  FROM orders o
  JOIN order_items oi ON o.id = oi.order_id
  JOIN products p ON oi.product_id = p.id
  WHERE o.created_at
    >= DATE_TRUNC('quarter', CURRENT_DATE)
  GROUP BY p.category
),
previous_q AS (
  SELECT
    p.category,
    SUM(oi.quantity * p.price) AS revenue
  FROM orders o
  JOIN order_items oi ON o.id = oi.order_id
  JOIN products p ON oi.product_id = p.id
  WHERE o.created_at
    >= DATE_TRUNC('quarter', CURRENT_DATE)
       - INTERVAL '3 months'
    AND o.created_at
    < DATE_TRUNC('quarter', CURRENT_DATE)
  GROUP BY p.category
)
SELECT
  c.category,
  c.revenue AS current_revenue,
  p.revenue AS previous_revenue,
  (c.revenue - p.revenue)
    / NULLIF(p.revenue, 0) AS growth_rate,
  c.refund_rate
FROM current_q c
LEFT JOIN previous_q p
  ON c.category = p.category
WHERE c.refund_rate > 0.05
ORDER BY growth_rate DESC;

Over 30 lines of SQL. The LLM must correctly construct two CTEs, duplicate the revenue logic with different time windows, self-join on category, handle divide-by-zero in two places, and apply the refund rate filter to the right CTE.

A business user reviewing this output has no fast way to verify that this SQL actually implements "quarter-over-quarter revenue growth, filtered by refund rate above 5%."

The problem isn't that SQL is wrong. It's that SQL exposes implementation instead of business intent. As questions get more complex, the gap between what the user asked and what the AI generated widens.

Metric query DSL: the AI fills out a form

Instead of writing SQL, the LLM fills out a predefined query schema.

For the simple question:

metrics:
  - refund_rate

dimensions:
  - field: orders.created_at
    granularity: month

The semantic layer then compiles this into SQL.

This makes the generated query remarkably easy to inspect. A human can immediately see that the AI intends to calculate monthly refund rate, without reading any SQL.

The tradeoff is that the language is defined by a fixed schema. The LLM can only reference metrics and operations that the schema supports. If refund_rate hasn't already been modeled, the AI cannot simply invent it as part of the query. It must either rely on the semantic layer to provide that metric, or fall back to SQL.

Some systems address this with ad-hoc metrics, where the AI defines custom metrics as part of the query. For the harder question, this might look like:

dimensions:
  - field: products.category

custom_metrics:
  - name: revenue
    sql: "SUM(oi.quantity * p.price)"
  - name: refunded_revenue
    sql: "SUM(CASE WHEN o.status = 'refunded'
              THEN oi.quantity * p.price
              ELSE 0 END)"
  - name: refund_rate
    sql: "${refunded_revenue}
          / NULLIF(${revenue}, 0)"
  - name: previous_revenue
    sql: "SUM(CASE WHEN o.created_at
              >= DATE_TRUNC('quarter', CURRENT_DATE)
                 - INTERVAL '3 months'
              AND o.created_at
              < DATE_TRUNC('quarter', CURRENT_DATE)
              THEN oi.quantity * p.price
              ELSE 0 END)"
  - name: growth_rate
    sql: "(${revenue} - ${previous_revenue})
          / NULLIF(${previous_revenue}, 0)"

metrics:
  - growth_rate
  - refund_rate

filters:
  - metric: refund_rate
    operator: ">"
    value: 0.05

The top-level structure looks clean. But look at the metric definitions. Each one is ultimately a SQL string. The period comparison (previous_revenue) leaks into raw SQL with hardcoded time-window logic. The growth rate definition pastes SQL strings together via ${...} substitution.

The semantic layer knows that growth_rate depends on revenue and previous_revenue. But it doesn't understand the computation itself. From its perspective, the definitions are opaque text.

This is the metric-as-string pattern described earlier. As analytical logic grows more sophisticated, composition increasingly becomes string expansion rather than language composition. The DSL's clean structure at the top level masks SQL complexity at the definition level.

SQL with semantic references: better grounding, same output

A more recent direction is Open Semantic Interchange (OSI), an open standard introduced by Snowflake and now supported by Databricks and other vendors.

Instead of forcing the LLM to rediscover business definitions from database schemas, OSI lets SQL reference governed metrics directly:

SELECT
  DATE_TRUNC('month', o.created_at) AS month,
  MEASURE(refund_rate)
FROM orders o
GROUP BY 1
ORDER BY 1;

Here, refund_rate is no longer defined inside the query. It is resolved by the semantic layer, where its business definition lives and can be shared across tools.

This is an important improvement over raw SQL. The AI no longer invents the definition of refund_rate, so business logic becomes more consistent and easier to govern.

But the harder question reveals the limitation. To get quarter-over-quarter revenue growth filtered by refund rate, the LLM still needs CTEs, self-joins, and time-window arithmetic. MEASURE(revenue) resolves the metric definition, but the structural complexity of period comparison and cross-metric filtering remains. OSI improves the vocabulary. It doesn't simplify the grammar.

There is a reasonable counterargument here: SQL's universality is a feature, not a bug. SQL runs everywhere. Every analyst reads it. Every database executes it. These are real advantages.

But universality and reliability are different properties. SQL being well-understood doesn't mean that 30 lines of AI-generated SQL are easy to verify against the original business question. The question isn't whether a human can read the SQL. The question is whether they can quickly confirm it captures the right business intent. For simple queries, the answer is yes. For multi-step analytical questions involving period comparisons, cross-metric filters, and compositional logic, the answer is much less clear.

From our perspective, the key distinction is: a semantic layer determines the vocabulary available to the AI. The output language determines how the AI expresses its reasoning. OSI solves the grounding problem. It does not solve the output language problem.

AQL: the AI composes instead of invents

AQL starts from a different idea.

Instead of asking the AI to invent SQL, or to fill out a predefined query schema, AQL asks it to compose analytical logic from reusable business concepts.

For the simple question:

metric revenue = sum(order_items.quantity * products.price)

metric refunded_revenue =
  revenue | where(orders.status == "refunded")

metric refund_rate =
  safe_divide(refunded_revenue, revenue)

dimensions {
  orders.created_at | month()
}
measures {
  refund_rate
}

Notice that refunded_revenue never repeats the definition of revenue. It simply builds on it. refund_rate is defined in terms of refunded_revenue and revenue, without expanding into a larger SQL expression.

Now the harder question:

metric revenue = sum(order_items.quantity * products.price)

metric refunded_revenue =
  revenue | where(orders.status == "refunded")

metric refund_rate =
  safe_divide(refunded_revenue, revenue)

metric revenue_growth =
  percent_change(revenue, vs: previous_quarter)

dimensions { products.category }
measures { revenue_growth, refund_rate }
filters { refund_rate > 0.05 }

Compare this to the 30+ lines of SQL or the DSL with embedded SQL strings.

The period comparison is percent_change(revenue, vs: previous_quarter) – a language-level operation, not a SQL pattern the model must construct. The cross-metric filter (refund_rate > 0.05 applied while displaying revenue_growth) just works. No CTEs. No self-joins. No duplicate revenue logic with different time windows.

The AI composed existing concepts into a larger analytical program. It didn't reinvent the implementation.

If you've written software before, this idea should feel familiar. Programming languages don't build larger programs by copying source code from one function into another. They compose functions, expressions, and abstractions. AQL applies the same principle to analytics.

That changes the role of the AI.

The model no longer has to reconstruct implementation details every time it answers a question. It doesn't write joins. It doesn't redefine revenue. It doesn't remember to guard against divide-by-zero. Those concerns belong to the language and its compiler.

Instead, the generated program expresses only the reasoning behind the analysis: Revenue → Refunded Revenue → Refund Rate, Revenue → Growth Rate, filter by refund rate. The AI's output stays close to business intent regardless of complexity.

This is why AQL can optimize for all three design goals at once.

Governance comes from building on existing business definitions instead of recreating them. Verifiability comes from expressing business intent rather than implementation details. And because AQL is a language rather than a fixed query schema, it remains expressive enough to represent new analytical logic through composition.

Programming languages should be designed for their programmers. SQL was designed for humans. AQL is designed for AI.

But LLMs already know SQL

The obvious objection: LLMs have been trained on billions of lines of SQL. AQL has a comparatively tiny corpus. Why would the model generate correct AQL when it has seen so much more SQL?

The answer is that volume of training data and reliability of generation are different things.

LLMs "know" SQL well in the sense that they can produce syntactically valid SQL for a wide range of questions. But syntactically valid is not the same as semantically correct. The model can write a perfectly valid SQL query that uses the wrong revenue definition, the wrong time window, or the wrong join condition – and nothing in the output signals the error.

AQL's advantage is not that the model has seen more of it. AQL's advantage is that it gives the model fewer decisions to make.

Consider the harder example again. In SQL, the model must decide: how to structure the CTEs, how to compute the time windows, how to self-join, how to handle nulls, how to apply the refund rate filter, and how to avoid duplicating revenue logic across CTEs. Each decision is an opportunity for error.

In AQL, the model's decisions are: which metrics to reference, which comparison operation to use, and which filter to apply. The joins, time windows, null handling, and compilation to SQL are handled by the language compiler – deterministically and correctly.

Fewer degrees of freedom in the output means a smaller space of possible mistakes.

This is how code-generation models work with any well-specified DSL. The model doesn't need millions of training examples. It needs a grammar specification, representative examples, and clear semantic context. The semantic layer provides the vocabulary. Agentic Query Language provides a constrained, well-defined grammar for using it.

There is a secondary benefit: AQL programs are shorter. Fewer output tokens means fewer opportunities for the model to drift. The 30-line SQL query has roughly six times the token surface area of the equivalent AQL program. Each token is a point where the model can introduce an error. Compression alone is a meaningful reliability improvement.

We have observed this directly in Holistics. The model generates correct AQL more reliably than it generates correct SQL for equivalent questions, particularly as question complexity increases. We attribute this primarily to the reduced decision surface, not to training data volume.

In practice

AQL is the query language that powers agentic and AI analytics in Holistics. When users ask analytical questions in natural language, the AI generates AQL, which can be inspected, verified, and compiled into SQL for execution against the underlying data warehouse.

We didn't set out to invent a new language for its own sake. We built AQL after repeatedly running into the tradeoffs described in this article. SQL gave the model complete freedom, but made its reasoning difficult to inspect. Metric query DSLs made generated queries easy to verify, but constrained what the model could express. Semantic layers solved the grounding problem, but not the output language problem.

AQL is our attempt to find a better point in that design space. The shorter output programs, the reduced decision surface for the model, and the structural composition model have made AI-generated analytics meaningfully more reliable in our system.

If you'd like to learn more about the language, including its syntax, compiler, and execution model, you can find the documentation at holistics.io/aql.

Closing thoughts

Over the past few years, the AI community has invested heavily in helping models understand data better. Semantic layers, metadata, and retrieval have all made AI systems more grounded.

We think there is another design question that is just as important.

Once a model understands the business, what language should it use to express its reasoning?

That choice shapes what the model is likely to generate, what humans can verify, and what downstream systems can execute. The output language is not an implementation detail. It is part of the system's design.

As AI becomes a primary author of analytical programs, the languages those programs are written in will matter as much as the models that write them. We believe AQL represents a step toward a better design – one that optimizes for the reliability that organizations actually need from AI analytics.

Huy Nguyen

Data Engineer turned Product; writes SQL for a living.