Building Ground Truth: The Hardest Problem Nobody Talks About

Every AI system makes a claim about reality. A medical summariser claims it accurately represents patient records. A document search engine claims it found the most relevant passage. An AI agent claims it completed the task correctly.

The question that determines whether these claims are trustworthy: what are you evaluating against, and what data is the system reasoning over?

That reference — the curated dataset you use to benchmark quality, and the verified knowledge base your LLM retrieves from — is your ground truth. And for most enterprise AI systems, it's either missing, outdated, or quietly wrong.

What Ground Truth Actually Is

Ground truth is not "a big dataset." It's not "all the PDFs in our S3 bucket." It's not the vector store you populated eighteen months ago and haven't audited since.

Ground truth is a curated, validated, continuously maintained dataset that represents what "correct" looks like for your specific system. It's the standard against which you measure everything else.

Building it is unglamorous, painstaking work. It requires domain expertise, not just engineering talent. And it's the single most important investment you can make in your AI programme.

"You aren't just moving bytes from storage to an API call. You are defining what the system considers to be reality. If your pipeline isn't treating data with that level of gravity, you aren't building intelligence — you're automating noise."

Ground Truth in the Age of Foundation Models

A few years ago, ground truth meant labelled training data — thousands of annotated examples used to train custom models. That era isn't over, but it's no longer the dominant paradigm.

Today, most enterprises use frontier LLMs (GPT-5, Claude, Gemini) via API. They're not training models from scratch. They're building systems on top of models: RAG pipelines, AI agents, document processing workflows, and intelligent search.

This shift changes what ground truth is for — but it doesn't make it less important. If anything, it makes it more important.

Ground truth now serves two critical functions:

1. Evaluation benchmarks. When you're using an LLM you didn't train, you need to know whether it's performing well on your data, in your domain. A curated set of inputs and verified correct outputs — your golden dataset — is the only way to measure this systematically. Without it, you're relying on vibe checks.

2. RAG knowledge bases. When your LLM retrieves documents to reason over, the quality of those documents determines the quality of the output. If your knowledge base contains contradictory information, outdated procedures, or poorly structured content, the model will faithfully synthesise garbage into confident-sounding answers. This is the data foundations problem.

In both cases, the requirement is the same: curated, validated, domain-specific data that represents what "correct" looks like. The difference is that you're no longer feeding it to a training loop — you're using it to evaluate and ground a system that's already capable but needs to be trusted.

The Three Layers of Ground Truth Engineering

We build ground truth pipelines around three concentric evaluation layers. Every piece of data must pass through all three before it enters the ground truth corpus.

Layer 1: Structural Validation

The first layer is mechanical. Does the data conform to the expected schema? Are required fields present? Are formats consistent? Is the data deduplicated?

This catches the obvious garbage — corrupted files, misencoded text, missing metadata. It's necessary but nowhere near sufficient.

Layer 2: Semantic Verification

The second layer asks a harder question: does the data make sense?

Semantic verification uses domain-specific rules and lightweight models to check for internal consistency. A medical record that lists a patient's age as 340 is structurally valid (it's a number in the age field) but semantically wrong. A financial report where the line items don't sum to the stated total is internally inconsistent.

This layer is where most of the real quality improvement happens. It catches the errors that structural checks miss — the ones that are well-formed but factually incorrect.

Layer 3: Adversarial Probing

The third layer is deliberately paranoid. It uses an ensemble of specialised models to probe the data for anomalies that the first two layers couldn't detect.

Has a data source been contaminated with synthetic content? Are there subtle inconsistencies across documents from different departments? Is there a category of edge case that's systematically underrepresented?

// Simplified evaluation ring router
async function evaluateBatch(batch: DataBatch): Promise<EvalResult> {
  // Layer 1: Structural
  const structural = await validateSchema(batch);
  if (!structural.valid) {
    return { status: 'rejected', reason: structural.errors };
  }

  // Layer 2: Semantic
  const semantic = await verifySemantics(batch, domainRules);
  if (semantic.score < SEMANTIC_THRESHOLD) {
    return { status: 'flagged', layer: 'semantic', score: semantic.score };
  }

  // Layer 3: Adversarial
  const adversarial = await probeAnomalies(batch);
  if (adversarial.anomalies.length > 0) {
    return {
      status: 'escalated',
      anomalies: adversarial.anomalies,
      requiresExpert: true,
    };
  }

  return { status: 'accepted', confidence: semantic.score };
}

Only data that passes all three layers — structure, semantics, and adversarial probing — enters our ground truth corpus. Everything else is either corrected or discarded.

Expert-in-the-Loop, Not Just Human-in-the-Loop

"Human-in-the-loop" has become a buzzword that means very little. Putting a human in the loop only works if that human has the domain expertise to make meaningful judgements.

We draw a sharp distinction:

Human-in-the-loop: Any person reviewing AI output. Better than nothing.
Expert-in-the-loop: A domain specialist — a clinician, a financial analyst, a legal professional — reviewing AI output within their area of expertise.

The difference matters enormously. When Layer 3 flags an anomaly in a medical knowledge base, we don't route it to a general reviewer. We route it to a medical professional who can determine whether the flagged content is genuinely anomalous or a valid edge case.

This is slower and more expensive than automated-only validation. It's also the only approach that produces ground truth you can actually trust.

How Ground Truth Compounds

Here's the part that most teams miss: ground truth isn't a one-time project. It's a flywheel.

Every time your AI system encounters a new edge case in production — a document format it hasn't seen, a query it handles poorly, an answer it gets wrong — that edge case should flow back into your ground truth pipeline. It gets validated, corrected, and added to both your evaluation benchmarks and your knowledge base.

Over time, your ground truth becomes an increasingly accurate representation of the real world your AI operates in. Your evaluation benchmarks become more rigorous. Your RAG knowledge base becomes more comprehensive. Quality improves not because you swapped to a better model, but because you gave the same model better data to retrieve and better benchmarks to be measured against.

This compounding effect is the real competitive advantage of ground truth engineering. Models are commodities. Frontier capabilities are accessible to anyone with an API key. But a curated, domain-specific ground truth corpus — built over years of careful engineering — is genuinely defensible.

The Real Cost of "Close Enough"

We work with enterprise teams where a 0.3% error rate in ground truth data sounds negligible. It isn't.

At scale, a 0.3% error rate in a medical AI system means thousands of patient records with incorrect information. In financial services, it means audit findings. In legal, it means liability.

And here's the subtler problem: errors in ground truth propagate. If your evaluation benchmark contains incorrect reference answers, you'll optimise your system toward the wrong target. If your RAG knowledge base contains inaccurate information, every answer built on that information inherits the error. Ground truth errors don't cause isolated mistakes — they create systematic, compounding failures. This is why most enterprise AI projects fail before they start.

There is no shortcut to ground truth. There are faster and slower ways to build it, smarter and less smart architectures for validating it, more and less efficient ways to maintain it. But there's no substitute for the disciplined, unglamorous work of defining what reality looks like for your AI system.

That's what we do. And we think it's the most important problem in AI that nobody's talking about.

Building Ground Truth: The Hardest Problem Nobody Talks About

What Ground Truth Actually Is

Ground Truth in the Age of Foundation Models

The Three Layers of Ground Truth Engineering

Layer 1: Structural Validation

Layer 2: Semantic Verification

Layer 3: Adversarial Probing

Expert-in-the-Loop, Not Just Human-in-the-Loop

How Ground Truth Compounds

The Real Cost of "Close Enough"

Related Articles

Why Most Enterprise AI Projects Fail Before They Start

The Case for Boring AI Infrastructure

Why Your AI Agent Needs an Engineer's Playbook, Not a Bigger Context Window

Intelligence Delivered.