The evaluation harness: engineering the layer between your AI and production

Most AI teams ship the model. Very few ship the harness. The model is the part that generates answers. The harness is the part that decides whether those answers should reach a human.

This is the single biggest gap in production AI engineering today. Teams spend months on prompt engineering, fine-tuning, and retrieval — then deploy with a try/except block and a Slack alert. The result is predictable: the model drifts, the guardrails don't exist, and the first real incident is discovered by a customer, not by your system.

This post walks through how we build evaluation harnesses at Dilr.ai — the architecture, the five layers, the metrics that matter, and the patterns we've learned shipping 40+ voice agents into production.

What is an evaluation harness?

An evaluation harness is the infrastructure layer that sits between your AI model and your users. It does four things:

Validates every output before it reaches the user
Measures quality continuously, not just at deploy time
Detects drift before it becomes an incident
Enforces guardrails that the model cannot override

Think of it as the test suite that runs on every single inference — not just in CI, but in production, on every call, every response, every tool invocation.

The harness doesn't make the model better. It makes the system honest about when the model is bad.

Why most teams skip it

Three reasons:

Eval feels like testing, and testing feels like slowing down. Teams under pressure to ship a demo don't build eval infrastructure. They ship the demo, it works, and the eval never gets built.
Offline eval gives false confidence. A model that scores 94% on a held-out test set can score 60% on real traffic. The distribution shifts. The harness catches the shift; the test set doesn't.
Nobody owns it. The ML engineer owns the model. The backend engineer owns the API. The product manager owns the feature. Nobody owns the harness. It falls between the cracks.

This is why our Execution Office engagements always start with the harness — before the model, before the integration, before the first call.

The five layers

Every production harness we build has five layers. Each layer runs independently. Each layer can halt the pipeline.

Layer 1: Input validation

Before the model sees anything, validate the input. For voice AI, this means:

Audio quality gate: Is the signal-to-noise ratio above threshold? Is the sample rate correct? Is the audio actually speech or is it silence/music/DTMF?
Language detection: Does the detected language match the expected language? If not, route to the correct model or escalate.
PII screening: Scan the transcript for credit card numbers, SSNs, or health identifiers before they enter the LLM context. Redact or flag.

This layer is fast (sub-10ms) and deterministic. No model involved. Pure engineering.

Layer 2: Output guardrails

After the model generates a response, but before TTS converts it to speech:

Hard-stop classifier: Does the response contain content that must never be spoken? Suicide/self-harm references, medical diagnoses (if not a medical agent), financial advice (if not a financial agent), profanity, competitor names.
Factual grounding check: Did the model cite a fact? Is that fact in the knowledge base or CRM? If it's not grounded, replace with a safe fallback: "Let me check that and have someone get back to you."
Tone/sentiment guardrail: Is the response empathetic when the caller is distressed? Is it professional when the caller is hostile? Tone mismatches are the fastest way to lose trust.

This layer adds 20–50ms of latency. Worth it.

Layer 3: Continuous metrics

Every call, every turn, every response — measure:

Containment rate: Did the agent resolve the call without human handoff?
Escalation sentiment: When the agent did hand off, was the caller angry?
Resolution accuracy: Did the action the agent took actually succeed? (Appointment booked? Payment processed? Ticket opened?)
Latency percentiles: P50, P95, P99 for time-to-first-token and total response time.
Hallucination rate: Percentage of responses containing claims not grounded in the knowledge base.

These metrics feed a dashboard. But more importantly, they feed alerting thresholds. When containment drops below 70%, someone gets paged. Not emailed. Paged.

Layer 4: Drift detection

Models don't degrade suddenly. They drift. The input distribution shifts — new accents, new intents, new products the agent doesn't know about. Drift detection catches this before your metrics crater.

We track three drift signals:

Intent distribution shift: Are callers asking for things they didn't ask for last week? A KL-divergence threshold on the intent classifier output.
Confidence decay: Is the model's average confidence score declining over a 7-day rolling window? If yes, the input distribution is moving away from the training distribution.
Escalation pattern change: Are escalations clustering around a new topic or time of day? This often indicates a business change (new product launch, pricing change) that the agent doesn't know about.

Drift detection runs as a batch job every 6 hours. It doesn't halt the pipeline — it alerts the owner.

Layer 5: Regression testing

Every change to the model, the prompt, the knowledge base, or the tool configuration triggers a regression suite:

Golden set: 200+ curated call transcripts with known-correct responses. The model must match or exceed the baseline score before deployment.
Red team corpus: Adversarial inputs designed to trigger failures — prompt injection, jailbreak attempts, edge-case intents, multilingual switching mid-sentence.
A/B gate: New versions run on 5% of traffic for 48 hours before full rollout. If any metric regresses by more than 2 standard deviations, the rollout is automatically halted.

This layer runs in CI/CD and in production. It's the reason we can deploy prompt changes with confidence.

Architecture

The harness sits in the inference pipeline, not beside it. Every request flows through it. Layers 1 and 2 are synchronous — they add latency but they're in the hot path. Layers 3–5 are asynchronous — they observe but don't block.

The metrics that matter

Out of the dozens of metrics you could track, these six are the ones that actually predict production health:

Metric	Target	Why it matters
Containment rate	>70% inbound, >55% outbound	Below this, the agent creates work
Hallucination rate	<2%	Above this, trust erodes fast
P95 latency	<800ms to first token	Above this, callers talk over the agent
Escalation sentiment	<20% negative	Above this, the agent is holding too long
Resolution accuracy	>85%	Below this, the agent promises but doesn't deliver
Drift alert frequency	<2 per week	Above this, the input distribution is unstable

Track these six. Ignore everything else for the first 90 days.

What breaks without a harness

We've seen it. Every failure mode below has happened in a real deployment:

The confident hallucination: The agent tells a caller their appointment is at 3pm. It's actually at 4pm. The agent sounded confident. The caller showed up at 3pm. The clinic lost a patient.
The slow drift: Over three weeks, the model's intent classifier gradually misclassifies "cancel my appointment" as "reschedule." Nobody notices until a patient complaint.
The guardrail gap: The agent is asked about drug interactions. It answers — helpfully, confidently, and incorrectly. There was no guardrail preventing medical advice from a non-medical agent.
The regression surprise: A prompt change improves booking accuracy by 12% but breaks the handoff flow. No regression test caught it because handoff scenarios weren't in the golden set.

Every one of these is preventable with the five-layer harness.

How to start

You don't need all five layers on day one. Start with this sequence:

Week 1: Layer 2 (output guardrails) — the highest-impact, lowest-effort layer. A regex + classifier that catches the worst outputs.
Week 2: Layer 3 (continuous metrics) — instrument containment, latency, and escalation sentiment. Build the dashboard.
Week 3: Layer 1 (input validation) — add PII screening and audio quality gates.
Week 4: Layer 5 (regression testing) — build the golden set from your first 200 real calls.
Month 2+: Layer 4 (drift detection) — you need 30 days of baseline data before drift detection is meaningful.

This is the sequence we run in every Execution Office engagement. By day 30, all five layers are live. By day 90, the harness is the most valuable piece of infrastructure in the stack — more valuable than the model itself.

The uncomfortable truth

The model is replaceable. GPT-4o today, Claude tomorrow, an open-weight model next quarter. The model will change.

The harness is not replaceable. It's the institutional knowledge of what "good" looks like for your specific use case, your specific callers, your specific compliance requirements. It's the accumulated red-team corpus. It's the drift baselines. It's the golden set that encodes your definition of quality.

Build the harness first. Then pick the model.

If you're shipping AI into production and don't have an evaluation harness — or if you have one but it's a Jupyter notebook that someone runs manually before deploys — book a call. We'll walk through what a production harness looks like for your specific use case.