A post I wrote called Hallucination Is Physics generated nearly 200 comments on LinkedIn. Engineers, architects, CTOs, compliance officers, and founders all weighed in. The debate was fascinating. But one thread ran through almost every response, whether people agreed with the physics framing or not:

Everyone is trying to engineer entropy out of the model.

I understand the impulse. Hallucination is costly. In regulated industries — legal, financial services, healthcare — a confident, fluent, completely fabricated AI output isn't an inconvenience. It's a liability event. So people reach for tools: better prompts, retrieval augmented generation, guardrails, bidirectional validation frameworks, attention mechanism improvements.

All of it real. None of it addressing the actual constraint.

You do not own the model.

You cannot call Anthropic, OpenAI, or Google and ask them to recalibrate the probability distributions for your specific use case. You get the output. That is your interface. Everything else — the training data, the sampling mechanics, the architecture — belongs to someone else, running on infrastructure you will never touch.

So the real question isn't how to fix the model. It's what you can actually control from where you stand.

What You Don't Control (And Why That Matters)

A VP of Business Development put it plainly in the comment thread: LLMs are only hallucinating. The outputs that look accurate are hallucinations that happen to align with ground truth. The ones that don't are the ones you catch — if you catch them at all.

This isn't pessimism. It's an accurate description of the system you're working with. A large language model generates text one token at a time, sampling from probability distributions. Over short sequences, this works remarkably well. Over long sequences — a contract review, a compliance analysis, a financial variance report — uncertainty compounds with every prediction. The model doesn't decide to drift. It drifts because that is what probabilistic generation does over extended sequences.

The engineers arguing in that thread that RAG and attention mechanisms reduce hallucination are correct. They do reduce it. They do not eliminate probabilistic generation. They narrow the probability space. The model still samples. It still predicts. It still drifts.

The distinction matters because it determines where you invest your governance effort.

The structural layer that addresses this is Intent Engineering — the governance architecture that sits between human intent and AI output.

For a concrete example of what governance failure looks like in practice, see MedVI Isn't a Billion-Dollar AI Success Story.

What You Do Control

You control three things: the surface area you expose, the structure you impose, and the verification layer you put between AI output and human decision.

Surface area is how much room you give the model to drift. An untethered prompt — "review this contract" or "analyze this financial statement" — hands a probabilistic system maximum sequence length and maximum latitude. One senior AI architect in the thread described the practical response: break long tasks into short, bounded segments with explicit retrieval anchors between each step. Don't ask for a 50-page analysis in one pass. Ask for section-by-section verification with defined source documents at each step. Shorter sequences mean less accumulated uncertainty.

Structure is what you impose on the model's operating environment before it generates. One commenter made a distinction that stuck with me: the difference between supervisory control and structural control. Guardrails govern behavior until they don't. Architecture governs what's possible in the first place. Rules can be violated. Structure constrains the space of possible outputs before the model ever starts generating. Structured intermediate checks, approved policy libraries, retrieval boundaries — these are structural. A system prompt saying "don't hallucinate" is not.

Verification is the piece most organizations skip entirely because it requires admitting that the first two controls are insufficient on their own. A CTO working on AI compliance frameworks in the thread described it correctly: design a process, not a prompt. Make governance measurable — thresholds, audit trails, explicit sign-off points. That only works if you have independent measurement of whether the AI's output aligns with source truth.

This is where the conversation gets uncomfortable. Because independent verification means something specific: not asking the AI to check its own work, not running a second prompt over the output, but having a separate system compare each claim in the AI's output against the source documents it was supposed to be grounded in.

The Sign Most People Miss

The most technically valuable comment in the entire thread came from a founder who has been instrumenting LLMs in production for years. His observation: drift is not visible in the final answer. It shows up in the trajectory of generation — the system moves from stable patterns into progressively noisier ones before the hallucination ever appears in the output.

By the time you see a wrong answer, the drift already happened upstream.

This matters practically. It means that reviewing outputs after the fact — which is what most organizations call "human oversight" — is already too late in the process. You're reading the result of drift, not catching it in motion. The governance question isn't just "was this output correct?" It's "at what point in this generation did the system lose its grounding, and would we have caught it before it reached a decision?"

Most organizations cannot answer that question. Not because they lack the intention to, but because they have no instrumentation between the AI and the output.

The Threshold Question Nobody Is Asking

A CIO in financial services made an observation that reframed the whole conversation for me: AI systems don't just age, they drift at pace. That's a different governance challenge than legacy systems, which degrade predictably. AI output quality can vary dramatically between runs on the same input, depending on sequence length, context structure, and the accumulated uncertainty of each generation.

That means governance can't be binary. You can't decide that AI output is either trustworthy or not. The real question is: at what confidence threshold does your organization require independent verification before an AI output drives a decision?

That threshold will be different for a first draft of internal marketing copy and a regulatory filing. It will be different for a low-stakes customer response and a clinical recommendation. The organizations that will navigate AI governance well are the ones that define those thresholds explicitly — not by gut feeling, but by the actual cost of being wrong in each context.

A simple starting framework:

Low stakes, high speed
AI generates, human spot-checks. Acceptable for internal content, brainstorming, first drafts where errors are caught in review.
Medium stakes
AI generates, structured verification against known sources, human reviews flagged items only. Appropriate for customer-facing content, research summaries, operational analyses.
High stakes, regulated
Every claim independently verified against source documents before any output reaches a decision-maker. Required for legal analysis, financial reporting, clinical documentation, compliance review.

Most organizations are running high-stakes work at low-stakes governance levels. Not because they're reckless — because the governance infrastructure for the third tier doesn't exist in most workflows yet.

Key Takeaways
  • You do not own the model. Your interface is the output.
  • Drift is not visible in the final answer — it shows up in the trajectory of generation before the wrong answer appears.
  • Governance cannot be binary. Define confidence thresholds by the actual cost of being wrong in each context.
  • The gap is instrumentation — knowing when drift happened, where, and whether it crossed the threshold that matters.

What This Means Practically

You are not going to fix the model. The model will continue to drift over extended sequences because probabilistic generation is what it is. That is not a criticism of the model. It is an accurate description of how the technology works.

What you can do is reduce the surface area you expose, impose structural constraints on the operating environment, and build independent verification between AI output and human decision. Not as a compliance checkbox, but as the actual mechanism by which you know whether the AI's output is trustworthy enough for the decision you're about to make.

The practitioners in that thread who are already doing this — batching short prompts, anchoring to retrieval sources, building structured intermediate checks — are ahead of most organizations. The gap is instrumentation: knowing when drift happened, where it happened, and whether it crossed the threshold that matters for this specific decision.

That gap is the governance problem. And it won't be closed by a better model.


VertixIQ builds independent verification infrastructure for AI output in regulated industries. If you're working through what governance looks like for your organization, start at vertixiq.com.