Methodology14 May 2026·12 min read

The Three-Layer Pattern: LLMs Propose, Algorithms Decide, Humans Validate

An architectural principle for agentic AI when outputs must be defensible. The LLM is the evidence-finder. A deterministic algorithm is the decision-maker. A human is the validator.

There is a recurring failure mode we have come to recognise in agentic AI systems built for high-stakes domains. The system runs end-to-end. The outputs look credible. The audit log shows reasoning chains. And yet, when you trace any consequential decision back through the trail and ask the foundational governance question — who decided this? — the honest answer is that a large language model decided.

Maybe it was prompted to “act as a senior reviewer”. Maybe it was wrapped in a structured-output schema and given the title “judge”. Maybe two LLMs debated and a third synthesised. The mechanics vary; the substance does not. In each case, the load-bearing decision was made by a system whose outputs cannot reliably be replicated, whose reasoning cannot reliably be inspected, and whose judgement cannot be held to account.

For an LLM helping a journalist draft an article or a developer write a function, this is fine. The stakes are bounded, the human in the loop is real, the LLM-as-judge pattern is genuinely useful. For an LLM helping prepare evidence for a regulatory submission, a clinical guideline update, or a health technology assessment, it is not fine. The stakes are not bounded, the human in the loop is doing something else, and the LLM-as-judge pattern produces outputs that fail the most basic question a regulator will ask: who decided this, and on what authority?

Over eighteen months of building Axelium — an agentic platform for systematic review and meta-analysis — we have arrived at an architectural principle we call the three-layer pattern. The LLM is the evidence-finder. A deterministic algorithm is the decision-maker. A human is the validator. Each role is bounded by what it is reliable for. None of them, in isolation, is the system’s authority on any consequential output.

This post is a defence of that principle and an argument that — for any agentic system operating in a regulated, audited, or otherwise consequential domain — it is the right primary architecture. It is not the only viable architecture. It is, we will argue, the one most likely to survive contact with a serious methodologist or a careful regulator.

“LLM-as-judge” is a category error in regulated contexts

The default architecture for agentic AI today is roughly: extract evidence, ask the LLM to judge, return the result. Versions of this pattern dominate the published landscape, from chatbot frameworks to scientific extraction tools. It is fast to implement, demos beautifully, and produces outputs that look credible to anyone not actively looking for the failure modes.

It rests on a category error. An LLM is an inference engine, not an authority. It produces plausible continuations of contexts; it does not — and, by the current best evidence, cannot reliably — execute decision procedures whose correctness must be defensible against an external standard.

The literature on this is now unambiguous. Huang et al. (ICLR 2024) established that LLMs cannot reliably self-correct reasoning without an external feedback signal — intrinsic self-criticism, in fact, often degrades performance on hard problems. Anthropic’s own work on chain-of-thought faithfulness showed that Claude 3.7 Sonnet acknowledged using injected hints in only 25% of cases. The reasoning trace is evidence about what the model said it did, not evidence about what it did. The MAST taxonomy (Cemri et al., 2025) catalogued fourteen failure modes in multi-agent systems, with conformity bias prominent: when several LLMs share architecture or training data, “consensus” often reduces to correlated error.

In a domain where outputs are reviewed by methodologists trained to re-derive every consequential judgement from primary sources, the gap between plausible and defensible is the gap that matters. A regulator does not ask “is this output credible?” — a regulator asks “show me how you arrived at this judgement, and demonstrate that the procedure is appropriate.” The default LLM-as-judge architecture has nothing meaningful to show. It has a transcript of plausible reasoning that, per the faithfulness literature, may have no causal relationship to the actual answer the model produced.

The regulatory frameworks that have taken AI seriously have all converged on the same structural answer. The principle of augmentation, not replacement — that there must be a capable and informed human in the loop — has become a near-universal requirement across recent national position statements on AI in evidence generation. Cochrane’s 2025 joint statement with Campbell, JBI, and CEE requires that authors remain accountable for content regardless of AI use. The EU AI Act’s high-risk system provisions require records that demonstrate how decisions were made, not just what the decisions were. Every credible framework points the same way: the LLM cannot be the decision-maker.

The category error is obvious once stated. We have just not been treating it as foundational.

The three layers and what each is reliable for

The three-layer pattern is the architectural answer.

The three-layer pattern. Each role is bounded by what it is reliable for.

Layer 1: the LLM as evidence-finder

The LLM reads the source material — papers, trial reports, guidelines — and proposes structured evidence: “the trial randomised 248 participants 1:1”; “the primary outcome was overall survival at 36 months”; “the article states the analysis was intention-to-treat”. Each proposal is bound to a source: a page reference, a verbatim quote, ideally a specific table cell. The LLM is doing exactly what LLMs are demonstrably good at — extracting structured information from semi-structured text, with the recall and breadth that hand-extraction cannot match.

What the LLM is not doing in this layer is making a judgement. It is not deciding whether a trial is at low risk of bias. It is not deciding whether two outcomes are clinically comparable. It is not deciding whether a meta-analysis should be conducted. It is finding evidence and presenting it for someone else to decide on.

Layer 2: the algorithm as decision-maker

The algorithm — deterministic, code-resident, version-controlled — takes the LLM’s evidence proposals and applies a published decision procedure. For risk-of-bias assessment, that procedure is the Cochrane RoB 2.0 algorithm, which the Cochrane Methods Group has published as a series of decision tables. For certainty rating, it is the GRADE Working Group’s downgrade rules. For meta-analysis inclusion, it is the protocol’s pre-registered eligibility criteria. In every case, the algorithm executes a method that pre-exists the software, that has been peer-reviewed in the methodological literature, and that can be cited rather than explained.

The algorithm has properties the LLM does not. It is deterministic — given the same inputs, it produces the same outputs. It is inspectable — the procedure is human-readable code that mirrors a published methodological canon. It is replayable — months later, the same algorithm against the same inputs reproduces the same judgement. It is citable: when challenged, it points to a section of a methods manual rather than to a “model card”.

What it cannot do is interpret novel evidence, recognise implicit information, or understand context the way an LLM can. That is precisely why the LLM has to find the evidence first. The algorithm does the load-bearing decision because the LLM has done the load-bearing perception.

Layer 3: the human as validator

The human looks at the LLM’s evidence proposals and the algorithm’s derived judgement, and either accepts, edits, or overrides. The human is not re-extracting from scratch — the LLM has done that work. The human is not re-deriving the algorithm — that work is canonical. The human is exercising the one thing humans are uniquely positioned to do in this loop: catching the cases where the LLM’s evidence is wrong in a way the algorithm cannot detect, or where the canon’s procedure is being applied to a case it was not designed for, or where some contextual judgement is required that no published procedure has anticipated.

This is what augmentation, not replacement looks like when it is structurally enforced rather than rhetorically asserted. The boundary is architectural, not aspirational.

Common antipatterns

The pattern becomes clearer once juxtaposed with the antipatterns it replaces. We have watched — and built, and rebuilt — systems that demonstrated each of the following failure modes.

Free-form LLM judgement

The LLM reads the paper and returns "low risk of bias because…" with a paragraph of plausible-sounding reasoning.

Inspectable but not auditable. Fails the regulator’s first question: what procedure was applied? The LLM’s general impression was applied. The output may even be correct, often enough. It is not defensible.

Multi-agent debate as decision

Two LLMs argue; a third synthesises. Surface validity high; structural validity low.

When agents share architecture, debate amplifies shared priors rather than cancelling them. When they don’t, the resolution rule becomes the actual decision-maker, hidden one layer down. The answer to who decided this is "the rhetorical structure of the debate did" — which is not an answer.

Intrinsic self-correction

"Are you sure? Check your work." The system loops the LLM back over its own output.

Without an external feedback signal, the literature is consistent: this can make outputs worse, not better. The LLM evaluating its own previous output is not a verification step — it is another inference step with the same failure profile.

Confidence threshold as governance

"If the model’s confidence is below 0.X, route to human review." LLM confidence acts as the governance gate.

LLM confidence is famously poorly calibrated. And this pattern lets the LLM be the decision-maker above the threshold; the human sees only the cases the system has self-classified as uncertain. The cases most needing human review are precisely the ones the system thinks it handled well.

Audit-log-as-explanation

The system records the LLM’s chain-of-thought trace and treats this as the explanation for the decision.

The faithfulness literature has shown this to be unsafe: the trace records what the model said, not what it did. Treating the trace as causal explanation is reasonable for diagnostics; treating it as audit evidence is not.

The three-layer pattern is, in part, a structural response to each of these antipatterns. It does not depend on LLM calibration, on chain-of-thought faithfulness, on debate dynamics, or on self-correction reliability. It depends on the LLM being good at evidence-finding — which is empirically true — on the algorithm encoding the published canon — which is verifiable — and on the human being available to override — which is structurally required.

What this means for evaluation, audit, and regulatory acceptance

Adopting the three-layer pattern has follow-on implications for how systems are evaluated, audited, and presented to regulators.

Evaluation gains a structure

Instead of a single end-to-end accuracy metric, the system has three independently evaluable surfaces — LLM evidence-finding accuracy, algorithm correctness, and human-validator behaviour. Each is independently improvable and testable.

Audit gains a structure too

Every output traces to the LLM’s evidence proposals with source citations, the algorithm’s decision with its canonical reference, and the human’s validation action with rationale. The audit trail becomes a procedure record rather than a reasoning log.

Regulatory alignment becomes natural

The "augmentation, not replacement" principle now standard in regulatory guidance is enforced by code path, not by policy. Records show how decisions were made, not just what they were.

Each consequential output traces to three things: the LLM’s evidence proposals with their source citations, the algorithm’s decision with its canonical reference, and the human’s validation action with their identity and rationale. There is never a state in which the answer to “how was this decided?” terminates at the LLM. The audit trail becomes, in the literal regulatory sense, a procedure record rather than a reasoning log.

The augmentation, not replacement principle now standard in regulatory guidance is structurally enforced — the LLM is architecturally barred from being the final decision-maker on any judgement, not by policy but by code path. The widely articulated requirement to declare AI use and explain method choice maps directly to the algorithm’s canonical citation and the human’s validation record. The EU AI Act’s high-risk system audit-trail requirements are satisfied by construction. Cochrane MECIR’s “two independent reviewers” requirement remains intact because the LLM, as evidence-finder, is not a reviewer in the methodological sense — it is a research assistant for whichever human exercises the validation role.

This is not the only viable architecture for agentic AI in regulated workflows. We do not claim it is. Other architectures may achieve similar goals through different structural means. What we want to suggest is that whatever the architecture, the same three questions must be answerable: who finds the evidence, who applies the decision procedure, who validates the result. If any of those three answers reduces to “the LLM”, the system probably does not meet the regulatory bar — even if it produces correct outputs most of the time.

The three-layer pattern is a starting point, not a destination. We have found it the most defensible foundation for an agentic system whose outputs need to be defensible — and we expect the questions that follow from it (how the pattern extends to non-randomised studies, where the line falls between procedure and judgement, what validation methodology applies to the algorithm layer itself) to define a meaningful share of the methodological work in agentic evidence synthesis over the coming years.