Resources · validation

Validation, assumptions, and current limitations

How Axelium validates extracted data, what the platform currently supports, and how to handle edge cases.

Validation philosophy

Axelium combines deterministic validation rules with human review. The goal is to catch obvious inconsistencies early and make it easy for reviewers to spot, correct, and document issues before they reach the analysis.

What Axelium validates

  • Events ≤ totals for dichotomous outcomes.
  • Non-negative variances and standard deviations.
  • Confidence interval bounds and ordering checks.
  • Numeric field types and required field presence.
  • Selected content-level checks such as unit consistency where possible.
app.axelium.io · Analysis
Data lineage panel tracing statistics to source extractions
Fig 1Data lineage: every value in a model run can be traced back to the extracted source — supporting audit and manual verification.

Fulltext recovery limits

  • Captcha-walled publishers cannot be auto-fetched. Sites behind Cloudflare, reCAPTCHA, hCaptcha, or Incapsula block automated retrieval. Axelium stamps these documents fetch_blocked and surfaces them on the Bulk Upload page (/analysis/[id]/fulltext/missing) so reviewers can hand-fetch the PDFs and drag-drop them onto the matching study rows.
  • OA backfill is intentionally rate-limited. A Layer-2 recovery pass queries PMC mirrors and NCBI efetch.fcgi for open-access copies of missing fulltexts, but per-document and inter-candidate delays are baked in to dodge PMC’s bot-gate. Throughput is bounded by design, so a large backlog of missing PDFs may take several minutes to drain even when most are recoverable.

Screening limitations

  • Unsure studies require resolution. Studies marked “unsure” during screening are not counted as included or excluded in PRISMA reporting. A large unsure bucket means the meta‑analysis is incomplete rather than just underpowered. Use confidence bands, custom instructions, PICO refinement, and escalation to drive the unsure rate below 5%.
  • Abstract‑only screening. The AI screener evaluates titles and abstracts only. Studies whose abstracts lack key details (biomarker status, exact comparator, age range) may be marked unsure even when the full text would confirm eligibility. Custom screening instructions can help the AI handle common missing‑detail patterns.
  • Escalation decisions are auditable but automated. When escalation forces a decision on an unsure study, the tie‑breaker bias (default: include) means some borderline studies may be included that a human reviewer would have excluded. All escalated decisions are tagged as pass 2, so they can be reviewed and overridden.
  • Protocol papers can match PICO criteria. Protocol and methodology papers describe the same population, intervention, and comparator as their results papers. The screener may incorrectly include them. Title‑based heuristics and study‑type flags help catch these, but manual review of included studies is still recommended.

Extraction quality safeguards

The extraction pipeline includes several layers of quality assurance beyond basic field validation:

  • Confidence scoring — every extracted value carries a confidence score. Values below configurable thresholds are routed to the human Review Queue rather than auto‑accepted.
  • Multi‑dimensional scoring — the quality check evaluates evidence adequacy, arm alignment, schema completeness, and provenance quality independently. A weakness in any dimension triggers human review.
  • Conflict detection — when registry data and PDF values disagree, or when successive extraction runs produce different results, the conflict is surfaced in a side‑by‑side view for explicit resolution.
  • Arm swap detection — the math engine identifies when treatment and control arms may have been swapped and flags the extraction for review.
  • Cross‑outcome HR deduplication — when the same hazard ratio is reported for multiple outcomes, the system detects the duplicate and prevents double‑counting.
  • HR‑in‑OR detection — catches cases where a hazard ratio value has leaked into an odds ratio or risk ratio field (e.g., an HR of 0.59 appearing as pCR) and clears the invalid value automatically.
  • Events derivation — when studies only report event rates as percentages, the math engine derives absolute event counts so they can be pooled with studies that report raw counts.
  • Extraction failure provenance — every sniper failure persists its input snippet and failure reason to a sentinel extraction_provenance row plus values_json.provenance._sniper_input and _debug, so reviewers can see exactly what the snippet looked like when a value did not parse. Outcomes whose sniper outputs come back entirely null automatically retry with a second pass before being marked as missing.
  • Spotter synonym expansion — generic outcome names like “anxiety” or “depression” are expanded against a static clinical-scale synonym table (HADS-A, GAD-7, mRS, and similar) so snippets that only mention the scale name are still located by the spotter.

Automated quality control

After extraction completes, an automated QC pass inspects every outcome for five categories of issues before the result is surfaced for human review:

  • Missing effect sizes — required numeric fields (effect estimate, CI, events/totals) are absent.
  • Impossible values — events exceed totals, negative standard deviations, or confidence intervals in the wrong order.
  • Implausible magnitudes — effect sizes that fall outside clinically plausible ranges for the outcome type (e.g., HR below 0.20 for EFS or below 0.25 for OS).
  • Subgroup‑vs‑overall confusion — values that appear to come from a subgroup breakdown (ctDNA, PD‑L1, squamous‑specific) rather than the overall population.
  • Cross‑measure leakage — hazard ratios appearing in outcomes that expect odds ratios or risk ratios, caught by the HR‑in‑OR detector.

When a trigger fires, the system re‑runs extraction with targeted repair hints describing the specific issue. Only if the re‑extraction still fails QC is the outcome routed to the human Review Queue.

Automated QC flow: extract → check → repair → re-check → accept or escalate to human review.

Review and adjudication

Outcomes that reach the Review Queue receive per‑outcome Accept, Reject, or Re‑extract decisions — not batch approval. Reviewers see full context:

  • QC history — which triggers fired, how many retry attempts were made, and how scores changed between runs.
  • Agent‑assisted adjudication — when two extraction candidates produce conflicting values, an AI agent evaluates both options and recommends the most likely correct value with a confidence score and rationale. Reviewers can accept, override, or escalate.
  • Rerun instructions — reviewers can provide free‑text guidance (e.g., “use Table 2, not Table S3”) that is passed to the extraction engine for a targeted re‑extraction before the outcome is persisted.

Expected error modes

Complex tables, unusual study designs, and low-quality PDFs can still cause extraction errors. AI suggestions should always be reviewed before final analysis. The confidence badges and Review Queue help prioritize which extractions need attention.

Corrections are designed to be straightforward and auditable — updates to extracted values are tracked so changes are transparent over time.

Risk of Bias 2.0 — scope and v1 limitations

Axelium implements the Cochrane RoB 2.0 tool (2016 individually- randomised-trials guidance) for per-result risk-of-bias assessment. The LLM proposes signaling-question answers grounded in PDF source spans; the published Cochrane decision tables derive the domain judgements deterministically; a human reviewer owns final validation via the dual-review flow.

Known scope limits in the current release:

  • RCTs only. Non-randomised studies (cohort, case-control, etc.) need ROBINS-I, which is on the roadmap as v2. The RoB Evaluator agent skips non-randomised studies and records a rob_non_randomised_determined activity entry.
  • 2016 wording. Signaling-question text is encoded from the 2016 detailed guidance. The 2019 update revised D3.3 and D5.1 with inverted polarity; switching to 2019 wording without also flipping the truth tables would silently corrupt every assessment. Tracking 2019 alignment as a follow-up.
  • Individually-randomised only. Cluster- randomised trials need an additional Domain 1b that is not yet supported.
  • Single human reviewer. Reviewer A is the LLM (implicit on the rob_assessments row); Reviewer B is the human. Multi-rater dual-human RoB is not yet supported.
  • Blinded review queue is judgement-only. The review queue captures Reviewer B's domain judgements but not signaling-answer-level edits. Reviewers who want to record their own signaling answers use the matrix drawer instead.

Defaults are conservative: legacy analyses without explicit rob.enabled: true in their config skip the RoB stage entirely, protecting production from a runaway auto- trigger.

GRADE Summary of Findings — scope and v1 limitations

Axelium auto-derives a Cochrane GRADE certainty rating for every outcome a stats run produces, with reviewer-finalisation and per-factor override via the SoF drawer. Derivation is pure TypeScript (no LLM in the certainty path); the SQL twin recomputes the arithmetic CHECK on every write so app-side and DB-side cannot drift.

Known scope limits in the current release:

  • RCT starting level only. The engine starts every body of evidence at “high” because the stats-v6 MVP only ingests randomised designs. When observational pathways land, the startingLevel input is exposed for override but no UI affordance for it yet.
  • Indirectness is reviewer-only. Auto- derivation cannot assess PICO directness from numeric output alone, so the engine always emits not_serious. Reviewers escalate manually via the drawer.
  • Imprecision thresholds. Total-events thresholds (<100 very_serious, <300 serious) are applied only to dichotomous outcomes. Continuous outcomes fall through to the CI vs MID / no-effect rule alone, even when the per-arm sample size is very small. A future release may add an analogous total-N threshold.
  • Publication bias requires k ≥ 10. Below the Cochrane Egger's test power threshold, the factor is pinned not_serious — even when the funnel plot is visibly asymmetric. Reviewers can manually override based on the funnel-plot artifact rendered by the publication-bias tool.
  • Per-stratum sub-grading not supported. The save-hook always writes stratum = NULL. Reviewers cannot sub-grade outcomes by subgroup; the SoF table renders the stratum line conditionally so adding stratum support won't reshape rows.
  • Single reviewer. GRADE does not use the dual-review conflict pattern (RoB does). A second reviewer who disagrees with a finalised assessment must un-finalise and re-finalise; the audit trail captures the chain of reviewers via activity_log entries.
  • Egger's test ignores estimator transforms.The parser accepts the bias-artifact payload as metafor produces it. If a future stats path returns Egger statistics on a log-transformed scale, the engine's p-value comparison may behave unexpectedly.

Defaults are ON for new analyses (the derivation is pure-TS so there's no LLM-cost concern). Set configJson.grade.auto_enabled: false to disable the save-hook for a specific analysis; the choice is recorded via a grade_auto_skipped activity-log entry.

Living reviews — scope and limitations

A living review re-runs the full pipeline on a schedule and alerts you when the pooled evidence changes. See Key concepts · living reviews for the full walkthrough. A few things to be aware of before enabling automation:

  • The GRADE-certainty-downgrade alert is not yet implemented. A living-review cycle does not currently capture a per-cycle GRADE certainty rating, so an alert on a certainty downgrade cannot fire. It is therefore not offered as an alert rule in the Living-Review Automation dialog. The working alert rules are the effect-size delta, the CI-crossing-the-null flip, and the I² jump.
  • Unattended cycles spend LLM tokens. Every scheduled cycle searches, screens, and (when enabled) extracts and assesses risk of bias — all of which consume tokens against your plan, with no human in the loop deciding whether each step is worth it. Cost guardrails run at every stage and a quota-exhausted account is stopped before a cycle spends anything, but a daily cadence on a large review is meaningfully more expensive than a monthly one. Pick a cadence, a Max Studies / Run cap, and the auto-extraction / auto-RoB toggles deliberately.
  • Auto-runs require a locked protocol. The scheduler only picks up analyses whose protocol and search queries are locked. This is intentional — an unattended cycle must not search against an in-flux PICO — but it means automation does not begin until you lock the review. A draft analysis is silently skipped by the scheduler.
  • The live Provenance view is not a single point-in-time record. The Provenance page recomputes on every load. A cycle's pooled results are immutable — each one is a snapshot taken when that cycle ran — but the agentic-extraction provenance and the analysis metadata shown alongside them reflect current state: re-running extraction, or editing the analysis, changes what the live view shows for past cycles. The live view therefore mixes per-cycle-immutable results with current-state provenance and is not guaranteed to be internally consistent as a whole. Use Freeze for submission to capture a frozen snapshot — a self-contained, immutable copy of one instant — whenever you need a consistent point-in-time artefact for an audit trail or an HTA submission.