Validation, assumptions, and current limitations
How Axelium validates extracted data, what the platform currently supports, and how to handle edge cases.
Validation philosophy
Axelium combines deterministic validation rules with human review. The goal is to catch obvious inconsistencies early and make it easy for reviewers to spot, correct, and document issues before they reach the analysis.
What Axelium validates
- Events ≤ totals for dichotomous outcomes.
- Non-negative variances and standard deviations.
- Confidence interval bounds and ordering checks.
- Numeric field types and required field presence.
- Selected content-level checks such as unit consistency where possible.

What Axelium does not support yet
- Individual participant data (IPD) meta-analysis.
- Network meta-analysis across multiple interventions.
- Some non-English PDFs or heavily unstructured scans with unreadable OCR.
- Full automation of search execution within subscription databases that require manual access.
Some of these capabilities are on the roadmap, and others can be handled with manual or custom scripting workflows outside Axelium.
Screening limitations
- Unsure studies require resolution. Studies marked “unsure” during screening are not counted as included or excluded in PRISMA reporting. A large unsure bucket means the meta‑analysis is incomplete rather than just underpowered. Use confidence bands, custom instructions, PICO refinement, and escalation to drive the unsure rate below 5%.
- Abstract‑only screening. The AI screener evaluates titles and abstracts only. Studies whose abstracts lack key details (biomarker status, exact comparator, age range) may be marked unsure even when the full text would confirm eligibility. Custom screening instructions can help the AI handle common missing‑detail patterns.
- Escalation decisions are auditable but automated. When escalation forces a decision on an unsure study, the tie‑breaker bias (default: include) means some borderline studies may be included that a human reviewer would have excluded. All escalated decisions are tagged as pass 2, so they can be reviewed and overridden.
- Protocol papers can match PICO criteria. Protocol and methodology papers describe the same population, intervention, and comparator as their results papers. The screener may incorrectly include them. Title‑based heuristics and study‑type flags help catch these, but manual review of included studies is still recommended.
Extraction quality safeguards
The extraction pipeline includes several layers of quality assurance beyond basic field validation:
- Confidence scoring — every extracted value carries a confidence score. Values below configurable thresholds are routed to the human Review Queue rather than auto‑accepted.
- Multi‑dimensional scoring — the quality check evaluates evidence adequacy, arm alignment, schema completeness, and provenance quality independently. A weakness in any dimension triggers human review.
- Conflict detection — when registry data and PDF values disagree, or when successive extraction runs produce different results, the conflict is surfaced in a side‑by‑side view for explicit resolution.
- Arm swap detection — the math engine identifies when treatment and control arms may have been swapped and flags the extraction for review.
- Cross‑outcome HR deduplication — when the same hazard ratio is reported for multiple outcomes, the system detects the duplicate and prevents double‑counting.
- HR‑in‑OR detection — catches cases where a hazard ratio value has leaked into an odds ratio or risk ratio field (e.g., an HR of 0.59 appearing as pCR) and clears the invalid value automatically.
- Events derivation — when studies only report event rates as percentages, the math engine derives absolute event counts so they can be pooled with studies that report raw counts.
Automated quality control
After extraction completes, an automated QC pass inspects every outcome for five categories of issues before the result is surfaced for human review:
- Missing effect sizes — required numeric fields (effect estimate, CI, events/totals) are absent.
- Impossible values — events exceed totals, negative standard deviations, or confidence intervals in the wrong order.
- Implausible magnitudes — effect sizes that fall outside clinically plausible ranges for the outcome type (e.g., HR below 0.20 for EFS or below 0.25 for OS).
- Subgroup‑vs‑overall confusion — values that appear to come from a subgroup breakdown (ctDNA, PD‑L1, squamous‑specific) rather than the overall population.
- Cross‑measure leakage — hazard ratios appearing in outcomes that expect odds ratios or risk ratios, caught by the HR‑in‑OR detector.
When a trigger fires, the system re‑runs extraction with targeted repair hints describing the specific issue. Only if the re‑extraction still fails QC is the outcome routed to the human Review Queue.
Review and adjudication
Outcomes that reach the Review Queue receive per‑outcome Accept, Reject, or Re‑extract decisions — not batch approval. Reviewers see full context:
- QC history — which triggers fired, how many retry attempts were made, and how scores changed between runs.
- Agent‑assisted adjudication — when two extraction candidates produce conflicting values, an AI agent evaluates both options and recommends the most likely correct value with a confidence score and rationale. Reviewers can accept, override, or escalate.
- Rerun instructions — reviewers can provide free‑text guidance (e.g., “use Table 2, not Table S3”) that is passed to the extraction engine for a targeted re‑extraction before the outcome is persisted.
Expected error modes
Complex tables, unusual study designs, and low-quality PDFs can still cause extraction errors. AI suggestions should always be reviewed before final analysis. The confidence badges and Review Queue help prioritize which extractions need attention.
Corrections are designed to be straightforward and auditable — updates to extracted values are tracked so changes are transparent over time.