Methodology · statistical analysis
Statistical analysis with the Stats Agent
Axelium's Stats Agent is a conversational tool orchestrator that runs pooled meta-analyses, sensitivity checks, and publication bias assessments using R/metafor — then helps you interpret the results and pin evidence for your report.
Overview
The Stats Agent lives in the Analysis tab of each project. Type a natural-language request — for example "Run a random-effects meta-analysis for pCR" — and the agent selects the right tool, executes R code via metafor, and returns a forest plot with heterogeneity metrics. Every number it reports comes directly from a tool call; it never invents statistics.

Analysis types
The agent provides quick-prompt buttons based on how many studies (k) are available for a given outcome and timepoint.
| Analysis | Minimum k | What it does |
|---|---|---|
| Pooled meta-analysis | 2 | Fixed-effect or random-effects pooling with forest plot, I², τ², and Q statistic. |
| Sensitivity analysis | 3 | Leave-one-out, alternative τ² estimators, and Knapp-Hartung adjustment. |
| Publication bias | 3 | Egger's regression, Begg's rank test, trim-and-fill, and funnel plot. |
| Subgroup analysis | 4 | Split pooling by a categorical variable with Q-between test for subgroup differences. |
| Meta-regression | 4 | Linear regression of effect size on a continuous covariate. |
Pooled meta-analysis
The core analysis pools study-level effect sizes into a single summary estimate. You can choose between fixed-effect (assumes a common true effect) and random-effects (allows effects to vary across studies) models. The default τ² estimator is REML, but the agent can switch to DL, HS, SJ, PM, HE, ML, or EB on request.
Every pooled analysis returns:
- Pooled estimate and 95% CI — the summary effect on the scale of your chosen measure (RR, OR, HR, MD, SMD, or proportion).
- Forest plot — SVG visualisation of each study's weight and the diamond summary.
- Heterogeneity metrics — I², τ², Q statistic, and their interpretation.

Living-review pooling. The dynamic-significance stage that runs after each scheduled living-review cycle pools the validated outcomes through the same hosted-R metafor engine that interactive stats-v6 uses (REML for random-effects pools). There is one engine and no fallback pooler: if that service is unreachable, or a cycle cannot load its locked analysis recipes, the cycle fails or skips rather than re-pooling by a different method, so every snapshot is comparable. The threshold-tripping decision is fully deterministic; only the narrative copy in the “Evidence change detected” alert is LLM-written, and the judge agent CANNOT alter the alert decision. An outcome is flagged when an alarm rule fires — a large effect-delta, an I² jump, or the confidence interval crossing the null — and the evidence base also changed (trials added or removed). Independently of trial count, a CI flip across the null, or an effect move of at least twice the configured threshold, flags the outcome on its own. For ratio measures (RR, OR, HR) the effect-delta is measured on the log scale, so a halving and a doubling register as equal-magnitude moves.
Per-cycle assessment record. Once the threshold decision is made, an lr-persist-assessment step writes the cycle's verdict to a durable per-cycle record — the pooled outcomes it assessed, which of them tripped an alert rule, and the judge's narrative rationale. This is saved on every cycle, including the ones that find no material change, so “the cycle ran and nothing moved” is a recorded fact rather than an absence. These per-cycle records, together with the agentic extraction traces tagged to each cycle, roll up into the analysis Provenance view — a cycle timeline plus the cross-cycle trajectory of every monitored outcome — which can be frozen into an immutable snapshot for audit or health-technology-assessment submission.
Understanding heterogeneity
Heterogeneity measures how much variability across studies exceeds what you would expect from sampling error alone.
- I² — percentage of total variability due to between-study differences. Low (<25%), moderate (25–75%), or high (>75%).
- τ² — absolute between-study variance on the log scale (for ratio measures). Useful for comparing heterogeneity across analyses.
- Cochran's Q — chi-squared test for heterogeneity. A low p-value (p < 0.05) suggests real differences between studies.
The agent always reports I² and τ² and flags the heterogeneity level. If heterogeneity is high, consider running subgroup analysis or meta-regression to investigate sources of variation.
Sensitivity analysis
Sensitivity analysis tests whether your results are robust to analytical decisions. The agent supports three approaches, run individually or together:
Leave-one-out
Re-runs the pooled analysis k times, each time dropping one study. If removing a single study shifts the estimate substantially or changes statistical significance, that study has outsized influence and warrants closer inspection.
Estimator comparison
Re-runs the analysis with each of 8 τ² estimators (REML, DL, HS, SJ, PM, HE, ML, EB). If the pooled estimate is stable across estimators, the result is robust to the choice of heterogeneity method.
Knapp-Hartung adjustment
Applies a t-distribution instead of the normal distribution for confidence intervals — a more conservative approach when the number of studies is small. The agent returns the adjusted CI, t-value, degrees of freedom, and p-value.
Publication bias assessment
Publication bias occurs when studies with significant results are more likely to be published, skewing the pooled estimate. The agent runs four complementary tests:
- Egger's regression — weighted least-squares regression of effect sizes on standard errors. A significant intercept (p < 0.05) suggests funnel plot asymmetry.
- Begg's rank correlation (k ≥ 4) — Kendall's τ rank correlation between effects and standard errors.
- Trim-and-fill — imputes "missing" studies and recalculates the pooled estimate, showing how much publication bias might shift your result.
- Funnel plot — scatter of each study's effect vs. precision. Asymmetry suggests bias; the plot includes 95% and 99% confidence envelopes.
WARN · Power caveat
Subgroup analysis and meta-regression
Subgroup analysis
Split the pooled analysis by a categorical variable (e.g., region, risk of bias, treatment line). The agent uses study tags — either pre-extracted during data collection or classified on the fly via LLM-based tagging. The result includes a per-subgroup pooled estimate and a Q-between test for subgroup differences.
Meta-regression
Fit a linear regression of effect size on a continuous covariate (e.g., median age, baseline risk). The agent returns the slope (β), standard error, and p-value. A significant β suggests the covariate partially explains heterogeneity.
NOTE · Note
Pinning results to the Evidence Board
Every forest plot and analysis result can be pinned to the Evidence Board with a single click. Pinned evidence appears in the Reports & Evidence tab and can be included directly when generating your final report. The agent creates two artifacts per analysis: a plot artifact (the SVG forest plot) and a model artifact (numeric results as JSON). Both can be pinned independently.
Data exploration tools
Before running an analysis, the agent can inspect the dataset to help you decide what to analyse. These tools are available in the conversation:
- List dimensions — discovers all outcome/timepoint combinations and how many studies (k) are available for each.
- Preview missingness — checks for missing data fields and flags studies that would be excluded from a given analysis.
- Query data — browse, filter, and sort the extracted data table with arbitrary queries.
- Read source documents — look up parsed table assets, document sections, or registry data for specific studies.
- Classify studies — create ad-hoc categorical tags (e.g., region, line of therapy) via LLM for use in subgroup analysis.
- SQL query — run read-only SQL against the whitelisted outcome views when you need joins, aggregations, or shape-of-the-data checks the higher-level explorer tools don't cover. A companion schema tool lists the available columns and types.
- Merge timepoints & patch values — relabel rows from source timepoints into a target timepoint so they pool together, or correct individual numeric fields (e.g. convert SE to SD, fill in a missing N) without re-running extraction.
The agent surfaces these alongside its Analysis, Sensitivity, and Publication-bias tools — five tool groups in total: Analysis, Dataset, Explorer, SQL, and Tagging.
What reaches the Stats Agent
The quality of any pooled estimate depends on the rows that land in the dataset. Recent changes to the upstream extraction pipeline broaden what the agent has to work with and make extraction failures easier to diagnose:
- Clinical-scale synonyms — outcomes with generic names (e.g. "anxiety", "disability") now also match snippets that mention only the clinical scale (HADS-A, GAD-7, mRS, and similar), so scale-only reporting no longer goes missing.
- Results-section fallback — when the snippet spotter finds nothing but the document has a Results section, the full section is handed to the per-outcome extractor instead of giving up.
- PMC tables as markdown — JATS XML tables from PMC are now rendered as GFM markdown before being shown to the extractor, which lifts numeric recall on table-heavy papers.
- Retry & failure provenance — an extractor pass that returns all-null is automatically retried once, and every failure persists the input snippet it saw, so you can audit why a study didn't produce a row.
- Looser numeric schemas — median/IQR reporting, count outcomes without an explicit denominator, and mRS shift-distribution rows are now accepted by the extractor schemas, so fewer studies are silently dropped before pooling.
Tips for working with the Stats Agent
- Start with the quick-prompt buttons — they encode the right tool and minimum-k checks automatically.
- If you have multiple timepoints for an outcome, specify which one (e.g., "pCR at 12 months"). The agent will ask if ambiguous.
- For subgroup analysis, the agent can classify studies on the fly. Just ask: "Run subgroup analysis by region."
- Pin your key forest plots and sensitivity results as you go — they are easier to find later on the Evidence Board than scrolling through chat history.
- The agent never outputs raw R code. If you need to reproduce the analysis outside Axelium, export the extracted dataset and run metafor manually.
Risk of Bias 2.0 — methodology
Quality assessment runs as a dedicated stage between extraction and finalize. We implement the Cochrane RoB 2.0 tool (2016 individually- randomised-trials guidance) with a strict separation of concerns between the LLM, the deterministic algorithm, and the human reviewer.
The three-layer split
- The LLM reads the PDF. A dedicated “RoB Evaluator” Mastra agent receives the parsed full text + the already-extracted outcomes for one (study × effect_of_interest) tuple. It answers Cochrane's 14-16 signaling questions per domain (depending on Domain 2 variant + cluster-RCT flag) using the 5-level scale (Y / PY / PN / N / NI) plus NA for conditional questions whose gating answer makes them irrelevant. Each non-NI answer must cite a quoted PDF evidence span.
- The algorithm derives the judgement. A pure- TypeScript module encodes Tables 4 / 6 / 8 / 10 / 12 / 14 from the published Cochrane guidance verbatim. The LLM's signaling answers are fed into the deterministic lookup table; the algorithm output is the domain judgement. The LLM cannot drift from methodology because it is never permitted to invent a domain judgement. If the LLM's claimed judgement disagrees with the algorithm output and the LLM did not flag
judgement_source='reviewer_override'with a non-empty rationale, the algorithm wins and the LLM's justification is prefixed with[Algorithm corrected LLM's claimed X → Y]so human reviewers see the disagreement explicitly. - The human owns the validation. Reviewer B (the primary human reviewer; the LLM is implicit Reviewer A on every rob_assessments row) submits a blinded review via the
/analysis/[id]/rob/reviewqueue. Agreement flips the assessment tostatus='validated'; disagreement creates a conflict that routes through the existing unified conflict gate. The Cochrane Table 1 escalation of multiple “some concerns” to “high” is captured via anoverall_override_per_outcomeentry on the agent response.
Overall judgement
The default overall judgement is the worst-of-five-domains rule: any “high” → high; all “low” → low; otherwise “some concerns”. The Cochrane Table 1 escalation case (multiple “some concerns” combining into “high”) is a reviewer-discretion call — captured via an explicit overall_override_reason field on the rob_assessments row so the audit trail records the deviation from the worst-domain default.
Domain 2 variant selection
The reviewer picks the target estimand at analysis-configuration time via configJson.rob.effect_of_interest:
- assignment (default): intent-to-treat-like. Domain 2 asks about deviations from intended interventions and their potential impact on the estimated effect.
- adherence: per-protocol-like. Domain 2 asks about co-intervention balance, implementation success, adherence, and whether an appropriate analysis (e.g. IPW, instrumental variables) was used.
GRADE Summary of Findings — methodology
When a stats-v6 run completes, a save-hook auto-derives a Cochrane GRADE assessment for each (outcome × effect × timepoint) tuple the run produced. The derivation is pure deterministic TypeScript — no LLM in the certainty path — so the same numeric inputs always produce the same final certainty.
Starting level + five downgrade factors
Bodies of evidence start at “high” (RCT-derived; the MVP only ingests randomised designs). Each of the five Cochrane GRADE downgrade factors then computes a severity from the run's numeric outputs:
- Risk of bias: weighted share of the pooled estimate that comes from High-RoB studies, conditioned on the validated subset only. ≥40% → serious; ≥70% → very_serious. Reviewer-unreviewed LLM judgements never move the threshold — the safety branch pins not_serious when no contributing study has a validated RoB row.
- Inconsistency: I² ≥ 50% → serious; ≥ 75% → very_serious. k ≤ 2 forces not_serious because the heterogeneity statistic is uninformative at small k.
- Indirectness: auto-derivation cannot assess PICO directness from numeric output alone, so the engine always emits not_serious. Reviewers escalate via the drawer.
- Imprecision: takes max() of two streams. Stream A (events): <100 → very_serious, <300 → serious. Stream B (CI): with MID configured, crosses BOTH MID and no-effect → very_serious; crosses MID only → serious; without MID, crosses no-effect → serious. The max() ensures a sparse event count cannot down-weigh a CI signal and vice versa.
- Publication bias: k < 10 → not_serious (Egger's test is underpowered below the Cochrane k=10 threshold). k ≥ 10 AND Egger p < 0.10 → serious.
SQL ↔ TS arithmetic parity
The final certainty is computed app-side and validated DB-side by a Postgres CHECK constraint (chk_grade_arithmetic_consistent). The CHECK calls a SQL function (compute_grade_final_certainty) that mirrors the TypeScript computeFinalCertaintyline-for-line; the TS twin is exhaustively unit-tested across every (starting × downgrade-total) cell, and the SQL CHECK enforces equality at write time, so any divergence between the two fails closed.
Reviewer finalisation + override
The SoF table at /analysis/[id]/grade renders every auto-derived row. Clicking a row opens a drawer with per-factor severity radios + basis textarea + an expandable auto_signals panel showing the numbers the engine considered. The state machine is single-reviewer (no dual-review conflict gate): auto ⇄ under_review → finalised → auto(the under_review → auto edge is for abandoned edits; the finalised → auto edge is the un-finalise path). Optimistic-lock CAS on every mutation guards against lost updates.
Snapshots at sign-off
When a sign-off is approved, the approval transaction also appends every finalised GRADE row (and every validated RoB row) into the grade_assessment_snapshots / rob_assessment_snapshots tables, atomic with the milestone advance. The frozen payload is the authoritative SoF for that analysis-version — subsequent re-finalisation or re-derivation cannot retroactively alter the submission.