Methodology · statistical analysis

Statistical analysis with the Stats Agent

Axelium's Stats Agent is a conversational tool orchestrator that runs pooled meta-analyses, sensitivity checks, and publication bias assessments using R/metafor — then helps you interpret the results and pin evidence for your report.

Overview

The Stats Agent lives in the Analysis tab of each project. Type a natural-language request — for example "Run a random-effects meta-analysis for pCR" — and the agent selects the right tool, executes R code via metafor, and returns a forest plot with heterogeneity metrics. Every number it reports comes directly from a tool call; it never invents statistics.

Stats Agent flow: question → tool selection → R execution → results → evidence pinning.
app.axelium.io · Stats
Stats Agent conversation with forest plot and results panels
Fig 1The Stats Agent interface showing a pooled analysis, forest plot, and heterogeneity metrics.

Analysis types

The agent provides quick-prompt buttons based on how many studies (k) are available for a given outcome and timepoint.

AnalysisMinimum kWhat it does
Pooled meta-analysis2Fixed-effect or random-effects pooling with forest plot, I², τ², and Q statistic.
Sensitivity analysis3Leave-one-out, alternative τ² estimators, and Knapp-Hartung adjustment.
Publication bias3Egger's regression, Begg's rank test, trim-and-fill, and funnel plot.
Subgroup analysis4Split pooling by a categorical variable with Q-between test for subgroup differences.
Meta-regression4Linear regression of effect size on a continuous covariate.

Pooled meta-analysis

The core analysis pools study-level effect sizes into a single summary estimate. You can choose between fixed-effect (assumes a common true effect) and random-effects (allows effects to vary across studies) models. The default τ² estimator is REML, but the agent can switch to DL, HS, SJ, PM, HE, ML, or EB on request.

Every pooled analysis returns:

  • Pooled estimate and 95% CI — the summary effect on the scale of your chosen measure (RR, OR, HR, MD, SMD, or proportion).
  • Forest plot — SVG visualisation of each study's weight and the diamond summary.
  • Heterogeneity metrics — I², τ², Q statistic, and their interpretation.
app.axelium.io · Analysis
Forest plot showing pooled random-effects estimate with study weights
Fig 2A random-effects forest plot: each row is a study with its effect size and 95% CI. The diamond at the bottom is the pooled estimate.

Living-review pooling. The dynamic-significance stage that runs after each scheduled living-review cycle pools the validated outcomes through the same hosted-R metafor engine that interactive stats-v6 uses (REML for random-effects pools). There is one engine and no fallback pooler: if that service is unreachable, or a cycle cannot load its locked analysis recipes, the cycle fails or skips rather than re-pooling by a different method, so every snapshot is comparable. The threshold-tripping decision is fully deterministic; only the narrative copy in the “Evidence change detected” alert is LLM-written, and the judge agent CANNOT alter the alert decision. An outcome is flagged when an alarm rule fires — a large effect-delta, an I² jump, or the confidence interval crossing the null — and the evidence base also changed (trials added or removed). Independently of trial count, a CI flip across the null, or an effect move of at least twice the configured threshold, flags the outcome on its own. For ratio measures (RR, OR, HR) the effect-delta is measured on the log scale, so a halving and a doubling register as equal-magnitude moves.

Per-cycle assessment record. Once the threshold decision is made, an lr-persist-assessment step writes the cycle's verdict to a durable per-cycle record — the pooled outcomes it assessed, which of them tripped an alert rule, and the judge's narrative rationale. This is saved on every cycle, including the ones that find no material change, so “the cycle ran and nothing moved” is a recorded fact rather than an absence. These per-cycle records, together with the agentic extraction traces tagged to each cycle, roll up into the analysis Provenance view — a cycle timeline plus the cross-cycle trajectory of every monitored outcome — which can be frozen into an immutable snapshot for audit or health-technology-assessment submission.

Understanding heterogeneity

Heterogeneity measures how much variability across studies exceeds what you would expect from sampling error alone.

  • — percentage of total variability due to between-study differences. Low (<25%), moderate (25–75%), or high (>75%).
  • τ² — absolute between-study variance on the log scale (for ratio measures). Useful for comparing heterogeneity across analyses.
  • Cochran's Q — chi-squared test for heterogeneity. A low p-value (p < 0.05) suggests real differences between studies.

The agent always reports I² and τ² and flags the heterogeneity level. If heterogeneity is high, consider running subgroup analysis or meta-regression to investigate sources of variation.

Sensitivity analysis

Sensitivity analysis tests whether your results are robust to analytical decisions. The agent supports three approaches, run individually or together:

Leave-one-out

Re-runs the pooled analysis k times, each time dropping one study. If removing a single study shifts the estimate substantially or changes statistical significance, that study has outsized influence and warrants closer inspection.

Estimator comparison

Re-runs the analysis with each of 8 τ² estimators (REML, DL, HS, SJ, PM, HE, ML, EB). If the pooled estimate is stable across estimators, the result is robust to the choice of heterogeneity method.

Knapp-Hartung adjustment

Applies a t-distribution instead of the normal distribution for confidence intervals — a more conservative approach when the number of studies is small. The agent returns the adjusted CI, t-value, degrees of freedom, and p-value.

Publication bias assessment

Publication bias occurs when studies with significant results are more likely to be published, skewing the pooled estimate. The agent runs four complementary tests:

  • Egger's regression — weighted least-squares regression of effect sizes on standard errors. A significant intercept (p < 0.05) suggests funnel plot asymmetry.
  • Begg's rank correlation (k ≥ 4) — Kendall's τ rank correlation between effects and standard errors.
  • Trim-and-fill — imputes "missing" studies and recalculates the pooled estimate, showing how much publication bias might shift your result.
  • Funnel plot — scatter of each study's effect vs. precision. Asymmetry suggests bias; the plot includes 95% and 99% confidence envelopes.

WARN · Power caveat

All publication bias tests have low statistical power when k < 10 (Sterne et al., 2011). The agent flags this automatically. Non-significant results in small meta-analyses do not rule out bias.

Subgroup analysis and meta-regression

Subgroup analysis

Split the pooled analysis by a categorical variable (e.g., region, risk of bias, treatment line). The agent uses study tags — either pre-extracted during data collection or classified on the fly via LLM-based tagging. The result includes a per-subgroup pooled estimate and a Q-between test for subgroup differences.

Meta-regression

Fit a linear regression of effect size on a continuous covariate (e.g., median age, baseline risk). The agent returns the slope (β), standard error, and p-value. A significant β suggests the covariate partially explains heterogeneity.

NOTE · Note

Subgroup analysis and meta-regression are currently available in interactive (browser) mode only. Server-side execution supports pooling, sensitivity, and publication bias but not subgroup or meta-regression.

Pinning results to the Evidence Board

Every forest plot and analysis result can be pinned to the Evidence Board with a single click. Pinned evidence appears in the Reports & Evidence tab and can be included directly when generating your final report. The agent creates two artifacts per analysis: a plot artifact (the SVG forest plot) and a model artifact (numeric results as JSON). Both can be pinned independently.

Analysis artifacts flow from execution to the Evidence Board and into reports.

Data exploration tools

Before running an analysis, the agent can inspect the dataset to help you decide what to analyse. These tools are available in the conversation:

  • List dimensions — discovers all outcome/timepoint combinations and how many studies (k) are available for each.
  • Preview missingness — checks for missing data fields and flags studies that would be excluded from a given analysis.
  • Query data — browse, filter, and sort the extracted data table with arbitrary queries.
  • Read source documents — look up parsed table assets, document sections, or registry data for specific studies.
  • Classify studies — create ad-hoc categorical tags (e.g., region, line of therapy) via LLM for use in subgroup analysis.
  • SQL query — run read-only SQL against the whitelisted outcome views when you need joins, aggregations, or shape-of-the-data checks the higher-level explorer tools don't cover. A companion schema tool lists the available columns and types.
  • Merge timepoints & patch values — relabel rows from source timepoints into a target timepoint so they pool together, or correct individual numeric fields (e.g. convert SE to SD, fill in a missing N) without re-running extraction.

The agent surfaces these alongside its Analysis, Sensitivity, and Publication-bias tools — five tool groups in total: Analysis, Dataset, Explorer, SQL, and Tagging.

What reaches the Stats Agent

The quality of any pooled estimate depends on the rows that land in the dataset. Recent changes to the upstream extraction pipeline broaden what the agent has to work with and make extraction failures easier to diagnose:

  • Clinical-scale synonyms — outcomes with generic names (e.g. "anxiety", "disability") now also match snippets that mention only the clinical scale (HADS-A, GAD-7, mRS, and similar), so scale-only reporting no longer goes missing.
  • Results-section fallback — when the snippet spotter finds nothing but the document has a Results section, the full section is handed to the per-outcome extractor instead of giving up.
  • PMC tables as markdown — JATS XML tables from PMC are now rendered as GFM markdown before being shown to the extractor, which lifts numeric recall on table-heavy papers.
  • Retry & failure provenance — an extractor pass that returns all-null is automatically retried once, and every failure persists the input snippet it saw, so you can audit why a study didn't produce a row.
  • Looser numeric schemas — median/IQR reporting, count outcomes without an explicit denominator, and mRS shift-distribution rows are now accepted by the extractor schemas, so fewer studies are silently dropped before pooling.

Tips for working with the Stats Agent

  • Start with the quick-prompt buttons — they encode the right tool and minimum-k checks automatically.
  • If you have multiple timepoints for an outcome, specify which one (e.g., "pCR at 12 months"). The agent will ask if ambiguous.
  • For subgroup analysis, the agent can classify studies on the fly. Just ask: "Run subgroup analysis by region."
  • Pin your key forest plots and sensitivity results as you go — they are easier to find later on the Evidence Board than scrolling through chat history.
  • The agent never outputs raw R code. If you need to reproduce the analysis outside Axelium, export the extracted dataset and run metafor manually.

Risk of Bias 2.0 — methodology

Quality assessment runs as a dedicated stage between extraction and finalize. We implement the Cochrane RoB 2.0 tool (2016 individually- randomised-trials guidance) with a strict separation of concerns between the LLM, the deterministic algorithm, and the human reviewer.

The three-layer split

  1. The LLM reads the PDF. A dedicated “RoB Evaluator” Mastra agent receives the parsed full text + the already-extracted outcomes for one (study × effect_of_interest) tuple. It answers Cochrane's 14-16 signaling questions per domain (depending on Domain 2 variant + cluster-RCT flag) using the 5-level scale (Y / PY / PN / N / NI) plus NA for conditional questions whose gating answer makes them irrelevant. Each non-NI answer must cite a quoted PDF evidence span.
  2. The algorithm derives the judgement. A pure- TypeScript module encodes Tables 4 / 6 / 8 / 10 / 12 / 14 from the published Cochrane guidance verbatim. The LLM's signaling answers are fed into the deterministic lookup table; the algorithm output is the domain judgement. The LLM cannot drift from methodology because it is never permitted to invent a domain judgement. If the LLM's claimed judgement disagrees with the algorithm output and the LLM did not flag judgement_source='reviewer_override' with a non-empty rationale, the algorithm wins and the LLM's justification is prefixed with [Algorithm corrected LLM's claimed X → Y] so human reviewers see the disagreement explicitly.
  3. The human owns the validation. Reviewer B (the primary human reviewer; the LLM is implicit Reviewer A on every rob_assessments row) submits a blinded review via the /analysis/[id]/rob/review queue. Agreement flips the assessment to status='validated'; disagreement creates a conflict that routes through the existing unified conflict gate. The Cochrane Table 1 escalation of multiple “some concerns” to “high” is captured via an overall_override_per_outcome entry on the agent response.

Overall judgement

The default overall judgement is the worst-of-five-domains rule: any “high” → high; all “low” → low; otherwise “some concerns”. The Cochrane Table 1 escalation case (multiple “some concerns” combining into “high”) is a reviewer-discretion call — captured via an explicit overall_override_reason field on the rob_assessments row so the audit trail records the deviation from the worst-domain default.

Domain 2 variant selection

The reviewer picks the target estimand at analysis-configuration time via configJson.rob.effect_of_interest:

  • assignment (default): intent-to-treat-like. Domain 2 asks about deviations from intended interventions and their potential impact on the estimated effect.
  • adherence: per-protocol-like. Domain 2 asks about co-intervention balance, implementation success, adherence, and whether an appropriate analysis (e.g. IPW, instrumental variables) was used.

GRADE Summary of Findings — methodology

When a stats-v6 run completes, a save-hook auto-derives a Cochrane GRADE assessment for each (outcome × effect × timepoint) tuple the run produced. The derivation is pure deterministic TypeScript — no LLM in the certainty path — so the same numeric inputs always produce the same final certainty.

Starting level + five downgrade factors

Bodies of evidence start at “high” (RCT-derived; the MVP only ingests randomised designs). Each of the five Cochrane GRADE downgrade factors then computes a severity from the run's numeric outputs:

  • Risk of bias: weighted share of the pooled estimate that comes from High-RoB studies, conditioned on the validated subset only. ≥40% → serious; ≥70% → very_serious. Reviewer-unreviewed LLM judgements never move the threshold — the safety branch pins not_serious when no contributing study has a validated RoB row.
  • Inconsistency: I² ≥ 50% → serious; ≥ 75% → very_serious. k ≤ 2 forces not_serious because the heterogeneity statistic is uninformative at small k.
  • Indirectness: auto-derivation cannot assess PICO directness from numeric output alone, so the engine always emits not_serious. Reviewers escalate via the drawer.
  • Imprecision: takes max() of two streams. Stream A (events): <100 → very_serious, <300 → serious. Stream B (CI): with MID configured, crosses BOTH MID and no-effect → very_serious; crosses MID only → serious; without MID, crosses no-effect → serious. The max() ensures a sparse event count cannot down-weigh a CI signal and vice versa.
  • Publication bias: k < 10 → not_serious (Egger's test is underpowered below the Cochrane k=10 threshold). k ≥ 10 AND Egger p < 0.10 → serious.

SQL ↔ TS arithmetic parity

The final certainty is computed app-side and validated DB-side by a Postgres CHECK constraint (chk_grade_arithmetic_consistent). The CHECK calls a SQL function (compute_grade_final_certainty) that mirrors the TypeScript computeFinalCertaintyline-for-line; the TS twin is exhaustively unit-tested across every (starting × downgrade-total) cell, and the SQL CHECK enforces equality at write time, so any divergence between the two fails closed.

Reviewer finalisation + override

The SoF table at /analysis/[id]/grade renders every auto-derived row. Clicking a row opens a drawer with per-factor severity radios + basis textarea + an expandable auto_signals panel showing the numbers the engine considered. The state machine is single-reviewer (no dual-review conflict gate): auto ⇄ under_review → finalised → auto(the under_review → auto edge is for abandoned edits; the finalised → auto edge is the un-finalise path). Optimistic-lock CAS on every mutation guards against lost updates.

Snapshots at sign-off

When a sign-off is approved, the approval transaction also appends every finalised GRADE row (and every validated RoB row) into the grade_assessment_snapshots / rob_assessment_snapshots tables, atomic with the milestone advance. The frozen payload is the authoritative SoF for that analysis-version — subsequent re-finalisation or re-derivation cannot retroactively alter the submission.