Case study13 April 2026·6 min read

Can AI reproduce a published meta-analysis? We tested it.

We used Axelium to independently reproduce a published meta-analysis of neoadjuvant immunotherapy in resectable lung cancer. Three of four endpoints matched within 6%—delivered by a single analyst in two working days, versus the five-author team and multi-month timeline of the original publication.

The reproducibility problem

Meta-analyses sit at the top of the evidence hierarchy—they guide treatment guidelines, drug approvals, and clinical decisions. Yet studies have found that 10–40% of published meta-analyses contain errors that could affect conclusions. The problem? Reproducing a meta-analysis manually is painstaking work: literature searching, screening, data extraction, risk-of-bias assessment, statistical pooling. Few teams have the bandwidth to replicate someone else’s work.

We wanted to know: can an AI-assisted platform make reproduction practical? To find out, we picked a real published meta-analysis and tried to reproduce it from scratch using Axelium.

The target: Zhang et al. (2024)

We chose a 2024 meta-analysis by Zhang et al. published in BMC Cancer. It pooled seven randomised controlled trials (2,929 patients) comparing neoadjuvant PD-1/PD-L1 immune checkpoint inhibitors plus chemotherapy versus chemotherapy alone in resectable non-small cell lung cancer (NSCLC). The trials include landmark studies like KEYNOTE-671, CheckMate 816, CheckMate 77T, AEGEAN, Neotorch, TD-FOREKNOW, and NADIM II.

Zhang et al. reported pooled estimates for four co-primary endpoints: event-free survival (EFS), overall survival (OS), pathological complete response (pCR), and major pathological response (MPR). All four showed a significant benefit of adding immunotherapy.

How the reproduction worked

The entire reproduction was conducted through the Axelium web interface, following the platform’s standard five-stage workflow. No custom scripting or manual R coding was involved.

Configure & Ingest

PICO framework setup, PubMed search, trial linking via NCT IDs.

Screen & Retrieve

Abstract screening against eligibility criteria, full-text PDF retrieval.

Extract & Validate

AI-assisted data extraction with confidence scores, human validation of every value.

Analyse (Stats Agent)

Natural-language requests to a conversational agent that executes R/metafor analyses.

Evidence & Report

Pin key results to an evidence board, generate structured narrative report.

Under the hood, Axelium uses the metafor R package with REML estimation for random-effects models. The Stats Agent translates natural-language requests like “Run a random-effects meta-analysis for pCR” into tool calls and returns forest plots, heterogeneity metrics, and diagnostic statistics.

In total, we pinned 50 evidence items to the analysis report—forest plots, sensitivity analyses, publication bias tests, meta-regression, GRADE assessments, and risk-of-bias evaluations. The platform then generated a comprehensive narrative report from the pinned evidence.

The results: endpoint by endpoint

We compared our pooled estimates against Zhang et al.’s reported values across all four co-primary endpoints. All four agreed in direction and statistical significance.

Endpoint	Zhang	Axelium	Difference	Verdict
EFS(HR)	0.58	0.57	2%	Close
pCR(RR)	5.98	5.81	3%	Close
MPR(RR)	2.88	3.06	6%	Partial
OS(HR)	0.57	0.66	16%	Discordant

Event-free survival (EFS)2%

Near-identical pooled hazard ratios. Five of six shared study-level HRs matched exactly. Axelium included one additional study (NADIM II), yet the pooled estimate barely shifted.

Pathological complete response (pCR)3%

Both analyses found a nearly six-fold increase in pathological complete response with immunotherapy. The 3% difference is well within the range expected from minor extraction variation.

Major pathological response (MPR)6%

A roughly three-fold benefit in both analyses. The 6% difference has no clinical significance given the large treatment effect. Both I² values were near zero.

Overall survival (OS)16%

Both analyses included the same three trials and found a significant OS benefit (p = 0.001). The difference traces to a single data point: Axelium used the primary 2022 NEJM publication for CheckMate 816, while Zhang et al. appear to have used updated follow-up data.

How fast? Two days, one analyst.

Concordance is only half the story. The other half is speed. The original Zhang et al. analysis was conducted by five authors over a timeline consistent with the multi-month cadence of a conventional systematic review. The Axelium reproduction was completed by one analyst in roughly two working days—an acceleration of one to two orders of magnitude when reviewer-hours are compared end-to-end.

2 days

Reproduction time

One analyst, end-to-end through all five workflow stages.

5 → 1

Team size

Five authors on the original; one analyst on the reproduction.

30+

Analyses in Stage 4

Forest plots, sensitivity, bias tests, subgroups, meta-regression, GRADE—all via natural-language requests.

Prior evaluations of AI in evidence synthesis have focused on a single stage at a time. The most-cited benchmark—Hamel et al. (2020) on DistillerSR’s prioritization tool—reported a 47% median reduction in title/abstract screening burden, saving roughly 30 hours per review at the screening stage alone. Axelium extends that acceleration across the full pipeline: retrieval, extraction, pooling, sensitivity analyses, publication- bias testing, subgroup analyses, meta-regression, GRADE, risk of bias, and narrative report generation. The two-day figure above covers all of it.

The analyst’s time during those two days was concentrated on the two activities automation cannot fully displace: validating AI-extracted study-level values against the source PDFs, and interpreting the results for write-up. Everything in between— literature retrieval, statistical execution, bias assessment, evidence pinning, report drafting—ran through the platform.

Beyond the headline numbers

The reproduction went well beyond pooled estimates. The Axelium analysis also included:

Leave-one-out sensitivity analysis confirming no single study drives any result
Egger’s and Begg’s tests finding no publication bias across all endpoints
Subgroup analysis showing a significantly greater EFS benefit in Asian trials (p = 0.03)
GRADE certainty assessment: High for EFS, OS, and pCR; Moderate for MPR
Risk of Bias 2.0 assessment: five trials low risk, two with some concerns (open-label design)

Zhang et al. did not perform GRADE assessment or Egger’s test—these are examples of comprehensive secondary analyses that an automated platform can add efficiently and systematically.

What this means for meta-analysis

This case study demonstrates that AI-assisted reproduction of published meta-analyses is not only feasible but practical:

Automated reproduction works

An AI-assisted platform matched 3 of 4 endpoints within 6%, covering the full pipeline from study identification through GRADE assessment.

Scalable quality assurance

A multi-month effort by a five-author team was reproduced by a single analyst in two working days, with 50 evidence items pinned to a structured report.

Comprehensive by default

The platform automatically added GRADE assessment, Egger’s test, trim-and-fill analysis, and meta-regression—analyses the original study did not include.

Transparent where it matters

The single discordant endpoint (OS, 16%) was traceable to a specific data-source difference—not a methodological error. Full transparency enables trust.

Read the full preprint

The complete medRxiv manuscript includes study-level data tables, full heterogeneity analysis, a PRISMA checklist, and the Axelium-generated supplementary report with all 50 pinned evidence items.

Download preprint (PDF)Supplementary report (PDF)Learn about Axelium