Show us the evidence for the value of medical AI

The adoption of artificial intelligence (AI)-powered tools is accelerating rapidly across all layers of healthcare systems. Predictive models, decision support tools and generative tools have entered clinical environments1, and large language models are increasingly being used by the general public to seek medical information and advice2. Yet evidence that AI tools create value for patients, providers or health systems remains scarce.

Nonetheless, in publications, and in product materials, claims about clinical impact are increasingly more common, even though there is no clear agreement on what level of evidence should be required before such claims are considered credible. The result is not only scientific uncertainty but also often premature implementation and adoption. If AI is to improve care meaningfully, the field must begin to systematically and consistently link claims of impact to appropriate, proportional evidence. A framework for how AI medical technologies should be evaluated, by what metrics and against which benchmarks is urgently needed.

Thus far, evaluation of medical AI has relied mostly on statistical metrics — such as discrimination, calibration, sensitivity and specificity — that measure computational capabilities and the performance of a tool. Although these metrics are certainly important, they do not establish clinical impact on their own. A system may perform very well in retrospective validation and still fail to improve care if its outputs are poorly timed, difficult to interpret, inconsistently acted upon or disruptive to clinical workflows. As a result, when such tools are adopted without more-concrete measures of their clinical impact, health systems and users may invest in products whose real-world value remains uncertain at best and whose unintended consequences may be substantial.

Claims of clinical impact in medicine have historically required more than demonstration of technical performance alone. For instance, drug development typically requires progressively stronger evidence before clinical benefit is accepted, and oversight mechanisms from government agencies help determine when evidence is sufficient for approval, recommendation or reimbursement. For many reasons, including the rapid pace of technological change, heterogeneous applications and different incentives for evidence generation, the medical AI field has not yet developed comparable norms. Although regulatory frameworks are the subject of ongoing debate and development, they remain inadequate3. Published studies often emphasize technical validity over clinical usefulness4. Implementation decisions are frequently made before core questions of actionability, feasibility, safety and effectiveness have been adequately addressed5. In the absence of a consensus on evidentiary standards, those decisions may rely more on early adoption enthusiasm than on consistent criteria. Without clearer rules and a direct mandate to provide robust evidence, the threshold for claiming value remains too variable.

Going forward, the medical AI field must develop a consistent framework to connect claims of clinical value of an AI tool to the appropriate type of evidence needed to support those claims. For example, claims of analytic performance should require robust validation in the intended setting and population, whereas claims of clinical actionability should require evidence that outputs are interpretable and can support reasonable decisions. Claims of workflow benefit should require implementation studies showing that tools can be integrated without the introduction of delay, burden or unintended harms. Claims of improved outcomes or efficiency should require stronger prospective evidence, including comparative evaluations to standard of care, where appropriate. Moreover, because model performance may shift over time, post-deployment monitoring should be seen as an institutional expectation rather than as a late, optional addition.

Having such a framework does not mean that every AI tool must undergo all the staged phases of testing up to a randomized controlled trial before adoption, as is usually required for other medical interventions. In many cases, that would be impractical, given the high costs, the rapid updating of the models underpinning the tools, and the overall complexity and time needed to conduct such studies. At the same time, accepting retrospective performance alone as a sufficient basis for trust is not scientifically rigorous. Therefore, the goal should be proportional evidence, meaning that the stronger the claim, the stronger the evidence needed to support it.

This principle has practical implications for all stakeholders. For example, regulators should better clarify which categories of medical AI tools require prospective evidence of clinical impact and which can enter practice under more limited claims. Healthcare organizations and administrators should distinguish among pilot implementation, operational use and evidence of benefit, rather than collapsing these into a single decision. Across these settings, evidence standards should be transparent, claim specific, and open to revision as tools evolve.

Scientific journals, as part of the research ecosystem, have a unique opportunity to define acceptable types of evidence. In emerging fields, the published literature is often viewed as establishing what constitutes valid evidence for an area of research or practice. By enforcing proportional evidentiary standards, journals can ensure that published research reflects genuine clinical claims rather than mere technical promise, a role we will continue to support at Nature Medicine.