Synthetic data can benefit medical research — but risks must be recognized

Wait 5 sec.

Synthetic data can benefit medical research — but risks must be recognizedDownload PDF EDITORIAL10 September 2025Artificially generated data can help to train AI models when real data are scant, but more focus is needed on validating the results.You have full access to this article via your institution.Download PDF Synthetic data have the potential to improve aspects of health care, for example, in the rapid analysis of X-rays.Credit: Costfoto/NurPhoto/GettyThe revolution promised by the application of artificial intelligence (AI) to research and health care has been much discussed, but less attention has been paid to the sub-revolution following in its slipstream. Increasingly, AI models are being trained on ‘synthetic’ data. There is no universally agreed definition for this type of data, but, in this Editorial, we are talking about information used in medical research that was not collected in the real world. A growing number of synthetic data sets exist that were generated by a mathematical model or algorithm — sometimes incorporating real-world data — and are designed to mimic the presumed statistical properties of real-world data.AI-generated medical data can sidestep usual ethics review, universities sayThe use of such data has some clear benefits, for example, for hypothesis generation, for preliminary testing of an idea before actual data collection, and for anticipating the results of experiments. Synthetic data also have much potential to improve studies involving low- and middle-income countries, where real data can be scarce, and barriers to collection high. Synthetic data can also be shared more freely, in part because there are fewer risks of study participants being identified, proponents say1.As one example of their application in clinical settings, synthetic data are being used in AI models that can interpret patients’ X-ray scans and generate reference scans2. Worldwide, there is a shortage of radiologists, and real-world training data remain limited. The AI models can assist radiologists in making decisions, potentially helping them to work faster and with greater accuracy3.As we report this week (Nature https://doi.org/10.1038/d41586-025-02911-1; 2025), some universities and research institutions are waiving ethical review requirements for research in which synthetic data are being used instead of human data. Ordinarily, an independent ethics board would review how studies affect participants’ rights, safety, dignity and well-being. But because synthetic data at most derive from human data, some researchers have been able to persuade their institutions that an ethical review isn’t needed. Considering the speed and scale at which AI is being adopted, this raises at least two concerns.This AI ‘thinks’ like a human — after training on 160 psychology studieThe first is how to better understand and mitigate the risk that people whose data have been used to generate AI models could be identified. This particular problem might well lessen as models move through their second, third, fourth and fifth iterations — in which synthetic data are essentially trained on other synthetic data, and the link to real data becomes more remote. But it needs to be taken seriously if, for example, identities can be revealed without individuals’ consent4.The second concern is more deeply rooted. Anyone with skin in this game — such as producers, publishers and users of AI studies — needs to be reassured that the findings of AI models trained on synthetic data can be validated, not least because of the risk of ‘model collapse’. In this scenario, AI models trained on successive generations of synthetic data start to generate nonsense5. Validation can take different forms, but at its core is the principle that a result needs to be confirmed by researchers independent of those reporting it. At present, this often does not happen, and there are no agreed guidelines for how it should. However, researchers are proposing ways in which it could be supported.Zisis Kozlakidis, a data scientist for the World Health Organization, headquartered in Geneva, Switzerland, says that, in the absence of real-world data against which to test results from synthetic data, researchers should explain how they generated their synthetic data, describing their algorithm, parameters and assumptions. They could also propose how another group might validate their results.Could machine learning help to build a unified theory of cognition?In an article published in BMJ Evidence-Based Medicine in July, bioinformatics researcher Randi Foraker at the University of Missouri in Columbia and her colleagues suggest that there should be reporting standards for synthetic data alongside those that already exist for data and code availability6. They recommend that researchers work with publishers to develop these.Marcel Binz at the Helmholtz Institute for Human-Centred AI in Munich, Germany, makes a powerful case for validation across all forms of AI (Nature https://doi.org/g9r6ct; 2025), following a study, on which he is a co-author, on how AI is being used to predict people’s decision-making7. The model that he and his colleagues developed, called Centaur, was trained on data from psychology experiments that recorded more than 10 million choices made by study participants across a range of decision-making tasks. Some researchers are sceptical about its ability to predict human behaviour. Binz told Nature’s news team that the model — which is freely accessible — needs to be externally validated. “It’s probably the worst version of Centaur that we will ever have, and it will only get better from here.”The benefits of synthetic data are clear. But the risks need to be recognized and mitigated — and the temptation to accept results as valid and accurate just because a computer says they are must be avoided at all costs.Nature 645, 283 (2025)doi: https://doi.org/10.1038/d41586-025-02869-0Referencesvan Breugel, B., Liu, T., Oglic, D. & van der Schaar, M. Nature Rev. Bioeng. 2, 991–1004 (2024).Article Google Scholar Khosravi, B. et al. eBioMedicine 104, 105174 (2024).Article PubMed Google Scholar Achour, N. et al. Health Technol. 15, 489–501 (2025).Article Google Scholar Arora, A., Wagner, S. K., Carpenter, R., Jena, R. & Keane, P. A. Lancet Digit. Health 7, e157–e160 (2025).Article PubMed Google Scholar Shumailov, I. et al. Nature 631, 755–759 (2024).Article PubMed Google Scholar Foraker, R. et al. BMJ Evidence-Based Med. https://doi.org/10.1136/bmjebm-2024-113617 (2025).Article Google Scholar Binz, M. et al. Nature 644, 1002–1009 (2025).Article PubMed Google Scholar Download references AI-generated medical data can sidestep usual ethics review, universities say This AI ‘thinks’ like a human — after training on 160 psychology studie Read the paper: A foundation model to predict and capture human cognition Could machine learning help to build a unified theory of cognition? The AI revolution is running out of data. What can researchers do? Synthetic data in biomedicine via generative artificial intelligence Lessons for synthetic data from care.data’s pastSubjectsMachine learningComputer scienceTechnologyLatest on:Jobs Global Recruitment for Faculty, Postdocs, and Specialists at Hangzhou Institute of Medicine, CASSeeking exceptional Senior/Junior PIs, Postdocs, and Core Specialists globally year-roundHangzhou, ChinaHangzhou Institute of Medicine Chinese Academy of Sciences (HIMCAS)Associate or Senior Editor, Nature Machine IntelligenceJob Title: Associate or Senior Editor, Nature Machine Intelligence Locations: New York, New Jersey, Beijing or Shanghai (Hybrid working model) Appl...New York City, New York (US)Springer Nature LtdPostdoc in Causal Inference of Complex Gene NetworksDeveloping machine learning methods to infer multi-modal, condition-dependent causal gene networks from large-scale single-cell multi-omic datasets.Massachusetts (US)UMass Chan Medical School - GRN LabTenure track Assistant and Associate Professor positionsSeeking tenure-track faculty in microbiology at Pitt Med. Apply by Dec 15, 2025 (Req #25003838). Details: www.mmg.pitt.eduUniversity of Pittsburgh, PittsburghUniversity of Pittsburgh, Department of Microbiology and Molecular GeneticsSenior Engineer – i-BRAIN Nanofabrication FacilityDevelop nanofab processes, operate advanced tools includes EBL, KrF Stepper, PL, CDSEM & PVD for cutting-edge BCI research. Fluent English & Mandarin.Guangming Distrcit, Shenzhen, ChinaShenzhen Medical Academy of Research and Translation