Lessons for synthetic data from care.data’s past

Wait 5 sec.

Lessons for synthetic data from care.data’s pastDownload PDF Download PDF CommentOpen accessPublished: 09 August 2025Sahar Abdulrahman1 &Markus Trengove1 npj Digital Medicine volume 8, Article number: 511 (2025) Cite this articleSubjectsData acquisitionEthicsGovernmentLawMedical researchThe use of synthetic data to augment real-world data in healthcare can ensure AI models perform more accurately, and fairly across subgroups. By examining a parallel case study of NHS England’s care.data platform, this paper explores why care.data failed and offers recommendations for future synthetic data initiatives centring on confidentiality, consent and transparency as key areas of focus needed to encourage successful adoption.IntroductionThe Government’s AI Opportunities Action Plan has highlighted synthetic data as an area that requires further exploration, as it may be a potential solution that would allow for the sharing of privacy-preserving versions of highly sensitive data, such as National Health Service (NHS) health data1,2. This paper adds a novel perspective to existing research by grounding policy recommendations for the use of synthetic health data in a real-world NHS case study that illustrates the challenges of accessing health data in a United Kingdom (UK) context. By exploring the care.data case study, a data platform announced and subsequently abandoned by NHS England3, the causes of care.data’s failure will be elicited from the literature and examined. In this Comment, we will discuss how frameworks for the responsible use of synthetic data must centre on the importance of confidentiality, consent and transparency in order to avoid a repetition of care.data’s fate.What is synthetic data?Despite the wealth of literature discussing synthetic data, there remains a lack of consensus around what synthetic data is. A definition proposed by the Royal Society and Alan Turing Institute refers to synthetic data as ‘data that has been generated using a purpose-built mathematical model or algorithm, with the aim of solving a (set of) data science task’4. The Office for National Statistics use a scale to further categorise synthetic data into six levels, better capturing the spectrum of data that falls under the definition5. At the start of the scale are ‘structural’ synthetic datasets that preserve the format and datatypes of original data and can therefore be used only for very basic code testing with no analytical value. Moving up the spectrum, univariate synthetic datasets replicate marginal distributions present in the original data, followed by multivariate datasets that replicate joint distributions. At the end of the spectrum are ‘replica’ synthetic datasets that are generated to preserve the format, structure and conditional distributions of original data, allowing for the most accurate statistical analysis, when compared to the original data.Synthetic data could ‘unlock’ health data for researchers, resulting in models that have both better and fairer performance. Deep learning, which is a subset of AI that utilises neural networks to learn complex properties from data, has shown particular promise within clinical applications6. A significant barrier in the development of deep learning models is data availability, as effective model training relies on learning from large volumes of data. A lack of sufficient data can lead models to ‘underfit’, where complex relationships in the data cannot be learnt7. Additionally, where there are sufficient volumes of data but imbalances in class types (i.e., underrepresented groups), ‘overfitting’ can occur, which limits model generalisability in real-world diverse populations7. The increased access to privacy-preserving versions of original data could therefore solve issues relating to data volume and diversity. The UK has a unique opportunity for data-driven innovation as a nationally-funded health service means that the NHS currently holds a wealth of data that spans lifetimes and systems (e.g., primary and secondary care)8. As well as this, with the UK’s relatively diverse population, NHS data can be used to fill in data gaps across socioeconomic groups, ethnicities and regions, helping to ensure models are generalisable to real-world populations8.However, despite the proposed benefits of synthetic data, there are ethical considerations that policymakers need to consider. The degree to which synthetic data mimics the original data can have implications on the privacy of artificially generated datasets; this is often referred to as the privacy-fidelity trade-off4. ‘Replica’ synthetic datasets have maximal fidelity in that they more closely represent real-world data, compared to ‘structural’ synthetic datasets, which are considered to have the lowest fidelity. It is widely accepted that higher fidelity synthetic data holds the highest re-identification risk, and there is yet to be a set standard on what appropriate privacy looks like in terms of synthetic data. As well as re-identification risks, other concerns around synthetic data in healthcare include the risk of amplifying biases present in the original data and issues of data quality9.NHS England’s care.data schemeIn 2013, NHS England announced the care.data scheme which aimed at establishing a universal platform where data stored in electronic health records of General Practice (GP) surgeries, together with secondary care data, would be stored3. This would allow researchers to access pseudo-anonymised comprehensive data for patients, spanning both primary and secondary care, for the first time. Care.data was born out of the introduction of the 2012 Health and Social Care Act, which gave a legal right for GP practices to share patient data10. Despite NHS England’s rationale that this was needed to ‘unlock’ health data and foster innovation, the scheme was met with concern from both public and professional organisations, such as medConfidential and the Royal College of General Practitioners (RCGP)10. NHS England later scrapped the scheme in 2016, at which point the platform had already received 1.5 million opt-outs10. By critically examining the reasons for care.data’s failure, this paper will offer policy recommendations for future synthetic data initiatives, see Fig. 1.Fig. 1: Synthetic data policy recommendations.The themes of confidentiality, consent and transparency were identified from the care.data case study as critical areas to address. Recommendations for synthetic data policy in healthcare have therefore been based on these three areas as outlined in diagram.Full size imageWhy did care.data fail?ConfidentialityThe risk of breaching patient confidentiality through re-identification became a key tenet that opponents of care.data, including medConfidential, built their opposition on despite its legality10. Although pseudo-anonymisation removes identifiers from data, there remains a risk of re-identification. Due to this, pseudo-anonymised data is bound by UK General Data Protection Regulation laws that apply to personal data11. Furthermore, aside from public stakeholders, professional groups, such as the British Medical Association (BMA) and RCGP also raised concerns about the impact of confidentiality concerns on the patient-doctor relationship and how this would negatively impact patient care. NHS England’s failure to reassure the public and professional bodies that the risk of re-identification was low enough to warrant data sharing contributed to the abandonment of care.data. Although synthetic data differs from pseudo-anonymised real-world data, both have risks of re-identification and therefore confidentiality is a critical consideration for synthetic data policy.Utilising privacy metrics to differentiate synthetic datasets into different risk categories (e.g., low, medium and high) would help policymakers adequately mitigate for risks, and act to reassure public and professional bodies regarding reidentification risk. Set thresholds of privacy risk should be agreed to at a national level by cross-functional teams, bridging technical knowledge with sector-specific insights, led by the relevant government department. Low-fidelity data has the lowest risk of re-identification, and would require much less stringent requirements than current NHS data access processes, allowing for greater data sharing without compromising privacy. For example, as part of an NHS pilot, low-fidelity synthetic data made from Hospital Episode Statistics aggregate data is currently publicly available to download12. Medium risk synthetic datasets have a higher re-identification risk and could therefore utilise additional safeguards, such as Trusted Research Environments (TREs). This type of risk stratification also serves as a design criterion for generating organisations as if TREs cannot be supported, then low-fidelity synthetic data would be prioritised for generation. For synthetic datasets with the highest re-identification risks, identical processes needed for real-world data access should be followed. Given this, organisations who need high-fidelity synthetic data may instead choose to focus on real-world data acquisition, as the requirements for access would be the same.ConsentA significant critique of care.data surrounded the failure of adequately consenting patients in a move that was seen by many as a violation of patient autonomy13,14. Strategies suggested by NHS England to obtain informed consent included posters to be displayed at GP practices and leaflets posted to homes15. The obvious issue with these includes assumptions that they will be read, as well as exclusionary impacts on patients for whom written text is not accessible (i.e., language barrier, literacy levels). Furthermore, the circulated unaddressed leaflets were often mistaken for junk, and households who opted out of junk mail deliveries were missed altogether16. Of those that did receive and read the leaflet, it was fedback that there was no mention of care.data by name and a lack of detailed risk information, including the possibility of re-identification, as well as details of opting-out16.For synthetic data efforts to be successful, public acceptance is key. Although data sharing initiatives may be lawful, legal authority does not always equate to social legitimacy17. Carter et al17 explore this further, arguing that data sharing initiatives rely on a ‘social contract’ that requires trust and transparency in managing data, so that patients continue to consent to its use for research. Patients must understand what synthetic data is, both the risks and the benefits and their rights as data subjects. Without proper provision of information, synthetic data initiatives are likely to face contestation similarly to care.data, because of a breakdown in the social contract necessary for data-sharing initiatives to succeed. Therefore, meaningful engagement with Patient and Public Involvement and Engagement groups should be prioritised by policymakers.TransparencyA significant critique that contributed to the abandonment of care.data centred on the backlash from the lack of transparency about who would be able to access data15. In response to these concerns, the Care Act 2014 was amended to prohibit data release to certain commercial companies (i.e., marketing and insurance)10. Although research has shown that public concern about commercial use of health data reduces when conditions are applied, such as data access requests having a clear public benefit, care.data’s clarifications came too late18. More recently, NHS plans for data sharing through a federated learning platform have also faced obstacles, largely due to the controversy surrounding the award of a contract to Palantir, a multibillion-dollar US tech company19.In light of care.data’s transparency failures, synthetic data initiatives must clearly communicate to patients who the intended users are prior to roll out. To safeguard the interests of patients and garner trust, external organisations that request access to medium and high fidelity synthetic data should also go through a process where their rationale for data use is vetted to ensure it is for public benefit. Furthermore, patients should be able to opt out of synthetic data generation from their real data, with the ability to choose if commercial entities should be given access. This would help ease concerns amongst individuals who oppose commercial access, respecting patient autonomy and allowing patient choice.In response to the current federated learning controversies stemming from NHS England awarding Palantir a £480 million contract to create and run the platform, synthetic data initiatives must be transparent in who is creating the data19. A designated public body within the NHS would be best served to have ownership of creating and managing access to synthetic data, in a move that would reassure the public and avoid the current controversies seen amongst other privacy-enhancing technology (PET) initiatives. Where outsourcing is necessary, conflicts of interest must be properly considered and published to ensure partners are considered trustworthy when managing sensitive data. For example, the openSAFELY federated platform, a publicly funded collaborative project, has gained support from the BMA, RCGP, and medConfidential, highlighting that trust in the same technology can be eroded depending on who is managing that platform20.In summary, synthetic data offers promise in solving issues of data availability and imbalance when developing AI models. For synthetic data initiatives to be successful in a UK context, lessons from previous endeavours, such as care.data must be learned by prioritising patient confidentiality, consent and organisational transparency, with the ultimate aim of improving patient care.Data AvailabilityNo datasets were generated or analysed during the current study.ReferencesDepartment for Science, Innovation & Technology. AI Opportunities Action Plan. (Department for Science, Innovation & Technology, 2025).Blair, T. & Hague, W. The New AI Action Plan Offers the UK a Way to Get Back on Track. (Tony Blair Institute, 2025).Godlee, F. What can we salvage from care.data? BMJ 354, i3907 (2016).Jordon, J. et al. Synthetic Data – What, Why and How? https://royalsociety.org/-/media/policy/projects/privacy-enhancing-technologies/Synthetic_Data_Survey-24.pdf (2022).Bates, A., Spakulova, I. & Mealor, A. Synthetic Data Pilot. (Office for National Statistics, 2019).Alowais, S. A. et al. Revolutionizing healthcare: the role of artificial intelligence in clinical practice. BMC Med Educ. 23, 689 (2023).PubMed PubMed Central Google Scholar Aliferis, C. & Simon, G. Overfitting, Underfitting and General Model Overconfidence and Under-Performance Pitfalls and Best Practices in Machine Learning and AI. in Artificial Intelligence and Machine Learning in Health Care and Medical Sciences (eds. Simon, G. J. & Aliferis, C.) 477–524 (Springer International Publishing, 2024).Goldacre, B. Better, Broader, Safer: Using Health Data for Research and Analysis - Executive Summary. (Oxford Internet Institute, 2022).Giuffrè, M. & Shung, D. L. Harnessing the power of synthetic data in healthcare: innovation, application, and privacy. npj Digit. Med. 6, 186 (2023).PubMed PubMed Central Google Scholar Vezyridis, P. Kindling the fire’ of NHS patient data exploitations: the care.data controversy in news media discourses. Soc. Sci. Med. 348, 116824 (2024).PubMed Google Scholar Lodie, A. & Lauradoux, C. Is It Personal Data?: solving the Gordian Knot of Anonymisation. in Privacy Symposium 2024 (eds. Hoepman, J.-H., Jensen, M., Porcedda, M. G., Schiffner, S. & Ziegler, S.) 83–109 (Springer Nature, 2025).NHS Digital. Artificial Data Pilot. (NHS Digital, 2025).Stancic, H. Trust and Records in an Open Digital Environment. (Routledge, 2021).Hays, R. & Daker-White, G. The care.data consensus? A qualitative analysis of opinions expressed on Twitter. BMC Public Health 15, 838 (2015).PubMed PubMed Central Google Scholar Sterckx, S., Rakic, V., Cockbain, J. & Borry, P. “You hoped we would sleep walk into accepting the collection of our data”: controversies surrounding the UK care.data scheme and their wider relevance for biomedical research. Med. Health Care Philos. 19, 177–190 (2016).PubMed Google Scholar McCartney, M. Care.data doesn’t care enough about consent. BMJ 348, g2831 (2014).PubMed Google Scholar Carter, P., Laurie, G. T. & Dixon-Woods, M. The social licence for research: why care.data ran into trouble. J. Med Ethics 41, 404–409 (2015).PubMed Google Scholar Kalkman, S. et al. Patients’ and public views and attitudes towards the sharing of health data for research: a narrative review of the empirical evidence. J. Med. Ethics 48, 3–13 (2022).PubMed Google Scholar Abbasi, K. Trust and the Palantir question. BMJ 388, r452 (2025).Mahase, E. Researchers could soon access GP patient data—how will it work? BMJ 388, r375 (2025).Download referencesAcknowledgementsThis research was completed as part of the GSK.ai fellowship within the Responsible AI department. The views expressed in the paper are the authors’ own, and do not necessarily reflect of views of GSK or GSK.ai.Author informationAuthors and AffiliationsGSK.ai, GSK, King’s Cross, London, UKSahar Abdulrahman & Markus TrengoveAuthorsSahar AbdulrahmanView author publicationsSearch author on:PubMed Google ScholarMarkus TrengoveView author publicationsSearch author on:PubMed Google ScholarContributionsS.A. completed the literature search, planned, and wrote the manuscript. S.A. also edited the manuscript. M.T. reviewed and edited the manuscript.Corresponding authorCorrespondence to Sahar Abdulrahman.Ethics declarationsCompeting interestsS.A. and M.T. are current employees and shareholders at GSK, which is a pharmaceutical company that conducts AI research.Additional informationPublisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.Rights and permissionsOpen Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.Reprints and permissionsAbout this articleDownload PDF