Quality assessment of large language models’ output in maternal health

Wait 5 sec.

IntroductionAccess to safe and reliable information, as well as the capacity to use it effectively, is critical during pregnancy, birth, and puerperium1. With the ultimate goal of improving outcomes, antenatal education is a priority to enhance maternal health literacy and increase women’s engagement during pregnancy. In turn, interventions that take advantage of society’s exponentially increasing digital resources represent a valuable opportunity to support sustainable development goals for maternal and child health, particularly in developing countries2.Despite the challenge of universal accessibility and persisting gender gap, internet use in Low-Middle Income Countries (LMIC) has grown over the last four years3. Nevertheless, common online sources of health information often lack content accuracy and suffer from poor accessibility and readability. This can lead to misperceptions about diagnosis, management, and prognosis which may misguide or discourage patients from seeking appropriate care.In this regard, chat-based artificial intelligence (AI) platforms based on Large Language Models (LLM) hold the potential to fundamentally improve healthcare information-seeking mechanisms. LLMs are advanced AI systems trained to understand and generate human-like text, with their applications in healthcare gaining momentum since their introduction in the early 2020s. The safe and responsible deployment of LLMs may provide accurate, reliable, and culturally relevant maternal healthcare information, a critical issue in LMICs2. In fact, AI may have the greatest potential benefit in the often resource-limited healthcare systems of LMIC.Improving maternal health is one of the World Health Organization’s key priorities and uneven access to quality maternal health care poses a significant barrier to healthcare equity. In this regard, evaluation and continuous quality improvement are paramount to successfully integrating LLMs in healthcare, ensuring their safety and reliability. To date, despite great enthusiasm, formal assessments of the applicability of LLMs in LMIC healthcare systems and maternal health has not been well-defined. LLMs tailored towards LMIC may function as a source of antenatal education, and digital literacy may enable informed decisions, which are critical for women’s empowerment during pregnancy, birth, and puerperium. Although the majority of LLMs are trained on English databases, English is often not the primary language in LMIC. Furthermore, the real-world utility of this technology is influenced by cultural nuances. Therefore, translation tools may play a pivotal role in overcoming this barrier fairly and transparently. Without proper evaluation, LLMs risk spreading harmful or biased information, making assessment crucial for safety and reliability. Identifying key features of a high-performing LLM for LMIC healthcare scenarios could help guide the development and use of this technology in vulnerable communities.Therefore, the objective of the current study was to assess the potential applicability of LLMs in LMIC healthcare systems. Specifically, we aimed to assess the quality of outputs generated by LLMs pertaining to maternal health. Using a mixed-methods, cross-sectional survey approach, an international panel of obstetrics and gynecologic specialists from Brazil, United States, and Pakistan assessed LLM generated responses in their native languages to a set of questions relating to maternal health. The LLMs’ responses were evaluated using metrics for information quality, clarity, readability, and adequacy for the target audience in technical and non-technical domains. As far as we know, this is the first study to evaluate the potential applicability of LLMs as maternal healthcare resources across various languages.MethodsThis cross-sectional survey study adopted a mixed-methods approach, utilizing both quantitative and qualitative evaluation techniques to assess the performance of several LLMs in responding to a series of questions pertaining to maternal health. This study was approved by the Institutional Review Board of the Universidade Federal de Minas Gerais (UFMG), The Aga Khan University and The Ohio State University. We confirm that all research was performed in accordance with relevant guidelines/regulations and Declaration of Helsinki. Informed consent was obtained from all participants. All interactions with the LLM were conducted in compliance with OpenAI’s use case policy and The Bill and Melinda Gates Foundation policies4.Large language model selectionWe compared the performance of four LLMs: GPT-3.5 (OpenAI, Inc. San Francisco, CA),5 a custom version of GPT-3.5, GPT-4,6 and Meditron-70B7 ChatGPT is an LLM based on the GPT architecture developed by OpenAI, and built upon either GPT-3.5 or GPT-4. While the former is freely available to all the users, the latter is an advanced version provided to paid subscribers8. Meditron-70B is an open-source medical LLM adapted from Llama 2 to the medical domain. The choice for utilizing GPT-3.5 and GPT-4 relied on their popularity, applicability in a general context, and their training based on vast parameters. Meditron-70B was selected as it is one of the largest LLMs specialized in the medical field, available at the time of this study. Taking advantage of its availability, we also utilized Meditron’s training dataset to fine-tune the custom version of GPT-3.5. Custom GPT-3.5 was fine-tuned via OpenAI’s proprietary interface using supervised learning with cross-entropy loss, batched inputs, mixed precision, and AdamW optimization. The training corpus included around 48.1B tokens from clinical and biomedical sources, preprocessed as described in the original Meditron pipeline. Hyperparameters (e.g., batch size, learning rate, epochs) and hardware specifications were not disclosed or user-configurable. No reinforcement learning (e.g., RLHF) or parameter-efficient methods (e.g., LoRA) were applied.Questions and compositionA set of three questions was obtained from a de-identified curated maternal health Q&A database developed by specialists from the UFMG,9 based on the specialists’ clinical experience and relevant topics observed in clinical practice. These were subsequently translated into English and Urdu. All questions were simple and direct based on users’ common inputs, aiming to depicting one of each key phases of puerperium (prenatal, labor and nursing) and common topics seen in clinical practice. The questions were:1.“I already had a C section, can I have a natural birth in my next pregnancy?”2.“What are my pain relief options during labor and childbirth?”3.“How many times a day should I breastfeed my baby?”Once per each question and LLM, questions were submitted to the web chat interfaces of the four above-mentioned LLMs on March 12th 2024 and responses were collected. All questions and generated answers are summarized on Supplementary Table 1.LLM response generation protocolA standardized prompt template was developed to guide the response generation of various LLM architectures in a neutral and non-leading way, thus minimizing output bias. This was paramount in maintaining content integrity across the various scenarios presented. The template provided explicit instructions on operating under the persona of a general medical practitioner tasked with addressing medical inquiries in the most comprehensive and informative manner. Specifically, the models were directed to structure their responses using bullet points and paragraphs to enhance readability and clarity. The prompt further stipulated that each response should strive for completeness and be devoid of medical advice, thus focusing solely on providing informational content. Each LLM received identical instructions to ensure uniformity. First-generated responses were always selected to be used in analysis to prevent variability through multiple response regeneration or selection.The LLMs were engaged in two distinct rounds of response generation. In the first round (“Survey 1”), prompts were input in English (EN-US), and responses were outputted accordingly. These responses were then translated into Portuguese (PT-BR) and Urdu using the Google Translate API to standardize comparative analysis across the target demographics of Brazil, United States, and Pakistan. In that way, evaluators assessed outputs in their native language10. The second round (“Survey 2”) involved direct prompting in Portuguese, with responses being analyzed exclusively for the Brazilian branch of the study. This was made to evaluate whether responses generated directly in a language other than English differ in quality, completeness, adequacy and clarity compared to those produced in English and later translated. This allowed more granular data on the impact of prompting in LMIC languages. The structure of the LLM Response Generation Protocol is depicted in Supplementary Figs. 1 and 2.EvaluationThe evaluation was conducted by an international panel comprising 47 obstetrics and gynecologic specialists from Brazil, the United States, and Pakistan. These settings provide geographic and socioeconomic diversity and represent distinct healthcare systems and cultural contexts. Each specialist was presented with a standardized survey that included the three sets of questions & answers for every LLM, resulting in a total of twelve responses per evaluator. Responses were assessed according to information quality, i.e., if the content was correct, complete, and relevant, and clear, i.e., if the response could be easily understood. If an evaluator gave an insufficient score to any of the questions, two more queries were initiated to inquire about the reasons for that score and thus collect more granular data on the rationale behind the assessment. Each metric was evaluated using a five-point Likert scale, ranging from strong disagreement (1) that the criteria were fully met to strong agreement (5). Evaluators were also invited to provide feedback outlining the strengths or weaknesses observed in each LLM response. After concluding their assessment, evaluators were also asked to rank the four LLM answers according to their preference for each question. Finally, a brief section on the applicability of this technology in daily practice also allowed qualitative comments. This structured approach allowed for a comprehensive comparison of the models’ abilities to handle a variety of medical queries across different cultural and linguistic contexts.In order to evaluate the quality, clarity and adequacy of model responses when framed for professional versus general audiences, specialists conducted their evaluation in either a technical or non-technical domain. Each specialist analysed the adequacy of responses to the target audience they were randomly requested to, judging them as if they were appropriate to be read by domain specialists (audience with technical knowledge) or lay individuals/regular patients (audience without technical knowledge). In aggregate, aiming to capture all the nuances of the two different domains, identify the influence of EN-US vs. PT-BR prompting, as well as evaluate metrics for information quality and clarity in all three languages, a total of eight surveys were created (Supplementary Figs. 1 and 2). To ensure the objectivity and reliability of the evaluation process, each evaluator independently assessed the LLM responses through a standardized online survey designed to elicit detailed scrutiny of the responses’ relevance, accuracy, adequacy and comprehensiveness. To safeguard against potential bias from peer influence, evaluators were blinded to the assessments made by their colleagues. Each evaluator was randomly assigned to only respond to one survey, either Survey 1 or Survey 2, and either in the technical or non-technical domain in their native language. This methodological rigor was intended to enhance the validity of the study’s findings by reducing subjective variability. All surveys are appended as supplementary material in Annex 1.Statistical analysisThe Shapiro-Wilk test was conducted and revealed a nonnormal distribution of the data. Therefore, the non-parametric Kruskal-Wallis test was utilized to compare scores between different LLM answers. Continuous variables were presented as medians with inter-quartile range (IQR) and compared using Kruskal-Wallis test. Categorical variables were presented as numbers and percentages and compared with the chi-square test, or Fisher exact test, to evaluate the hypothesis of independence. Descriptive statistics, including means and standard deviation (SD) were computed for all answers in readability analysis. Given the skewed, ordinal nature of the data, intraclass correlation coefficient (ICC) was employed to evaluate inter-rater reliability. Specifically, we used an average rating, fixed-effects consistency ICC model, focusing on the correlation of ratings among evaluators rather than their absolute agreement11,12. All tests were 2-sided, p