La benchmarking large language models for extracting biobank-derived insights into health and disease

Wait 5 sec.

by Manuel Corpas, Alfredo IacoangeliBiobank-scale datasets such as the UK Biobank have become foundational resources for advancing biomedical discovery. Yet the complexity and heterogeneity of these resources, spanning genomics, imaging, clinical records, and metadata, pose substantial barriers to access and interpretation. Large Language Models (LLMs) offer a promising avenue for making such datasets more navigable through natural language interfaces. However, the extent to which current general-purpose LLMs can retrieve and synthesize biobank-specific insights has not yet been systematically evaluated. In this study, we present a reproducible, multi-metric evaluation framework to benchmark the capabilities of leading LLMs. We evaluated six leading large language models: Gemini 3 Pro, Claude Opus 4.5, Claude Sonnet 4, GPT-5.2, Mistral Large, and DeepSeek V3, on four benchmark tasks designed to assess biobank-related knowledge retrieval. We evaluate model performance across six dimensions (coverage, semantic accuracy, factual correctness, domain knowledge, reasoning quality, and biobank specificity) and assessed output consistency using curated UK Biobank references and a robust random baseline. All models outperformed the baseline by 16× to 25 × , with strong statistical separation (p