A multi-view similarity network fusion framework for syndrome discovery from aggregated health records

Wait 5 sec.

Syndrome discovery, the identification of clinically meaningful groupings of signs and symptoms, is a foundational but labor-intensive task in syndromic surveillance, and the COVID-19 pandemic exposed the rigidity of expert-curated definitions in the face of novel threats. Unsupervised, data-driven methods are well-suited to this problem but remain underused. We propose an unsupervised framework based on Similarity Network Fusion (SNF) that operates on only five variables: diagnosis code, sex, age group, epidemiological week and year, and encounter count. Each diagnosis code was represented through three complementary views corresponding to the fundamental questions of syndromic surveillance: what condition is recorded (clinical, via SapBERT embeddings), who is affected (demographic, via chi-square distances), and when it occurs (temporal, via Move-Split-Merge). The fused affinity matrix is partitioned by spectral clustering and exported directly in the Open Syndrome Definition (OSD) format for downstream integration. To our knowledge, this is the first framework of SNF applied to the task of syndrome discovery. We use the framework in 72.9 million primary care encounters across ten Brazilian municipalities. Treating each city as an experiment with no shared training signal yields 47 candidate syndromes, 72% of which are rated fully valid by an expert blind to the procedure. By requiring no predefined targets, the framework discovers candidate syndromes at scale, including ones never explicitly sought, and emits them in a deployable format, shortening the path from emerging signal to usable definition.