SOuLMuSiC, a novel tool for predicting the impact of mutations on protein solubility

Wait 5 sec.

SOuLMuSiC, a novel tool for predicting the impact of mutations on protein solubilityDownload PDF Download PDF ArticleOpen accessPublished: 29 July 2025Simone Attanasio1,2,Jean Kwasigroch1,2,Marianne Rooman1,2 na1 &…Fabrizio Pucci1,2 na1 Scientific Reports volume 15, Article number: 27531 (2025) Cite this articleSubjectsComputational biology and bioinformaticsComputational modelsProtein designProtein foldingProteome informaticsAbstractProtein solubility problems arise in a wide range of applications, from antibody development to enzyme production, and are linked to several major disorders, including cataracts and Alzheimer’s diseases. To assist scientists in designing proteins with improved solubility and better understand solubility-related diseases, we introduce SOuLMuSiC, a computational tool for the fast and accurate prediction of the impact of single-site mutations on protein solubility. Our model is based on a simple artificial neural network that takes as input a series of features, including biophysical properties of wild-type and mutated residues, energetic values computed using various statistical potentials, and mutational scores derived from protein language models. SOuLMuSiC has been trained on a curated dataset of about 700 single-site mutations with known solubility values, collected and manually verified from original literature. It significantly outperforms current state-of-the-art predictors in strict cross validation: the Spearman correlation reaches 0.5 when solubility changes are represented categorically; for the subset with quantitative values, it increases to 0.7. SOuLMuSiC also shows good performance on external datasets containing high-throughput enzyme solubility-related data as well as protein aggregation propensities. In summary, SOuLMuSiC is a valuable tool for identifying mutations that impact protein solubility, and can play a major role in the rational design of proteins with improved solubility and in understanding genetic variants’ effect. It is freely available for academic use at http://babylone.ulb.ac.be/SoulMuSiC/.IntroductionSolubility is a key biophysical property of proteins, the lack of which is often a major bottleneck for their production and storage1,2,3. Numerous applications in protein structure determination1, pharmaceutics (e.g., production of protein-based therapeutics)4,5, and biotechnology (e.g., protein heterologous expression)6,7 require high-concentration protein formulation. Moreover, problems related to poor protein solubility and aggregation are central in a wide series of protein misfolding-related diseases, also known as proteinopathies. For example, Alzheimer’s and Parkinson’s diseases8,9 are characterized by the growth of insoluble deposits of misfolded proteins, cataracts are related to a decrease in the solubility of human \(\gamma\)-crystallin10,11, and amylin (Islet amyloid polypeptide - IAPP) is involved in diabetes12. The different solubility states of these proteins are also important for understanding how they spread and propagate between cells13.Despite the crucial importance of protein solubility and the considerable efforts made by the scientific community, a full comprehension of the mechanisms behind protein solubility remains out of reach. Solubility is influenced by a complex interplay of intrinsic factors such as residue-residue interactions, protein flexibility, amino-acid composition, and hydrophobicity, as well as extrinsic variables such as pH, environmental temperature, ionic strength, and protein concentration2,14,15,16,17.In recent decades, multiple experimental and computational approaches have been developed to engineer proteins with improved solubility properties3,18,19. For example, when inclusion bodies form in heterologous expression, experimental protocols, including the solubilization of these bodies through the addition of denaturant compounds followed by the refolding of the solubilized proteins, have been designed to get bioactive recombinant proteins20,21. Another experimental approach to enhance the solubility of a target protein involves fusing it with solubility-enhancing tags, even though this requires additional chromatographic steps to obtain tag-free recombinant proteins22.All these procedures remain, however, labor-intensive and quite unsatisfactory. Therefore, computational methods have been designed to speed up and complement the experimental approaches. A series of methods for predicting the solubility of a full protein have been developed, which leverage experimental solubility data23 to train computational models15,24,25,26,27,28,29,30.The challenge of predicting the impact of mutations on protein solubility has been underinvestigated due to the scarcity of mutagenesis data. Indeed, only recently has a database of mutations with known changes in solubility been collected31. Moreover, the variability in environmental conditions and in the methods used for variant characterization make constructing datasets and testing variant predictors challenging.Over the past decade, a few methods have been developed to address this challenge, ranging from simple linear combinations of features to machine learning approaches, either considering the amino-acid sequence or the three-dimensional structure. Renowned methods include CamSol32, SODA33 and PON-Sol234. However, these methods do not achieve satisfactory performance and moreover, the limited datasets they use for training make them prone to overfitting. A rigorous benchmark of their performances does not exist at the moment in the literature. For these reasons, we proceeded in this study to collect and manually curate a new dataset of mutations with an experimentally determined solubility value, and used it to develop a new method called SOuLMuSiC for the prediction of the impact of mutations on solubility. We compared SOuLMuSiC’s performance with the above mentioned algorithms and tested its ability to generalize to unseen data.MethodsDataset construction and curationWe collected a set of mutations whose effects on protein solubility have been experimentally characterized. Although protein solubility can be rigorously defined in thermodynamics as the protein concentration in a saturated solution that is in equilibrium with a solid phase2, its experimental measurement is far from trivial. Different techniques were used to estimate the solubility of proteins, either directly or indirectly35,36. They include measuring protein activity in a pellet after centrifugation37, sodium dodecyl sulfate–polyacrylamide gel electrophoresis (SDS-PAGE)38, and dynamic light scattering39.To set up our first mutation dataset \(\mathcal {D}_{Sol}\), we started to collect mutations and their experimentally measured solubility changes :$$\begin{aligned} \Delta S = S_{\text {mut}} - S_{\text {wt}}, \end{aligned}$$(1)from the mutational solubility database SoluProtMutDB31. We then performed a literature search involving manually reviewing and curation of each SoluProtMutDB entry in the original literature to correct possible errors. Additionally, we extended the search to incorporate recent literature and patent data that were not included in SoluProtMutDB. We excluded data originating from high-throughput techniques such as fluorescence-activated cell sorting, and from mutations in membrane proteins. We focused exclusively on single-site mutations. This led to a total of 702 curated entries in \(\mathcal {D}_{Sol}\).Due to the heterogeneity of experimental approaches and the dependence of solubility on the environmental conditions used in the experiments, such as pH, temperature, and ion buffer, the quantitative values of the changes in solubility upon mutation are often not precisely reported in the original articles. Only 225 entries out of 702 have a numerical estimation of \(\Delta S\) (in %). The remaining entries are annotated by their authors with qualitative labels such as ’strongly decrease/increase solubility’, ’decrease/increase solubility,’ or ’no impact.’ We therefore decided to use five discrete solubility scores (−3, −1, 0, 1, 3) or equivalently, five solubility classes (− −, −, =, +, ++). We classified the effects of mutations into these classes according to qualitative labels or, when available, based on their reported \(\Delta S\) values, as defined in Table 1. Negative solubility scores represent a decrease in solubility upon mutation and positive values, an increase. A value of zero indicates no significant change in solubility. We also defined the subset of \(\mathcal {D}_{Sol}\) that contains the 225 entries with quantitative \(\Delta S\) values, called \(\mathcal {D}_{Sol}^Q\).Table 1 Correspondence between quantitative solubility changes upon mutations (in %) and qualitative solubility scores and classes.Full size tableWe collected the 3-dimensional (3D) structures of all wild-type proteins in \(\mathcal {D}_{Sol}\). We considered the experimental structures in the Protein Data Bank (PDB)40 when available, but only if their resolution was smaller than 2.5 Å and their sequence 100% identical to the one considered. Otherwise, we modeled the structure using AlphaFold 241. In the few cases in which the oligomeric structures were too large to be modeled with AlphaFold, we used the homology modeling algorithm SWISS-MODEL42. The final dataset \(\mathcal {D}_{Sol}\) consists of 702 mutations with experimentally characterized effects on protein solubility, inserted in 80 proteins with experimental or modeled structures. For more details about the \(\mathcal {D}_{Sol}\) dataset, see Supplementary Section 1.We set up a second dataset, called \(\mathcal {D}_{Inv}\), containing the reverse mutations of \(\mathcal {D}_{Sol}\). More precisely, for each mutation (wt \(\rightarrow\) mut) in \(\mathcal {D}_{Sol}\) with a given \(\Delta S\) value, we included the reverse mutation (mut \(\rightarrow\) wt) in \(\mathcal {D}_{Inv}\) and assigned it a value of \(-\Delta S\). We created this dataset to test the antisymmetry properties of our predictor, as non-symmetric models that are biased towards the training set are often constructed, as shown in43. The structures of the 702 mutant proteins in \(\mathcal {D}_{Inv}\) were modeled with the homology algorithm Modeller44 using as template the structures contained in \(\mathcal {D}_{Sol}\).As an additional test mutation dataset, called \(\mathcal {D}_{LGK}\), we used solubility data of Levoglucosan kinase (LGK) from Lipomyces starkeyi. This enzyme catalyzes the phosphorylation of levoglucosan45. Deep mutational scanning was performed on LGK using yeast surface display (YSD)46: proteins were fused with an N-terminal domain to localize the protein on the outer cell surface, and with a C-terminal epitope tag, which binds to a fluorescent antibody, enabling the identification of only the variants expressed on the cell surface. Misfolded protein variants are less expressed on the cell surface as the proteasome degrades them to ensure protein quality control. In total, \(\mathcal {D}_{LGK}\) consists of 6,246 single-site mutations with known solubility values. Note that these values are only partially related to real solubility as the proteasome can also degrade misfolded soluble proteins. The structure that we used for LGK is homodimeric and has the PDB code 4ZFV.In the last dataset, referred to as \(\mathcal {D}_{A\beta }\), we collected the aggregation propensity of variants of A\(\beta\)42, a protein known to play a key role in the pathogenesis of Alzheimer’s disease. The variant scores were measured47 using a yeast-based selection assay in which A\(\beta\) is fused to the essential protein dihydrofolate reductase. Consequently, the aggregation of the variants is inversely related to the yeast growth. In total, we collected 790 variants with known changes in aggregation propensity. These scores are only partially related to solubility, as aggregation phenomena differ from poor solubility precipitation. The structure that we used for A\(\beta\)42 has the PDB code 1IYT.The datasets \(\mathcal {D}_{Sol}\), \(\mathcal {D}_{Inv}\), \(\mathcal {D}_{LGK}\) and \(\mathcal {D}_{A\beta }\), as well as all structures used in this study, can be retrieved from our GitHub repository at github.com/3BioCompBio/SOuLMuSiC.FeaturesIn our SOuLMuSiC model, we used two types of features: structure-based and sequence-based. The former are mainly based on statistical potentials, and the latter, on various amino-acid scales and on protein large language models. Below, we provide a brief description of these features.\(\bullet\) Statistical potentials, which are well-known mean force potentials extracted from frequencies of associations between sequence elements, se, and structure motifs, st, in datasets of protein structures48,49. Using this formalism, the folding free energy \(\Delta W(st, se)\) of the sequence-structure pair (st, se) can be computed in terms of the probabilities of se, st and (st, se) using the inverse Boltzmann law, as defined in the first equality :$$\begin{aligned} \Delta W(st, se) = -k_B T \log \frac{P(st, se)}{P(st) P(se)} \simeq -k_B T \log \frac{F(st, se)}{F(st) F(se)} \end{aligned}$$(2)where \(k_B\) is the Boltzmann constant and \(T\) the absolute temperature taken to be room temperature49. As shown in the second approximate equality, the probabilities P can be estimated from the frequencies F of observation of se, st and (st, se) in a high-quality, non-redundant dataset of experimental protein 3D structures.In our model, we used four types of statistical potentials, differing in the structure and sequence elements considered. Their list and characteristics are given in Table 2. Among these four potentials, two describe local interactions along the polypeptide chain and two are distance potentials that describe tertiary interactions and the likelihood of amino acids being separated by specific spatial distances. Note that more potentials can be defined, but we combined some of them to limit the number of free parameters that we have to optimize and thus to avoid overfitting.Once these potentials were derived, we used them to compute the change in folding free energy \(\Delta \Delta G\) between the wild type (wt) and mutant (mut) proteins. Considering for example the distance potentials \(\Delta W_{sds}\) defined in Table 2, the associated free energy change \(\Delta \Delta G_{SDS}\) is defined as:$$\begin{aligned} \Delta \Delta G_{SDS} = \Delta G_{SDS} (\text {mut}) - \Delta G_{SDS} (\text {wt}) \end{aligned}$$(3)where$$\begin{aligned} \Delta G_{SDS} = \sum _{i