Common to rare transfer learning (CORAL) enables inference and prediction for a quarter million rare Malagasy arthropods

Wait 5 sec.

MainEarth is home to several millions of species1. Among these, most are unknown2 and rare3. Innovations in sensor technologies are now providing unprecedented capacity to record patterns and changes in the abundance and distribution of all kinds of taxa, from the named to the previously unnamed and from the rare to the common. These technologies include DNA-based monitoring, passive acoustic monitoring and visual sensors4,5. By allowing the efficient recording of thousands to hundreds of thousands of species in time and space, the accumulation of high-dimensional ‘novel community data’ is transforming our access to information on species distributions and abundances4. As a particularly exciting development, the emergence of novel community data allows us to target the speciose groups accounting for the main part of global biodiversity1,2. Where species records so far have been massively biased toward vertebrates, one of the least species-rich taxa3, recent methods are now making hyper-diverse taxa such as arthropods and fungi arguably easier to sample than vertebrates and plants. As these speciose taxa can be mass-sampled and mass-identified, we can derive automated characterizations of what taxa occur where5,6,7. Nonetheless, the revolution in the generation of data is awaiting a matching insurgence of methods to analyze the data.While most species on Earth are rare, these are the species that we know least about, partially because rare species are the most challenging to model8. Paradoxically, the rare species also encompass the taxa in greatest need of protection, and thus the very species for which information on their distributions and ecological requirements is most critical (the rare species paradox9). Understanding biodiversity change necessitates models that can relate species occurrences to environmental, biotic and spatial predictors, and which can predict changes in species communities with changes in the state of these drivers10. Hence, the need for predictive tools for rare species has been repeatedly highlighted11,12,13,14. However, the inherent rarity of most species results in highly skewed species-abundance distributions, where a few species are common whereas most species are present at few sites in low numbers. Typical approaches to species-level modeling will then impose a cutoff on species occurrences or abundances15,16, arguing that for the rarest species, the data are simply insufficient for any quantitative inference regarding the drivers of their distribution. In a world where rare is common3, this can and will typically amount to rejecting most data, and all the information there then remains hidden. To make the most of increasingly available data, we need modeling approaches that can fully exploit such data.With species distribution models, rare species may be modeled through ensemble predictions from multiple small models, each of which contains just a few predictors to avoid overfitting9,17,18. Because closely related species are generally ecologically more similar than distantly related species19,20, phylogenetic information may be used to infer the distributions of rare species21,22,23,24,25. Joint species distribution models (JSDM)26 allow leveling up by modeling large numbers of species simultaneously. This enables efficient borrowing of information across species through their shared responses to environmental variation27. Furthermore, when data on species phylogenies and/or traits are available, information can be borrowed especially across similar species28,29. This can lead to improved predictions, especially for rare species30.The high-dimensional, and often extremely sparse, nature of species occurrence data, compounded with spatiotemporal and phylogenetic dependencies, presents major challenges for statistical analyses and computation. High-performance computing can scale some existing JSDMs to thousands of species31,32. Two-stage methods, which make small concessions by cutting dependence between species via approximate likelihoods33,34, can scale to tens of thousands of species while still retaining reasonable uncertainty estimates. Unfortunately, these approaches do not yet scale to the millions of species that comprise the Earth’s biodiversity1. What is more, they may perform poorly for extremely sparse rare species, by compromising model structures in the interest of gaining computational advantage.ResultsThe HMSC frameworkIn this paper, we apply Bayesian transfer learning35 to develop the common to rare transfer learning (CORAL) approach (Fig. 1). Transfer learning refers to a broad class of multi-stage analysis methods that leverage information from a pretrained model to improve performance for a new but related inference task. In a Bayesian context, this is often achieved by using the posterior model from one dataset to define an intelligent prior model for another dataset. Sharing information between models can improve parameter estimates and substantially boost out-of-sample performance, particularly when studying new, smaller datasets. Our transfer learning method builds on the hierarchical modeling of species communities (HMSC)10,29,36 approach to joint species distribution modeling26. The core idea of CORAL is very general and will thus apply to many other JSDM approaches, too. What makes its application in the HMSC context so intuitive is that HMSC models species responses to predictors as a function of species traits and phylogenetic relationships. This feature can be efficiently harnessed for transfer learning.Fig. 1: High-level description of the CORAL modeling approach and the generic CORAL model structure.CORAL is based on fitting a backbone JSDM model to a subset of the most common species in the data and then borrowing information from this backbone to model the rare species. The backbone model learns about latent factors representing relevant missing environmental predictors, as well as about the species responses to both the measured and the latent predictors. The backbone model provides an informative prior distribution for each rare species. This is particularly efficient when we have access to phylogenetic or trait information, allowing information to be borrowed especially from common species closely related to the rare species, or species sharing similar traits with the rare species. CORAL simplifies a fully Bayesian JSDM by replacing latent factors with a pre-estimated point estimate and by accounting for dependence from common to rare species, but not for dependence from rare to common species.Full size imageIn brief, HMSC is a multivariate generalized linear model fitted in a Bayesian framework. As a response it considers a matrix of species occurrences or abundances. We exemplify our approach with presence–absence data, denoting by yij = 1 if species j (with j = 1,…ns) is present in sample i (with i = 1,…ny) and yij = 0 if this is not the case. Presence–absence data are modeled in HMSC with probit regression: $\Pr \left({y}_{{ij}}=1\right)=\Phi ({L}_{{ij}})$, where $\Phi (.)$ is the standard normal cumulative distribution function and Lij is the linear predictor modeled as:$${L}_{{ij}}=\mathop{\sum }\limits_{k}^{{n}_{\mathrm{c}}}{x}_{{ik}}{\beta }_{{kj}}+\mathop{\sum }\limits_{k}^{{n}_{f}}{\eta }_{{ik}}{\lambda }_{{kj}},$$(1)where xik are measured predictors, ηik are latent predictors, and βkj and λkj are regression coefficients quantifying responses of the species to the measured and the latent predictors. The latent features induce within-sample dependence across species; these features may encode characteristics of the habitat, the environment and the spatiotemporal setting not captured by the xiks. HMSC uses a Bayesian hierarchical model to (1) automatically infer how many latent features nf are needed and (2) to borrow information across species in inferring the βkjs. For point (2), HMSC estimates to what degree taxonomically or phylogenetically related species, or species with similar traits, show similar responses βkj to environmental variation through the multivariate normal distribution10,28$${\rm{vec}}\left(B\right) \sim N\left(\mu ,P\otimes V\right).$$(2)Here, B is the matrix of the regression parameters βkj of the ns species, μ is the average response, P = ρC + (1 − ρ)Ins is a weighted average between the phylogenetic or taxonomic correlation matrix C and the identity matrix I corresponding to unrelated species, 0 ≤ ρ ≤ 1 is the strength of the phylogenetic signal, and V is the variance–covariance matrix of species-specific deviations from the average μ. The average response μ is further modeled as μ = vec(ΓTT), where T is a matrix of species traits, Γ are the estimated responses of the traits to environmental variation and the superscript T denotes the matrix transpose. With equation (2), HMSC learns if and to what extent related species, or species with similar traits, show similar environmental responses. This allows for effective borrowing of information among species; for example, improving parameter estimation for rare species, for which it would be difficult to obtain accurate estimates if considering the data in isolation from the community context29. As a result, HMSC shows higher predictive performance compared to approaches that do not enable such borrowing of information30.CORAL priors for rare speciesOur key idea is that even if it is not feasible to include 100,000+ species in a JSDM model such as HMSC, one can still borrow information from the common species. The structure of our approach follows naturally from three assumptions, namely that (1) users have enough data to perform high-quality inference on common species without leveraging rare species data, (2) information from these common species is relevant for modeling rare species, and (3) rare species may be viewed as conditionally independent given the common species data and measured sample covariates. This suggests a two-stage analysis that first studies the common species jointly and then studies each rare species independently using the results of the common species analysis.The first stage of CORAL is to fit an HMSC to the common species to pre-estimate latent factors (Fig. 1). From this analysis we obtain a point estimate of the latent features ηi = ηi1,… ${\eta}_{{in}_{f}}$), which provides key information not captured in xi = (xi1,… ${x}_{{in}_{c}}$) about environmental and habitat conditions and the overall biological community represented in sample i. We define a new covariate vector ${\widetilde{\bf{x}}}_{i}={\left({\widetilde{x}}_{i1},\ldots ,{\widetilde{x}}_{i({n}_{c}+{n}_{f})}\right)}^{T}$ by concatenating xi and ηi to be used as a fixed predictor in the second stage of CORAL, which fits HMSC to the common species to provide a backbone model (Fig. 1). The third stage fits CORAL models (independent Bayesian probit models) to each rare species: $\Pr \left({y}_{{ir}}=1\right)=\Phi ({\widetilde{x}}_{i}^{T}{\beta }_{r})$, for $r\in {{\mathcal{J}}}_{r}$ with ${{\mathcal{J}}}_{r}\subset \{1,\ldots ,n_s\}$ the set of rare species (Fig. 1). To reduce mean square error in inferring βr for $j\in {{\mathcal{J}}}_{r}$, we construct a prior that adaptively shrinks toward the common species coefficients accounting for taxonomic or phylogenetic similarity.Our prior is motivated by the prior for fixed-effects coefficients in HMSC. To simplify inference and learn relevant hyper-parameters, we first rerun HMSC with the expanded covariate vector ${\widetilde{\bf{x}}}_{i}$. Under HMSC, the prior conditional distribution of the rare species coefficients given the common species coefficients is$${\beta }_{r} \sim N\left({m}_{r},{S}_{r}\right).$$(3)Here, the mean is given by$${m}_{r}=\varGamma {\bf{t}}_{r}+\left[\left(\rho {c}_{r}^{T}\right)\left({P}^{-1}\otimes {I}_{{n}_{\mathrm{c}}}\right)\right]\left({\rm{vec}}\left(B\right)-\mu \right)$$(4)where $P=\rho C+\left(1-\rho \right){I}_{n_s}$. The variance–covariance matrix is S = krV, where the variance scaling factor kr is given by$${k}_{r}=1-\rho {c}_{r}^{T}{P}^{-1}{\bf{c}}_{r}.$$(5)The vector cr encodes relatedness between a rare species $r\in {{\mathcal{J}}}_{r}$ and all the ns species in the backbone analysis and tr is the trait vector for this rare species. As HMSC is fitted to data with Bayesian inference, parameter uncertainty can be accounted for by defining the prior as a mixture of multivariate normal distributions (each defined by equations (3)–(5)) over the posterior samples. To achieve a simple functional form for the prior, we approximated the mixture by a single multivariate normal distribution, the mean and variance–covariance matrix of which we set equal to those of the mixture.We refer to the above approach as CORAL. Figure 1 shows the full mathematical structure of this approach, with each box corresponding to a separate stage of CORAL inference. This contrasts with the (computationally intractable) joint modeling approach, which would estimate all parameters for all species simultaneously. CORAL is likely to perform well when assumptions (1)–(3) hold: that is, when there is high-quality common species data that spans the phylogenetic tree and when the backbone model estimates that species responses are phylogenetically structured and/or influenced by species traits. To quantify the benefits of CORAL, we compare its performance to that of a baseline model that does not benefit from the backbone analysis. In other words, for the baseline model we fit $\Pr \left({y}_{{ir}}=1\right)=\Phi ({x}_{i}^{T}{\beta }_{r})$, separately for $r\in {{\mathcal{J}}}_{r}$ using a simple Gaussian prior ${\beta }_{r} \sim N({m}_{r},{S}_{r})$. We expect CORAL to have substantial advantages over the baseline model due to two considerations: ${\widetilde{\bf{x}}}_{i}$ contains important latent factor information on top of xi, and CORAL allows the borrowing of information from the βjs for common species to rare species. As both CORAL and the baseline models can be fitted independently for $r\in {{\mathcal{J}}}_{r}$, computational time scales linearly with the number of species. As a result, these computations can be trivially parallelized allowing for inference and prediction for hundreds of thousands or even millions of species.To enable easy application of the CORAL approach to high-dimensional biodiversity data, we provide an R package for fitting these models, visualizing the results and generating predictions37. This software package also includes a simulated case study that demonstrates how the CORAL approach is able to recover the true parameter values that were used to simulate the data.Case study on Malagasy arthropodsWe tested the approach in the context of metabarcoding data on Malagasy arthropods. We applied Malaise trap sampling in 53 locations across Madagascar, each of which was relatively undisturbed and where the vegetation represented the conditions of the local environment. We then applied high-throughput cytochrome c oxidase subunit I (COI) metabarcoding38 and the OptimOTU pipeline39 to score the occurrences of 255,188 species-level operational taxonomic units (henceforth, species) in 2,874 samples. To create a backbone model of common species, we included those 876 species that occurred in at least 50 samples. This left those 254,312 species that occurred fewer than 50 times in the data as rare species, which we model by the CORAL approach. We note that the threshold of 50 occurrences is relatively high so some of the rare species are not so rare. This choice was made to test the hypothesis that borrowing information from the backbone model changes predictions and inference especially for the very rare species, but less so for more common species. Most of these rare species were extremely rare in the sense that 182,402 species (71% of all rare species) were detected in one sample only. Among these extremely rare species, 1,479 were singletons, that is, represented by a single sequence. Some of these taxa may be artifacts, reflecting chimeric sequences or sequencing error. However, most (99.4%) of the rare species were represented by more than one sequence. Thus, the potential interpretation of some sequencing errors as false species is unlikely to qualitatively influence our conclusions.As simple and frequently used predictors of species presence, we included covariates related to seasonality, climatic conditions and sequencing effort. Climatic conditions were modeled through the second order polynomials of mean annual temperature and mean annual precipitation40, while including the interaction between these two climatic predictors. We modeled seasonality through periodic functions $\sin (2\pi d/365)$, $\cos (2\pi d/365)$, $\sin (4\pi d/365)$ and $\cos (4\pi d/365)$, where d is the day of sampling. To capture site-level and sample-level variation not captured by the measured predictors, we included ten site-level (n = 53 sites) and four sample-level (n = 2,874 samples) latent variables. Variation in sequencing effort was modeled by including log-transformed sequencing depth as a predictor. As a proxy of phylogeny, we used taxonomic assignments at the levels of kingdom, phylum, class, order, family, subfamily, tribe, genus and species, including assignments to pseudotaxa for those cases that could not be reliably classified to previously known taxa.The common species responded especially to site-level variation (Fig. 2a). This was shown both by responses to climatic variables, which contributed 48% of the explained variation, and by responses to the site-level random factors, which contributed 42% of the explained variation. The effects of the remaining predictors were much less pronounced, with seasonality contributing 3% of the explained variation, sample-level latent factors 7% and sequencing depth 0.1%. As we did not include any traits in the model, we only based the CORAL models on borrowing information on taxonomic relatedness. The responses of the species to the predictors were strongly phylogenetically structured (posterior mean ρ = 0.65, posterior probability $\Pr \left(\rho > 0\right)=1.00$), thus providing potential for borrowing information especially from related species.Fig. 2: Estimated responses of the species to measured and latent predictors.a,b, The responses are shown for the backbone model of common species (a) and for the CORAL model of all species (b). Responses that were estimated to be positive (red large dots for common species and pink small dots for rare species) or negative (blue large dots for the common species and cyan small dots for the rare species) with at least 95% posterior probability are highlighted. The dots have been made partially transparent and jittered in the horizontal direction to show the responses to many predictors for a very large number of species.Full size imageThe variance scaling factor k varied between 0.13 and 0.70, with a mean value of 0.34, thus showing a substantial reduction in variance. As expected, it was lowest for species with close relatives in the backbone model (Fig. 3a). The conditional prior models predicted variation in the occurrences of the rare species better than random (Fig. 3b). This result is nontrivial as the predictions are made by a completely independent model that has not seen any data for the focal species. The accuracy of the prior predictions increased with the level of relatedness between the focal species and the species in the backbone model and the predictions were more accurate for species occurring at least ten times in the data than for the very rare species (Fig. 3b).Fig. 3: Conditional prior models for rare species constructed by borrowing information from the backbone model of common species.a, Prior model precision measured by the variance scaling factor k (equation 5), as shown in relation to the taxonomic level shared with the closest relative in the backbone model. The numbers on top of the bars indicate the number of rare species in each category. b, Discrimination powers of the conditional prior models, shown separately for each rank of the closest relative in the backbone model (different colors of bars) and for two prevalence classes (at most ten occurrences, left bars; more than ten occurrences, right bars). The blue line shows the null expectation AUC = 0.5. In both panels, the lines show the medians, the boxes the lower and upper quartiles, and the whiskers the minimum and maximum values. In b, the numbers of datapoints, (species) included in each box plot are (from left to right) 29,447, 609, 9,725, 145, 31,565, 692, 71,924, 2,545, 20,806, 899, 13,299, 753, 12,477 and 724.Full size imageTo compare the baseline and CORAL models in terms of inference, we fitted them to all of the 254,312 rare species. Combining the parameter estimates from the backbone and the CORAL models then enabled us to reveal the responses of all species (common and rare) to the measured and latent predictors (Fig. 2b). These responses illustrate how CORAL transfers information from common species to rare species, as in Fig. 2b blocks of red dots tend to spread pink dots in their surroundings, and blue dots tend to spread cyan dots in their surroundings, meaning that common species induce similar responses to taxonomically related rare species. However, there are exceptions to this general rule, as Fig. 2b shows the CORAL posteriors rather than the CORAL priors. Thus, if the data for a rare species has sufficient evidence of for example positive response even if the related common species show negative responses, the estimate of the rare species will be positive. By updating the conditional prior from the backbone model of common species with data from the focal rare species, we achieved improved predictions in the sense that the CORAL models showed higher precision than the baseline models (Fig. 4). This was especially the case for the very rare species (such as the one exemplified in Fig. 4b), for which the baseline models led to very large credible intervals, as may be expected for the low information contained in few occurrences. For more common species (such as the one exemplified in Fig. 4a), the increase in precision was smaller (Fig. 4c). The increase in precision increased with relatedness between the focal species and the species included in the backbone model (Fig. 4c), thus mirroring the relation seen between relatedness and the variance scaling factor (Fig. 3a).Fig. 4: Comparison of inference between CORAL and baseline models.a,b, A specific prediction for two example species, one that is relatively common (a, the wall spider Garcorops madagascar, ten occurrences) and another that is very rare (b, the deer fly Chrysops madagascarensis, one occurrence). The panels show the posterior mean (line) and the interquartile posterior range (shaded area) of the linear predictor under changing precipitation, keeping temperature at its mean value over the data. c, Comparison of the posterior variance between baseline and CORAL models systematically for all species. We averaged the posterior variance over the environmental predictors (excluding intercept and latent factors). The panel shows the difference between posterior variance in the CORAL and baseline models. Thus, for values below 0 (the blue line) the CORAL model shows smaller variance. In c, the left-hand boxes correspond to very rare species (1–10 occurrences), the right-hand boxes to relatively common species (11–49 occurrences), the lines show the medians, the boxes the lower and upper quartiles, and the whiskers the minimum and maximum values. In c, the numbers of datapoints (species) included in each box plot are (from left to right) 29,447, 609, 9,725, 145, 31,565, 692, 71,924, 2,545, 20,806, 899, 13,299, 753, 12,477 and 724.Full size imageOne benefit of the CORAL approach is that its posterior distribution is presented analytically rather than through posterior samples obtained through Markov chain Monte Carlo sampling. This is achieved by approximating the CORAL posterior for each species (both the common and the rare) by a multivariate normal distribution. This saves storage space, which could otherwise become limiting for models with very large numbers of species. The multivariate normal presentation of the CORAL posterior also simplifies downstream analyses as posterior mean occurrence probabilities can be computed analytically without Markov chain Monte Carlo sampling. The use of an analytical approximation may, however, introduce model misspecification, the extent of which we explored by comparing the posterior predictive distribution to the data in terms of relevant summary statistics (Fig. 5). The CORAL model fitted to the Malagasy arthropod data was well calibrated in terms of generally predicting the number of times each species was observed, except for some overestimation for the rarest species (Fig. 5a). The model also satisfactorily predicted the number of species present in each sample, but the overestimation in the occurrences of the rarest species translated to some overestimation of species richness (Fig. 5b). The model fit was uniform across ranges of temperature (Fig. 5c) and humidity (Fig. 5d), suggesting no substantial misspecification in terms of how the effects of these covariates were modeled.Fig. 5: Verification that CORAL posteriors are consistent with the observed data, both in terms of the overall scale of the predicted probabilities as well as the learned covariate effects.a, The expected number of observations for each species under the CORAL posterior compared to the observed prevalence, with the identity function, is shown as a red line in all figures. b, Expected species richness for each sample under the CORAL posterior compared to the observed richness. c,d, Observed versus predicted proportion of occurrences below the median of temperature (c) and precipitation (d), shown for species that occur at least ten times in the data.Full size imageTo compare the baseline and CORAL models in terms of predictive power, we considered the 22,140 species that were not included in the backbone model but occurred at least five times in the data. We applied twofold cross validation, where we randomized the folds separately for each species, resampling until both folds included at least 40% of the occurrences. We compared the models using the area under the curve (AUC), Tjur’s R2, area under the precision recall curve (PRAUC), Brier score, negative log-likelihood and log-determinant posterior covariance. Together, these metrics provide a comprehensive overview of model performance covering predictive power, well calibrated probabilities and useful inference. All metrics of predictive performance improved considerably when moving from the baseline model to the CORAL model: AUC from 0.86 to 0.94, Tjur’s R2 from 0.03 to 0.08, PRAUC from 0.07 to 0.16, Brier score from 0.004 to 0.003, negative log-likelihood from 0.023 to 0.016 and log determinant from −28.2 to −36.4. All these improvements were significant with P 0\right)=1.00\)), forming the basis for borrowing information especially across related species. However, even without phylogenetic signal in the data, or alternatively by fitting a model without phylogeny, CORAL makes it possible to borrow information from the backbone model of common species by identifying sample-level and site-level latent factors, as well as by basing the conditional mean on the average response of all species.To illustrate the scale of the gain, we reiterate the proportion of rare species in our samples: had we imposed a cutoff of species occurrence in 50 samples, we would have omitted 254,312 out of 255,188 species (99.7%), retaining 876 species (0.3%) of the species pool. Leaving the rare species unmodelled would hardly be an efficient use of the massive data painstakingly acquired. For the 22,140 species (8.7%) that occurred at least five times in the data but were not included in the backbone model, we scored a substantial improvement in predictive power by borrowing information from the more common species. This is a major achievement, as it shows how the limited information inherent in the distribution of rare species may be leveraged by gleaning information from more common species.While this study focused on methodological development, our findings are also of major interest for understanding the eco-evolutionary community assembly processes of the Malagasy fauna. We found seasonality and climatic responses of arthropods to vary with their phylogenetic relatedness, suggesting that their distributions across Madagascar are partially constrained by their ancestral niche. This region is characterized by extreme levels of endemism at both a regional and a very small scale41,42,43. Nonetheless, in adapting to local conditions, the species appear to maintain a strong signal of their ancestral niche.MethodsDeriving the CORAL priorCORAL is motivated by the default prior for coefficients in HMSC. Under this prior, the prior for a species r that is not part of the backbone model (that is, a rare species) is given by ${\beta }_{r}{|B}\sim N({m}_{r},{S}_{r})$, with the mean (4) and variance (5) given by the conditional multivariate normal formulas. The moments of this distribution are functions of HMSC parameters including Γ, ρ, V and B and do not include information from the common species data, Yc, a priori. Fitting the backbone model produces a posterior distribution, π, over these parameters, which in turn implies a posterior marginal distribution for the rare species coefficients,$$p\left({{{\beta }}}_{r}|{Y}_{\mathrm{c}}\right)=\int N\left({{{\beta }}}_{r};{m}_{r},{S}_{r}\right){\rm{\pi }}\left({m}_{r},{S}_{r}|{Y}_{\mathrm{c}}\right){{{\mathrm{d}}m}}_{r}{{{\mathrm{d}}S}}_{r}$$This updated distribution is our desired rare species prior. As this distribution is analytically intractable due to the integral over the posterior; we approximate it with a Gaussian:$$p\left({{{\beta }}}_{r}|{Y}_{\mathrm{c}}\right)\approx N\left({{{\beta }}}_{r};{m}_{r}^{{\prime} },{S}_{r}^{{\prime} }\right).$$The mean ${m}_{r}^{{\prime} }$ and variance ${S}_{r}^{{\prime} }$ of this Gaussian are chosen to be the mean and variance of $p\left({\beta }_{r}|{Y}_{\mathrm{c}}\right)$, respectively. These can be calculated using the laws of total expectation and/or variance, resulting in simple expressions in terms of posterior means/variances: ${m}_{r}^{{\prime} }={E}_{{\rm{\pi }}}\left[{m}_{r}\right]$ and ${S}_{r}^{{\prime} }={E}_{{\rm{\pi }}}\left[{S}_{r}\right]+{V}_{{\rm{\pi }}}\left[{m}_{r}\right]$. In practice, we approximate posterior means and/or variances using Monte Carlo with posterior samples returned by HMSC. This completes specification of the CORAL prior.Computational detailsWe fitted the backbone model with a high-performance computing accelerated version32 of the R package Hmsc36, sampling each of the four chains for 37,500 iterations. Of these chains, we omitted the first 12,500 iterations as transient and then thinned the remaining chains by 100 to obtain 250 samples per chain and thus 1,000 posterior samples in total.For each rare species, we fitted a single-species model where we either did not (the baseline model) or did (the CORAL model) use information from the backbone models of common species. The baseline models were simple probit models with a Gaussian prior on the regression coefficients. The baseline models did not include the latent factors as predictors, and they assumed a default prior distribution for the species responses (N(0, 10) for the intercept and N(0, 1) for fixed effect coefficients). In the CORAL models, we included the latent factors as predictors, and assumed the conditional prior distribution based on equations (3)–(5). We obtained 5000 samples after 2,500 transient iterations for each species for both the baseline and CORAL models using MCMCpack44.For each species, we summarized the CORAL model in terms of the mean μ and variance–covariance matrix Σ of the posterior samples. As the model contained 25 parameters (including the intercept), the model for each species was thus represented by 25 + 25(25 + 1)/2 = 350 parameters (accounting for the symmetry of Σ). The collection of models for all the 255,188 species thus contained roughly 89 million parameters, which resulted in the manageable file size of around 1.1 GB. We approximate the CORAL posterior through the multivariate normal distribution N(μ,Σ). For predictor vector xi, the posterior mean of the linear predictor can be then computed as ${x}_{i}^{T}\mu$, and the posterior mean of the occurrence probability as $\Phi ({x}_{i}^{T}\mu /\sqrt{1+{x}_{i}^{T}\Sigma {x}_{i}})$.Metrics used to evaluate model performanceAUC is the probability a randomly chosen positive sample has a higher predicted probability than a randomly chosen negative sample. Tjur’s R2 is a pseudo-R2 value, which can be read like any other R2 value, but typically reaches lower values45. The PRAUC metric quantifies true positives and is useful for analyzing highly imbalanced data where the minority class is of primary interest. The Brier score is the average squared error between predicted probabilities and labels: this metric penalizes overconfidence. Negative log-likelihood directly measures goodness of fit under the proposed Bernoulli model. The determinant of the posterior covariance determines the volume of a 95% credible interval for fixed effect coefficients under a Gaussian approximation, smaller intervals meaning more confident inference. Together, these metrics provide a detailed summary of discrimination ability (AUC, PRAUC), confidence (R2, Brier score), goodness of fit (negative log-likelihood) and precision (log determinant).Sampling of Malagasy arthropodsThe sampling was conducted as part of the worldwide LIFEPLAN biodiversity sampling design46. We selected 53 locations across Madagascar that were relatively undisturbed and where the vegetation represents the conditions of the local environment. Of the sites, 28 were sampled in a spatially nested sampling design with decreasing distances between them (50 km, 5 km and 500 m apart). The other 25 sites were spread across different forested habitats in Madagascar (dry, lowland and montane forests), at elevations ranging from 8 to 1,592 m above sea level. We continuously collected 1-week samples of flying arthropods in 95% ethanol using Malaise traps (ez-Malaise Trap, MegaView Science Co.) Arthropod samples used in this study comply with the regulations for the export and exchange of research samples outlined in the Convention of Biology Diversity and the Convention on International Trade in Endangered Species of Wild Fauna and Flora. Permits to research, collect, and export arthropods were obtained from the Ministry of Environment and Forest as part of an ongoing collaboration between the Ministry of Environment and Forest, Madagascar National Parks, Parc Botanique et Zoologique de Tsimbazaza and the Madagascar Biodiversity Center. Export authorization was provided by the Director of Natural Resources (approval numbers 229/23/MEDD/SG/DGGE/DAPRNE/SCBE.Re). For a detailed description of the sampling, sample shipping and handling, and steps related to DNA extraction and sequencing, we refer to the LIFEPLAN Malaise sample metabarcoding protocol47.The CORAL method can be applied to sample x species occurrence data generated by a wide variety of detection technologies and analysis pipelines. In this paper, we applied CORAL to DNA sequence data analyzed using the OptimOTU bioinformatics pipeline39, which was originally developed for the Global Spore Sampling Project48 and updated to apply to arthropod COI sequence data as part of the LIFEPLAN biodiversity sampling project46. The OptimOTU workflow for COI sequence data consists of primer removal, quality filtering, denoising, de novo and reference-based chimera removal, flagging likely nonanimal sequences, removal of putative nuclear-mitochondrial pseudogenes, probabilistic taxonomic assignment and finally taxonomically guided hierarchical clustering. The OptimOTU pipeline is implemented using the targets v.1.5.1 workflow management package49, here executed using the crew v.0.9.0 (ref. 50) and crew.cluster v.0.3.0 (ref. 51) backends in R v.4.2.3 (ref. 52) on the Puhti cluster at CSC—IT Center For Science, Finland. This yielded a full taxonomic tree with approximate placeholder taxa to group those sequences that could not be reliably identified.Inclusion and ethics statementThe case study on Malagasy arthropods included local researchers (D.R. and E.T.R.) and was conducted in collaboration with a local partner (Madagascar Biodiversity Center). The roles and responsibilities among collaborators were agreed ahead of the research through an Access and Benefit Sharing Agreement.Reporting summaryFurther information on research design is available in the Nature Portfolio Reporting Summary linked to this article.Data availabilityThe data needed to reproduce the analyses are available via Zenodo at https://doi.org/10.5281/zenodo.11076832 (ref. 37). All raw sequence data are archived on mBRAVE and are publicly available via the European Nucleotide Archive (ENA) at https://www.ebi.ac.uk/ena; project accession number PRJEB86111; run accession numbers ERR15009869–ERR15018787; sample IDs for each accession and download URLs are available via Zenodo at https://doi.org/10.5281/zenodo.11076832(ref. 37).Code availabilityThe R-based CORAL pipeline needed to reproduce the analyses is available via Zenodo at https://doi.org/10.5281/zenodo.11076832 (ref. 37). The bioinformatics pipeline is available via GitHub at https://github.com/brendanf/CORAL_bioinfo.ReferencesMora, C., Tittensor, D. P., Adl, S., Simpson, A. G. B. & Worm, B. How many species are there on Earth and in the ocean? PLoS Biol. 9, e1001127 (2011).Article CAS PubMed PubMed Central Google Scholar Stork, N. E. How many species of insects and other terrestrial arthropods are there on Earth? Annu. Rev. Entomol. 63, 31–45 (2018).Article CAS PubMed Google Scholar Callaghan, C. T., Borda-de-Água, L., van Klink, R., Rozzi, R. & Pereira, H. M. Unveiling global species abundance distributions. Nat. Ecol. Evol. 7, 1600–1609 (2023).Article PubMed PubMed Central Google Scholar Hartig, F. et al. Novel community data in ecology-properties and prospects. Trends Ecol. Evol. 39, 280–293 (2024).Article CAS PubMed Google Scholar van Klink, R. et al. Towards a global toolkit for insect biodiversity monitoring. Phil. Trans. Roy. Soc. B 379, 20230101 (2024).Article Google Scholar Větrovský, T. et al. A meta-analysis of global fungal distribution reveals climate-driven patterns. Nat. Commun. 10, 5142 (2019).Article PubMed PubMed Central Google Scholar Tedersoo, L. et al. Global patterns in endemicity and vulnerability of soil fungi. Glob. Chang. Biol. 28, 6696–6710 (2022).Article CAS PubMed PubMed Central Google Scholar Jeliazkov, A. et al. Sampling and modelling rare species: conceptual guidelines for the neglected majority. Glob. Chang. Biol. 28, 3754–3777 (2022).Article CAS PubMed Google Scholar Lomba, A. et al. Overcoming the rare species modelling paradox: a novel hierarchical framework applied to an Iberian endemic plant. Biol. Conserv. 143, 2647–2657 (2010).Article Google Scholar Ovaskainen, O. et al. How to make more out of community data? A conceptual framework and its implementation as models and software. Ecol. Lett. 20, 561–576 (2017).Article PubMed Google Scholar Helmstetter, N. A., Conway, C. J., Stevens, B. S. & Goldberg, A. R. Balancing transferability and complexity of species distribution models for rare species conservation. Divers. Distrib. 27, 95–108 (2021).Article Google Scholar Didham, R. K. et al. Interpreting insect declines: seven challenges and a way forward. Insect Conserv. Divers. 13, 103–114 (2020).Article Google Scholar Galante, P. J. et al. The challenge of modeling niches and distributions for data‐poor species: a comprehensive approach to model complexity. Ecography 41, 726–736 (2018).Article Google Scholar Zhang, C., Chen, Y., Xu, B., Xue, Y. & Ren, Y. Improving prediction of rare species’ distribution from community data. Sci. Rep. 10, 12230 (2020).Article CAS PubMed PubMed Central Google Scholar Novotny, V. et al. Low beta diversity of herbivorous insects in tropical forests. Nature 448, 692–695 (2007).Article CAS PubMed Google Scholar van Proosdij, A. S. J., Sosef, M. S. M., Wieringa, J. J. & Raes, N. Minimum required number of specimen records to develop accurate species distribution models. Ecography 39, 542–552 (2016).Article Google Scholar Le Lay, G., Engler, R., Franc, E. & Guisan, A. Prospective sampling based on model ensembles improves the detection of rare species. Ecography 33, 1015–1027 (2010).Article Google Scholar Breiner, F. T., Guisan, A., Bergamini, A. & Nobis, M. P. Overcoming limitations of modelling rare species by using ensembles of small models. Methods Ecol. Evol. 6, 1210–1218 (2015).Article Google Scholar Cadotte, M. W., Cardinale, B. J. & Oakley, T. H. Evolutionary history and the effect of biodiversity on plant productivity. Proc. Natl Acad. Sci. USA 105, 17012–17017 (2008).Article CAS PubMed PubMed Central Google Scholar Wiens, J. J. et al. Niche conservatism as an emerging principle in ecology and conservation biology. Ecol. Lett. 13, 1310–1324 (2010).Article PubMed Google Scholar Guénard, G., von der Ohe, P. C., de Zwart, D., Legendre, P. & Lek, S. Using phylogenetic information to predict species tolerances to toxic chemicals. Ecol. Appl. 21, 3178–3190 (2011).Article Google Scholar Mondanaro, A. et al. ENphylo: a new method to model the distribution of extremely rare species. Methods Ecol. Evol. 14, 911–922 (2023).Article Google Scholar Qiao, H., Peterson, A. T., Ji, L. & Hu, J. Using data from related species to overcome spatial sampling bias and associated limitations in ecological niche modelling. Methods Ecol. Evol. 8, 1804–1812 (2017).Article Google Scholar Morales‐Castilla, I., Davies, T. J., Pearse, W. D. & Peres‐Neto, P. Combining phylogeny and co‐occurrence to improve single species distribution models. Glob. Ecol. Biogeogr. 26, 740–752 (2017).Article Google Scholar Smith, A. B., Godsoe, W., Rodríguez-Sánchez, F., Wang, H.-H. & Warren, D. Niche estimation above and below the species level. Trends Ecol. Evol. 34, 260–273 (2019).Article PubMed Google Scholar Warton, D. I. et al. So many variables: joint modeling in community ecology. Trends Ecol. Evol. 30, 766–779 (2015).Article PubMed Google Scholar Ovaskainen, O. & Soininen, J. Making more out of sparse data: hierarchical modeling of species communities. Ecology 92, 289–295 (2011).Article PubMed Google Scholar Abrego, N., Norberg, A. & Ovaskainen, O. Measuring and predicting the influence of traits on the assembly processes of wood‐inhabiting fungi. J. Ecol. 105, 1070–1081 (2017).Article Google Scholar Ovaskainen, O. & Abrego, N. Joint Species Distribution Modelling (Cambridge Univ. Press, 2020).Norberg, A. et al. A comprehensive evaluation of predictive performance of 33 species distribution models at species and community levels. Ecol. Monogr. 89, e01370 (2019).Article Google Scholar Pichler, M. & Hartig, F. A new joint species distribution model for faster and more accurate inference of species associations from big community data. Methods Ecol. Evol. 12, 2159–2173 (2021).Article Google Scholar Rahman, A. U., Tikhonov, G., Oksanen, J., Rossi, T. & Ovaskainen, O. Accelerating joint species distribution modeling with HMSC-HPC: a 1000× faster GPU deployment. PLoS Comp. Biol. 20, e1011914 (2024).Article CAS Google Scholar Ting, B., Wright, F. & Zhou, Y.-H. Fast multivariate probit estimation via a two-stage composite likelihood. Stat. Biosci. 14, 533–549 (2022).Article Google Scholar Chakraborty, A., Ou, R. & Dunson, D. B. Bayesian inference on high-dimensional multivariate binary responses. J. Am. Stat. Assoc. 119, 2560–2571 (2023).Article PubMed PubMed Central Google Scholar Suder, P. M., Xu, J. & Dunson, D. B. Bayesian transfer learning. Preprint at https://doi.org/10.48550/arXiv.2312.13484 (2023).Tikhonov, G. et al. Joint species distribution modelling with the R‐package HMSC. Methods Ecol. Evol. 11, 442–447 (2020).Article PubMed PubMed Central Google Scholar Ovaskainen, O. et al. Data and script pipeline for: common to rare transfer learning (CORAL) enables inference and prediction for a quarter million rare Malagasy arthropods. Zenodo https://doi.org/10.5281/zenodo.11076832 (2025).Elbrecht, V. et al. Validation of COI metabarcoding primers for terrestrial arthropods. PeerJ 7, e7745 (2019).Article PubMed PubMed Central Google Scholar Furneaux, B. et al. OptimOTU: taxonomically aware OTU clustering with optimized thresholds and a bioinformatics workflow for metabarcoding data. Preprint at https://doi.org/10.48550/arXiv.2502.10350 (2025).Wouters, H. Global Bioclimatic Indicators from 1979 to 2018 Derived from Reanalysis (Copernicus Climate Change Service (C3S) Climate Data Store (CDS), 2021).Antonelli, A. et al. Madagascar’s extraordinary biodiversity: evolution, distribution, and use. Science 378, eabf0869 (2022).Article CAS PubMed Google Scholar Vences, M., Wollenberg, K. C., Vieites, D. R. & Lees, D. C. Madagascar as a model region of species diversification. Trends Ecol. Evol. 24, 456–465 (2009).Article PubMed Google Scholar Wilmé, L., Goodman, S. M. & Ganzhorn, J. U. Biogeographic evolution of Madagascar’s microendemic biota. Science 312, 1063–1065 (2006).Article PubMed Google Scholar Martin, A. D., Quinn, K. M. & Park, J. H. MCMCpack: Markov chain Monte Carlo in R. J. Stat. Softw. 42, 1–21 (2011).Article Google Scholar Abrego, N. & Ovaskainen, O. Evaluating the predictive performance of presence–absence models: why can the same model appear excellent or poor? Ecol. Evol. 13, e10784 (2023).Article PubMed PubMed Central Google Scholar Hardwick, B. et al. LIFEPLAN: a worldwide biodiversity sampling design. PLoS ONE 19, e0313353 (2024).Article CAS PubMed PubMed Central Google Scholar deWaard, J. R. et al. LIFEPLAN Malaise sample metabarcoding v1 (protocols.io.5qpvokn3xl4o/v1). protocols.io https://doi.org/10.17504/protocols.io.5qpvokn3xl4o/v1 (2024).Ovaskainen, O. et al. Global Spore Sampling Project: a global, standardized dataset of airborne fungal DNA. Sci. Data 11, 561 (2024).Article CAS PubMed PubMed Central Google Scholar Landau, W. M. The targets R package: a dynamic Make-like function-oriented pipeline toolkit for reproducibility and high-performance computing. J. Open Source Softw. 6, 2959 (2021).Article Google Scholar Landau, W. M. crew: a distributed worker launcher framework. R package version 1.2.1. 10.32614/CRAN.package.crew (2023).Landau, W. M., Levin, M. G. & Furneaux, B. crew.cluster: crew launcher plugins for traditional high-performance computing clusters. R package version 0.3.8. https://doi.org/10.32614/CRAN.package.crew.cluster (2023).R Core Team. R: A Language and Environment for Statistical Computing (R Foundation for Statistical Computing, 2023).Download referencesAcknowledgementsThe work was funded by the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation program (grant 856506: ERC-synergy project LIFEPLAN to O.O., D.D. and T.R.). O.O.’s group was also funded by Academy of Finland (grants 336212 and 345110) and the European Union: HORIZON-CL6-2021-BIODIV-01 project 101059492 (Biodiversity Genomics Europe). J.R.d.W., S.L.d.W., M.P., S.R., J.E.S., E.V.Z. and P.D.N.H. were funded by grants awarded to P.D.N.H. from the Government of Canada through its New Frontiers in Research Fund (grant NFRFT-2020-00073), the Large Scale Applied Research Program administered by Genome Canada and Ontario Genomics (OGI-208) and the Major Science Initiatives Fund administered by Canada Foundation for Innovation (project 42450). We acknowledge CSC—IT Center for Science, Finland, for computational resources.FundingOpen Access funding provided by University of Jyväskylä (JYU).Author informationAuthor notesJeremy R. deWaard & Stephanie L. deWaardPresent address: Smithsonian National Museum of Natural History, Washington, DC, USAThese authors contributed equally: Otso Ovaskainen, Steven Winter.Authors and AffiliationsDepartment of Biological and Environmental Science, University of Jyväskylä, Jyväskylä, FinlandOtso Ovaskainen, Nerea Abrego, Sten Anslan & Brendan FurneauxOrganismal and Evolutionary Biology Research Programme, Faculty of Biological and Environmental Sciences, University of Helsinki, Helsinki, FinlandOtso Ovaskainen, Gleb Tikhonov, Bess Hardwick, Panu Somervuo & Tomas RoslinDepartment of Statistical Science, Duke University, Durham, NC, USASteven Winter & David DunsonCentre for Biodiversity Genomics, University of Guelph, Guelph, Ontario, CanadaJeremy R. deWaard, Stephanie L. deWaard, Sujeevan Ratnasingham, Jayme E. Sones, Evgeny V. Zakharov & Paul D. N. HebertEntomology, California Academy of Sciences, San Francisco, CA, USABrian L. FisherMadagascar Biodiversity Center, Parc Botanique et Zoologique de Tsimbazaza, Antananarivo, MadagascarBrian L. Fisher, Dimby Raharinjanahary & Eric Tsiriniaina RajoelisonDepartment of Ecology, Swedish University of Agricultural Sciences (SLU), Uppsala, SwedenDeirdre Kerdraon & Tomas RoslinCanadian National Collection of Insects, Arachnids and Nematodes, Agriculture and Agri-Food Canada, Ottawa, Ontario, CanadaMikko PentinsaariAuthorsOtso OvaskainenView author publicationsSearch author on:PubMed Google ScholarSteven WinterView author publicationsSearch author on:PubMed Google ScholarGleb TikhonovView author publicationsSearch author on:PubMed Google ScholarNerea AbregoView author publicationsSearch author on:PubMed Google ScholarSten AnslanView author publicationsSearch author on:PubMed Google ScholarJeremy R. deWaardView author publicationsSearch author on:PubMed Google ScholarStephanie L. deWaardView author publicationsSearch author on:PubMed Google ScholarBrian L. FisherView author publicationsSearch author on:PubMed Google ScholarBrendan FurneauxView author publicationsSearch author on:PubMed Google ScholarBess HardwickView author publicationsSearch author on:PubMed Google ScholarDeirdre KerdraonView author publicationsSearch author on:PubMed Google ScholarMikko PentinsaariView author publicationsSearch author on:PubMed Google ScholarDimby RaharinjanaharyView author publicationsSearch author on:PubMed Google ScholarEric Tsiriniaina RajoelisonView author publicationsSearch author on:PubMed Google ScholarSujeevan RatnasinghamView author publicationsSearch author on:PubMed Google ScholarPanu SomervuoView author publicationsSearch author on:PubMed Google ScholarJayme E. SonesView author publicationsSearch author on:PubMed Google ScholarEvgeny V. ZakharovView author publicationsSearch author on:PubMed Google ScholarPaul D. N. HebertView author publicationsSearch author on:PubMed Google ScholarTomas RoslinView author publicationsSearch author on:PubMed Google ScholarDavid DunsonView author publicationsSearch author on:PubMed Google ScholarContributionsO.O. acquired funding, coined the original idea, codeveloped the statistical methodology, performed statistical analyses, contributed to software implementation and cowrote the first draft of the paper. S.W. coined the original idea, codeveloped the statistical methodology, performed statistical analyses, contributed to software implementation and cowrote the first draft of the paper. G.T. codeveloped the statistical methodology, led software implementation and contributed to the first draft of the paper. N.A. contributed to first draft of the paper, in particular placing the statistical framework into ecological context. S.A. contributed to the implementation of the bioinformatics workflow. J.R.d.W. contributed to sample management, DNA extraction and sequencing, and commented on the paper. S.L.d.W. contributed to sample management, DNA extraction and sequencing. B.L.F. acquired funding, participated in project coordination, participated in data collection and commented on the paper. B.F. led the implementation of the bioinformatics workflow and commented on the paper. B.H. participated in project coordination, participated in data collection and commented on the paper. D.K. participated in project coordination and participated in data collection. M.P. contributed to the implementation of the probabilistic taxonomic classification method used in the bioinformatics pipeline. D.R. participated in project coordination and participated in data collection. E.T.R. participated in project coordination and participated in data collection. S.R. contributed to sample management, DNA extraction and sequencing. P.S. led the implementation of the probabilistic taxonomic classification method used in the bioinformatics pipeline and commented on the paper. J.E.S. contributed to sample management, DNA extraction and sequencing. E.V.Z. contributed to sample management, DNA extraction and sequencing. P.D.N.H. contributed to sample management, DNA extraction and sequencing, and commented on the paper. T.R. contributed to first draft of the paper, in particular placing the statistical framework into ecological context. D.D. acquired the funding, coined the original idea, codeveloped the statistical methodology and contributed to the first draft of the paper.Corresponding authorCorrespondence to Otso Ovaskainen.Ethics declarationsCompeting interestsThe authors declare no competing interests.Peer reviewPeer review informationNature Methods thanks Anne Chao and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available. Primary Handling Editor: Nina Vogt, in collaboration with the Nature Methods team.Additional informationPublisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.Extended dataExtended Data Fig. 1 Test of CORAL approach’s ability to infer the environmental responses of the rare species.Shown is the correspondence between CORAL (vertical axis) and HMSC (horizontal axis) posteriors for species responses to the fixed effects (a–j), sample-level latent factors (k–n), and site-level latent factors (o–x) in the Madagascar arthropod model. In each panel, the dots correspond to the 876 common species that were included in this analysis. The HMSC posterior is based on fitting the full backbone model to the data. The CORAL posterior is based on applying the CORAL approach to rarefied occurrence data, obtained by masking 90% of occurrences, thus simulating the case that the common species would actually be rare species.Supplementary informationReporting SummaryPeer Review FileRights and permissionsOpen Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.Reprints and permissionsAbout this article