Finding missing links in interaction networks

Documenting which species interact within ecological communities is challenging and labour-intensive. As a result, many interactions remain unrecorded, potentially distorting our understanding of network structure and dynamics. We test the utility of four structural models and a new coverage-deficit model for predicting missing links in both simulated and empirical bipartite networks. We find they can perform well, but that the predictive power of structural models varies with the underlying network structure. Predictions can be improved by ensembling multiple models. Sample-coverage estimators of the number of missed interactions are highly correlated with the number of missed interactions, but strongly biased towards underestimating the true number of missing links. Augmenting observed networks with most-likely missing links improves estimates of qualitative network metrics. Tools to identify likely missing links can be simple to implement, allowing the prioritisation of research effort and more robust assessment of network properties.

existing is found by: As in the latent-trait model, ߣ is defined to be positive, ݇ is a constant intercept term and all 1 9 6 parameters are found by maximum likelihood. We again apply a weak Cauchy prior centred at 0 1 9 7 onto the latent trait terms m and fit 10 models, averaging predictions of the best five. Ecological networks, especially antagonistic networks, frequently show comparatively discrete (such as nocturnal/diurnal partitioning) or spatial segregation. This grouping can be represented by stochastic block models (SBMs), which have been shown to and are used increasingly in ecology (Sander et al. 2015, Kéfi et al. 2016. In SBMs, each species 2 0 7 is assigned to a group, , in a defined set: . The probability of interaction 2 0 8 between two species is determined based on their group membership: The elements of ߱ are the between-group interaction probabilities and are directly specified as 2 1 0 the fraction of observed interactions between each of the groups. We find optimal group 2 1 1 assignations and fit the model using a degree-corrected bipartite-SBM specific algorithm Individually the above models each capture discrete pieces of information about the identity of 2 1 7 missing links. However, the structure of ecological networks is the product of many separate 2 1 8 drivers. To capture this diversity, we combine the predictions of multiple models into ensembles.

1 9
We test combining the matching-centrality model with the block model, and each of the 2 2 0 'structural' models with the coverage-deficit model.

1
We test two ensembling approaches, multiplication and averaging. Multiplying the relative 2 2 2 probabilities assigned to each putative missing link, emphasises the extreme probabilities of the constituent models. Averaging the relative 2 2 4 probabilities, of probabilities assigned to unobserved interactions to sum to 1. Testing the efficacy of extrapolations requires knowledge of the 'true' network. We take two 2 2 9 complimentary approaches to defining our 'true' networks. First, we generate a set of simulated 2 3 0 networks that vary over a wide range of network properties. Second, we use a large and diverse  are highly likely to have been observed and the rarest missed.

3 5
We initially generated 2000 simulated networks from a probabilistic two-trait niche model depicted in SI 1.

4 2
Our objective was to generate interaction matrices that represent a wide range of ecologically have many similarities to the predictive models described above. This is for a good reason - reduce circularity, our generative model is considerably more complex than our predictive 2 4 7 models and we examine how predictive model performance changes in response to network 2 4 8 properties.

4 9
Using the 'true' networks we took 300-2000 samples per network to generate our 'observed' proportion of the interactions present. We collated a diverse set of 113 empirical networks representing antagonistic, mutualistic and 2 5 6 commensalistic interaction types (SI 2). We collated quantitative single-class bipartite networks across a wide geographic range), 2 plant-ant mutualism (2 sources) and 10 seed disperser (from 7 2 6 0 sources) networks. We supplemented this with 25 mammal-dung beetle interaction networks We assess performance at identifying missing links with the area under receiver operating 2 7 9 characteristic curve (AUC) metric to assess the information content of a signal, using the pROC Spearman's rank-correlation with the 'true' relative interaction strengths of the unobserved 2 8 5 interactions for each model. challenge now is to put these tools into practice to leverage additional ecological insight.

4 8
Choosing between predictive models Ideally, a model used to infer missed interactions would capture the true ecological drivers, but 3 5 0 this is not essential for all purposes. Given the diversity of ecological networks, there will never 3 5 1 be a single 'best' model and victors in comparisons will depend on the data set. In our simulated 3 5 2 data our block model appears to perform best. In our empirical datasets, the matching-centrality distrust the empirical datasets (discussed below). The pronounced, likely artefactual, abundance-3 5 5 generality relationship will favour degree-models.

5 6
Nonetheless, splitting hairs over the best-performing model is not necessarily a productive route.

5 7
Networks are structured by multiple processes and in our simulated data sets the very best interactions. Structure-based models pick out the more frequent interactions while coverage 3 6 0 deficit models highlight comparatively infrequent interactions that would be the hardest to determine through further undirected sampling. Future progress will come from operationalising 3 6 2 estimated missing links, rather than from further incremental model refinements. The empirical and simulated datasets overlapped substantially in key network metrics. The main 3 6 5 relevant difference between these datasets is the stronger correlation in empirical networks This can account for the poor performance of the coverage-deficit model in these cases.

6 8
Disentangling the extent to which apparent specialism of rarely observed species is a sampling likely bias in the structure of empirical ecological networks has the consequence that predictive 3 7 2 models may identify missing links introduced both by our subsampling procedure, and due to 3 7 3 gaps in the original dataset. We therefore place more weight on the results from the simulated 3 7 4 data, while noting that, despite the obstacles, the structural models are still able to perform 3 7 5 reasonably well on the sparse empirical data. were gathered only during the day. Because the naïve models used here do not use external information, or the judgment of the there will always be an upper limit to the predictive capacity of any statistical model. We suggest 4 1 1 that, rather than developing new models to gain marginal improvements in predictive capacity, a 4 1 2 more productive focus would be to develop frameworks exploiting this information to test the  interactions that will never be observed in a realistic sampling regime. Furthermore, there is principle uses for inferred missing links.

2 6
First, inferring missing links will direct further sampling where the goal is a descriptive network.

2 7
In many cases the topography of the network is of principle interest, given the potential for accepted, this will also be archived in a public repository and the data doi included.