Comparing sample bias correction methods for species distribution modeling using virtual species

A key assumption in species distribution modeling (SDM) with presence-background (PB) methods is that sampling of occurrence localities is unbiased and that any sampling bias is proportional to the background distribution of environmental covariates. This assumption is rarely met when SDM practitioners rely on federated museum records from natural history collections for geo-located occurrences due to inherent sampling bias found in these collections. We use a simulation approach to explore the effectiveness of three methods developed to account for sampling bias in SDMwith PB frameworks. Two of the methods rely on careful filtering of observation data—geographic thinning (G-Filter) and environmental thinning (E-Filter) —while a third, FactorBiasOut, creates selection weights for background data to bias locations toward areas where the observation dataset was sampled. While these methods have been assessed previously, evaluation has emphasized spatial predictions of habitat potential. Here, we dig deeper into the effectiveness of these methods by exploring how sampling bias not only affects predictions of habitat potential, but also our understanding of niche characteristics such as which explanatory variables and response curves best represent species–environment relationships. We simulate 100 virtual species ranging from generalist to specialist in their habitat preferences and introduce geographic and environmental bias at three intensity levels to measure the effectiveness of each correction method to (1) predict true probability of occurrence across a study area, (2) recover true species–environment relationships, and (3) identify true explanatory variables. We find that the FactorBiasOut most often showed the greatest improvement in recreating known distributions but did no better at correctly identifying environmental covariates or recreating species–environment relationships than G-Filter or E-Filter methods. Narrow niche species are most problematic for biased calibration datasets, such that correction methods can, in some cases, make predictions worse.


INTRODUCTION
Understanding the distributions of organisms and their habitat is a fundamental question in conservation ecology and is needed for implementing landscape conservation plans. When implementing such plans, resource managers are often tasked with resource inventory or monitoring. This has become essential for land managers who must respond to dramatic changes in species composition due to changing climates (Travis 2003, Rosenzweig et al. 2008, Butchart et al. 2010, Rosenberg et al. 2019). In addition to local declines and extirpations caused by changing climates, surface-disturbing land uses (e.g., urbanization, transportation corridors, military training, agriculture, recreational activities, and energy development) are altering vast areas of terrestrial landscapes (Leu et al. 2008), fragmenting habitat (Opdam andWascher 2004, Walston et al. 2016), and disrupting landscape linkages (Andrews 1990, Schneider 2009). Approaches that can help quantify changes in distributions of habitat from these compounding stressors are useful to land managers responsible for managing landscapes when sustainable environmental, social, and economic outcomes are desired (Freeman et al. 2015).
One such methodology to quantify species distributions is species distribution modeling (SDM), also referred to as ecological niche modeling (ENM), which is being used with increasing frequency in land-use planning and conservation (Schwartz 2012, Guisan et al. 2013. SDM is a quantitative modeling approach that relates locations of species occurrences to environmental covariates hypothesized to influence or define habitat potential (Franklin 2010). While SDM encompasses different statistical methods and approaches, one method has emerged as especially user-friendly and is now used widely: Maxent (Phillips et al. 2006). Since its publication in 2006, the original paper has been cited 6796 times in the Web of Science (accessed April 2020), with well over 1300 papers using "maxent" and "conservation" as keywords. Due to its user-friendly interface and accessibility, Maxent is arguably the most commonly used method for SDM in conservation and land-use planning. For example, biodiversity assessments for the UN Paris Agreement on climate change used Maxent to estimate distributions for over 80,000 species of five terrestrial taxa-plants, amphibians, reptiles, birds, and mammals (Warren et al. 2013(Warren et al. , 2018. In addition to its ease of use, the Maxent algorithm does not require knowledge of locations where organisms are absent, which can be difficult to obtain. Instead, a presence-background (PB) modeling approach is used that compares environmental conditions at locations where a species has been observed to environmental conditions across a study area (background). Methods such as Maxent using this approach can also leverage the ever-growing wealth of global biodiversity observation information available from online repositories such as the Global Biodiversity Information Facility (GBIF; https://www.gb if.org/), whereas presence-absence data require costly surveys.
While PB methods benefit from availability of data and software, they rely on several assumptions that are not always acknowledged in practice . One key assumption of PB methods is that sampling effort is uniform across the landscape such that environmental conditions are sampled proportionally to their relative availability (Hastie and Fithian 2013) and that a species' niche is sampled over the full range of environmental conditions in which it occurs (Phillips et al. 2009). This assumption is rarely met when observations are obtained from any of the growing number of biodiversity repositories providing geo-located species occurrence records because these collections generally represent opportunistic and haphazard sampling efforts (Newbold 2010) that introduce sampling bias to the locality information for thousands of species. Random or stratified-random sampling designs can be used to collect observations in a manner free of sampling bias (Edwards et al. 2006), but the high cost of conducting these surveys often precludes their use and has resulted in the reliance on widely available biodiversity observations from federated repositories such as GBIF for SDM (Newbold 2010).
Two sources of sampling bias are problematic for SDM: (1) incomplete sampling and (2) oversampling; when unknown, both result in unquantifiable spatial heterogeneity in the sampling intensity across a landscape and its environmental space (Freitag et al. 1998, Bystriakova et al. 2012, Guillera-Arroita et al. 2015. The former occurs when not all parts of the environmental space where a species can occur (occupied niche space) are sampled, thereby leaving some combinations of environmental conditions un-or under-sampled and results in few observations in areas that are in fact occupied (Hefley et al. 2013). Incomplete sampling may occur when certain areas are inaccessible due to administrative reasons such as private property or governmental regulations but may also rise from imperfect or nondetection when occupied areas are sampled, yet no observations are obtained (Kramer-Schadt et al. 2013, Hefley et al. 2013. Imperfect detection becomes problematic when knowledge of the true sampling effort is unavailable, thus yielding incomplete sampling. Oversampling occurs when some occupied regions of environmental space are sampled at higher intensities than others (Bean et al. 2012, Gelfand et al. 2012), thereby causing model coefficients to become more representative of areas that have been heavily sampled than of a species' entire range (Hortal et al. 2008, Hijmans 2012, Guillera-Arroita et al. 2015. Areas proximal to roadways and access routes, for example, may be sampled more heavily than remote areas (Freitag et al. 1998, Kadmon et al. 2004, and administrative designations, such as national parks, wildlife refuges, or other locations of interest, may introduce unknown heterogeneity in sampling intensity by being sampled preferentially. These two sources of sampling bias can manifest in both geographic and environmental spaces. The latter, environmental sampling bias, occurs when incomplete or over-sampling of the environmental space of the true distribution causes covariance between the sampling probability and the environmental variables used to estimate a species' distribution (Phillips et al. 2009). Environmental sampling bias can lead to errors in model fitting (Hortal et al. 2008, Yackulic et al. 2012 and is therefore the primary concern for SDM (Guillera-Arroita et al. 2015). However, geographic sampling bias can often cause environmental bias in the observational dataset (Hortal et al. 2008, Bystriakova et al. 2012. Moreover, it is geographic sampling bias that is readily apparent to practitioners and has been the focus of many studies (Lobo and Tognelli 2014, Aiello-Lammens et al. 2015, Fernández and Nakamura 2015, Soley-Guardia et al. 2016, Qiao et al. 2017. We therefore address these two sources of sampling bias because both can occur in museum records because these opportunistic datasets rarely stem from systematic sampling regimes (Newbold 2010).
We use a simulation approach to explore the effects of sampling bias on SDM in the Maxent PB framework and to compare three bias correction methods that are often used but have not been systematically compared across a robust set of virtual species. Where these methods have been evaluated previously (e.g., Kramer-Schadt et al. 2013, Varela et al. 2014, Fourcade et al. 2014, Stolar and Nielsen 2015, Gábor et al. 2019, emphasis has been on spatial predictions of habitat potential. Here, we dig deeper into their use by exploring how sampling bias not only affects spatial predictions, but also our understanding of fundamental niche characteristics such as which explanatory variables and species-environment relationships best represent a species' true niche. We simulate 100 virtual species ranging in landscape prevalence and niche breadth from habitat specialists to generalists and introduce geographic and environmental sampling bias at three intensity levels designed to mimic bias levels found in real-world situations to measure the effectiveness of each correction method to (1) predict the true habitat potential across our study area, (2) recreate true species-environment relationships, and (3) identify true explanatory variables.

Study area
We use environmental maps of an arid, interior southwestern region of the United States to define the environmental domain and correlates of realistic, but simulated, species distributions. Our study area covered 918,557 km 2 and incorporated a broad range of ecosystems ranging from Southern and Central California Chaparral and Oak Woodlands to the southwestern Mojave, Sonoran, and Chihuahua deserts. We chose this area for the availability of a rich set of environmental data (e.g., Inman et al. 2014) describing a broad range of physiographic (e.g., landform type, surface texture, and geologic character) and climatic (e.g., temperature, precipitation norms, and extremes) environmental conditions available to develop distributions for virtual species that could plausibly inhabit this region.

Virtual species
We developed 100 virtual species distributions using the package virtualspecies (Leroy et al. 2015) in R 3.3.2 (R Core Team 2020). Each species' niche was defined using two to five randomly v www.esajournals.org selected environmental covariates from a suite of ten uncorrelated explanatory covariates spanning physiographic and climatic constraints. These raster layers represented environmental conditions hypothesized to influence the distribution of many actual species in the region (Inman et al. 2014), and were generalized to a spatial resolution of 1 km, a grain size typical of widely available environmental covariates (e.g., Amatulli et al. 2018). We provide environmental covariates and all code used to generate virtual species in Data S1. For simplicity, we set the occupied niches of these species equal to their fundamental niches, precluding the need to simulate predation or competition, and assumed that these virtual species were in equilibrium with their environment and therefore exhibited stable population sizes.
In order to create realistic habitat preferences and distributions, we ensured that species-environment relationships (hereafter "response curves") could not contradict one another on the landscape (e.g., result in species living simultaneously in the hottest and coldest portions of the study area) using the realistic.sp option, wherein mechanistic response curves were created sequentially ( Fig. 1). In the first step, a single environmental covariate was selected, and a corresponding response curve was randomly generated from a linear, quadratic, logistic, or Gaussian function to create a map of habitat potential values based only on that single environmental descriptor. Prior to generating additional response curves, habitat potential values were linearly rescaled between 0 and 1 to identify high habitat potential areas with values >0.5. For each additional explanatory variable, a response curve was randomly generated and used to predict new habitat potential values as the sum of the response curve functions within the high habitat potential areas, and linearly rescaled between 0 and 1 (step 2 in Fig. 1). This process was iterated until each of the two to five randomly selected environmental covariates were included, with the distribution of true habitat potential values for each virtual species set as the habitat potential values resulting from the sum of all response curve functions (step 3 in Fig. 1). We aimed to create a range of specialist and generalist species by modifying the shape parameters for each of the response curve functions to ensure that some species occupied rare environments, while others occurred widely across the study area. Occupied areas were selected with a probabilistic approach rather than thresholding, which has been shown to introduce distortion when attempting to recover response curves (Meynard et al. 2019). Cells that were occupied were selected with the probability: x À0:65 À0:05 (1) where x (i) is the habitat potential value for cell (i) for x explanatory variable (Meynard et al. 2019).
We calculated the prevalence of occupied areas for each species, which spanned 0.05% to over 97% of the study area, with mean landscape habitat potential values ranging from 0.13 to 0.94.

Rarity and niche breadth
We hypothesized that the breadth of each species' niche and their landscape prevalence would affect how biased sampling distributions degraded SDM performance. Specifically, we predicted that rare specialists would show erratic responses to differing levels of sampling bias because, while rare species may require fewer samples to adequately estimate their distributions (Franklin et al. 2009), they may also be missed entirely or sampled sparsely at extreme levels of sampling bias. In cases of the latter, we assumed that SDM performance would be severely degraded. We measured rarity by landscape prevalence as the occupied proportion of the study area, and niche breadth with a novel approach quantifying the uniqueness of the environmental conditions defining each species' geographic distribution. Metrics of niche breadth can range from volumetric measurements of ndimensional hypervolumes (e.g., Blonder et al. 2014), to reduced principal component axes of occupied regions of environmental space (e.g., Saupe et al. 2015), to counts of unique habitat types presumed to be occupied (e.g., Nürnberg and Aberhan 2013). Our approach to quantifying the environmental uniqueness of occupied habitat draws on the Mahalanobis distances in environmental space because we were interested in how unique the occupied habitat was from the rest of the study area, and we knew the true v www.esajournals.org environmental variables determining each species' distribution. The niche breadth value for each species was defined as the median of the squared Mahalanobis distance of all occupied cells: where x occ is the matrix of explanatory variables used to define the species' niche over all occupied cells, x sa is the matrix of the same explanatory variables over all cells in the study area, µ sa is a vector of variable means of the study area, and Σ is the covariance matrix of x sa . Occupied cells with highD 2 values have greater environmental distance from all other cells and are therefore more unique, indicative of specialist species. D 2 values for this region and environmental space can range from near 0 to well over 400, a completely unrealistic value representing the most specialist species possible in our study area, namely a species occurring on a single grid cell with the most unique environmental conditions. We therefore rescaledD 2 into niche breadth Fig. 1. Multi-step virtual species creation. In the first step, a single environmental covariate was selected and a corresponding response curve was randomly generated from a linear, quadratic, logistic, or Gaussian function to create a map of habitat potential values based only on that single environmental descriptor. Habitat potential values were linearly rescaled between 0 and 1 to identify high habitat potential areas as those with values greater than 0.5. For each additional explanatory variable (step 2), a response curve was randomly generated and used to generate new habitat potential values as the sum of the response curve functions within the high habitat potential areas and linearly rescaled between 0 and 1 after each step. This step was iterated until each of the two to five randomly selected environmental covariates were included, with the distribution of true habitat potential values for each virtual species set as the habitat potential values resulting from the sum of all response curve functions (step 3). Occupied areas were then selected with a probabilistic approach using Eq. 1. v www.esajournals.org values as the inverse of the realistic minimum and maximum possible values in our study area (0 and 10) with the equation: (3) This index ranges from 0 to 1, wherein a niche breadth value of 0 indicates extreme uniqueness of environmental variables at occupied cells, and results from a median-squared Mahalanobis distance value of 10, the maximum realistic value in our study area. In contrast, a niche breadth value of one indicates the most generalist species possible, namely one that occupies the most diverse range of environmental conditions such as cells with the Mahalanobis distances near zero.

Sampling bias
We introduced environmental and geographic sampling bias into observation datasets of the virtual species to explore how each affects the performance of SDM (Fig. 2). We aimed to mimic sampling bias that might occur under a wide array of conditions, and therefore did not replicate biased sampling from a single source, such as proximity to roads. Instead, we used a generic clustering approach to create biased observation datasets for each species at three levels of bias intensity (low, medium, and high) that might be found in datasets obtained from haphazard sampling inherent in PB modeling approaches (Ponder et al. 2001). These levels ranged from almost none (nearly indistinguishable from randomly sampled observations) to an extreme level greater than that expected in most observation datasets. Bias was introduced with inclusion weights, which were used to preferentially sample a portion of occupied cells for inclusion to each biased observation dataset. The schemas used to set inclusion weights were designed to mimic patterns that might arise from a wide array of biased or haphazard sampling, such as sampling more often near roads in some areas, or not sampling in difficult to access areas such as military lands or private property.
Inclusion weights for the environmentally biased datasets were defined by identifying 15 unique clusters in the multivariate space of the ten possible explanatory variables. Clusters were identified using a partitioning-around-medoids algorithm in the package cluster (Maechler et al. 2016) in R 3.3.2 (R Core Team 2020) and mapped to geographic space. Inclusion weights were assigned to occupied cells in each cluster based on the total area of each cluster with the equation: where a = 1, 3, or 5 for the low, medium, or high bias scenarios, and cluster area is the proportion of the study area that is covered by the mapped cluster containing cell (i). This ensured that mapped clusters with larger areas were less likely to be selected. Inclusion weights were linearly rescaled between 0 and 1 prior to use and were used to randomly select N occupied cells according to their inclusion weights. The degree of sampling bias for each level was assessed with the random skewers method of covariance similarity (Cheverud 1996) in the package phytools (Revell 2012) in R 3.3.2 (R Core Team 2020), by measuring the collinearity between two random selection vectors, each operating on a single covariance matrix (Cheverud and Marroig 2007). This method is used in evolutionary biology and genetics to estimate similarity among genetic covariance matrices (Roff et al. 2012). Here, we use it to estimate the relatedness between the covariance matrix of environmental explanatory variables at locations in the biased observation datasets and the covariance matrix from all true occupied cells, with the most extreme biased datasets showing the least relatedness. The random skewer covariance similarity coefficient among the 100 virtual species ranged from 0.586 to 0.999 for each of the four levels of environmentally biased datasets, and each class (none, low, medium, and high bias) was significantly different when accounting for repeated measures of the species (F 3,273 = 31.19; P < 0.0001; Fig. 3A). Inclusion weights for the geographically biased observation datasets were calculated using a spatially clustered sampling schema by randomly seeding 5, 8, or 10 cluster centers for the low, medium, and high intensities of bias, respectively. For each level of intensity, we created a Gaussian density kernel raster using the equal split method of Okabe et al. (2009) with a bandwidth of 500, 250, or 100 km for the 5, 8, or 10 cluster centers, respectively, in GRASS GIS (Neteler et al. 2012). The inclusion weight for each occupied cell was defined by rescaling each Gaussian density kernel raster between 0 and 1. This schema ensures that occupied cells closest to the randomly seeded cluster were selected with higher probabilities than those further away. We created biased observation datasets by randomly selecting occupied cells according to their inclusion weights. The number of cells selected (N) for each species was determined by p, the prevalence of occupied cells across the study area, where: We limited the maximum number of observations to 500 to ensure a realistic sample size for each biased observation dataset. In order to compare the degree of bias in both the environmentally biased and geographically biased datasets, we used the random skewer method to assess the degree of environmental sampling bias in each dataset. Environmental bias among the unbiased and geographically biased observation datasets for the 100 virtual species ranged from 0.647 to 0.999, and each class (none, low, medium, and high bias) was significantly different when accounting for repeated measures of the 100 virtual species (F 3,272 = 31.19; P < 0.0001; Fig. 3B). Ripley's K function (Ripley 1976) also confirmed that the differences in spatial bias among the levels were significant when accounting for repeated measures across the 100 virtual species (F 3,272 = 121.5716; P < 0.0001; Fig. 3C), and was assessed as the mean difference between theoretical K(r) and bordercorrected K(r) Ripley's K density function for Fig. 2. 100 virtual species ranging in landscape prevalence and niche breadth from habitat specialists to generalists were simulated. We introduce geographic and environmental sampling bias at three intensity levels designed to mimic that found in real-world situations to measure the effectiveness of each correction method to (1) predict the true habitat potential across our study area, (2) recreate true species-environment relationships, and (3) identify true explanatory variables.
inhomogeneous point processes of the observation datasets out to a maximum distance of 350 km.

Bias correction methods
We consider three commonly used methods for addressing biased observational datasets in PB SDM (Fig. 2): geographic filtering (G-Filter), environmental filtering (E-Filter), and background weight correction (FactorBiasOut). We implemented the G-Filter method of bias correction using a sampling mesh with equally sized rectangular cells, each 225 km 2 in area (15 km per side), resulting in 4119 unique sampling areas throughout the study area. We limited observations in each sampling area to two observations per 225 km 2 , a rate that has been effective in minimizing sampling bias for other desert species (Inman et al. 2014). This method results in a spatially thinned point set with an effective sampling density of 1 observation per 112 km 2 , or a thinning distance of approximately 10 km between observations. We implemented the E-Filter correction by sampling observations at a uniform rate within multiple multivariate clusters identified in the n-dimensional environmental space defined by the explanatory variables for a given model specification. We draw on the same clustering algorithm used to introduce environmental sampling bias into the observation datasets, but since the degree of multivariate clustering or number of clusters in an observation dataset is rarely known, we estimate an optimal number by maximizing the average silhouette width with package fpc (Hennig 2015) in R 3.3.2 (R Core Team 2020). In each cluster, a random sample with a size equal to the minimum number of observations in any cluster was selected, and these were pooled to create an environmentally thinned observation dataset. This method ensures an equal sampling intensity across environmental space, though not necessarily across geographic space.
We implement background weight correction with the FactorBiasOut algorithm (Dudik et al. 2005) in Maxent (Phillips et al. 2006). Fac-torBiasOut estimates the combined distribution of the biased sampling distribution (biased survey effort; σ) and the true species distribution (π) Fig. 3. Environmental and geographic bias exhibited in virtual species observation datasets. Environmental bias was assessed with random skewers covariance similarity coefficient for observation datasets biased environmentally (A) or geographically (B). Similarity coefficients below 0.9 are omitted for visual clarity. Geographically biased datasets (C) were also measured with the mean difference between theoretical K (r) and border-corrected K(r) Ripley's K density function for inhomogeneous point processes of the spatial point pattern for each dataset out to a maximum distance of 350 km. and factors out σ, which is assumed to be known and represented as an auxiliary variable (Phillips et al. 2009), often in the form of a bias grid (Elith et al. 2011, Merow et al. 2013). This method relies on knowledge of the biased sampling distribution (σ), which, in practice, is rarely known. However, because the biased observation dataset is a sample of σ (i.e., the observations are sampled with the biased sampling distribution), σ can be estimated when the observation dataset is large (Phillips et al. 2009). We create a bias grid as an estimate of σ by creating a kernel density raster of each biased observation dataset. The bandwidth for each kernel was estimated using cross-validation to minimize mean square error (Baddeley et al. 2015).

Models and performance criteria
Because it is very widely used by SDM practitioners, we used MaxEnt version 3.4.0 (Phillips et al. 2018) with the default settings to create distribution models for each virtual species. Newer versions of Maxent software (e.g., version 3.4.0 and later) implement a new function to produce an estimate of occurrence probability: the complementary log-log function (Phillips et al. 2017). We used this transformation as an indication of habitat potential and allow inclusion of all feature classes (linear, quadratic, product, threshold, and hinge) to assess the three bias correction methods in their ability to (1) predict true habitat potential (distributions) across the study area, (2) recover the shape of true response curves, and (3) identify true explanatory variables (Fig. 2).
To address the first and second objectives, we created a single model for each of the 100 virtual species specified with the true explanatory variables because for these analyses we assumed that the true explanatory variables were known. While this is never truly known in the real world, rare is it that ecologists lack hypotheses about the environmental drivers of their study organisms' geographic distribution. Here, we use the true predictors of each virtual species' distribution to focus on the effects of sampling bias and the commonly used methods to address it.

Predicting distributions
We assessed the ability of each bias correction method to correctly estimate the true habitat potential for each virtual species using the expected fraction of shared presences (Godsoe 2013) by comparing the estimated habitat potential to the true habitat potential across all cells with the equation: where P 1(j) denotes the true habitat potential at location (j) and P 2(j) denotes the prediction generated from the model at location (j). ESP is a modified Sorenson similarity index (Sorenson 1948) used to compare predicted probabilities that each species is present at a given cell rather than relying on presence/absence information (Godsoe 2013). An ESP value of one indicates perfect agreement between the two maps, while a value of zero indicates complete geographic separation (Godsoe 2013).

Recovering response curve shape
The ability to recreate the form of the true response curves was assessed with the Spearman rank correlation coefficient between the true relationship and that estimated by the model for each explanatory variable. We used the nonparametric rank correlation coefficient because of the nonlinear relationships found between many of the true and estimated response curves. We measured collinearity between the marginal response curves estimated by Maxent for each explanatory variable by holding all other variables at their mean for the study area (Phillips et al. 2006) and those used to create each corresponding virtual species.

Identifying explanatory variables
We address the third objective by doing variable selection with 10 randomly selected virtual species to determine whether sampling bias or any of the bias correction methods affect the ability to correctly identify the true drivers of a species' geographic distribution. Variable selection is a key step in SDM development, yet few studies explicitly consider the choice of potential environmental covariates, opting instead to include widely available climatic variables with little justification (Bell and Schlaepfer 2016). We consider a plausible set of candidate models with unique v www.esajournals.org combinations of two to five of the ten possible explanatory variables, resulting in a set of 534 models considered for each species.
The two filtering methods were applied in a replicated fashion because each involves random thinning to remove observations. We replicated the G-Filter and E-Filter methods 20 times each to generate a set of 20 filtered training datasets for each biased observation dataset. We estimated a Maxent model with each of the replicated filtered datasets for each of the candidate model specifications. The background weight correction was implemented using FactorBiasOut with the bootstrap option, also with 20 replications. For each of the 80 biased observation datasets (two bias types, four bias intensities, 10 virtual species) and four bias correction methods (no correction, G-Filter, E-Filter, and Fac-torBiasOut), we selected the single best performing combination of variables with the average area under the receiver operating characteristic curve value (Fielding and Bell 1997) across the 20 replicates. AUC is commonly used for a jackknife test on variable importance and provides a measure of a model's ability to discriminate between presence and absence localities, independent of an arbitrary cutoff threshold (Cumming 2000), although it has been criticized for its sensitivity to geographic extent, among other factors ). However, because our models were evaluated using the same geographic extent and because our goal was to select a single model with which to evaluate multiple bias correction methods, we use AUC as an evaluation metric.
We evaluated each bias correction method's ability to identify the correct explanatory variables with the Jaccard similarity coefficient using the equation: where A is the set of true explanatory variables, B is the set of explanatory variables selected by the model, and A∩B is the count of correctly identified explanatory variables and A∪B is the count of unique explanatory variables in the true virtual species model and those in the best model selected by a mean AUC for a given bias correction method. Higher J values indicate a greater ability to discriminate the correct set of predictors. For all three objectives and their respective performance metrics (ESP, Spearman's correlation coefficient, Jaccard's similarity), we test for loss in performance due to each type and level of bias intensity, as well as a gain in performance due to the three correction methods. We use mixed models to account for random effects among species and denote differences where significant.

Observing sample bias in practice
In order to provide real-world context for our low, medium, high, and no bias levels of geographic bias, we assess the degree of geographic bias that might be expected in PB species locality datasets obtained from biodiversity databases such as the Global Biodiversity Information Facility (GBIF; www.gbif.org). Sampling effort in these datasets is often unknown, and therefore, we use Ripley's K function for inhomogeneous point processes to check for spatial clustering. We did not check for environmental bias because this depends on the environmental explanatory variables and geographic study region considered for SDM and can only be assessed after these have been identified. We obtained locality records for nine species of diverse flora and fauna, including Squamata, Aves, Rodentia, Asterales, and Zygophyllales from GBIF with the package rgbif (Chamberlain and Boettiger 2017) in R 3.3.2 (R Core Team 2020), which are provided in Appendix S1.

Predicting distributions
When using unbiased observations, ESP scores measuring the similarity between virtual species' true and estimated distributions ranged from 0.177 to 0.756, with a mean of 0.487 (SD = 0.128) across 100 virtual species. Species with high landscape prevalence showed greater agreement between the true and estimated distributions (Fig. 4A), as did generalists (Fig. 4B). When bias was introduced, ESP scores were reduced, with mean scores across both types of bias for the 100 virtual species of 0.461 (SD = 0.128), 0.442 (SD = 0.125), and 0.397 (SD = 0.140), for the low, medium, and high bias levels, respectively. Loss of prediction accuracy was greatest for the high environmental bias level, equating to an average reduction of 26% in ESP scores across all species v www.esajournals.org (Table 1). However, not all species showed reduced ESP scores when bias was introduced; 20% resulted in better ESP scores when bias was introduced, and nearly half of these were at the low bias level. Species that were predicted better with biased observation datasets tended to be very rare, with low landscape prevalence (Fig. 5A). We found no effect of landscape prevalence on the loss of ESP scores when bias was introduced (F 1,90 = 0.400; P = 0.529), though rare species showed the greatest variability in loss of ESP scores (Fig. 5A). We did, however, find an effect of niche breadth at the high bias level only (F 1,90 = 5.382; P = 0.023), suggesting that specialist species were more susceptible to sampling bias than were the most generalist species in our study area (Fig. 5B).
Bias correction methods were usually able to increase prediction accuracy under both environmental bias and geographic bias ( Fig. 6A and 6B, respectively), though the E-Filter method resulted in additional loss of prediction accuracy 43% of the time. This loss was nearly equally distributed among the three bias levels, suggesting that the E-Filter correction method was poor at compensating for any level of bias. The environmentally biased datasets saw greater improvement than the geographically biased ones (Fig. 6A vs. 6B), especially with the FactorBiasOut correction method. Overall, FactorBiasOut provided the greatest improvements of the three methods across both types of bias (F 2, 1560 = 331.856; P < 0.001). FactorBiasOut was able to improve ESP scores in 96% of cases, whereas the G-Filter and E-Filter only improved ESP scores 82% and 55% of the time, respectively. On average, all three correction methods provided 2% greater improvements with the environmentally biased datasets than with the geographically biased datasets (F 1, 1377 = 29.269; P < 0.001), suggesting that while high environmental bias caused the greatest reduction in ESP scores, it was also improved the most by correction methods. We found an interaction between prevalence and the bias correction methods (F 2, 1558 = 26.776; P < 0.001; Fig. 7A), as well as an interaction between niche breadth and bias correction methods (F 2, 1558 = 83.259; P < 0.001; Fig. 7B), with specialist and rare species showing greater improvements with the Fac-torBiasOut method, but not with the other two correction methods.

Recovering true response curve shape
On average, the correlation between true and estimated response curves was 0.781 (SD = 0.383) when no sampling bias was present, though Spearman's rank correlation coefficients ranged from −0.999 to 0.999. While some Fig. 4. In absence of bias, species with high landscape prevalence showed greater agreement between the true and estimated distributions (A), as did generalist species (greater niche breadth, measured by the Mahalanobis distance) (B) where spatial agreement between predictions of habitat potential was measured by expected fraction of shared presences (ESP). ESP of 1 indicates perfect agreement between the two habitat potential maps, and an ESP value of 0 indicates complete geographic separation. response curves were estimated poorly, the majority were estimated reasonably well; fewer than 14% of correlation coefficients were below 0.5 for our simulations where the true explanatory variables are known. It was rare for the estimated response curve to be completely wrong, that is, negative correlation occurred only 5% of the time. Under conditions of no sampling bias, we found no apparent relationship between the correlation of the true and estimated response curves with landscape prevalence (F 1,91 = 0.305; P = 0.582), or niche breadth (F 1,91 = 0.148; P = 0.701), though the variability of correlation coefficients was greater among species with wide niches.
When environmental bias was introduced, the mean of the correlations between actual and estimated response curves decreased significantly (F 3, 1315 = 9.286; P < 0.001) to 0.725 (SD = 0.437), 0.682 (SD = 0.493), and 0.617 (SD = 0.525), with proportional reductions of 30%, 37%, and 45% for the low, medium, and high levels of environmental bias, respectively (Table 2). Geographic bias caused fewer overall proportional changes in correlation between actual and estimated response curves with gains of 8%, and 1% and a loss of 19% for the low, medium, and high levels of bias, respectively (Table 2). There was no apparent relationship between decreases in correlation coefficients and landscape prevalence (F 1,90 = 0.786; P = 0.377), with no discernible difference between rare and widespread species. This pattern held for niche breadth as well (F 1,90 = 0.551; P = 0.460); specialist species showed the same reduction as generalist species. Generally, bias correction methods did not improve estimation of response curves across any of the bias levels (F 3, 10440 = 2.050; P = 0.105), though the FactorBiasOut correction method made recovering true response curve shapes worse across all levels of bias with the environmentally biased datasets (F 1, 4904 = 4.355; P = 0.037; Table 2).

Identifying true explanatory variables
In the absence of sampling bias, the ability of SDMs to identify the correct explanatory variables for the randomly selected ten virtual species was poor (mean Jaccard Index scores J = 0.62; SD = 0.22). The true set of explanatory variables was selected only 15% of the time. When bias was introduced, J scores were reduced, with losses of 0.06 (environmental) and 0.10 (geographic) for the low bias levels to 0.14 Notes: Models were calibrated with and without environmental and geographic sampling bias. Mean and standard deviation (SD) ESP without bias (original) and percent change after bias were introduced or correction methods were applied (% change) are shown.
(environmental) and 0.10 (geographic) for the high bias levels (Table 3). Reductions in J up to 22% were found in the high environmental bias level; the correct set of explanatory variables was selected only 3% of the time. Models of specialist species did somewhat better at identifying the correct variables than did models of generalist species, though this effect was only significant at the α = 0.1 level (F 1,18 = 3.927; P = 0.063). Bias correction methods rarely improved the ability to select the correct explanatory variables, and we found no difference among the three correction methods in their ability to improve J scores (F 2,168 = 0.377, P = 0.686). Similarly, we found no relationship between landscape prevalence (F 2,166 = 0.721; P = 0.488) or niche breadth (F 2,166 = 1.065; P = 0.347) and the three methods' ability to improve J scores.

Observing sampling bias in practice
We found that none of the GBIF observation datasets tested here were geographically unbiased and that most produced Ripley's K function scores indicative of low or medium geographic bias levels defined in this study. Only two species, Western shovel-nosed snake Chionactis occipitalis, and little pocket mouse Perognathus longimembris, yielded geographic bias equivalent to our most extreme high bias observation datasets (Fig. 8). Both of these species of conservation concern are well-studied and represented by over 1500 observations; we suspect each suffers from extreme sample bias wherein observations are locally dense along roads and near monitoring sites.

DISCUSSION
We use a simulation study with virtual species to compare three sampling bias correction methods that are often used in presence-only SDM but have not been systematically compared across a diverse set of virtual species. We found that the objective for which SDM is used had a large influence on the ability of these methods to compensate for biased observation data. When the intent of SDM is to explain ecological processes, such as identifying environmental factors influencing a species' distribution or characterizing their response curve shapes, the bias correction methods examined here do little to compensate for biased observation data. Bias correction methods did, however, improve the accuracy of mapped predictions of habitat potential, which is often the goal of SDM studies. For those studies focused on predicting spatial patterns of habitat potential Fig. 5. Most habitat potential predictions were less accurate (decrease in ESP) when bias was introduced, though some with low landscape prevalence were actually improved (A). Prediction accuracy was worsened more in specialist species than generalists, though only at the high bias level (brown circles), suggesting that specialists (narrow niche breadth measured by the Mahalanobis distance) may be more susceptible to sampling bias than generalists in our study (B). rather than understanding the mechanisms driving those patterns, the FactorBiasOut background weight correction method is well-suited. We found that FactorBiasOut outperformed the geographic filtering (G-Filter) and environmental filtering (E-Filter) bias correction methods across all levels and types of bias.

Predicting distributions
SDM is most often used for spatial prediction of distributions, and in this case, the use of bias correction methods is clearly recommended. Previous work with virtual species has shown contrasting results, however, with some studies suggesting that G-Filter methods may outperform the Fac-torBiasOut method in some instances (Kramer-Schadt et al. 2013, Fourcade et al. 2014, Stolar and Nielsen 2015, but others showing E-Filter methods as superior to G-Filter methods when using climatic variables (Varela et al. 2014), or that that E-Filtering can lead to mixed results depending on environmental gradients (Gábor et al. 2019) and may not necessarily provide improvements. Each of these studies considered a differing number of virtual species, with at least one, but no more than ten considered. Yet, other work suggests that the FactorBiasOut is superior, especially when targetgroup background data are incorporated (e.g., Syfert et al. 2013, Fourcade et al. 2014, Vollering et al. 2019. The target-group background approach incorporates an estimate of the biased sampling distribution (σ) from pooled Fig. 6. All three correction methods were able to increase prediction ability with environmentally (A) and geographically (B) biased datasets in most species, though the FactorBiasOut correction method (FO) showed the greatest improvement with an average 11% greater increase in ESP than the other two methods: E-Filter (EF) and G-Filter (GF) methods overall. EF resulted in poorer (lower) ESP scores, especially for datasets that were geographically biased (B). observations of multiple species within a target group of similar taxa whose locality data were collected in the same way as the modeled species, and has shown promise for reducing sampling bias (Phillips and Dudik 2008, Elith et al. 2011, Ranc et al. 2016. The benefit of pooling observations from multiple taxa is an increase in the sample size used to estimate σ, thereby improving its precision (Phillips and Dudik 2008). However, this method assumes that σ is the same across all species of the target group, which may not be true for many species represented in biodiversity repositories composed of aggregate survey efforts (Newbold 2010). Fig. 7. Rare species (low landscape prevalence; A) and specialists (narrow niche breadth measured by the Mahalanobis distance; B) showed greater improvements with the FactorBiasOut method than with either the G-Filter or E-Filter methods. Notes: Mean and standard deviation (SD) of correlation coefficients across 100 virtual species assess the ability to represent mechanistic determinants of species distributions when the true explanatory variables are known. Correlation coefficients for models without bias correction (original) and the percent change from the original value (% change) are shown.
Our results suggest that the FactorBiasOut method consistently outperformed both of the two filtering methods in their ability to accurately estimate geographic distributions. While our results may contradict some previous findings, they are not too dissimilar. For example, Kramer-Schadt et al. (2013) found that a G-Filter method resulted in lower omission errors than the FactorBiasOut method, but those differences were within 7%, a margin that was found only in a single, wide-ranging species. Similarly, other work has shown considerable variability in the performance of correction methods among differing levels of sampling bias, with the FactorBiasOut method outperforming other methods in a single virtual species, but not two (actual) species (Fourcade et al. 2014). To address this ambiguity, we simulated 100 virtual species with niche characteristics ranging from rare specialists to common generalists and found that the difference between methods was greatest for rare specialists across all levels of sampling bias.

Inferring niche characteristics
We found that while bias correction methods improved spatial predictions for virtual species, sampling bias greatly reduced the ability to identify the shape of species response curves or select the true explanatory variables using PB SDM, and bias correction methods did not greatly improve this ability. This result highlights differences between predictive and inferential applications of SDM and of statistical ecology generally.
The emphasis on predictive models over explanation has been a trend in ecology over the past several decades (Clark et al. 2001, Evans et al. 2012, Houlahan et al. 2017, Maris et al. 2018) and has led to substantial developments in predictive ecology, such as machine learning methods (Christin et al. 2019, Rammer andSeidl 2019). In general, machine learning methods are better suited for prediction and less so at identifying and explaining causal relationships (Shmueli 2010). This difference is exacerbated when making predictions under novel or no-analog conditions (Williams andJackson 2007, Dietze et al. 2018) where machine learning methods favor model complexity motivated by model fit over transferability and ecological theory (Sinclair et al. 2010, Bell andSchlaepfer 2016). As a correlative statistical approach, PB SDM can be used much like other statistical methods for explanatory or predictive modeling, though for the former, it is important to include only ecologically relevant and plausible environmental covariates (e.g., Guisan and Zimmermann 2000, Bell and  Notes: Agreement scores were calculated for the top performing species distribution model selected by area under the receiver/operator curve (AUC). Mean and standard deviation (SD) agreement scores measure the proportion of the true explanatory variables identified by the model with the highest AUC score.
Schlaepfer 2016), and to keep models simple when variable selection is the focus (e.g., Warren and Seifert 2011).
However, due to the inherent analyses of species-environment relationships that serve as the statistical underpinnings of spatial predictions in SDM (Guisan and Zimmermann 2000), PB SDM has often been leveraged to study niche traits and niche characteristics. One niche characteristic often explored is climatic niche conservation (Wiens and Graham 2005), the condition wherein changes in climate may cause shifts in geographic distributions over evolutionary timescales. Some formal tests, such as niche equivalency (Warren et al. 2008), have been based on SDM and have been used extensively to investigate niche conservatism in related extant taxa (e.g., Strubbe et al. 2014), though these tests compare spatial predictions of habitat potential between species to infer niche similarity rather than direct comparisons of response curves or selected variables. It seems then that using SDM for identifying the drivers of species distributions (e.g., variable selection) may be better accomplished with unbiased presence and absence data (Guisan and Zimmermann 2000) or in combination with mechanistic models incorporating biophysical ecology and functional traits (Kearney et al. 2010a, b) rather than PB SDM, even with bias correction.

Effect of niche characteristics on sensitivity to sampling bias
In the absence of sampling bias, our simulations confirm previously well-established relationships between sample size (landscape prevalence in our case) and the ability to estimate true distributions (Hernandez et al. 2006, Wisz et al. 2008, Proosdij et al. 2015. However, relationships between niche breadth and the effects of sampling bias are not well-understood. Previous work has suggested that the occupied niche may be more difficult to distinguish from background environmental space in generalists with broad tolerances than for specialists (Brotons et al. 2004, Elith et al. 2006, Franklin et al. 2009); on the other hand, recent work has shown that specialist species are more susceptible to sampling noise because of their often reduced sample sizes (Soultan and Safi 2017). If distributions of specialists are in fact more easily estimated than generalists, this would suggest that specialists may be less susceptible to biased sampling than generalists because their smaller niches are likely to be sampled more completely than larger ones. However, in previous studies, niche breadth is often conflated with landscape prevalence (or rarity), assuming that generalists with wide niches are more common than specialists.
Species with wide niches may in fact be rare if their realized and fundamental niches are substantially differentiated by biotic factors such as predation, disease, or competition (Hutchinson 1991). Studies using SDM rarely incorporate biotic factors, and instead often mix observational occurrence records from realized niches with environmental data describing fundamental niche properties, especially when scenopoetic variables are the only type considered (Soberón and Nakamura 2009). Thus, the type of niche being quantified (e.g., fundamental, potential, or realized) is often not well-defined and can influence perception of niche breadth. Our method used occupied environmental uniqueness to Fig. 8. Geographic sampling bias in nine observation datasets obtained from the Global Biodiversity Information Facility gives context to the no, low, medium, and high bias observation datasets used for all virtual species. Geographic sampling bias was measured as the mean difference between theoretical K(r) and border-corrected K(r) Ripley's K density function for inhomogeneous point processes of the spatial point pattern for each biased dataset out to a maximum distance of 350 km. define niche breadth, and with this definition, we found evidence that specialists with narrow niches are more susceptible to high intensities of biased sampling than generalists (Fig. 5B), which we also found for actual species locality data (Fig. 8). We found no evidence of an effect of landscape prevalence on susceptibility to sampling bias, and no interaction between landscape prevalence and niche breadth, suggesting that any susceptibility to sampling bias was due to niche breadth alone, not rarity. Our results suggest that it is niche breadth, not landscape prevalence, that may influence sensitivity to sampling bias, except in cases of extreme rarity. Here, biased sampling actually improved predictions of habitat potential in a few, very rare, species. By design, our study sampled each virtual species proportionally to landscape prevalence, resulting in small sample sizes for rare species. However, rare species that are well-studied may be heavily sampled (e.g., Fig. 8); in these cases, they may be more easily modeled than generalists, as previously suggested.

Practical notes about implementing bias correction methods
Of the three bias correction methods compared here, FactorBiasOut and G-Filter are used most often. These methods are rather straight forward to implement, though each requires tuning with parameters and little guidance is provided in the literature. The FactorBiasOut method implemented in Maxent requires a bias grid, or prior representing hypothesized sampling intensity across the study area (Merow et al. 2013). Where documented, these grids are often created as the Gaussian kernel densities of observations as recommended by Elith et al. (2011), though the choice of standard deviation (SD) for the kernel is not often reported. This choice influences the distance at which adjacent cells affect estimated densities, wherein a small SD equates to a local influence among neighboring cells. For G-Filter methods, the choice of thinning grid size or thinning parameter will influence the local density of observations. The consequences of different choices of these parameters are not well-understood. Elith et al. (2011) suggest SD might reflect some property of a species' dispersal ability to capture the possibility of animal movement between surveys. However, the haphazard nature of opportunistic sampling inherent in biodiversity databases likely confounds the interpretation of SD, and therefore, we chose to use a cross-validation approach based on the spatial pattern of observations to estimate an optimal value for SD. This method minimizes the mean squared error in the Gaussian kernel density field at multiple SD values and identifies an optimum value for a given observation dataset (Diggle et al. 1998). The trade-offs and implications of using statistical measures over biological knowledge are not well-understood; additional work is needed to provide guidelines for selecting SD and developing bias grids.
The G-Filter method also relies on selecting a spatial parameter, namely the bandwidth used to identify areas with high observation densities. Previous studies have filtered observations based on nearest neighbor distances, such as removing observations that are within a certain distance of one another (e.g., Kramer-Schadt et al. 2013, Boria et al. 2014, while others use spatial grids to filter observations from cells with high sample densities (e.g., Vandergast et al. 2013). The decision for the bandwidth size is rarely discussed, though Veloz et al. (2009) use a novel method of exploring spatial autocorrelation in model residuals to determine an optimal value. Our selection of bandwidth and choice of filtering methods were based on previous work to reduce sampling bias in aggregated observations for the southwestern United States (Vandergast et al. 2013, Inman et al. 2014. As with the FactorBiasOut method, additional research is needed to explore differences among alternative implementations of the G-Filter method.
In contrast to the G-Filter method, the E-Filter method is less well represented in the literature. This is likely because it is more tedious to implement and is dependent on explanatory variables specified in any given model. Whereas G-Filter thinning occurs entirely in geographic space and results in a single dataset that can be used to calibrate multiple model specifications, the E-Filter method results in a separate dataset for each model specification. This precludes the use of multi-model inference or complexity metrics, such as Akaike information criterion (Burnham and Anderson 2002) or Bayesian information criterion (Schwarz 1978). Model complexity is not often discussed, but has been shown to be an important factor affecting predictions across geographic regions and time periods (Warren and Seifert 2011, Syfert et al. 2013, Merow et al. 2014, Bell and Schlaepfer 2016.

Conclusions
Due to the proliferation and availability of geolocated species observations, SDM has become a primary tool for tasks such as designing conservation and monitoring programs, evaluating the efficacy of proposed land management actions, and developing recovery planning for threatened or endangered species. In these applications, biased observations are all but assured. Bias correction methods may offer improvement in estimating spatial distributions at low and moderate levels of bias, but we suggest practitioners be wary of using model fit to identify explanatory variables or infer ecological meaning from response curves when using SDM in PB frameworks.

ACKNOWLEDGMENTS
Any use of trade, firm, or product names in this publication is for descriptive purposes only and does not imply endorsement by the U.S. Government.