# Resolving misaligned spatial data with integrated species distribution models

**Editors’ Note**: Papers in this Special Feature are linked online in a virtual table of contents at: www.wiley.com/go/ecologyjournal

## Abstract

Advances in species distribution modeling continue to be driven by a need to predict species responses to environmental change coupled with increasing data availability. Recent work has focused on development of methods that integrate multiple streams of data to model species distributions. Combining sources of information increases spatial coverage and can improve accuracy in estimates of species distributions. However, when fusing multiple streams of data, the temporal and spatial resolutions of data sources may be mismatched. This occurs when data sources have fluctuating geographic coverage, varying spatial scales and resolutions, and differing sources of bias and sparsity. It is well documented in the spatial statistics literature that ignoring the misalignment of different data sources will result in bias in both the point estimates and uncertainty. This will ultimately lead to inaccurate predictions of species distributions. Here, we examine the issue of misaligned data as it relates specifically to integrated species distribution models. We then provide a general solution that builds off work in the statistical literature for the change-of-support problem. Specifically, we leverage spatial correlation and repeat observations at multiple scales to make statistically valid predictions at the ecologically relevant scale of inference. An added feature of the approach is that addressing differences in spatial resolution between data sets can allow for the evaluation and calibration of lesser-quality sources in many instances. Using both simulations and data examples, we highlight the utility of this modeling approach and the consequences of not reconciling misaligned spatial data. We conclude with a brief discussion of the upcoming challenges and obstacles for species distribution modeling via data fusion.

## Introduction

Determining how species respond to changing environmental conditions is fundamental to sound management and species conservation (Yoccoz et al. 2001). Accomplishing this requires leveraging empirical evidence to inform and ultimately validate decision making. This need for data-driven decision making has motivated significant advances in the ability to collect and store spatially and temporally referenced data. At the same time there has been an influx in the development and application of methods that integrate multiple streams of data. These new data-integration approaches seek to exhaust all available data sources to model species distributions while explicitly accounting for differences among data types (Dorazio 2014, Fithian et al. 2015, Giraud et al. 2016, Pacifici et al. 2017, Coron et al. 2018). The advantages of combining multiple data sources in integrated species distribution models (ISDMs) include increased spatial coverage, bias reduction and overall improvement in estimator accuracy (Dorazio 2014, Fithian et al. 2015, Giraud et al. 2016, Pacifici et al. 2017). Several authors have put forth different approaches for integrating different data sources, typically when one source is collected through standardized surveys and the other source is not (Fletcher et al. 2019, Miller et al. 2019). As a result, we now have a range of methods that leverage information across different data types (Dorazio 2014, Pacifici et al. 2017, Zipkin et al. 2017), among multiple species (Giraud et al. 2016, Thorson et al. 2016, 2017), and among neighboring locations by incorporating spatial correlation (Thorson et al. 2017). As data becomes more available and easier to access the propensity to combine data will only increase, as will the demand to apply it rigorously to inform decision making.

In light of the increased interest in ISDMs, it is essential to explore the implications that come with combining different data sources. As with all species distribution modeling the goal is to correlate observations of individual species with environmental layers that are driving the observed patterns of occurrence. In some cases, the focus will be on large geographic areas or on species that are difficult to sample. Alternative data sources can fill in gaps that might occur in data collection and improve inference (Pacifici et al. 2017, Fletcher et al. 2019, Miller et al. 2019). Integrated species distribution models can increase precision and reduce bias in certain settings (Pacifici et al. 2017) and are flexible enough to incorporate a wide range of auxiliary data sources (Fletcher et al. 2019, Miller et al. 2019). Despite this, two major problems need to be addressed when fusing multiple streams of data. The first problem is to ensure that the ISDM rigorously combines each data source so that relevant and valid statistical inference is possible. The second is to reconcile spatial and temporal observations properly when they are collected at multiple differing spatial and temporal resolutions. The first problem has already received significant attention (Fletcher et al. 2019, Miller et al. 2019). The result is a range of flexible approaches that have been developed to integrate multiple data sources rigorously (Pacifici et al. 2017, Fletcher et al. 2019). The second problem, however, has not been formally addressed for ISDMs in the ecological literature. This stands in contrast to significant coverage given to the topic in the spatial statistics literature, where it is often referenced as the general *change of support* (COS) problem (Mugglin et al. 2000, Gelfand et al. 2001, Gotway and Young 2002, 2007, Wikle and Berliner 2005, Young and Gotway 2007, Berrocal et al. 2010a,b, Ren and Banerjee 2013, Reich et al. 2014, Parker et al. 2015, Kim and Berliner 2016).

Before exploring the challenges of combining data sources and COS we first need to understand COS as it relates to a single data source. Here, we briefly describe three general COS problems. We encourage the reader to explore the topic more thoroughly (Gelfand et al. 2001, Gotway and Young 2002); *Journal of the Royal Statistical Society, Series A*, Volume 164 Issue 1 is dedicated to the topic. Generally, COS arises from three causes: (1) spatial or temporal misalignment, (2) modifiable areal unit problem (MAUP), and (3) the ecological fallacy problem. It is important to recognize that the effect of COS can exist in relation to either the response variable (e.g., counts, occurrences), the covariates driving the response (e.g., landcover, elevation), or both.

First, data may be “misaligned” either spatially or temporally (Mugglin et al. 2000, Cressie and Wikle 2015) meaning that data may come from different classifications or partitions of parcels of land or from different years or seasons. Take for example the case where a predictor variable (e.g., elevation) is measured at one spatial scale (e.g., county) and another variable (e.g., human population density) is measured at a different spatial scale (e.g., zip code). Our interest may be in using both covariates to explain variation, in say, abundance. However, the misalignment of the covariate information needs to be reconciled to make proper inference (Mugglin et al. 2000). The same will hold for temporal mismatch wherein one covariate may be measured at a different temporal resolution (e.g., annually) than a second covariate (e.g., daily) and the differences must be recognized (Cressie and Wikle 2015). The problem of misalignment also occurs when the response variable (e.g., counts, presence/absence data) is mismatched with the covariate information either spatially (e.g., counts occur at different spatial scale than covariate) or temporally (e.g., covariate information, say land cover, comes from different year than counts were collected).

The second problem classified under COS is referred to as the modifiable areal unit problem (MAUP) in the geography and statistics literature (Gotway and Young 2002). Modifiable areal unit problem is essentially two separate problems: spatial aggregation and the grouping effect. Spatial aggregation is the process of grouping data into increasingly larger geographic units. This might occur when covariate information is aggregated or grouped to a larger scale to match another covariate or response variable (Latimer et al. 2006) or response variables (e.g., counts) are summarized at increasingly larger geographic scales (e.g., collected at point, summarized to county level). Spatial aggregation will change inferences for estimated parameters. The second problem of MAUP, the grouping effect, occurs when there are differences in the size, shape, or formation of the geographic units (Gotway and Young 2002). Grouping effects have been studied extensively in ecology for some time (Turner 1989, Levin 1992).

A third challenge, “ecological fallacy,” is often listed separately from MAUP, but also can be considered a special case of MAUP. Ecological fallacy deals with the case where the underlying individual response to a covariate differs from the response estimated from grouping the individuals (Gotway and Young 2002, Bradley et al. 2016, 2017). The result is that conclusions based on an analysis using fine-resolution data differ from analyses that are conducted using an aggregate or summary of the fine-resolution data (Gotway and Young 2002). “Downscaling” is often used to address this problem in the environmental and remote sensing fields (Bradley et al. 2017). Often the most difficult piece is identifying which variables are responsible for significantly altering the results when data are scaled up or aggregated from individual level to group level (Gotway and Young 2002).

In all three cases, notable bias can occur if it is not properly handled, and choosing different scales to conduct the analysis results in different magnitudes of error (Bradley et al. 2017). Bias can occur not only in estimating the mean and variance of parameters of interest, but extends to any statistic that is estimated at multiple scales (Waller and Gotway 2004, Bradley et al. 2017). The consequences of ignoring COS are hard to predict, and they can result in severe biases.

Although the statistical literature is rich with examples of COS (Gotway and Young 2002) it generally remains unaddressed in species distribution modeling. Several authors note that COS occurs when species presence/absence data that are referenced to point locations and environmental data used to predict occurrence are typically referenced to grid cells (Latimer et al. 2006, Finley et al. 2014). Another example is when location errors arising from georeferenced covariate information that is summarized or aggregated to a grid cell (Hefley et al. 2017). (Latimer et al. 2006) describe a solution as either working at the scale of the responses by assigning the environmental data to that level or alternatively working at the grid cell level by scaling up the response data to match the environmental data. However, this leads to a loss of information from rescaling to match either the level of the response or the level of the environmental data. Ideally we want a method to account for and circumvent the loss of information due to aggregating data formally, and to recognize the variation within and between the aggregated units. The frequency with which COS will occur and associated issues becomes greater when multiple sources of data are combined, as now both the covariate information and auxiliary response data can be from mismatched scales.

The nuances of each type of COS warrants careful consideration of the appropriate solution. A wide range of methods exist (Gotway and Young 2002), with the general goal being to make spatial predictions or estimate variables (covariates or responses) on regions over which they were not measured (Mugglin et al. 2000, Gotway and Young 2002). As (Cressie and Wikle 2015) recommend, the only logical solution is to build models at different scales and evaluate the differences in inference when doing so, ideally first building the model at the finest scale and then aggregating or scaling up to fit additional models. We will apply this general philosophy to COS with multiple data sources as well.

Here we lay out a framework for accommodating COS when combining multiple sources of data in ISDMs. First we describe the theoretical underpinnings of COS in the context of ISDMs, then develop COS extensions to a suite of data fusion models that vary in the level of shared information between data sources (Pacifici et al. 2017). We explore the properties of these models via simulation and apply these methods to our motivating data set on Black-throated Blue Warblers (*Setophaga caerulescens*; BTBW) in Pennsylvania, USA (Fig. 1). Our overall objectives are twofold: (1) introduce the concept of COS and demonstrate its relevancy to ISDMs, and (2) identify specific situations when COS most matters and provide recommendations for how it should be handled.

## Change of Support for Integrated Species Distribution Models

We will use a case study of BTBW in Pennsylvania, USA to demonstrate the challenges of accounting for COS in ISDMs (Fig. 1). Here two different data sources provide useful, yet different information about the distribution of BTBW. The first data source is collected at finer-resolution standardized surveys across the state (Breeding Bird Atlas point counts), and the second data source (eBird) has been summarized at a coarser resolution to account for data that are not collected at a single point location. Combining these data create a conflict in spatial resolution and necessitates a method that addresses the misalignment. This requires being able to reconcile the misalignment between the two data sources and the differing spatial scales to make inferences about the underlying distribution of BTBW.

### Modeling framework

We envision two general approaches to handle misalignment to accommodate COS. The first naive method is a two-stage approach (we formally define this approach below as the “Covariate” model). The first step consists of imputing the second data source in the spatial locations where the response of the first data source is observed. The prediction could be done in any number of ways depending on the characteristics of the second data source (e.g., presence only, presence–absence, counts) and could be accomplished using any number of appropriate species distribution modeling techniques (Guillera-Arroita et al. 2015). The second step uses those predicted values as known constants and linear predictors in an ISDM (Dorazio 2014, Fithian et al. 2015, Fletcher et al. 2016, 2019, Pacifici et al. 2017, Miller et al. 2019). However, this approach does not account for uncertainty in the predictions from the second data source during the first step and can result in potentially biased inference. The second general approach, and the one we will focus on here, is a joint-modeling strategy. In this case, both sources of data are modeled simultaneously. As a result, uncertainty is properly accounted for and propagated through to the predictions of the joint response. Below we describe the framework for joint modeling of ISDMs to account for spatial misalignment.

Species distributions can generally be thought of as a continuous point process that describes the distribution of individuals across a species’ range. The local intensity of the process (i.e., the local probability an individual occurs at any point in space) determines the density of animals across space. Building a statistical model for the distribution requires carefully aggregating the intensity function for a point process to the scale of the data. As with any probability density function of a continuous random variable, the probability of an observation at any single spatial location is zero. As a result, non zero probabilities arise only when considering the number of observations in a spatial region. Therefore, some minimum form of aggregation is required. For example, if a camera trap is placed at location ${\mathbf{s}}_{0}$ and animals that pass within distance $r$ from the camera are recorded, then the region of the survey, $\mathcal{B}$, is the circle with center ${\mathbf{s}}_{0}$ and radius $r$ and the expected number of observations in $\mathcal{B}$ is $\stackrel{~}{\mathrm{\lambda}}(\mathcal{B})$, which increases with $r$. Given that all observations are made with reference to an area, it is generally difficult to estimate the function $\mathrm{\lambda}(\mathbf{s})$ for all $\mathbf{s}$ without simplifying assumptions about its smoothness.

The log-intensity process can be regressed onto covariates via the model $log[\mathrm{\lambda}(\mathbf{s})]=\mathbf{X}{(\mathbf{s})}^{T}\mathbf{\beta}+\mathrm{\theta}(\mathbf{s})$ where $\mathbf{X}(\mathbf{s})$ is a vector of spatial covariates, $\mathbf{\beta}$ are the corresponding effects and $\mathrm{\theta}(\mathbf{s})$ is the residual spatial process. Several models for the spatial intensity function have been proposed and we discuss three in Appendix S1.

## Data Fusion Models with COS

As we noted previously, the focus of our paper is to integrate multiple data sources collected at different spatial resolutions. Assume that data source $k$ is available for ${m}_{k}$ regions ${\mathcal{G}}_{k1},\dots ,{\mathcal{G}}_{k{m}_{k}}$. Consistent with our motivating data example with BTBW in Pennsylvania, we address the case where there are two data sources and (1) the first data source ${Y}_{1j}$ is the number of the ${N}_{j}$ sampling occasions in region ${\mathcal{G}}_{1j}$ for which the species was observed, so that ${Y}_{1j}\in \{0,1,\dots ,{N}_{j}\}$; and (2) that the second data source ${Y}_{2j}$ is the total number of individuals observed in grid cell ${\mathcal{G}}_{2l}$, so that ${Y}_{2l}\in \{0,1,2,\dots \}$. The approaches below are easily generalized to other cases. In our analyses we will treat the first data source as the “gold standard.” The data are collected using a systematic sampling design where effort and location are well defined and offer a benchmark for our data integration model. The second data source contains auxiliary data for which we have less confidence and this is reflected in how we formulate models in some cases (Pacifici et al. 2017). We describe the methods below in the context of the discretized model that assumes the true intensity is constant with each fine-resolution grid cell ${\mathcal{B}}_{1},\dots ,{\mathcal{B}}_{n}$ described above. For our motivating example this model is amenable to implementation in standard software (e.g., OpenBUGS, see available code in Data S1). However, we emphasize that other approaches can also be used in the data-fusion models developed in this section.

Here we lay out three approaches for data fusion that vary in the degree of influence and reliance on the auxiliary data source (Pacifici et al. 2017) and extend each to allow for COS.

### Covariate model

### Shared model

### Correlation model

## Simulation Study: Aggregating Spatial Covariates with a Single Data Source

Now that we have formally defined COS in ISDMs we want to explore one of the most common challenges that researchers first face when fitting SDMs: how to use spatial covariates that have been collected at different spatial scales. Below we describe a brief simulation study to evaluate the effect of aggregating spatial covariates with a single data source. In this simulation the true intensity surface is generated on a $20\times 20$ fine grid of $n=400$ grid cells ${\mathcal{B}}_{1},\dots ,{\mathcal{B}}_{n}$. Data are generated on grid cells ${\mathcal{G}}_{1},\dots ,{\mathcal{G}}_{m}$, where each cell contains regular grid of ${k}^{2}$ of the $n$ fine-resolution cells, with ${S}_{j}$ denoting the indices of the fine-resolution cells in ${\mathcal{G}}_{j}$ so that ${\mathcal{G}}_{j}={\bigcup}_{i\in {\mathcal{S}}_{j}}{\mathcal{B}}_{i}$ (e.g., Appendix S1: Fig. S1a for $k=3$). We first simulate the spatial random effects $\mathbf{\theta}=({\mathrm{\theta}}_{1},\dots ,{\mathrm{\theta}}_{n})\sim \mathrm{CAR}(0.99,1,\mathbf{A})$ and covariate $\mathbf{X}={({X}_{1},\dots ,{X}_{n})}^{T}\sim \mathrm{CAR}(\mathrm{\rho},1,\mathbf{A})$. The true intensity is then set to $log({\mathrm{\lambda}}_{i})={\mathrm{\beta}}_{1}+{X}_{i}{\mathrm{\beta}}_{2}+{\mathrm{\theta}}_{i}$ with ${\mathrm{\beta}}_{1}=0$ and ${\mathrm{\beta}}_{2}=1$. The data for ${\mathcal{G}}_{j}$ is then generated as ${Y}_{j}\sim \mathrm{Binomial}(5,p{Z}_{j})$ where $\mathrm{Prob}({Z}_{j}=1)=1-exp(-{\sum}_{i\in {\mathcal{S}}_{j}}{\mathrm{\lambda}}_{i})$ with detection probability $p=0.5$. Data are simulated with aggregation level either $k=2$ or $k=3$ spatial correlation of the covariate equal either $\mathrm{\rho}=0.50$ or $\mathrm{\rho}=0.99$. For all combinations of these settings we simulate 500 data sets.

For each simulated data set we fit two models. The first model (“naive”) ignores COS and fits a standard spatial occupancy model using $m$ observations where the log intensity in cell ${\mathcal{G}}_{j}$ is ${\mathrm{\beta}}_{1}+{\stackrel{~}{X}}_{j}{\mathrm{\beta}}_{2}+{\mathrm{\gamma}}_{j}$, where ${\stackrel{~}{X}}_{j}$ is the average of ${X}_{i}$ over ${\mathcal{G}}_{j}$ and the $m$ spatial effects ${\mathrm{\gamma}}_{1},\dots ,{\mathrm{\gamma}}_{m}$ follow a CAR prior defined via the adjacency matrix of ${\mathcal{G}}_{1},\dots ,{\mathcal{G}}_{m}$. The second model (“COS”) is the COS model used to generate the data wherein we account for COS by modeling the process at the same fine resolution that we generated the data (instead of using the average as in the naive model). Both models assume priors ${\mathrm{\beta}}_{1},{\mathrm{\beta}}_{2}\sim \mathrm{Normal}(0,10)$, ${\mathrm{\sigma}}^{2}\sim \mathrm{InvGamma}(0.1,0.1)$, $\mathrm{\rho}\sim \mathrm{Beta}(10,1)$ and $p\sim \mathrm{Uniform}(0,1)$. Models are fit in OpenBUGS using 10,000 MCMC samples after a burn-in period of 2,500 iterations (see Data S1 for code). For each model and each data set we compute the posterior distribution of the slope ${\mathrm{\beta}}_{2}$, and present the bias and mean square error of the posterior mean and empirical coverage of 90% intervals averaged over the 500 data sets in Table 1.

Settings | Bias | MSE | Coverage | ||||
---|---|---|---|---|---|---|---|

$k$ | $\mathrm{\rho}$ | Naive | COS | Naive | COS | Naive | COS |

2 | 0.50 | 0.69 | 0.14 | 1.57 | 0.58 | 0.89 | 0.88 |

0.99 | 1.02 | 0.25 | 1.80 | 0.39 | 0.79 | 0.86 | |

3 | 0.50 | 0.34 | 0.09 | 1.77 | 1.27 | 0.94 | 0.92 |

0.99 | 1.07 | 0.41 | 2.11 | 0.76 | 0.84 | 0.85 |

The naive method that ignores COS is positively biased in all cases. The bias and MSE are the largest in the cases with a highly correlated covariate process ($\mathrm{\rho}=0.99$). Although the COS method does not completely eliminate the bias, it is greatly reduced especially in the cases with spatial correlation therefore highlighting the need to account for COS even with mismatched covariate and response data.

## Simulation Study: ISDMs with and without COS

*N*= 5, the detection probability $p$ is either 0.2 or 0.5 and $E=10$ is the offset for the second data source. The latent intensities ${\mathrm{\lambda}}_{i}$ are simulated as $log({\mathrm{\lambda}}_{i})={S}_{i}$, where $({S}_{i},\dots ,{S}_{n})$ is generated from the CAR model (with rook neighbors) with mean zero, variance parameter ${\mathrm{\sigma}}^{2}=1$, spatial dependence parameter $\mathrm{\rho}$ set to either 0.50 or 0.99. The first data source, ${Y}_{1i}$, is observed for all $n=400$ grid cells; the second data source, ${Y}_{2i}$, is only observed as aggregated counts over $k\times k$ ($k$ is either 2 or 4) rectangular grids, denoted ${\overline{Y}}_{2j}$ for coarse-resolution grid cell $j=1,\dots ,n/{k}^{2}$. Appendix S1: Fig. S1 plots one realization with $p=0.2$, $\mathrm{\rho}=0.99$, and $k=4$.

For each combination of $k$, $\mathrm{\rho}$, and $p$ we generate 100 data sets and fit the following models:

- Single: The second data source is ignored
- Covariate: The covariate model with $log({\overline{Y}}_{2j}+1)$ is used as a covariate
- Shared: The joint model for ${\overline{Y}}_{2j}$ and ${Y}_{1i}$
- Correlation: The correlation model for ${\overline{Y}}_{2j}$ and ${Y}_{1i}$
- Shared—no COS: ${\overline{Y}}_{2j}$ is assumed to represent one central fine scale grid cell and the data are analyzed using the shared method without COS (Appendix S1: Fig. S1d)
- Correlation—no COS: ${\overline{Y}}_{2j}$ is assumed to represent one central fine-scale grid cell and the data are analyzed using the correlation method without COS (Appendix S1: Fig. S1d)

Each model is fit using OpenBUGS with three chains each with 20,000 iterations and the first 5,000 iterations discarded as burn-in (see Data S1 for code). We used uninformative priors for all parameters and evaluated convergence using the Gelman–Rubin statistic and examining trace plots.

Scenario | Settings | Change of support | No COS | ||||||
---|---|---|---|---|---|---|---|---|---|

$k$ | $p$ | $\mathrm{\rho}$ | Single | Shared | Correlation | Covariate | Shared | Correlation | |

(a) Brier score | |||||||||

2 | 0.2 | 0.50 | 0.141 | 0.126 | 0.135 | 0.132 | 0.133 | 0.137 | |

2 | 0.2 | 0.99 | 0.117 | 0.101 | 0.110 | 0.102 | 0.108 | 0.117 | |

2 | 0.5 | 0.50 | 0.018 | 0.018 | 0.018 | 0.018 | 0.019 | 0.018 | |

2 | 0.5 | 0.99 | 0.017 | 0.016 | 0.016 | 0.016 | 0.017 | 0.017 | |

4 | 0.2 | 0.50 | 0.138 | 0.135 | 0.139 | 0.138 | 0.137 | 0.143 | |

4 | 0.2 | 0.99 | 0.118 | 0.110 | 0.116 | 0.112 | 0.112 | 0.119 | |

4 | 0.5 | 0.50 | 0.018 | 0.018 | 0.018 | 0.018 | 0.018 | 0.018 | |

4 | 0.5 | 0.99 | 0.017 | 0.016 | 0.017 | 0.016 | 0.017 | 0.017 | |

(b) Classification accuracy | |||||||||

2 | 0.2 | 0.50 | 0.775 | 0.802 | 0.794 | 0.789 | 0.791 | 0.786 | |

2 | 0.2 | 0.99 | 0.820 | 0.852 | 0.833 | 0.848 | 0.840 | 0.823 | |

2 | 0.5 | 0.50 | 0.981 | 0.981 | 0.981 | 0.981 | 0.980 | 0.981 | |

2 | 0.5 | 0.99 | 0.982 | 0.982 | 0.982 | 0.981 | 0.980 | 0.981 | |

4 | 0.2 | 0.50 | 0.777 | 0.784 | 0.786 | 0.780 | 0.781 | 0.780 | |

4 | 0.2 | 0.99 | 0.821 | 0.836 | 0.824 | 0.831 | 0.834 | 0.820 | |

4 | 0.5 | 0.50 | 0.981 | 0.981 | 0.981 | 0.981 | 0.981 | 0.981 | |

4 | 0.5 | 0.99 | 0.982 | 0.982 | 0.982 | 0.981 | 0.981 | 0.981 | |

(c) CPU times (min) | |||||||||

2 | 0.2 | 0.50 | 2.57 | 2.75 | 4.54 | 2.56 | 2.54 | 4.27 |

Including the second data source only shows substantial improvement compared to the single-data-source model when the grid cells are small ($k=2$) and detection is low ($p=0.2$). With large grid cells the aggregated data are too coarse to provide useful spatial information, and with high detection the first data sources provide sufficient information to produce precise maps, because we included data from all cells within the area for this data source. Strong spatial correlation improves classification accuracy for all methods, but the second data source provides roughly the same increase in precision regardless of the spatial correlation.

Focusing on the two cases with $k=2$ and $p=0.2$ where including the second data source is useful, the results are fairly robust to the COS method. The two simplest COS methods are the covariate model and the naive methods that include the aggregated data as a data point without accounting for COS. These two simple models perform comparably to the more sophisticated shared and correlation models. The average run times for these methods (Table 2c) are approximately 50% less than the full correlation model. In summary, these two methods provide simple and effective means of accommodating COS in ISDMs.

## Case Study: Black-throated Blue Warblers in Pennsylvania

We next apply the data fusion models with and without COS on a data set for BTBW in Pennsylvania, USA. Our goal is to examine the real-world consequences of ignoring COS and to make recommendations for modeling. We have two data sets collected from two different sources. We further subsample these data at different spatial scales (i.e., observations are assigned to cells of increasing sizes) to understand the utility of incorporating COS into ISDMs.

The first data set we use includes point count survey data collected as part of the second Pennsylvania Breeding Bird Atlas (BBA data; Wilson et al. 2012). During a 5-yr period from 2005 to 2009, 33,846 point count surveys were conducted across the state of Pennsylvania. An even distribution of points was achieved by randomly selecting eight roadside locations within each standard 1/24-degree latitude by 1/16-degree longitude blocks used for the atlas (Grid 1; Table 3). Point counts occurred during morning hours in the peak breeding season (last week of May through the end of June). Observers recorded singing males of all species during a 6 min 15 s survey. Observations were divided into five 75-s intervals and whether the bird was located less than or greater than 150 m from the observer. In our analysis we used all observations of singing male BTBW. We excluded observations >150 m from the observer.

Spatial resolution | Grid size (degrees) | Grid size (km^{2}) |
---|---|---|

Grid 1 | 1/24 × 1/16 | 24.3 |

Grid 2 | 1/12 × 1/8 | 97.5 |

Grid 3 | 1/3 × 1/2 | 1,553.6 |

Grid 4 | 2/3 × 1 | 6,230.5 |

Our second data set consists of eBird observations (Sullivan et al. 2009). We filtered eBird records to only include observations during the same 5-yr period (2005–2009) and only included records during the breeding season (late May–July). Records that did not include measures of survey effort were excluded. A subset of the BBA data was entered into the eBird database. To avoid duplication these records were also removed for analysis. A total of 4,937 checklists were included in our analyses. eBird data were summarized at three different resolutions, not including the original scale of the BBA data (Grid 1; Table 3).

Preliminary analyses found that percent forest cover has a positive relationship with the occurrence of Black-throated Blue Warbler. We therefore include this covariate in all of the models to understand the consequences of spatial misalignment on the ability to estimate the covariate effects. In addition we summarize the second data source (eBird) in two different ways, first we take the sum of the eBird counts for a particular grid size and average it across all of the BBA cells at Grid 1 within the larger grid (denoted by “Avg” following the model name). Second, we explore the effects of an ad hoc approach wherein we reconcile the misalignment by matching the grids for all of the data (referenced by “Scaled” following the model name). That is, we scale up the BBA data to match the eBird grid. This is to mimic the case where nothing is known about the location of the finer-resolution data and instead scale it up to match the second data source.

To evaluate the effects of ignoring vs. accommodating COS fully, we fit the data fusion models described in the *Data Fusion Models with COS* section with and without COS to 20% of the BBA data and compare the results with a model fit to all of the BBA data. The full BBA data set (33,846 points across Pennsylvania) has excellent geographic coverage, and by subsetting this data set we were able to explore the contrast in performance among the approaches.

## Case Study Results

Overall models ignoring COS perform poorly compared to models incorporating COS. Fig. 2 shows the estimated occupancy probability across data fusion models and whether COS was incorporated. All of the models incorporating COS had smaller credible intervals and were centered around the value estimated by the full BBA data set. Models ignoring COS and using both data sources equally (Shared) resulted in most estimates that are much higher than the full data set, although this is not the case when the covariate is aggregated up to match the eBird grid size (models with “Scaled” after name). The covariate model using the averaged covariate across all of the finer-resolution cells (models with “Avg” after name) performs well compared to more complex models (shared and correlation).

Individual site-level estimates of $\psi $ show similar results. Appendix S1: Fig. S2 plots the estimates when both data sources are at grid level 1. Models that do not account for COS tend to oversmooth the estimated occurrence probabilities compared to the full data set. This becomes more pronounced as the degree of spatial misalignment increases (Fig. 3). Again the covariate model performs competitively with the more complex shared COS model and outperforms the models ignoring COS.

We can compare the performance of the two models using different approaches to summarizing the second data source in the covariate model. Fig. 4 shows the performance at grid level 2 and Appendix S1: Fig. S4 depicts the performance at grid level 4. Here we can see how the second approach (scaling up the first data source to match the second) clearly averages over the spatial variation at a finer scale and oversmooths the predictions.

Fig. 5 shows the differences in estimated effects of percent forest cover when ignoring COS vs. accommodating it for data fusion models. The full data set (denoted by “Single”) shows a positive relationship with per cent forest cover and occurrence probabilities. This relationship is not as clear with the data fusion models, although this is probably due to the reduction in data (full data set vs. 20% of the data being used for all of the data fusion models). Overall, the models incorporating COS tend to perform less variably and have reduced uncertainty estimates. It is also important to note as the degree of misalignment increases the amount of uncertainty increases as well. Models using the second data source summarized at Grids 3 and 4 have highly variable and uncertain estimates relative to models using the second data source summarized at Grids 1 and 2. This pattern is especially pronounced for the models ignoring COS.

## Discussion

We present the first comprehensive treatment of spatial misalignment for ISDMs in the ecological literature. Within the spatial statistics literature, it is well known that spatial alignment matters when making predictions (Gelfand et al. 2001, Gotway and Young 2002). Thus it is not surprising that our results show that COS matters and when unaddressed leads to biased parameter estimates when combining data sources to build ISDMs. Data integration methods have shown both utility and future promise to improve our inferences about species distributions as well as population and community dynamics (Zipkin and Saunders 2018, Fletcher et al. 2019, Miller et al. 2019). Although much of the current effort has focused on the development of estimators for different data types, (Dorazio 2014, Fletcher et al. 2016, Pacifici et al. 2017), a parallel effort is needed to deal with scale and alignment in building models.

Our results highlight cases where not accounting for COS may be especially prone to introduce bias and reduce accuracy in results. We found bias and misclassification errors to be greatest when spatial correlation was high and when detection was low. Error due to COS was also greater when the relationship between distribution and the environment is defined by small-scale processes. For example, greater bias would be expected in our estimated relationships for BTBW if abundance was more correlated to local forest cover within 100 m of a location rather than at the landscape scale measured when values are taken for whole grid cells. In general, summarizing covariate information to match the grid size of observations smooths over important spatial variation, and can result in a loss of power to detect relationships and fine-scale trends. The likely result is that the strength of ecological relationships are underestimated. This is not a result unique to data integration methods, but is the case any time we fit models at coarse scales and ignore the COS issue.

We explored three general approaches to data integration, which we refer to as a shared, correlation, and covariate models for integrating two data sources (Pacifici et al. 2017). The covariate modeling approach provides a simple and efficient method for dealing with COS when it occurs between two data sets. By using data collected at a coarser scale as a covariate, it is possible to estimate the relationship of fine-level processes while sufficiently accounting for information loss due to spatial misalignment. The extent of the spatial misalignment will define the extent to which the two data sets are correlated. As demonstrated previously (Pacifici et al. 2017, Miller et al. 2019), the covariate approach also provides a flexible method to deal with other observational errors, such as misidentification and misspecification of locations.

What we refer to as a shared modeling approach or a joint-likelihood approach leads to the greatest preservation of information when COS is accounted for while combining data sources. Using a shared approach requires that both data sets be of high quality and that COS can accurately be modeled between the two data sets. If this is the case, then information from both data sets are placed on equal footing and are used to model a shared (or joint) underlying process. In contrast, when it is difficult to specify the COS, the covariate approach performed relatively well, especially when the primary data set can be specified at a fine scale.

Our results point to some recommendations for SDMs in general, not just when data integration is used. Misalignment between covariate resolution and the size of the grid cell for which responses are modeled is not unique to integrated methods (Latimer et al. 2006). One insight from our specific results is that fine-scale relationships between covariate and species distribution are more affected by ignoring misalignment than coarse-scale relationships. This suggests that covariates such as average climate, which tend to follow smoother gradients, especially in nonmountainous regions, should be relatively robust to spatial misalignment. Alternatively, estimating fine-scale habitat relationships, such as the effect of forest cover in a fragmented landscape, will be more sensitive to misalignment. In addition, many of the data sets we use to predict species distributions such as museum records, citizen science data, or even large-scale designed surveys include large uncertainty about spatial location of where records are located (Dickinson et al. 2010). Therefore, there is a need to understand better how scale influences inferences made from all SDMs (Steenweg et al. 2018).

### COS model steps

We are unable to provide general recommendations that are ubiquitous to fitting ISDMs. However, we provide five steps that we believe should be followed when addressing COS in ISDMs.

- Define the stochastic model for ecological process at the finest scale or resolution.
- Define support for observed data and determine the desired scale for predictions, i.e. scale that conservation and management decisions will be made.
- Identify best way to link data sources based on underlying ecological process. Here a second data source may provide a diversity of information including sources of error or effort.
- Develop joint model for data sources and the underlying ecological process and conduct inference.
- Conduct model evaluation and check for sensitivity (e.g., significant change in results when adding new data sources) specifically when rescaling the data.

### Temporal mismatch

Here we have purposely excluded a full evaluation of temporal mismatch because we believe it deserves its own treatment in a separate paper. However, we can provide a few insights into handling temporal mismatch based on our experiences with ISDMs. The first question an analyst must address is whether or not there is interest in a static or dynamic model of species distribution. This question dictates the types of data collected and the temporal resolution necessary to assume that distributional patterns are changing through time. If the analyst is interested in modeling distributional changes via dynamic models, then the temporal resolution of the data must represent the appropriate time scale to allow changes in the distribution at an ecologically relevant scale. When combining multiple sources of data this can present challenges when opportunistic data potentially arises from historic records (e.g., museum records), creating a gap in time. For example, it is common to use presence-only data that may have originated decades earlier than survey data. In this case the appropriate inference depends on the interpretation of “distribution” in that a coarse time scale suggests a coarser definition of distribution and is akin to results from redefining the response of interest (Guillera-Arroita et al. 2015). We believe that this definition can be relaxed when interest involves a static distribution of species occurrence, but this is still an important and active area of research to understand fully the implications of temporal mismatch when combining multiple sources of data.

Furthermore, to understand the implications of combining different data sources fully, it is necessary to classify the use of auxiliary data by how it is used to inform SDMs. Similar to integrated population models (IPMs), wherein the goal is to include supplemental data sources that inform specific vital rates that drive populations (Zipkin and Saunders 2018), we can identify the components of SDMs and how integrating new data improves our understanding of distribution and distributional changes in populations. Specifically, we are interested in how additional data sources improve our understanding of the drivers of distributions, and we do this by classifying new sources of data into two categories, spatial and/or temporal, wherein new information can be added. The spatial category can be thought of as including additional observations (presences and/or absences) that modify the geographic footprint of a species, provide information about sampling effort or variation in sampling effort, sources of bias or error (e.g., false positives or false negatives) or that help reduce these sources of error, and uncover or identify relationships with environmental covariates or other species (especially at different spatial scales). Adding temporal information includes observations (presences and/or absences) that modify the geographic range over a temporal scale (e.g., annual or seasonal variation) of interest, or improve our understanding of error and/or sampling effort (similar to spatial), except that which occurs over a temporal scale instead of spatial scale. The classification of how additional data will inform SDMs is a critical step in fully understanding whether it is worth using auxiliary information and how it will help.

## Future Directions

As we move forward and the number of opportunities to combine data sources increases we believe future directions for research include the need to explore more fully situations where spatial misalignment has the greatest influence on SDMs. In addition, new applications such as dynamic distribution models are also likely to be affected by COS, specifically because the ability to estimate changes in distribution are dependent on differentiating when local changes did and did not occur, often at a finer scale than the resolution of many data sets (Kery et al. 2013, Zurell et al. 2016). Finally, spatial alignment is not a problem unique to data integration for SDMs. Other integrated models, such as IPMs, will benefit from a better understanding of the effects of spatial misalignment and accounting for COS (Schaub and Abadi 2011, Zipkin et al. 2017).

## Acknowledgments

We would like to thanks Steve Beissinger, Brian Inouye, and Elise Zipkin and two anonymous reviewers for helpful comments on earlier drafts of the manuscript.

##
Literature Cited

## Data Availability

Data are available from GitHub/Zenodo: http://doi.org/10.5281/zenodo.2541844